Removing Google Analytics cruft from urls

Posted by on December 29, 2010

So I have decided to be productive over the Christmas break and continue work on my super secret, somewhat stalled twitter app. So apparently I am also going to blog about it here.

Warning: nerdery ahead, no cat photos.

The Problem

Today, I have begun work on a part of the application that processes the url’s posted within tweets and attempts to normalise them as much as possible. By normalising them, I mean expanding short urls, such as bit.ly links and also removing certain pieces of tracking code within the url itself. The main culprit I have come across so far has been the tracking code inserted as part of Google Analytics campaigns.

It is a very common pattern to see in urls, and looks very similar to:

?utm_source=twitter&utm_medium=social&utm_campaign=Our+vain+attempt+to+track+you

Why is this a problem?

If someone posts a link without the google tracking code and then an automated tool, such as twitterfeed or similar posts a shortened link with google tracking code inserted, I want my application to see that they both point to a single url.

The easiest way I can see to do this, is to remove the google tracking code. So I need to employ everyone’s favourite hammer, regular expressions.

The Solution

I took a selection of 500 url’s containing google tracking code and looked for the common factors in the urls. My first attempt at matching the tracking code involved looking at the various parameters in order and matching their possible values. The regex pattern came out like this:

/(?)utm_source=[^rn]+&utm_medium=[^rn]+&utm_campaign=[^rn]+/

I ran a preg_replace against the 500 urls and found that it worked as expected about 95% of the time. Issues came from url’s that were missing the utm_campaign parameter and url’s where the parameters were in a different order to the majority of the urls.

How to solve this issue? The obvious change to the regex was to match the parameters of the tracking code individually and then the order or the lack of a parameter wouldn’t affect the matching. I came up with the following regex to achieve this:

/((?)?(&)?utm_source=[^&]+|(?)?(&)?utm_medium=[^&]+|(?)?(&)?utm_campaign=[^&]+)/

This worked a lot better and had numerous improvements. However, it seemed far too verbose and lead to me discovering a fourth parameter in google tracking code, utm_content.

Refactoring

Hmmmm, should I get even more verbose or can I refactor? Refactoring is always my preferred answer.

I find that when I need to refactor code, the best starting point is always to look at what you are trying to do again and work out if you understand the problem as well as you could. In this case, the problem I started with was that I had a specific list of parameters, in a specific order that I wanted to get rid of from a url.

I had already worked out that they cannot be in order and now I was looking for each parameter individually and each parameter begins with “utm_”…

There we go, we have the catalyst for refactoring, lets adjust the regex to match any parameter that begins with “utm_”

The Finished Solution?

Adjusting the regex to match any parameter starting with utm_ was just a case of replacing the name part of the parameter with a pattern to match repeating lowercase letters and removing the redundant parts of the rest of the pattern.

/(?|&)?utm_[a-z]+=[^&]+/

I tested this against the 500 urls and it worked 100% of the time.

Bingo, a short and concise regular expression to strip out Google Analytics tracking code from urls.

See it in use here.

A Disclaimer…

This was tested with PHP’s preg_replace function, mileage in other environments will vary.

You can see how it is used here.

I am terrible at regular expressions compared to some. I welcome all and any suggestions or feedback via the comments or via email.

This article has 1 comments

  1. ZenPsycho Reply

    Hmmm, if URL normalisation is the goal, I would have written a full URL parsing class, or found a really good one (which doesn’t seem likely to me in the PHP landscape, hmm but I haven’t looked- would want one with a test suite). It’s not too hard to make one, (I’ve done it before), and it would enable you to just delete UTM containing parameters from a parameter array, and “Serialize” the result back into a url string. For this one case, that’s somewhat more complicated than a single regex. However, it would also, in one fell swoop, normalise percent encoding, remove username/password junk from the start of the URL, deal with weird characters that sometimes show up in URLs and break weak REGEX based URL matchers, handle unicode URLs (IRLs) that break many many poorly written twitter clients (as John Gruber of daring fireball discovered), and many more subtle obscure tricks URLs play on poor developers. (there’s various RFC’s that I would reccomend as good reading for this stuff)…

    Uhmmm.. but maybe that’s just me. -_-. I just imagine this approach ballooning out to a zillion regexes applied in sequence and quickly becoming unmaintainable, confusing and full of little parsing traps and bugs, with no assurance that what you end up with will be a valid URL. I worry too much.

    Other junk I can recommend normalising is various tracking code youtube adds to urls, canonical equivalents to mobile versions of urls, etc. Website specific url shorteners for flickr and youtube are things I’ve seen around too. You’re probably all over that though!

Leave a Reply