Mark Twain popularised the expression “There are three kinds of lies: lies, damn lies, and statistics”. It is a phrase that I think reminds us that when working with numbers, we have to be careful with the analysis we do.
Statistical sampling is a very valid way to come up with accurate, useful measurement from very large sets of data. The trick is, the sample has to be large enough, and random enough to be valid.
With Google’s recent announcement that it is reducing the default level at which fast access mode- (sampling) starts from 500,000 to 250,000 visits, it is even more important to understand what sampling can do to the accuracy of your data.
The bottom line about sampling is- are you getting enough samples to have the results be accurate?
Obviously, if your sample is 100% of the underlying set, then your results are 100% accurate.
Because Google sets an absolute limit on the number of sessions included in a query, the more visitors you have in the time frame of the query above this limit, the less accurate your results.
The challenge is that the key thing that often triggers sampling is the use of advanced segments- and advanced segments are exactly the tool you use when you are trying to look at a small subset of very interesting visits in a much larger set of visits. So right when sampling is the most damaging, it’s more likely to kick in.
An example of when sampling can cause serious inaccuracy in Google Analytics is when looking at very detailed information like page views or keywords over large numbers of visits. For example, if you are looking at a web site that gets a 2.5 million visitors a month, and look at specific pages and keywords over a period of a year, then depending on how you have the sampling set, you will be looking at a very small number of the overall visits. At the maximum of 500,000 visits, your sample is still less than 2% of the visits- so looking at details like specific keywords or pages just isn’t possible, as important as it might be for your analysis.
One solution to make sure you have all the data is to move to Google Analytics Premium. While the $150,000 a year price tag makes this a solution aimed at enterprises with larger budgets, Google has a powerful solution, and companies are signing up because they have the traffic (and revenue) that make the price affordable. Premium lets you download unsampled data from your custom reports, and has higher limits for other data aspects as well.
But if you can’t afford that price tag, you are probably looking for alternatives to Google Analytics Premium, or for that matter the other equally expensive paid solutions.
In this case, the free version of Google Analytics does offer a solution- the Google Analytics Core reporting API.
It is possible to control sampling through the API- by spreading your queries out over time, and storing the exact results, then aggregating back up. The Analytics Canvas tool and platform was designed to do exactly this, and is able to load millions of rows of data representing hundreds of millions of unique visitors using the Core reporting API.
In the example above, you would do a query for each day in the year, and store the results about pages and keywords in a database. As a result, because daily visits are always less than 500,000 no sampling occurs and you have the exact data.
Once you have loaded your data from Google analytics into the database, the 250,000 or 500,000 visit limits no longer apply, and sampling no longer has to be taken into account when designing queries giving you the flexibility to do the analysis you need.
To discuss with us further how we can help you manage sampling without going all the way to Google Analytics Premium, contact us. We can get you full access to your data, and often avoid sampling completely, even if you have millions of visitors to your site each month. We can provide a completely hosted, software as a service solution that stores your data securely and quickly, and gives you the reporting you need.