Google Analytics sampling data

James StandenAnalytics Canvas Tool, Avoiding sampling, Google Analytics API

Google Analytics is being used by some very large websites. The result is that there is a lot of data, and one of the realities of tracking a high traffic website with Google Analytics is that often data will be returned as a statistical sample, rather than exact metrics.

Ask for too much data and you will get an estimate.

What happens is, in order to ensure rapid return on queries, and to manage the load on the infrastructure, Google can make the decision to sample the data, rather than actually calculate the exact values.

UPDATE: when this blog post was originally written, Google returned a “Confidence interval” which was a statistical measure of the accuracy of the sampling. This value has since been depreciated.

It is completely understandable why Google needs to sample, and this is only a concern for pretty large traffic levels. But for those sites that are using Google Analytics and have very high traffic it happens. And sometimes you just need the actual data.

For users that have BIG data needs, we’re very excited to be adding some very cool query partitioning capabilities into Analytics Canvas.

Solution? Analytics Canvas partitions the queries.

With partitioning, you can pull the exact data you need, and get it the way you want it. Analytics canvas lets you ask for very large queries, then automatically partitions that query into a series of API calls that can be managed to avoid sampling. All of this of course adheres strictly to the Google API rules.

There are limits as to what can be achieved with this kind of partitioning, however. The key limitation is that it cannot be used with queries that contain metrics such as “unique visitors” or “new visitors” etc. Because the query is broken down into multiple periods, Google assesses uniqueness on the individual periods, and therefore will not result in valid values. However for overall visit counts, page views, source information etc, partitioned queries can return exact numbers, providing sampling does not occur over the smallest partition period, the smallest possible being 1 day.

We want YOU to try it out on your big data.

Have you got big traffic, or have clients that do? We’d like to hear from you.

Even if you don’t have huge data, we’re always looking for web analytics pros to help us test out and improve what we think is going to be a key new tool for those who analyze data of all kinds.