Understanding Google Analytics Sampling:

Sample Size and Sample Space

How to eliminate sampling using Analytics Canvas

Using Analytics Canvas, in the vast majority of cases you can eliminate sampling from Google Analytics.

The Google Analytics core reporting API has some new features in the area of sampling that will let Analytics Canvas users understand exactly how sampled a data set is.

Google Analytics samples data to speed up the response time of the queries. By partitioning, it is possible to make multiple queries that will avoid sampling- but the smaller the partition, the more API calls are required, and the longer it will take for the GA API to respond.

What is interesting about this new capability is that it can be used to tune partitioning more effectively, and as a result let you optimize both the speed AND the accuracy of your queries.

New API values: Sample Size and Sample Space:

  • sampleSize- this is the number of visits that were used to calculate the values returned
  • sampleSpace- this is the number of visits from which the sample was taken.

So if, for example, you get a million visitors a month, and GA uses 500,000 visits, then the sample size is 500k, sample space is 1M and dividing the two tells you that your results are based on just 50% of the underlying data.

Understanding Google Analytics Core Reporting API sampling with Analytics Canvas


The first step to mastering sampling is knowing when its happening.

In Analytics Canvas, when you make a query, if you see a exclamation mark on a query block and the word "Sampling!", you know that at least one API call involved sampled data.

Lets look at an example. We're looking at a website that year to date has had about 4.1 Million visitors. If we do a query by month for visits and bounces with a segment, here is what we see:

detecting-google-analytics-sampling-on-query

Sampling Detected!
Notice the exclamation mark. This indicates that sampling was detected.

If we add the new sampleSize and sampleSpace columns by using the detect sampling tab when we create a query, Analytics Canvas shows us this:

google-analytics-data-from-sampled-query-analytics-canvas

We can see our result is based on 493668 visits out of 4,109,978. That is only about 12% of the visits!

So we need probably to do a query for every month or so, to eliminate sampling. Analytics Canvas does this for you automatically, then combines the results into the single larger result you wanted- without sampling.

Lets specify that no query should have more than 30 days, and see what we get:

 

eliminate-google-analytics-sampling-with-analytics-canvas-and-partitioning

 

 

The result is no exclamation, no sampling- Analytics Canvas automatically generated a series of queries that kept the number of visits in each query below the sampling limit:

partitions-created-30-days-max

And what would have happened if we had increased the size and sampling started again? Lets increase it to 90 days, so instead of 10 partitions we will have only 4, and see what happens.

We can see that sampling as returned, and we can see for each partition exactly how much.

sampling-indication-and-partitioning-with-details

With Analytics Canvas, you can master sampling, and control if and how much sampling you want to have, up until the ultimate limit which is 500,000 visits per day. And quite frankly, if you are getting half a million visitors a day, its really time to consider Google Analytics Premium (Analytics Canvas has a whole other set of capabilities there too, but thats for another post)

Sampling level- how to adjust when sampling starts


Finally, one other addition to the core reporting API for GA is the concept of "sampling level". Just as in the web UI it is possible to change when sampling kicks in to some extent, this is now also part of the API call.

This parameter can have one of three values:

  • DEFAULT - this setting lets Google decide, and the API will balance speed and accuracy.
  • FASTER - this has sampling happen sooner to maximize speed at the possible expense of accuracy. This might mean a very small sample size is used.
  • HIGHER_PRECISION - this setting tries to get the best precision even if the query takes longer- at the time of this writing, that means sampling starts at 500,000 visits. This is the default setting for Analytics Canvas.



Full access to all API features in Analytics Canvas


All these new features are available as of V1.6.6 of Analytics Canvas, they can be found under the Detect Sampling tab when creating a query. Give it a try now and get rid of sampled data forever.

google-analytics-api-sampling-features-in-analytics-canvas