Note: This article was written for those using Google Analytics Standard accounts. For those using Analytics 360 check out our guide: Sampling in Analytics360? Explained and Solved with Analytics Canvas.
We talk to analysts all the time who aren’t aware that their Google Analytics data is being sampled. In some cases the sampling is so high that any attempts at conversion rate analysis are worthless.
In this post, we’re going to explore data sampling in Google Analytics, identify when it happens, why it matters, and talk about the best solution to eliminate sampling for truly accurate data that you can trust in making your business decisions.
What is data sampling?
As Wikipedia will tell you, sampling is the practice of “using a subset of individuals from within a statistical population to estimate characteristics of the whole population.”
In layman’s terms, sampling is when instead of gathering every single data point in a particular data set, you choose a smaller portion of data and use that sample to estimate what the whole set of data would look like based upon the smaller portion. Polling data is the most commonly known form of sampling. Pollsters survey a percentage of the population, then extrapolate the results to apply to the population as a whole.
Why does it matter if my Google Analytics data is sampled?
Sampling does offer benefits for data analysis since it speeds up and simplifies the process as seen in the example above.
However, sampling also has many drawbacks, not the least of which is the fact that a sample is NEVER a complete data set so you may be missing the long tail and outliers that actually contain meaningful data. And the smaller the sample size is when compared to the full data set the less accurate the conclusions drawn from the sample.
While this might not be an issue when you’re looking for a representation of demographics or similar factors it’s a big deal when doing any kind of comparative analysis. If you choose to use the sampled data you’re making decisions or providing advice to your team based upon inaccurate data and this can mean loss of your reputation at best and financial loss at worst.
How to identify sampling in Google Analytics data
When looking at your report in the Google Analytics web interface, take a look at the little shield that appears beside your segment/report name.
If the shield is green then your data is not being sampled!
If the shield is yellow then you are looking at sampled data.
Clicking on the shield will allow you to see how big the sample size is by telling you what percent of total sessions is being used in the reported data. The lower the percent of sessions used in the report the greater the sampling issue.
When you view a sampled report in the Google Analytics web interface you do have the choice to adjust the data sampling rate. This doesn’t allow you to eliminate sampling, but you can change the sample size by choosing between precision and speed. Faster speed means smaller samples and greater precision means a larger sample.
If you’re viewing the data in spreadsheets or dashboards other than Google Analytics then your task may be more complicated. Depending upon the tool you used to pull your data out of Google Analytics there may be a sampling indicator included with the dataset.
If there isn't, you should be concerned.
Some tools will say that they attempt to eliminate sampling, and they do attempt it, but if there was still sampling, you need to know that. Many very popular tools will not tell you if there's sampling and so you may be unknowingly working with sampled data! Don't put yourself in this position, there are tools that will keep you informed quickly and easily.
When Does Google Analytics Sample Data For Reporting?
In general, Google Analytics Standard Reports are unsampled reports. This is true even for their free accounts. Google Analytics captures the raw sessions data (up to the actual data limit of 10 million hits) and prepares that data into aggregated tables so that any user can access unsampled data quickly.
However, beyond the predetermined default reports, any segments, views, or dimensions being applied to ad-hoc reports run by a user can trigger sampling.
This is because Google Analytics is collecting and processing billions of hits daily, so in order to increase efficiency and save computing power, they have set limits upon how many sessions can be included in a single ad-hoc query before they will employ sampling.
According to Google Analytics Help documents, users are informed that Google Analytics will employ sampling when a report exceeds 500k sessions at the property level for the date range you are using. In other words, Google Analytics allows you to pull up to 500K sessions per tracking ID, not per account and not per View, for a single date range without sampling the data. Once your query exceeds that threshold, sampling will be employed.
What Does “Other” Mean In Google Analytics Reports?
Here we should mention another type of sampling that happens when pulling a report out of Google Analytics indicated by the “(other)” entries in reports.
The “(other)” label appears in your reports when Google applies “report query limiting” to high-cardinality dimensions in order to control the cost of the query.
Cardinality refers to the total number of values that can be assigned to a dimension. For example, one dimension most websites are concerned with is “Device Category”, as in, what type of device was the user on when they visisted the site. This dimension has only three values, Desktop, Mobile, and Tablet. However other dimensions can have countless values, such as “page” on a site that dynamically generates pages.
A dimension that has a large number of these potential values is referred to as a “high-cardinality dimension” and for Standard Google Analytics accounts these high-cardinality dimensions are limited to 50k rows per date range. If a requested dimension on your report exceeds these limits then all values that exceed the limit will be rolled up into a single entry called “(other)”.
“Report query limiting” happens when the data for any date range exceeds the standard data limit of 1 million rows set by Google for a single report. In this case, ALL additional data exceeding 1 million lines will be rolled up into a single row labeled “(other)”.
How To Eliminate Google Analytics Data Sampling?
While Google Analytics is clearly providing a powerful service for enterprise-level data analysis, the reality is, sampling is the enemy of accuracy and decision-making, especially when those decisions can affect the bottom line. In these situations, it is essential to make sure you have the right solution in place to protect yourself from incomplete data.
However, the only way to accomplish this within the Google Analytics web interface is to download your data in smaller date ranges where there is no sampling and then stitch that data back together in your spreadsheets or databases. This process is very tedious, time-consuming, and completely inefficient while also introducing the opportunity for human error.
The best method to prevent sampling when using Google Analytics is to employ the Google Analytics API and the help of a software that will provide the flexibility and latitude to get what you want out of your data while doing all the work of eliminating sampling for you automatically.
Meet Analytics Canvas
Sampling in Google Analytics is not a new issue, it is a concern that has been around since the beginning, and for nearly as long as Google Analytics sampling has been a complication, Analytics Canvas has been dedicated to providing a simple and powerful solution.
As a Google Technology Partner, Analytics Canvas is used by some of the world’s largest brands, and their equally significant data sets, to support the extraction and use of large quantities of data without sampling.
This experience and these partnerships have made Analytics Canvas uniquely qualified to help data analysts extract their Google Analytics data and prepare it for use no matter what their level of expertise.
How Analytics Canvas eliminates sampling for Google Analytics Standard accounts
Surprisingly, the majority of resources offering help with Google Analytics sampling remain limited in their ability to resolve the issue sampling has on data accuracy.
Many guides suggest manually doing the work of reducing the date ranges until the total number of sessions is below the sampling threshold, but this can be tedious and exacting work. Some suggest that the only solution is to use Google Analytics 360, but despite the common misconception, Google Analytics 360 is also subject to sampling. While others promote expensive solutions that require the use of BigQuery rather than simply loading the data to your desktop or your own databases and files.
Analytics Canvas, however, created the best solution for users to deal with data sampling by resolving issues through automated partitioning - and it does this for as little as $49/mo!
Partitioning is the process of breaking up a single query into multiple queries so that a single query does not exceed the limits for sampling.
For example, since sampling kicks in at 500,000 sessions in the query period, we can circumvent that by breaking a query up into multiple queries based on time period so that each individual query does not trigger sampling.
Imagine a site that gets 20,000 sessions a day. Any query of just one month in duration will trigger sampling. However, a query including half the month will only involve around 300,000 sessions, and therefore the API will return exact data. If you make two queries and combine them together, you’ll get the exact data for the time period even though it exceeds the 500K session limit!
Analytics Canvas does this for you automatically, no matter how big your data set, and no matter how many accounts and views you are querying.
How do I Eliminate Sampling From Google Analytics with Analytics Canvas?
- Open Analytics Canvas and apply your license key. If you don't already have one, sign-up for an unlimited free trial.
- Go to New Source > Google Analytics and choose Reporting API V4
- If this is your first time using Canvas, it will ask you to authorize a connection to Google Analytics. The authorization is between your machine and the Google Analytics API. Your tokens are not accessible by us and we cannot access your data.
- Select the View or Views you wish to query from your list of Google Analytics Accounts and Properties
- Make your query by selecting your dimensions and metrics, the query time period, applying one or more segments, and / or applying filters. If you have multiple accounts, be sure to add additional meta data.
- Click on the Sampling Tab to view the default settings and click ‘OK’ to run the query
- By default, Canvas will scan for sampling in the query and if found, Canvas will use partitioning to eliminate it (as seen below). The number of partitions will be automatically determined for you. Canvas will also scan for Report Query Limiting and if found, will partition the query in an attempt to eliminate it.
That's it! Canvas has returned an unsampled dataset which you can now process further using the tools available in the Block Library (filtering out bad data, creating your own custom channel groupings, cleaning page paths, etc.), or you can export the data to your preferred databases, file types, or our Google Data Studio community connector.
What's more is that this can all be done without a developer, so the business user can make as many queries as they want, updating them as their analysis unfolds, and staying in the flow to understand the data and provide meaningful analysis.
Are There Situations Where Analytics Canvas Can’t Eliminate Sampling?
While Analytics Canvas provides powerful automated solutions to resolve sampling issues, there are circumstances in which sampling cannot be avoided for Google Analytics free accounts.
When running a report, Analytics Canvas will automatically partition the results, finding the range at which unsampled data can be accessed. Canvas will go so far as to partition the report down to a single day in order to eliminate sampling. However, it is not possible to request a smaller unit than a single day, which means if your property is receiving more than 500,000 sessions a day, sampling cannot be avoided without upgrading to Google Analytics 360 where the threshold for sampling is much higher.
The good news is, Analytics Canvas will reduce sampling as much as possible by partitioning down to the day and collecting the full data set on days when the threshold has not been exceeded. As well, Canvas will tell you both in the interface and in the dataset whether the data has been sampled for each row in the dataset.
There are a few other limitations that can be encountered when using partitioning to eliminate sampling, specifically when it comes to Calculated and Unique metrics.
Unique metrics are those which are filtered by Google and only counted once per report no matter how many times that metric may have occurred.
For example, "users" is a unique metric. If you were looking at data for the months of March, April, and May, and John Doe visited your site every month, then the date parameters for your report will significantly affect the totals returned by Google Analytics.
If you pulled all three months at once then John Doe would only show on the report once, as “1” user. But if you pulled the three months separately then summed them together, John Doe would show up 3 times (once for each month) as 3 unique users. This result is incorrect and will lead to bad conclusions.
For this reason, you cannot use unique metrics in a partitioned query.
Similarly, calculated metrics are those where Google Analytics has pre-calculated a metric for you. Any metric that includes the term "rate" or the term "avg", such as Bounce Rate and Average Time On Page, are examples of calculated metrics. Google has used standard metrics to create these calculated metrics for you. The problem is, when partitioning, you cannot simply aggregate these calculated metrics - you will return the wrong result.
The solution here is much simpler: include the metrics required for the calculation, such as bounces and sessions to calculate bounce rate, and create your calculated metrics at the right level of granularity either in Canvas or in your reporting tool. If you include the date dimension this does not apply as the partitions will not be summarized- the partitions will simply be combined together to create the full data set.
Partitioning when there are calculated or unique metrics in the query will result in bad data. Analytics Canvas will warn you of this and prevent the query from running. Many tools don't!
Google Analytics is one of the best tools out there for tracking and reporting on website activity and because of this, it is in extreme demand. As high traffic websites take advantage of everything Google Analytics has to offer it is not surprising that Google has had to put in place limitations on the amount of data that can be extracted.
These limits have resulted in data sampling, a situation where only a portion of the data is provided and is utilized to simply estimate totals. While data sampling is a necessary solution for Google Analytics, it means many businesses are working from incomplete data sets either unknowingly or because they don’t know how to eliminate sampling from Google Analytics data. Obviously, this is a problem, no business can make informed decisions when using incomplete data.
If you have sampling you can eliminate it by partitioning your query, either manually or programmatically. The good news is, with the right tools, like Analytics Canvas, eliminating sampling is as easy as clicking a button to pull the perfect query through the Google Analytics API.
Don’t put yourself in the position of reporting inaccurate data or manually merging reports together.
Take advantage of our 30-day unlimited free trial to see the power and peace of mind a pipeline built with Analytics Canvas can bring to your organization.