Introduction to the GA4 BigQuery Export Tutorial series
One of the most exciting things about GA4 is that access to raw, event level data is now available to everyone, not just paying 360 customers. While exciting to those who are familiar with SQL and the GA4 data model, it can be intimidating to those who are new to it as there is a LOT to learn! As such, our GA4 BigQuery Export tutorials aim to help both new and experienced users to make the most of this rich data set.
In this post, we’re going to help you understand the export feature, look at the BigQuery export table structure and give some first, easy examples of SQL queries that will get you started. In future posts, we’re going to build on this and step by step help you build more and more sophisticated GA4 queries.
We will show you all the 'gotchas' and point out all the places where extra checks are needed. We'll also show you how Analytics Canvas handles those cases, and how it can be used to verify your work, or to generate the SQL on your behalf.
What is the Google Analytics 4 BigQuery Export?
The BigQuery export, also known as the BigQuery linking feature for GA4, is a free service where Google will push your raw, detailed GA4 data directly into BigQuery tables for both Standard and 360 accounts. All you have to do is enable it!
The BigQuery export allows you to use SQL like syntax to query that data, providing much more flexibility than using the API or web interface to query for data.
With the data in BigQuery, it is then owned by your organization - it is your data - and as such it is not subject to retention limits. You could export it to another database or internal system, but with the low cost and high performance of BigQuery, there is seldom reason to.
Setting up your GA4 BigQuery Export
The first step is to set up your Export so we can start to learn how to use BigQuery SQL to get at all that super detailed data. To do this, Google has a detailed guide here. Be sure to select both Daily and Streaming when you get to that part.
When you set it up, you select a BigQuery Project, and it will then create a Dataset that is named “analytics_123456789” where the number will be the GA4 Property ID.
Google will then write your data into this Dataset, creating two kinds of tables: Daily and Streaming (intraday).
When you see these tables in the BigQuery explorer they will be grouped together- but they are individual tables, each one with the name following these two patterns:
The structure of these tables is described by the BigQuery Export schema, which we’ll dive in into more detail later in this series.
Having data separated into multiple tables like this is referred to as “sharding”- this means that when you query your data, you need to use specific syntax in your SQL to look at multiple tables at once to get the date range you want.
We’ll show you this syntax in this post.
The reason there are individual tables rather than one large table that just keeps getting appended to is to reduce the BigQuery processing cost. With data sharded into daily tables, when you query only the last 30 days, you only get charged for the data processed in the last 30 tables.
Table structure and field data types
A table is a set of rows, each row is made up of fields (or columns) and each field has a data type and holds a value or data.
Every row takes up a certain amount of storage space. This can be different for each row since some rows will have longer strings, or might have missing values etc. The size of the table is the size of all the rows added together, and a little bit of overhead.
A table usually has an inherent granularity, which defines what level of detail and what is being stored in it. This is important because it changes what you can query and how you should query it.
For example, a sales table might have one row for each day (daily total sales for all products) or it might have one row for each day/product (so there would be as many records for each day as different unique products that were sold that day). Or, at its most detailed granularity a sales table might have a row for every item on every order. Multiple rows would make up an order.
In the GA4 export tables, the granularity is one event per row.
This is important because it means to get a count of events, you count the rows. It also means that a session involves more than one row.
We’ll go into more detail regarding granularity, and data types in future tutorials, but let’s take it a step at a time and start with a simple example.
Your first query to get started with the GA4 BigQuery Export
First, we point out a tricky syntax rule: when you look at the query, you will see that strings are surrounded by single quotes. But look closely, you will see that in BigQuery, column names are surrounded by a “backtick” which looks a bit like a single quote, but its slanted top left to bottom right (`).
This character is found on US keyboards with the tilde (~) at the top left. You can’t interchange these. The query only works if you use the right one in the right place! If you put ‘geo.country’ then you will get the string “geo.country” if you put `geo.country` then you will get the actual values from the column.
This simple query is for a count of sessions by country. The example below shows you how to access the GEO data in the event table, as well as how to convert the event_date column which is a string, to a proper date;
Breaking down the query
SELECT PARSE_DATE('%Y%m%d',`event_date`) as `date`,geo.country,COUNT(1) AS sessions
AND _table_suffix BETWEEN '20220501' AND '20220531'
GROUP BY 1,2
Let's look at all the parts of this query and understand them.
SELECT: Queries start with SELECT then they list the columns you want in the result. This can be just the column name or can be a calculation using multiple columns followed by “AS colName” to specify the name of the new column being created with the calculation.
In this example, we have three columns; date, country and sessions.
The event_date column in the GA4 BQ Export is a string with the format YYYYMMDD so we have to use a BigQuery function PARSE_DATE to convert it to a date, when we use AS to rename it to just date.
The country column is from the BigQuery schema directly- its a top level column but its in the "geo" record so you need to use geo. before it in your query. So geo.country becomes the field country in the output.
And finally, because we are filtering on the `event_name` to be 'session_start' every row in our result is a session- so we use a function COUNT(1) to count the rows. And we put " AS sessions " to name that column sessions.
FROM: This is the location of your table. Here you will replace 'project' with the name of your BigQuery project, and "1234567" with your GA4 Property ID. The "*" after ".events_" means that you are querying a sharded or partitioned table and will specify the date range.
WHERE: This is a clause, or a filter, that specifies how you want to constrain your results. In this case, we're looking for just the session_start event because we are counting sessions.
AND: This is a further constraint on the query. In this case, we're now going to specify the date range that we want to run the query over. This helps to reduce our BigQuery analysis charges. To specify the date range we use the special syntax "_table_suffix BETWEEN" this works because we used the wildcard "*" in our FROM statement. We then specify the start and end dates for the date range in the format "YYYYMMDD"
GROUP BY: This summarizes rows that have the same values into summary rows. The group by list must include any dimensions in the query- but we don't include columns that are numeric and already have a summary function like SUM, MIN, MAX, COUNT etc. To specify the colums to group by we can use numbers, or type out the full column names- in this case we used numbers since the first two columns (date and country) are the ones we are grouping by. The third column sessions, which is a summary function COUNT, should not be in the group by list.
Making a GA4 Query
Now that you know the basics of writing a query, where do you write it?
One place to execute the query is the BigQuery console. You can run the query, save it, schedule it, and explore it in Sheets or Data Studio. However, you are responsible for everything else - from writing the SQL, to creating date-partitioned summary tables, to performing the incremental load, etc. The console is a great place to get started, but if you have a lot of queries, and want to reuse and manage queries accross multiple users, the BigQuery console can be labor intensive, and/or require you to write scripts.
Perhaps the best place to make your query is Analytics Canvas. When you make your queries in Canvas, you can:
- Make any query you want by writing the SQL yourself, or create sophisticated GA4 queries in just a few seconds using the SQL Builder
- Incrementally load your data into date-partitioned summary tables, saving significantly on query costs
- Combine data from multiple GA4 properties into a single table
- Blend your GA4 data with other data sources like your CRM
- Visually clean and prepare your data for reporting and analysis using no-code data blocks
- Schedule your data for regular updates or use API automation to trigger updates on demand
Analytics Canvas has all the tools you need to write the most complex GA4 queries in a matter of seconds, store that data in BigQuery summary tables, and process your data in a visual, no-code environment.
By building your queries and workflows with Analytics Canvas, you can have full confidence in your data!
You can continue your learning on the GA4 BigQuery Export in the next article in our series, or you can jump right into a trial of Analytics Canvas to skip the hard parts and get right into the data.