In this blog, I would like to share with you my experiences in pulling data from google analytics, GA4 and then loading them in staging tables in snowflake. The ETL tool I used was Matillion. My instance was set up on AWS. There are many videos from Matillion itself, which describe activities related to the ETL components that are available. I have found most of the steps for integration in various blogs within matillion itself. https://www.matillion.com/resources/blog.
In this blog we will focus on pushing data from GA4 to snowflake. There are three ways this can be done, and we will discuss each one of these. Since we are using GA4, the linking to big query is straightforward and is the precondition to start. In fact the firebase linkage is also present (https://support.google.com/google-ads/answer/6333536?hl=en&ref_topic=3121765).
- Use Matillion Google Big Query Component to directly pull data into snowflake in basic mode.
- Use Matillion to pull from Big Query and then use google storage as JSON files for intermediate staging and finally push to staging tables and FLATTEN them using Snowflake queries or procedures
- Use Matillion Google Big Query component to write queries with UNNEST them then pass them on to staging tables.
Before we go deep into each of these methods, we need to understand how our analytics data is structured. The firebase+GA4 component, allows clients (web browsers, client devices) to send data to firebase server. There are numerous fields that are standard and available to be set values e.g. device details, location details etc. In some cases these fields might not be sufficient. e.g. In my case, I am dealing with use cases where, we need more custom parameters. These custom parameters are set up as name value pairs either in user properties or event properties. Once BigQuery is linked with GA4, we can query the events table e.g. a select * on a particular event date will show how these key values are arranged. This context is required to choose the type of method required to push data into snowflake.
One more aspect that is required for integration, is an ability to configure OAuth2 between the systems, mainly Google Analytics API and Big query API. The Google Analytics API is the interface to GA4 data. In my case, I was provided only an IP address of the Matillion server. This makes it difficult to configure OAuth2 connections in a production environment which require fully qualified domain names. I was able to overcome this by making an entry of local domain in my etc/hosts file in windows. The same will apply to Unix and MacOS systems. (https://support.acquia.com/hc/en-us/articles/360004175973-Using-an-etc-hosts-file-for-custom-domains-during-development)
If we do not worry about custom events and user parameters, it is safe to choose the first point, using Big Query component, to map the available rows in Big Query to target table in snowflake for staging. the link here is more than sufficient to get connected to both big query and snowflake and create the data pipeline. (https://www.matillion.com/resources/blog/using-the-bigquery-query-component-in-matillion-etl-for-snowflake). In this case we will observe that event_parameters and user_properties are columns that get created as Varchar(2000) automatically. These two columns are stored as a JSON.
In my case, I could not use the default size of the event parameters column, these were pretty large documents and could not be covered within VARCHAR(2000), and Matillion does not allow to copy these data on an existing table with different data type. A snowflake table with variant data type allows one to run many different queries that otherwise require JSON data. Hence I decided to pull the data from big query and push them to cloud storage of google, and then pull them from cloud storage and push them to snowflake. This is a single large JSON for each day. Then a separate procedure was written to parse this data and load this into different tables. Here are two links, one of them present how JSON can be traversed for a specific path for event parameters or user parameters (https://interworks.com/blog/2020/02/26/zero-to-snowflake-reporting-in-tableau/), and the other is on how the stored procedure can be written for the same.(https://www.snowflake.com/blog/automating-snowflakes-semi-structured-json-data-handling/). There are two steps here, we need to load data into google storage and (COPY INTO <table> — Snowflake Documentation) then copy into snowflake staging table (Copying Data from a Google Cloud Storage Stage — Snowflake Documentation). You can skip the two above and look into Matillion documentation for the same (https://www.matillion.com/resources/blog/moving-google-analytics-360-data-into-snowflake)
Again in my case, there was another problem, I had key value pairs, many of them, and the keys had to be converted to columns. The requirement became more complex with custom parameters JSON structures changing with events. The JSON parsing using stored procedures was very time consuming. Hence we decided to use the power of Big Query to process the data into columns and then map them to the columns in snowflake using the big query Matillion component itself. Let me assume we have a JSON event params of this type and we need column names as Author and Book. the magic around cast, coalesce is done to ensure that we sometimes cannot be sure which column has the value and we are certain about what data type the value has to be.
{
{
key:"author",
value:"Amish"
}
{
key:"book",
value:"Shiva Trilogy"
}
}
(SELECT CAST(COALESCE(CAST(value.int_value AS STRING), CAST(value.string_value AS STRING),CAST(value.float_value AS STRING),CAST(value.double_value AS STRING) )AS STRING) FROM UNNEST(event_params) WHERE key = "author") AS author,
(SELECT CAST(COALESCE(CAST(value.int_value AS STRING), CAST(value.string_value AS STRING),CAST(value.float_value AS STRING),CAST(value.double_value AS STRING) )AS STRING) FROM UNNEST(event_params) WHERE key = "book") AS book
From <YOUR BIG QUERY EVENT TABLE>
Now this data can be fed into the snowflake as columns which can be used directly by Tableau users. These queries run very fast and we leverage the power of Big query.