[META] Snowplow Analytics for GitLab.com

Description

We've made the decision to pursue the open-source Snowplow Analytics for tracking events on GitLab.com. These events will be pushed to our data warehouse, and visualized in Looker.

This meta issue is an SSOT for the tasks needed. Our goal is to have a working pipeline by July 7th, where we're tracking a small handful of events in GitLab.com and are able to successfully visualize them in Looker.

Setup

Snowplow's guide to their pipeline is here, visualized by the following diagram:

image

Since we have an existing data warehouse (PostgreSQL on Cloud SQL...?), our primary concern is handling setup through subsystem 4, and getting tracked event data stored somewhere - ideally Cloud SQL, possibly S3.

1: Tracking

  • Set up a Snowplow JS tracker for pageviews on GitLab.com

2: Collection

  • Set up a Snowplow collector for tracking events from GitLab.com
  • Self-host snowplow.js for Snowplow analytics tracking

3: Enriching

  • Stand up EmrEtlRunner on a server and schedule it to parse and push logs to storage

4: Storage

5: Modeling

6: Analytics

  • Data team to model w/dbt and visualize in Looker

Related security review: https://gitlab.com/gitlab-com/security/issues/114

Once the cleaned and enriched data is in S3, we can ETL it into our data warehouse on a regular basis. We can then visualize it in Looker. Great success!

Edited Feb 05, 2019 by Jeremy Watson (ex-GitLab)
Assignee Loading
Time tracking Loading