Into the Light

A Query Language is also a Data Constraint Language

2021-10-27T00:00:00+00:00

What’s the difference between a data constraint and a data query? Is there anything that can be expressed in one form but not the other? My sense is that there is no such thing.

A constraint – or, similarly, a validation check¹ – is a description of what your data should or shouldn’t look like. A query serves the same purpose. It describes data that has a specific “shape” or satisfies certain properties. Anything that can be thought of as a constraint can also be expressed as a query.

Consider one of SQL’s “built-in” constraints, PRIMARY KEY. It’s such a commonly used constraint that, presumably, SQL’s authors felt it merited a dedicated keyword. But a primary key constraint can also be expressed as a plain query.

-- `PRIMARY KEY` expressed as a plain SQL query.
-- `id` is our primary key column.
SELECT id, COUNT(*) AS id_count
FROM some_table
GROUP BY id
HAVING (
    id_count > 1
    OR id IS NULL
);

As long as this query returns nothing, our constraint holds. If some data violates our constraint, then this query will return precisely the violating data.

This ability to express constraints as queries isn’t specific to SQL. Any query language can do this. Let’s express this same constraint using Apache Spark’s DataFrame API, which is roughly equivalent to SQL in expressiveness:

# `PRIMARY KEY` expressed using PySpark's DataFrame API.
(
    some_table
    .groupBy("id")
    .count()
    .where(
        (col("count") > 1)
        | col("id").isNull()
    )
)

As another quick example, here is FOREIGN KEY expressed as an SQL query:

-- `FOREIGN KEY` expressed as a plain SQL query.
-- book.author_key is a foreign key to author.author_key.
SELECT b.*
FROM 
    book b
    LEFT OUTER JOIN author a
        ON b.author_key = a.author_key
WHERE
        b.author_key IS NOT NULL
    AND a.author_key IS NULL
;

And again using Spark’s DataFrame API:

# `FOREIGN KEY` expressed using PySpark's DataFrame API.
(
    book
    .join(author, on="author_key", how="left_outer")
    .where(book["author_key"].isNotNull())
    .where(author["author_key"].isNull())
    .select(book["*"])
)

If every row in book points to a valid row in author (or points to no author at all), then these queries return nothing. If a book points to an author not in our author table, then our foreign key constraint has been violated and these queries will return the rows from book that point to a bad author key.

Not Just the Built-In SQL Constraints

My feeling is that any possible constraint or check you can think of can be expressed in this way, using the query languages we are already familiar with. And since we are using mature and well-understood query languages, we can reuse all the constructs and patterns they provide: functions, subqueries, common table expressions, and so on.

To that end, let’s express a more complex constraint as a query, using this example scenario described in Designing Data-Intensive Applications²:

[Y]ou are writing an application for doctors to manage their on-call shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, but it absolutely must have at least one doctor on call.

You can’t express this constraint using any of SQL’s built-in keywords, but it’s pretty straightforward as a query.

-- Doctor on-call constraint expressed as SQL.
SELECT COUNT(*) >= 1 AS sufficient_on_call_coverage
FROM doctors
WHERE on_call;

If we have at least one doctor on-call, then sufficient_on_call_coverage is TRUE, otherwise it’s FALSE. And as before, this constraint can be expressed in another query language like a DataFrame API, not just SQL:

# Doctor on-call constraint expressed using PySpark's DataFrame API.
from pyspark.sql.functions import count

(
    doctors
    .where(col("on_call"))
    # I don't call .count() here but do it instead inside the
    # .select() so this query is semantically equivalent to the SQL one.
    .select(
        (count("*") >= 1).alias("sufficient_on_call_coverage")
    )
)

At this point, you can probably imagine any number of arbitrary data constraints expressed as queries. When the constraint holds the query returns TRUE (or, alternately, it returns nothing), and when the constraint is violated the query returns FALSE (or, alternately, it returns the violating data). You can make the query as complicated as you’d like, as long as it fits this pattern.³

Straight from the SQL specification: CREATE ASSERTION

This idea that arbitrary queries can be used to express data constraints is not new. As part of my research for this post I stumbled on a feature of SQL that’s been part of the specification since 1992: CREATE ASSERTION

The idea of SQL assertions is that you can specify constraints on your data via queries that return TRUE when the constraint holds and FALSE when it is violated.

So here’s our primary key constraint re-expressed as an SQL assertion.

CREATE ASSERTION some_table_primary_key
CHECK (
    NOT EXISTS (
        SELECT id, COUNT(*) AS id_count
        FROM some_table
        GROUP BY id
        HAVING (
            id_count > 1
            OR id IS NULL
        )
    )
);

If some_table should ever come to have more than one row with the same id, or a row with a NULL id, then the inner SELECT will return some results, causing the NOT EXISTS check to return FALSE, thus violating our some_table_primary_key constraint.

Here’s the doctor on-call constraint expressed as an SQL assertion.

CREATE ASSERTION sufficient_on_call_coverage
CHECK (
    (
        SELECT COUNT(*)
        FROM doctors
        WHERE on_call
    ) >= 1
);

Any constraint you can express as an SQL query can be tweaked to fit this form of an ASSERTION. Popular DataFrame APIs like Spark’s don’t offer assertions, but we could easily imagine a couple of potential DataFrame equivalents to SQL assertions. Here, again, is the on-call doctors assertion:

from pyspark.sql.functions import count

# Hypothetical DataFrame Assertion API, inspired by Spark's
# `DataFrame.createGlobalTempView()`.
(
    doctors
    .where(col("on_call"))
    .select(count("*") >= 1)
    .createAssertion("sufficient_on_call_coverage")
)

# Another hypothentical DataFrame Assertion API, inspired by the Delta
# Live Tables API.
@assertion
def sufficient_on_call_coverage():
    return (
        doctors
        .where(col("on_call"))
        .select(count("*") >= 1)
    )

Efficiently Checking Assertions

CREATE ASSERTION is, unfortunately, vaporware—no popular database supports it.⁴ Why is that? There probably isn’t a definitive answer to this question, but we can take some educated guesses at the reasons.

First, generic assertions are not a critical feature. The other constraints SQL supports – PRIMARY KEY, FOREIGN KEY, CHECK – cover most applications’ practical needs. Anything more complex that would need to be expressed as an assertion can instead be implemented in your application layer or as part of a stored procedure.

A deeper reason is that assertions are very difficult to check efficiently. An assertion can touch an arbitrary number of tables and involve potentially expensive operations. Every time a table referenced in an assertion is modified, the assertion needs to be rechecked. That means running a potentially expensive query that may involve joins or aggregations to confirm the assertion still holds. While the assertion is being checked, potentially large ranges of data may need to be locked to ensure consistency in the face of concurrent readers and writers. And if the assertion fails, the original modification to the referenced tables needs to also fail and be rolled back. Compare this to the traditional constraints supported by SQL, which are typically limited to the rows being updated and do not require looking outside of that narrow range.

To express this problem differently, a key aspect of what makes assertions expensive to check is that they are expensive to maintain. That is, given a database in a consistent state with some active assertions, how do you check that an assertion still holds when an arbitrary change is made to the database? Naively, this means running the query that defines the assertion. If the query involves a join and aggregation across two large tables, that means computing the join and aggregation against those large tables from scratch. In the case of our doctors example, that means scanning the table to count how many doctors are on-call.

One doctor is on call, and since `1 >= 1`our constraint query returns a single `True` row.

The one doctor that was on-call is no longer on-call. Our constraint query output is now stale and needs to be recomputed to account for the change to the `Doctors` table.

But why should we run anything from scratch? Given an incremental update to our database, we’d ideally want to be able to incrementally update the queries that define our assertions. That is, instead of recomputing the output of an assertion query from scratch, we want to incrementally update the output given the incremental changes to the input tables referenced in the query. So if one doctor updates their on-call status, we should be able to recompute the on-call constraint by looking at that one changed row.

This problem is starting to look like incrementally maintaining a materialized view. To recap how we got here: A general data assertion or constraint can be modeled as a query with a name. A named query is a view. We need to persist the output of this view in order to check it quickly, meaning we need a materialized view. And when the view output changes, we want a way to efficiently update it without recomputing the whole view. So we conclude that – in its essence – maintaining a data constraint is the same problem as maintaining a materialized view.

If you've modeled your data platform as a graph of queries, then a constraint is simply another node in this graph with a special property: If any update to the graph violates the constraint, that update is rolled back. In this example, updates to either datasets A or B will trigger an update to constraint C. If C is violated, the updates to A and B are rolled back.

So what?

So data constraints can be modeled as queries, and efficiently checking arbitrary constraints is the same problem as efficiently updating a materialized view. What does that get us? If nothing else, understanding how these seemingly separate ideas are deeply connected provides some mental clarity.

I find that valuable for its own sake, but there are perhaps some practical benefits we can derive from this understanding, too. The main one being: If your data must conform to some constraints or invariants – that is, things that must always be true (like, there must always be at least one on-call doctor) – then express them declaratively. That could be in SQL, some DataFrame API, Datalog, or something else.

Your data platform may not support realtime enforcement of complex constraints, but you can still periodically check for constraint violations using these expressive, high-level queries written in a declarative language. Declarative checks like this would be much easier to write, understand, and maintain compared to relatively low-level, imperative data tests. And you can run them against your real production data to find problems, not just against test data.

There already are modern systems for applying constraints or validation checks to data lakes, like Deequ and Great Expectations. These systems offer their own custom APIs for expressing checks, instead of building on widespread query languages like SQL or DataFrames. I feel this is a missed opportunity. On the other hand, I suspect at least one benefit they have derived from building custom APIs is to make it easier to compute groups of validation checks efficiently, e.g. by avoiding repeated scans of the same data that may be referenced by multiple constraints.

Other projects that are already based on building graphs of declarative queries – like dbt, Materialize, ksqlDB, and Delta Live Tables – could likewise take advantage of the idea laid out in this post. Some of these projects already allow users to define row-level constraints on a dataset, analogous to the typical SQL CHECK constraint. But what if you could declare the whole dataset itself to be a constraint? Whenever an update to the pipeline triggers a refresh of the dataset, the constraint is rechecked.

Wrapping Up

Model your data platform as a database. Use a declarative query language to interact with it. If you do both of these things, you can then use that same language to define the constraints on your data platform. That’s the message of this post.

This idea is not new. Many teams run their data platform on a traditional relational database where this idea fits most naturally. In fact, Oracle admins have literally been able to implement complex constraints as materialized views – perhaps the most direct application of this post’s idea – since the early 2000s.

But you don’t have to be running on an actual relational database to take inspiration from this idea. Declarative query languages have stood the test of time and been reimplemented for the modern data lake. You can use them today to describe your data in ways that a simple CHECK constraint cannot capture. And hopefully, in the near future we will see more direct support for efficiently maintained, general data constraints.

Thanks to Matthew for reading a draft of this post.

The difference being that constraints prevent rule violations upfront, while validation checks identify violations after they have happened. ↩
Martin Kleppmann, Designing Data-Intensive Applications, Chapter 7: Transactions, “Write Skew and Phantoms”, p.246-247 in the fourth release of the first edition. ↩
Constraints that make use of non-deterministic functions like CURRENT_TIME present added complications that I won’t get into in this post. But they matter especially when we consider how to efficiently maintain constraints. ↩
A user posting to a PostgreSQL mailing list back in 2006 reported that a database called Rdb supported assertions. But I cannot find mention of assertions in the Rdb 7.3 SQL reference manual, which was released in late 2018 by Oracle. ↩

A Data Pipeline is a Materialized View

2021-01-23T00:00:00+00:00

Say you run an online book store and want to build a data pipeline that figures out who the top-selling authors are. Logically, the input to the pipeline is a log of every individual book purchase on the store for all time, along with details about each book like who authored it. And the output is a list of the top-selling authors per month.

The output of this data pipeline is a function of the input. In other words, the output is derived from the input by running the input through the pipeline.

This is an important characteristic of the output. As long as the input data and pipeline transformations (i.e. the pipeline code) are preserved, the output can always be recreated. The input data is primary; if lost, it cannot be replaced. The output data, along with any intermediate stages in the pipeline, are derivative; they can always be recreated from the primary data using the pipeline.

The Logical View

Let’s represent our hypothetical “Top-Selling Authors” pipeline as a directed graph, where the nodes represent datasets and the edges represent transformations of those datasets. Furthermore, let’s color each dataset in the graph based on whether it’s primary or derivative.

Most data pipelines, if you zoom out far enough, look something like this. You have some source data; it gets sliced, diced, and combined in various ways to produce some outputs. If someone were to wipe out all the derived data in this pipeline, you’d be able to regenerate it without any data loss. The pipeline could include any number of arbitrary steps, like copying files from an FTP share, or scraping data from a web page. It doesn’t matter as long as the pipeline produces the same output when given the same input.

Any time someone queries the output of the pipeline, it’s logically equivalent to them running the entire pipeline on the source data to get the output they’re looking for. In this way, a pipeline is a view into the source data.

Materializing the View

Of course, data pipelines don’t work this way in practice. It would be a waste of resources and a long wait for users if every query triggered a series of computations stretching all the way back to the primary data. When you ask for this month’s top-selling authors, you expect a quick response.

Hence, the typical real-world pipeline materializes its output, and often also several of the intermediate datasets required to produce that final output. Materializing a dataset simply means saving it to persistent storage, as opposed to repeatedly computing it on the fly. So when you ask for that list of authors, whatever system answering your query can start from the closest materialized dataset, as opposed to starting at the source or primary data.

A query against dataset B only needs to recompute the pipeline starting from A, since A is materialized. All derivative datasets, whether materialized or not, can be thrown away and recreated from the primary data.

So we’ve turned our view into a materialized view. “View” represents the logical transformations expressed in the pipeline. “Materialized” represents the fact that we cache the output of the pipeline, and perhaps also some of the intermediate steps. A complex set of interdependent data pipelines can be conceptualized in this way, as a graph of materialized views.

Note that this concept can be applied very broadly, and not just to what we think of as “normal” data pipelines:

A traditional web cache alleviates read traffic from the primary database, which is the source of truth. The cache is derivative and can be regenerated from the database at any time. The data in the cache is materialized so that incoming queries do not need to go all the way back to the database to get an answer.
A build system compiles or assembles source code into artifacts like executables or test reports. The artifacts are derivative, whereas the source code is primary. When you run a program over and over, you reuse the artifacts output by your build system, as opposed to recompiling them from source every time.

Updating a Materialized View

Materializing the output, though a practical necessity for most pipelines, adds an administrative cost. When the source data changes, the materialized views need to be updated. Otherwise, the data you get from the view will be stale.

The sales total of 101 for Eric Carle is stale. The correct value is now 103.

To update a materialized view, there are two high-level properties you typically care about: the update trigger, and the update granularity. The former affects the freshness of your output, which impacts end-users of the data, and the latter affects the performance of your update process, which impacts the engineers or operators responsible for that process.

Update Trigger

The update trigger is the event that prompts a refresh of the materialized view—e.g. by running your pipeline against the latest source data.

That event may be a file landing in a shared drive, or some data arriving on an event stream, or another pipeline completing. For some pipelines, the update trigger may just be a certain time of day, in which case it might be more useful to talk about the update frequency rather than trigger.

A typical batch pipeline, for example, might run on a daily or hourly cadence, whereas a streaming pipeline may run every few seconds or minutes, or whenever a new event is delivered via some sort of event stream. Whenever the pipeline runs, it updates its output, and the whole process can be viewed as a refresh of the materialized view.

Update Granularity

The update granularity refers to how much of the materialized view needs to be modified to account for the latest changes to the source data.

A common update granularity is the full refresh. No matter how small or large the change to the source data, when the pipeline runs it throws away the entire output table and rebuilds it from scratch.

A more sophisticated pipeline might rebuild only a subset of the table, like a date partition. And an extremely precise pipeline may know how to update exactly the output rows that are impacted by the latest changes to the source data.

The update trigger and granularity are independent. You can have a pipeline that runs every second and does a full refresh of its output, and you can have a pipeline that runs once a day but carefully updates only the rows that it needs to.

Typical Examples

Let’s explore these two properties a bit using our example pipeline that computes the top-selling authors of the month.

The Daily Batch Update

Every night at 1 a.m., an automated process looks for a dump of the latest purchases from the previous day. The dump is a compressed CSV file.

The update process uses this dump to recompute the month’s sales numbers for all authors. It replaces the entire output table with all-new calculations for all authors. Many of the authors’ numbers may not have changed since the last update (because they had no new sales in that time period), but they all get recomputed nonetheless.

This is a very typical example of a batch pipeline. It has a scheduled update trigger at 1 a.m. every night, and an update granularity of the entire output.

The Live-Updating Table

In this version of our top-selling authors pipeline, individual purchases are streamed in as they happen, via a stream processor like Apache Kafka. Every purchase on this stream triggers an update to the calculation of top-selling authors.

The update process uses each individual purchase to incrementally recompute the sales total for the relevant author. If an author has no new sales over a given span of updates, their sales total is not recomputed (though their rank in the top-selling authors may need to be updated).

This is an example of a precise streaming pipeline. The update trigger is the purchase event that is streamed in, and the update granularity is the sales total for a single author.

The Declarative Data Lake

We previously discussed the idea of conceptualizing your data lake as a database. And here we’ve shown how you can conceptualize your data pipelines as materialized views.

But what if we could take this idea further than just as a conceptual tool? What if you could actually implement your data pipelines as a graph of materialized views?

Taken far enough, the promise of such an idea would be to build a declarative data lake, where the code that manages the lake focuses more on defining what the datasets are and less on how to mechanically build or update them.

Two relatively new projects express aspects of this vision in clear but different ways, and they merit some discussion here: dbt and Materialize.

dbt: Pipelines as Batch-Updated SQL Queries

The core of dbt is an engine for building a graph of SQL queries. Parts of any given query can be generated dynamically using a templating language (Jinja), and queries can reference other queries.

Every query has a configured materialization strategy, which defines whether the results of the query are generated ahead of time, and if so, how they are stored and updated.

If the results are materialized, they can be updated with a full refresh or incrementally, though there are some restrictions on what kinds of updates can be done incrementally. Updates are typically triggered on a schedule.

Materialize: Pipelines as Live-Updated Materialized Views

Materialize is an engine for building live, incrementally updated materialized views from streaming sources like Apache Kafka. A view can reference other live-updated views, as well as fixed tables.

The primary interface for creating these views is plain and elegant: A CREATE MATERIALIZED VIEW SQL statement.

Conceptually, this is roughly the same statement that is available in traditional relational databases. Materialize’s implementation however, allows for very efficient incremental updates against very flexible and expressive queries. Materialize’s capabilities are based on relatively new research done by its creators.

Conclusion

The ideas presented in this post are not new. But materialized views never saw widespread adoption as a primary tool for building data pipelines, likely due to their limitations and ties to relational database technologies. Perhaps with this new wave of tools like dbt and Materialize we’ll see materialized views used more heavily as a primary building block in the typical data pipeline.

Regardless of whether we see that kind of broad change, materialized views are still a useful design tool for conceptualizing what we are doing when we build data pipelines.

Get clear on what data is primary and what is derivative. Map your pipeline to the concept of a graph of transformations with materialized, intermediate datasets, each with a specific update trigger and update granularity.

The exercise should help bring some conceptual order to even the messiest pipelines.

Read the discussion about this post on Hacker News.

Thoughts on States vs. Transitions

2020-09-27T00:00:00+00:00

Many years ago, working as a database developer at a video game company, I was tasked with designing the database behind an in-game wallet service. The wallet service would store each player’s current balance and transaction history. In other words, it was a simple banking database.

Working on that problem gave me a hands-on introduction to several recurring themes in software engineering. One of those themes is the divide between two ways of thinking about data: thinking about the state of something, and thinking about the transition of something from one state to another.

The state of Alice's balance.

The transitions that led to Alice's balance.

The state of something describes what it is at a specific point in time, and a transition describes a change from one state to another—i.e. how the state changed at a particular point in time. There are many other ways to describe the same concepts, but I’ll stick with state and transition for this post. In this player wallet scenario, a player’s balance is a piece of state, whereas the individual transactions that the player makes are transitions on that state.

In this post I’d like to share all the common threads I’ve found in software design and data management once I started to think about this divide.

Perspective: State-First or Transition-First

When you have a problem that requires you to manage both states and transitions, like in the player wallet example, one of those ways of thinking about the problem will tend to dominate. What I’ve noticed is that, whichever way is the primary way you think about the problem, you end up needing to build automatic and efficient ways of deriving the other representation of your problem.

If states are the primary way you think of a problem, then when a piece of state changes you need to automatically derive the transitions that implement that change.
If transitions are the primary way you think of a problem, then as the number of transitions grows large you need to build a way to efficiently query the state at a specific point in time (which is typically “right now”).

Our wallet example is a typical example of a “transitions-first” problem. The primary data are the individual transactions against a balance, since that corresponds most naturally to the activity we’re capturing, so we need to build an efficient way to derive the current balance from the history of transactions.

There are many problems that fit this “transitions-first” pattern, as well as problems that fit the “state-first” pattern. Let’s look at some examples of each, and see how in each case the secondary way of thinking about the problem needs to be automatically derived.

State-First Thinking

Infrastructure Management

Terraform is a tool for managing infrastructure that uses declarative configuration files to describe the infrastructure under its control. For example, here is a simple configuration that describes a single EC2 instance running on AWS:

provider "aws" {
  profile = "default"
  region  = "us-east-1"
}

resource "aws_instance" "example" {
  ami           = "ami-00b882ac5193044e4"
  instance_type = "t2.micro"

  tags = {
    Name  = "TerraformExample"
    Owner = "Nick"
  }
}

The configuration doesn’t explain how to create this infrastructure. It simply describes what the infrastructure is. In other words, it describes the state of the infrastructure.

When you deploy this configuration, Terraform compares the desired configuration against what is already out there and automatically figures out what operations are required to change the deployed infrastructure to match the configuration. To use the terminology we’re using in this post: The user specifies the desired infrastructure state, and Terraform automatically derives the required transitions to bring about that state.

So when you first deploy the above configuration against an empty environment (at least to Terraform’s knowledge), Terraform reports what actions it will take to bring about the specified infrastructure:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_instance.example will be created
  + resource "aws_instance" "example" {
      + ami                          = "ami-00b882ac5193044e4"
      ...
      + tags                         = {
          + "Name"  = "TerraformExample"
          + "Owner" = "Nick"
        }
      ...

Plan: 1 to add, 0 to change, 0 to destroy.

Suppose that after you deploy this infrastructure, you update your configuration to change the tags on the instance:

resource "aws_instance" "example" {
  ...

  tags = {
    Name        = "TerraformExample"
    Owner       = "Bob"
    Environment = "Dev"
  }
}

When you instruct Terraform to deploy this change, it figures out how to modify the existing infrastructure to match your desired configuration, even though you haven’t specified how it would do that, only what infrastructure you want at the end of the day:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # aws_instance.example will be updated in-place
  ~ resource "aws_instance" "example" {
        ami                          = "ami-00b882ac5193044e4"
        ...
      ~ tags                         = {
          + "Environment" = "Dev"
            "Name"        = "TerraformExample"
          ~ "Owner"       = "Nick" -> "Bob"
        }
        ...

Plan: 0 to add, 1 to change, 0 to destroy.

As you can see in the plan, Terraform detects that only parts of the infrastructure need to be changed, and it figured out exactly how to change them to match the desired state.

As a Terraform user, you think about infrastructure primarily as what you want its current state to be, and Terraform figures out for you how to transition your infrastructure to match your desired configuration.

This seems like a natural way to approach infrastructure management, but you can probably imagine how a transitions-first approach to this problem would play out. Instead of specifying “I want one instance” in your configuration, you’d say “Add one instance”, and so on. You’d then have to carefully run just the appropriate steps as they are needed, or somehow build idempotency into each operation so it’s safe to rerun them without careful pre-planning. Otherwise, you’d likely end up creating duplicate infrastructure or make unwanted changes.

Database Schema Migrations

Relational databases are typically tightly coupled to the applications they back. As an application changes, the database schema backing it often also needs to change. But where an application can be updated with a simple code push, the database needs more care because it’s stateful. In other words, it’s carrying all this data that you want to maintain; you typically don’t want to update your database schema by dropping the whole database and redeploying it from scratch, which is in effect what you do when you deploy a new version of your application—replace the old application code entirely with the new. Instead, you want to migrate the database schema in-place, preserving all the data.

When I worked as a database developer, one of my tasks was to plan and execute migrations like this. I’d compare the current database schema against the new one that needed to be deployed, and hand craft a migration script that would ALTER tables and make any other necessary changes to mutate the schema as needed.

-- Version 1
CREATE TABLE person(
  id INT PRIMARY KEY,
  first_name VARCHAR(200),
  last_name VARCHAR(200)
);

-- Version 2
CREATE TABLE person(
  id INT PRIMARY KEY,
  first_name VARCHAR(200),
  last_name VARCHAR(200),
  birth_date DATE
);

-- Derived v1->v2 migration script
ALTER TABLE person
ADD COLUMN birth_date DATE;

Every release of an application had an associated database schema as well as a migration script to upgrade a database from the previous schema version. The full database schema at a given version was the primary description of the database, and the migration script was derived from the comparison of the full schema at two different versions.

To fit this into the common thread I’m tracing in this post, database schemas fit the state-first mode of thinking. The state of your database schema at a given version is primary (i.e. what the schema is), and the transitions from one schema version to another are secondary (i.e. how to get the schema to that state).

There are tools for approaching database schemas in this fashion, like Redgate SQL Compare for SQL Server and OnlineSchemaChange for MySQL. You give these tools two full schemas, and they compute the appropriate migration script. There are some risks to performing automatic migrations in this fashion. There may be semantic changes to your schema or strict availability requirements that an automated schema migration tool cannot satisfy without human input. But I think these tools address the problem of schema migrations in a conceptually natural way.

Source Control: Commits

Consider git. Depending on what you’re doing, git seamlessly moves between a state-first and transition-first view of the world. One of the primary things you do with git is create new commits to capture changes to your codebase. When you create a commit, git derives the diff for the commit by comparing the current state of your codebase against its state at the most recent commit.

diff --git a/.travis.yml b/.travis.yml
index c7938b6..c1a3c91 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -1,6 +1,5 @@
 language: python
 python:
-  - "3.4"
   - "3.5"
   - "3.6"
 # Work-around for Python 3.7 on Travis CI pulled from here:

In other words, you as the developer focus simply on what you want your code to look like now, and git figures out for you how to capture that as an incremental change from the most recently committed state of the code. You specify the desired state of the code, and git computes the transition from one state of the code to the other.

The derived transitions have a number of uses which you are probably familiar with. We’ll take a look at some of them in the next section.

Transition-First Thinking

In contrast to these examples of state-first thinking, we have transition-first thinking. This form of thinking naturally fits many real-world problems that are oriented around capturing or responding to events: a customer bought something; a user added a comment; a player made a move.

We saw how bank transactions fit this way of thinking; the individual transactions are primary, and the account balance is derived from those transactions. Let’s take a look at a few more examples of transition-first thinking and see how the state of the world ends up being derived from those transitions.

“Like, comment, and subscribe!” – a common refrain on social media platforms – also captures a straightforward example of a transition-first problem. Users each add individual likes to a post, for example. Each such action is recorded in the backing database. That’s the primary record of what happened.

When people view a given post, however, they don’t see all the individual likes (at least not by default). What they see is a summary which is derived from all the individual likes and represents the current state of the total. From the perspective of the backend database, the individual likes are primary; the derived total is secondary.

Turn-Based Games

Games like chess are well represented as a sequence of transitions. Each player takes a turn to make a move, and at any point in time the state of the game can be derived from the history of moves that have been made leading up to that point.

These are the first 10 moves of each player in Game 1 of the 1996 match between Garry Kasparov and Deep Blue.

e4 c5
c3 d5
exd5 Qxd5
d4 Nf6
Nf3 Bg4
Be2 e6
h3 Bh5
0-0 Nc6
Be3 cxd4
cxd4 Bb4

The state of the board shown in the image is derived from this sequence of moves.

Image Editing

Edit history from the Pixlr E photo editor.

Image editing tools like Photoshop track your edits to the image you are working on, allowing you to easily undo changes or quickly flip between past image states.

You can think of the current state of the image as being derived from the history of edits to the original image. Each edit is a transition from one image state to another. The edits are the primary activity the user engages in, and the resulting image follows from them.

Source Control: Code Review

This is the other side of source control systems like git. Git lets you focus on the state of your code when you make a commit, and derives from that the transition from one state of the code to another. But after that, almost everything else in git starts with the transitions first and derives from those the state of the code.

A great example of a transitions-first interface is what happens when you submit a code change for review, like a GitHub Pull Request.

A pull request diff from the Apache Spark project.

Instead of presenting the entire code base at once, the pull request focuses on a set of code changes – a “diff” – that a person can review in isolation. Focusing on this limited transformation of the code is what makes code review practical. And when a pull request gets merged in, git uses the encapsulated set of changes to efficiently update all downstream clones of the repository. The diff is primary, and the state of the codebase is derived from a series of transitions (i.e. diffs).

Thanks to Ivan, Michael, and Cip for reading drafts of this post.

The Modern Data Lake is a Database

2020-05-10T00:00:00+00:00

I was recently talking to some coworkers about the mix of data technology we have in our stack. Apache Spark, HDFS, Amazon Athena, Amazon S3, AWS Glue… The list is long. The technologies obviously work together somehow, but to a newcomer it may not be clear how each technology relates to the other. And in the details of how a given technology works it’s easy to lose sight of what purpose it serves in the grand scheme of things.

In a previous post we discussed why many teams use both Postgres and Redshift, or some equivalent, in their data stack. In this post let’s look at the broader collection of data systems that constitute the modern data lake and give you, the newcomer, a mental map of them organized around a longstanding and very useful abstraction—the database.

What is a Database?

First, what is a database? In the abstract, it’s a system for storing and retrieving data. For the purposes of this post, however, I want to take a more practical view inspired by the classic graphical database client interface that is widespread in the software industry.

So in practical terms, this is a database:

A typical database client highlighting the three main components of a database. (image source)

Specifically, it’s a collection of three components that work together as one system:

Catalog: The Catalog tracks what data you have – i.e. schemas, tables, and columns – and where in the Storage layer to find it. The Catalog also tracks statistics about the data, like how many rows are in a table, or what the most common values in a specific column are. The Query Engine uses these statistics to figure out how to execute a query efficiently.
Query Engine: The Query Engine is what takes your query, in this case a SQL query, and translates it into specific machine instructions that will fetch and assemble the data you asked for. In other words, it takes a declarative query describing what you want and translates it into instructions detailing how to get it. The Query Engine uses the Catalog to lookup the datasets referenced in the query and find them in Storage.
Storage: The Storage layer holds all of the database’s data. Its job is to store all the rows of data for all the tables in the database and retrieve or update them as requested.

Every traditional relational database system, like Postgres or MySQL, comes with all three of these components packaged into one coherent system. They work together seamlessly, but they’re also inseparable. You cannot, for example, query or update the data in the database using regular Unix tools like grep or sed; you have to go through the database’s query engine. And while some databases let you use the database’s query engine to query data from outside of its own storage layer, it’s very much a secondary capability that you wouldn’t want to rely on heavily.

Breaking up the Database

In the years since this formula was first developed and perfected, there’s been an explosion of new database and data processing technology: Graph databases, document databases, column-oriented databases, stream processing systems, and more. Among these new technologies is the group of distributed data processing systems – also known as “Big Data” tools – dominated by the Hadoop ecosystem. This ecosystem includes systems like Apache Spark.

Broadly speaking, what distinguishes these systems from traditional databases is that they enable you to process a) large amounts of data b) in varied formats c) quickly d) and affordably. They do that by distributing the work to process data over a large number of cheap machines that are clustered together, and by allowing you to process data as it is on your storage system. In other words, instead of sending your data to the query engine, you send your query engine to the data. This contrasts with a traditional database system, where you would need to load the data into a specialized format in an area managed exclusively by the database. So to give a simple example of something you could do with these systems, which wasn’t as easy or practical to do before, you could process 20 TB of plain old CSV data distributed across 100 cheap machines in a few minutes.

When these systems were first being developed, the focus was on making them scalable and fault-tolerant, and the programming APIs weren’t very friendly. Over time, these distributed data processing systems evolved to recreate the convenience and productivity of the traditional database system. Instead of MapReduce, you could now query data using plain old SQL. And instead of referring to data by fixed paths on a filesystem, you could now refer to them by abstract schema and table names, just like in a traditional database.

In effect, the people building these systems took the three components of the traditional database – Catalog, Query Engine, and Storage – and reinvented each as a stand-alone component for the distributed, massively scalable world. These components interoperate through shared catalog and storage formats.

The modern data lake as a logical database--three components, many options.

This means you can store your data in one place, like S3, and query it using multiple tools, like Spark and Presto. These query engines will have the same view of the available datasets by pointing to a shared instance of the Hive Metastore.

Another key point is that storage is split up into formats and systems. Instead of having your data in a closed format on a single server operated on by a single database system, you can have data in multiple, open formats (like CSV or Parquet), across several storage systems. And because the data formats are not specific to any query engine, data created by one query engine can easily be read by another.

Example: The Spark “Database”

Apache Spark is extremely popular with teams building data lakes. If you’re reading this post, chances are that you’ve used it at some point. But if your experience with Spark was limited to its RDD or DataFrame APIs, you may not have realized that it can be integrated with these other systems to create a logical database with SQL as the primary query language. So let’s take a quick look at how to do that, keeping in mind that you can do something similar for many other “Big Data” query engines.

Spark comes with a command-line utility called spark-sql. It’s similar, for example, to Postgres’s psql. It gives you a SQL-only prompt where you can create, destroy, and query tables in a virtual database. By default, the catalog for this database is stored in a folder called metastore_db, and the data for the tables in the database is stored in a folder called spark-warehouse, typically in Parquet format. That’s already pretty neat, but you can take this further by calling ./sbin/start-thriftserver.sh from the Spark home directory. This will start up a JDBC server that you can connect to with any old database client, like DBeaver. That will give you the full “Spark is a database” experience. I won’t go over how to do this in detail, since that’s not the focus of this post, but the documentation for Spark’s JDBC server and SQL CLI is here.

We can extend this experience to the cloud. If you work with Spark on Amazon EMR, you can connect Spark to the AWS Glue Data Catalog. This gives Spark the same view into your datasets that several other AWS services have, including Amazon Athena and Amazon Redshift Spectrum. In other words, you can have one catalog, managed by AWS Glue, one location for your data, on S3, and any number of different services or query engines updating or querying that data using SQL. And just as you can with Spark running locally, on EMR you can start a JDBC server and connect to it with a regular database client.

Final Thoughts

I hope this post connected some dots for you about the various distributed data systems out there. There are many ways to conceptualize a data lake. Thinking of it as a database – i.e. a combination of catalog, query engine, and storage layer – provides a familiar abstraction that will help you mentally map out many of the technologies in this space.

This idea is more powerful than just as a conceptual tool, though! After all, a team may use these same technologies to build a data lake without integrating them to create that cohesive “database” package. What are they missing out on? As we touched on earlier, by actually building your data lake around the database abstraction, you can can shift the focus of your work away from where the data is or how to manipulate it, and instead focus on what data you want. Let’s explore this idea in a future post.

Thanks to Michelle, Yuna, Cip, and Sophie for reading drafts of this post.

Postgres and Redshift: Why use both?

2020-03-28T00:00:00+00:00

A couple of co-workers who are new to database technology recently asked me why we use both Postgres and Redshift in our stack. They’re both SQL databases and seem to do the same thing. So why not just use one technology? It would be simpler.

It’s a great question. In fact, it’s very common for teams to use a combination of databases in their stack, especially a combination like Postgres and Redshift. Let’s explore why.

Understanding Database Types via their Access Patterns

There are a lot of database types in the world. A very powerful way to understand any given database is via the access patterns it is designed for.

At a high level, we can divide databases into two broad categories:

transactional
analytical

(Note: We’re not talking about transactions in the sense of ACID; we’re focusing just on the transactional access pattern.)

Transactional Databases

A transactional database is what most people think of when they hear “database”. It’s typically a database that backs online, interactive operations, like for a store or a game, where users expect instant responses to their queries.

Each query typically touches a very small amount of data, since the user executing the query is usually only reading or writing data about themselves, like updating their profile information or noting a new purchase. Tables tend to be narrow and highly normalized. At any one time there may be thousands or tens of thousands of such queries executing against a transactional database, as a multitude of people interact with the service that the database backs.

Query times in well-operated transactional databases are typically measured in milliseconds or less. Engineers spend a lot of time designing table indexes to enable the database to sift through the minimum number of rows required to answer a query, and tuning database parameters to keep as much data in memory as possible and minimize disk I/O.

In summary, transactional databases are designed for:

point reads/writes, i.e. small amounts of data per query
high concurrency, i.e. many queries running at the same time
very low latency, i.e. quick query response times

Examples of transactional databases include all the popular database systems you’ve heard of:

MySQL
Oracle
Postgres
SQL Server

The transactional database access pattern: Lots and lots of tiny chunks of data coming at you real fast.

Analytical Databases

An analytical database is designed for a very different access pattern. Instead of backing an online store or game, an analytics database typically backs pipelines or tools that help users analyze large swathes of data.

A typical analytics query will touch a large range of data, like a reporting query that summarizes sales numbers by day for an entire quarter. Compared to a transactional database, there will only be a small number of queries running at one time against an analytics database, and each query will usually only touch a handful of columns in any given table. Tables tend to be wide (i.e. they have many columns) and highly denormalized.

Query times in an analytics database will typically be on the order of seconds or minutes. Indexes, which help queries quickly find their target row, aren’t relevant to analytics databases because queries rarely target a single row. Instead, engineers optimize analytical databases by reorganizing how the data is stored on disk to minimize the number of columns that need to be read, and by compressing the data so that large chunks of data can be read quickly. Trying to hold most of your data in memory is typically not possible or even necessary for analytics workloads.

In summary, analytical databases are designed for:

bulk reads/writes, i.e. large amounts of data per query
lower concurrency, i.e. fewer queries running at the same time
higher latency, i.e. longer query response times

Popular analytical database systems include:

Vertica
Redshift
Teradata

The analytical database access pattern: A handful of huge chunks of data coming at you relatively slowly.

Typical Usage Examples

Now we can answer the question that opened this post: Why use both Postgres and Redshift?

A typical pattern is for teams to use both to build an analytics product. For example, consider a team building a product that tracks visits to your website and then shows you a handy chart summarizing your web traffic over the past few weeks.

The team uses Redshift to bulk load detailed event data tracking every visit to your site and then aggregate it down to a set of summary statistics and key metrics. They then load that summarized data – for example, total visits to your site per day for the past 12 weeks – into Postgres and serve it up from there to a website or API endpoint for users to access. Redshift answers a relatively small number of queries that crunch a lot of data and take a lot of time each, as part of a batch update pipeline, while Postgres answers a larger number of lighter queries that touch smaller amounts of data each, as users browse the summary statistics for their website.

It’s also common for the flow between the systems to go the other way around. Consider a team building an online game–an MMORPG, perhaps.

They use Postgres to back online game operations and track what actions a player is taking in the game. Those actions affect the online world and develop the player’s character as they are playing the game. The game only needs to know what a player has done in the current session, so to keep the transactional database light, the team regularly moves data for old sessions from Postgres to Redshift. In Redshift, analysts study player behavior across long stretches of time and try to answer questions like “What is the most popular path players take through our world?” or “Where are players quitting our game, and why?” Postgres handles the flurry of detail-level activity to serve thousands of online players, while Redshift answers big picture queries for a handful of in-house analysts.

Final Thoughts

There are many more ways to understand and categorize database systems:

by the consistency guarantees they provide;
by the levels of transaction isolation they provide;
by how they scale to handle additional load;
by the query languages and data structures they support;
or by how they lay out data on disk, to name a few.

In this post we’ve focused just on access patterns, though databases designed for different access patterns typically do so by differing on these other axes, too.

What about systems like Amazon Athena and Spark SQL, by the way? Many teams with data-intensive workflows tend to use these tools as well. And they certainly look like databases, though there’s something weird about them. Roughly speaking, systems like Athena and Spark SQL can be categorized as analytical databases, but there’s more to them than that. We’ll explore these systems in more detail in a follow-up post.

Thanks to Michelle, Yuna, Sam, Cip, Fabian, and Roland for reading drafts of this post.

Solving the Water Jug Problem from Die Hard 3 with TLA+ and Hypothesis

2017-03-28T00:00:00+00:00

In the movie Die Hard with a Vengeance (aka Die Hard 3), there is a famous scene where John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson) are forced to solve a problem or be blown up: Given a 3 gallon jug and 5 gallon jug, how do you measure out exactly 4 gallons of water?

McClane and Carver puzzle over the water jug problem.

Apparently, you can solve this problem using a formal specification language like TLA+. I don’t know much about this topic, but it appears that a formal specification language is much like a programming language in that it lets you describe the behavior of a system. However, it’s much more rigorous and builds on mathematical techniques that enable you to reason more effectively about the behavior of the system you’re describing than you can with a typical programming language.

In a recent discussion on Hacker News about TLA+, I came across this comment which linked to a fun and simple example showing how to solve the Die Hard 3 problem with TLA+. I had to watch the first two lectures from Leslie Lamport’s video course on TLA+ to understand the example well, but once I did I was reminded of the idea of property-based testing and, specifically, Hypothesis.

So what’s property-based testing? It’s a powerful way of testing your logic by giving your machine a high-level description of how your code should behave and letting it generate test cases automatically to see if that description holds. Compare that to traditional unit testing, for example, where you manually code up specific inputs and outputs and make sure they match.

How not to Die Hard with Hypothesis

Hypothesis has an excellent implementation of property-based testing for Python. I thought to myself: I wonder if you can write that Die Hard specification using Hypothesis? As it turns out, Hypothesis supports stateful testing, and I was able to port the TLA+ example to Python pretty easily:

from hypothesis import note, settings
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant


# The default for `max_examples` is sometimes not enough for Hypothesis
# to find a falsifying example.
@settings(max_examples=2000)
class DieHardProblem(RuleBasedStateMachine):
    small = 0
    big = 0

    @rule()
    def fill_small(self):
        self.small = 3

    @rule()
    def fill_big(self):
        self.big = 5

    @rule()
    def empty_small(self):
        self.small = 0

    @rule()
    def empty_big(self):
        self.big = 0

    @rule()
    def pour_small_into_big(self):
        old_big = self.big
        self.big = min(5, self.big + self.small)
        self.small = self.small - (self.big - old_big)

    @rule()
    def pour_big_into_small(self):
        old_small = self.small
        self.small = min(3, self.small + self.big)
        self.big = self.big - (self.small - old_small)

    @invariant()
    def physics_of_jugs(self):
        assert 0 <= self.small <= 3
        assert 0 <= self.big <= 5

    @invariant()
    def die_hard_problem_not_solved(self):
        note("> small: {s} big: {b}".format(s=self.small, b=self.big))
        assert self.big != 4


DieHardTest = DieHardProblem.TestCase

Calling pytest on this file quickly digs up a solution:

self = DieHardProblem({})

    @invariant()
    def die_hard_problem_not_solved(self):
        note("> small: {s} big: {b}".format(s=self.small, b=self.big))
>       assert self.big != 4
E       AssertionError: assert 4 != 4
E        +  where 4 = DieHardProblem({}).big

how-not-to-die-hard-with-hypothesis.py:17: AssertionError
----------------------------- Hypothesis -----------------------------
> small: 0 big: 0
Step #1: fill_big()
> small: 0 big: 5
Step #2: pour_big_into_small()
> small: 3 big: 2
Step #3: empty_small()
> small: 0 big: 2
Step #4: pour_big_into_small()
> small: 2 big: 0
Step #5: fill_big()
> small: 2 big: 5
Step #6: pour_big_into_small()
> small: 3 big: 4
====================== 1 failed in 0.22 seconds ======================

Note: I last tested this against Python 3.13, pytest 8.4, and Hypothesis 6.138.

What’s Going on Here

The code and test output are pretty self-explanatory, but here’s a recap of what’s going on:

We’re defining a state machine. That state machine has an initial state (two empty jugs) along with some possible transitions. Those transitions are captured with the rule() decorator. The initial state and possible transitions together define how our system works.

Next we define invariants, which are properties that must always hold true in our system. Our first invariant, physics_of_jugs, says that the jugs must hold an amount of water that makes sense. For example, the big jug can never hold more than 5 gallons of water.

Our next invariant, die_hard_problem_not_solved, is where it gets interesting. Here we’re declaring that the problem of getting exactly 4 gallons in the big jug cannot be solved. Since Hypothesis’s job is to test our logic for bugs, it will give our state machine a thorough shake down and see if we ever violate our invariants. In other words, we’re basically goading Hypothesis into solving the Die Hard problem for us.

I’m not entirely clear on how Hypothesis does its work, but I know the basic summary is this: It takes the program properties we’ve specified – including things like rules, invariants, data types, and function signatures – and generates data or actions to probe the behavior of our program. If Hypothesis finds a piece of data or sequence of actions that get our program to violate its stated properties, it tries to whittle that down to a minimum falsifying example—i.e. something that exposes the same problem but with a minimum number of steps. This makes it much easier for you to understand how Hypothesis broke your code.

Hypothesis’s output above tells us that it was able to violate the die_hard_problem_not_solved invariant and provides us with a minimal reproduction showing exactly how it did so. That reproduction is our solution to the problem. It’s also how McClane and Carver did it in the movie!

Final Thoughts

All in all, I was pretty impressed with how straightforward it was to translate the TLA+ example into Python using Hypothesis. And when Hypothesis spit out the solution, I couldn’t help but smile. It’s pretty cool to see your computer essentially generate a program that solves a problem for you. And the Python version of the Die Hard “spec” is not much more verbose than the original in TLA+, though TLA+’s notation for current vs. next value (e.g. small vs. small') is elegant and cuts out the need to have variables like old_small and old_big.

I don’t know how Hypothesis compares to TLA+ in a general sense. I’ve only just started to learn about property-based testing and TLA+, and I wonder if they have a place in the work that I do these days, which is mostly Data Engineering-type stuff. Still, I found this little exercise fun, and I hope you learned something interesting from it.

Thanks to Julia, Dan, Laura, Anjana, and Cip for reading drafts of this post.

Read the discussion about this post on Lobsters.

Into the Light

A Query Language is also a Data Constraint Language

Not Just the Built-In SQL Constraints

Straight from the SQL specification: CREATE ASSERTION

Efficiently Checking Assertions

So what?

Wrapping Up

A Data Pipeline is a Materialized View

The Logical View

Materializing the View

Updating a Materialized View

Update Trigger

Update Granularity

Typical Examples

The Daily Batch Update

The Live-Updating Table

The Declarative Data Lake

dbt: Pipelines as Batch-Updated SQL Queries

Materialize: Pipelines as Live-Updated Materialized Views

Conclusion

Thoughts on States vs. Transitions

Perspective: State-First or Transition-First

State-First Thinking

Infrastructure Management

Database Schema Migrations

Source Control: Commits

Transition-First Thinking

Social Media Activity

Turn-Based Games

Image Editing

Source Control: Code Review

The Modern Data Lake is a Database

What is a Database?

Breaking up the Database

Example: The Spark “Database”

Final Thoughts

Postgres and Redshift: Why use both?

Understanding Database Types via their Access Patterns

Transactional Databases

Analytical Databases

Typical Usage Examples

Final Thoughts

Solving the Water Jug Problem from Die Hard 3 with TLA+ and Hypothesis

How not to Die Hard with Hypothesis

What’s Going on Here

Final Thoughts