<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://nchammas.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://nchammas.com/" rel="alternate" type="text/html" /><updated>2025-10-22T18:07:11+00:00</updated><id>https://nchammas.com/feed.xml</id><title type="html">Into the Light</title><subtitle></subtitle><author><name>Nicholas Chammas</name></author><entry><title type="html">A Query Language is also a Data Constraint Language</title><link href="https://nchammas.com/writing/query-language-constraint-language" rel="alternate" type="text/html" title="A Query Language is also a Data Constraint Language" /><published>2021-10-27T00:00:00+00:00</published><updated>2021-10-27T00:00:00+00:00</updated><id>https://nchammas.com/writing/query-language-constraint-language</id><content type="html" xml:base="https://nchammas.com/writing/query-language-constraint-language"><![CDATA[<p>What’s the difference between a data constraint and a data query? Is there anything that can be expressed in one form but not the other? My sense is that there is no such thing.</p>

<p>A constraint – or, similarly, a validation check<sup id="fnref:validation" role="doc-noteref"><a href="#fn:validation" class="footnote" rel="footnote">1</a></sup> – is a description of what your data should or shouldn’t look like. A query serves the same purpose. It describes data that has a specific “shape” or satisfies certain properties. Anything that can be thought of as a constraint can also be expressed as a query.</p>

<p>Consider one of SQL’s “built-in” constraints, <code class="language-plaintext highlighter-rouge">PRIMARY KEY</code>. It’s such a commonly used constraint that, presumably, SQL’s authors felt it merited a dedicated keyword. But a primary key constraint can also be expressed as a plain query.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- `PRIMARY KEY` expressed as a plain SQL query.</span>
<span class="c1">-- `id` is our primary key column.</span>
<span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">id_count</span>
<span class="k">FROM</span> <span class="n">some_table</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span>
<span class="k">HAVING</span> <span class="p">(</span>
    <span class="n">id_count</span> <span class="o">&gt;</span> <span class="mi">1</span>
    <span class="k">OR</span> <span class="n">id</span> <span class="k">IS</span> <span class="k">NULL</span>
<span class="p">);</span>
</code></pre></div></div>

<p>As long as this query returns nothing, our constraint holds. If some data violates our constraint, then this query will return precisely the violating data.</p>

<p>This ability to express constraints as queries isn’t specific to SQL. Any query language can do this. Let’s express this same constraint using Apache Spark’s DataFrame API, which is roughly equivalent to SQL in expressiveness:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># `PRIMARY KEY` expressed using PySpark's DataFrame API.
</span><span class="p">(</span>
    <span class="n">some_table</span>
    <span class="p">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">"id"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">count</span><span class="p">()</span>
    <span class="p">.</span><span class="n">where</span><span class="p">(</span>
        <span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"count"</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">)</span>
        <span class="o">|</span> <span class="n">col</span><span class="p">(</span><span class="s">"id"</span><span class="p">).</span><span class="n">isNull</span><span class="p">()</span>
    <span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>

<!--
With a slight tweak, we can have this query instead return `TRUE` when our constraint is met and `FALSE` when it's violated.

```sql
-- Check that `id` is a valid primary key.
SELECT NOT EXISTS(
    SELECT id, COUNT(*) AS id_count
    ...
);
```
-->

<p>As another quick example, here is <code class="language-plaintext highlighter-rouge">FOREIGN KEY</code> expressed as an SQL query:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- `FOREIGN KEY` expressed as a plain SQL query.</span>
<span class="c1">-- book.author_key is a foreign key to author.author_key.</span>
<span class="k">SELECT</span> <span class="n">b</span><span class="p">.</span><span class="o">*</span>
<span class="k">FROM</span> 
    <span class="n">book</span> <span class="n">b</span>
    <span class="k">LEFT</span> <span class="k">OUTER</span> <span class="k">JOIN</span> <span class="n">author</span> <span class="n">a</span>
        <span class="k">ON</span> <span class="n">b</span><span class="p">.</span><span class="n">author_key</span> <span class="o">=</span> <span class="n">a</span><span class="p">.</span><span class="n">author_key</span>
<span class="k">WHERE</span>
        <span class="n">b</span><span class="p">.</span><span class="n">author_key</span> <span class="k">IS</span> <span class="k">NOT</span> <span class="k">NULL</span>
    <span class="k">AND</span> <span class="n">a</span><span class="p">.</span><span class="n">author_key</span> <span class="k">IS</span> <span class="k">NULL</span>
<span class="p">;</span>
</code></pre></div></div>

<p>And again using Spark’s DataFrame API:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># `FOREIGN KEY` expressed using PySpark's DataFrame API.
</span><span class="p">(</span>
    <span class="n">book</span>
    <span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">author</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">"author_key"</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s">"left_outer"</span><span class="p">)</span>
    <span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">book</span><span class="p">[</span><span class="s">"author_key"</span><span class="p">].</span><span class="n">isNotNull</span><span class="p">())</span>
    <span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">author</span><span class="p">[</span><span class="s">"author_key"</span><span class="p">].</span><span class="n">isNull</span><span class="p">())</span>
    <span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">book</span><span class="p">[</span><span class="s">"*"</span><span class="p">])</span>
<span class="p">)</span>
</code></pre></div></div>

<p>If every row in <code class="language-plaintext highlighter-rouge">book</code> points to a valid row in <code class="language-plaintext highlighter-rouge">author</code> (or points to no author at all), then these queries return nothing. If a <code class="language-plaintext highlighter-rouge">book</code> points to an author not in our <code class="language-plaintext highlighter-rouge">author</code> table, then our foreign key constraint has been violated and these queries will return the rows from <code class="language-plaintext highlighter-rouge">book</code> that point to a bad author key.</p>

<h2 id="not-just-the-built-in-sql-constraints">Not Just the Built-In SQL Constraints</h2>

<p>My feeling is that any possible constraint or check you can think of can be expressed in this way, using the query languages we are already familiar with. And since we are using mature and well-understood query languages, we can reuse all the constructs and patterns they provide: functions, subqueries, common table expressions, and so on.</p>

<p>To that end, let’s express a more complex constraint as a query, using this example scenario described in <em>Designing Data-Intensive Applications</em><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup>:</p>

<blockquote>
  <p>[Y]ou are writing an application for doctors to manage their on-call shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, but it absolutely must have at least one doctor on call.</p>
</blockquote>

<p>You can’t express this constraint using any of SQL’s built-in keywords, but it’s pretty straightforward as a query.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Doctor on-call constraint expressed as SQL.</span>
<span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1</span> <span class="k">AS</span> <span class="n">sufficient_on_call_coverage</span>
<span class="k">FROM</span> <span class="n">doctors</span>
<span class="k">WHERE</span> <span class="n">on_call</span><span class="p">;</span>
</code></pre></div></div>

<p>If we have at least one doctor on-call, then <code class="language-plaintext highlighter-rouge">sufficient_on_call_coverage</code> is <code class="language-plaintext highlighter-rouge">TRUE</code>, otherwise it’s <code class="language-plaintext highlighter-rouge">FALSE</code>. And as before, this constraint can be expressed in another query language like a DataFrame API, not just SQL:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Doctor on-call constraint expressed using PySpark's DataFrame API.
</span><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">count</span>

<span class="p">(</span>
    <span class="n">doctors</span>
    <span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"on_call"</span><span class="p">))</span>
    <span class="c1"># I don't call .count() here but do it instead inside the
</span>    <span class="c1"># .select() so this query is semantically equivalent to the SQL one.
</span>    <span class="p">.</span><span class="n">select</span><span class="p">(</span>
        <span class="p">(</span><span class="n">count</span><span class="p">(</span><span class="s">"*"</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1</span><span class="p">).</span><span class="n">alias</span><span class="p">(</span><span class="s">"sufficient_on_call_coverage"</span><span class="p">)</span>
    <span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>

<p>At this point, you can probably imagine any number of arbitrary data constraints expressed as queries. When the constraint holds the query returns <code class="language-plaintext highlighter-rouge">TRUE</code> (or, alternately, it returns nothing), and when the constraint is violated the query returns <code class="language-plaintext highlighter-rouge">FALSE</code> (or, alternately, it returns the violating data). You can make the query as complicated as you’d like, as long as it fits this pattern.<sup id="fnref:non-deter" role="doc-noteref"><a href="#fn:non-deter" class="footnote" rel="footnote">3</a></sup></p>

<!--
One final example, just to drive home the point that queries can express any sort of constraint or check: If your underlying table format supports [time travel][8], you can check historical metrics that describe the data over time. Say we want to check that a table does not grow beyond a certain rate each day:

[8]: https://docs.delta.io/latest/delta-batch.html#-deltatimetravel

```sql
WITH recent_counts AS (
    SELECT
        (
            SELECT COUNT(*)
            FROM some_table TIMESTAMP AS OF TODAY()
        ) AS count_today,
        (
            SELECT COUNT(*)
            FROM some_table TIMESTAMP AS OF DATEADD(DAY, -1, TODAY())
        ) AS count_yesterday
)
-- Check that some_table's row count has not grown more than 10%
-- since yesterday.
SELECT count_today <= 1.1 * count_yesterday;
```

Constraints that reference non-deterministic functions like `TODAY()` do present added complications that I won't get into in this post.
-->

<h2 id="straight-from-the-sql-specification-create-assertion">Straight from the SQL specification: CREATE ASSERTION</h2>

<p>This idea that arbitrary queries can be used to express data constraints is not new. As part of my research for this post I stumbled on a feature of SQL that’s been part of the specification <a href="http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt">since 1992</a>: <code class="language-plaintext highlighter-rouge">CREATE ASSERTION</code></p>

<!-- [^sql-assert]: Section 11.34, page 324. -->

<p>The idea of SQL assertions is that you can specify constraints on your data via queries that return <code class="language-plaintext highlighter-rouge">TRUE</code> when the constraint holds and <code class="language-plaintext highlighter-rouge">FALSE</code> when it is violated.</p>

<p>So here’s our primary key constraint re-expressed as an SQL assertion.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">ASSERTION</span> <span class="n">some_table_primary_key</span>
<span class="k">CHECK</span> <span class="p">(</span>
    <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="p">(</span>
        <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">AS</span> <span class="n">id_count</span>
        <span class="k">FROM</span> <span class="n">some_table</span>
        <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span>
        <span class="k">HAVING</span> <span class="p">(</span>
            <span class="n">id_count</span> <span class="o">&gt;</span> <span class="mi">1</span>
            <span class="k">OR</span> <span class="n">id</span> <span class="k">IS</span> <span class="k">NULL</span>
        <span class="p">)</span>
    <span class="p">)</span>
<span class="p">);</span>
</code></pre></div></div>

<p>If <code class="language-plaintext highlighter-rouge">some_table</code> should ever come to have more than one row with the same <code class="language-plaintext highlighter-rouge">id</code>, or a row with a <code class="language-plaintext highlighter-rouge">NULL</code> <code class="language-plaintext highlighter-rouge">id</code>, then the inner <code class="language-plaintext highlighter-rouge">SELECT</code> will return some results, causing the <code class="language-plaintext highlighter-rouge">NOT EXISTS</code> check to return <code class="language-plaintext highlighter-rouge">FALSE</code>, thus violating our <code class="language-plaintext highlighter-rouge">some_table_primary_key</code> constraint.</p>

<p>Here’s the doctor on-call constraint expressed as an SQL assertion.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">ASSERTION</span> <span class="n">sufficient_on_call_coverage</span>
<span class="k">CHECK</span> <span class="p">(</span>
    <span class="p">(</span>
        <span class="k">SELECT</span> <span class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span class="p">)</span>
        <span class="k">FROM</span> <span class="n">doctors</span>
        <span class="k">WHERE</span> <span class="n">on_call</span>
    <span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1</span>
<span class="p">);</span>
</code></pre></div></div>

<p>Any constraint you can express as an SQL query can be tweaked to fit this form of an <code class="language-plaintext highlighter-rouge">ASSERTION</code>. Popular DataFrame APIs like Spark’s don’t offer assertions, but we could easily imagine a couple of potential DataFrame equivalents to SQL assertions. Here, again, is the on-call doctors assertion:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pyspark.sql.functions</span> <span class="kn">import</span> <span class="n">count</span>

<span class="c1"># Hypothetical DataFrame Assertion API, inspired by Spark's
# `DataFrame.createGlobalTempView()`.
</span><span class="p">(</span>
    <span class="n">doctors</span>
    <span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"on_call"</span><span class="p">))</span>
    <span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">count</span><span class="p">(</span><span class="s">"*"</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">.</span><span class="n">createAssertion</span><span class="p">(</span><span class="s">"sufficient_on_call_coverage"</span><span class="p">)</span>
<span class="p">)</span>

<span class="c1"># Another hypothentical DataFrame Assertion API, inspired by the Delta
# Live Tables API.
</span><span class="o">@</span><span class="n">assertion</span>
<span class="k">def</span> <span class="nf">sufficient_on_call_coverage</span><span class="p">():</span>
    <span class="k">return</span> <span class="p">(</span>
        <span class="n">doctors</span>
        <span class="p">.</span><span class="n">where</span><span class="p">(</span><span class="n">col</span><span class="p">(</span><span class="s">"on_call"</span><span class="p">))</span>
        <span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">count</span><span class="p">(</span><span class="s">"*"</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1</span><span class="p">)</span>
    <span class="p">)</span>
</code></pre></div></div>

<!--
another example with join + aggregation; e.g. doctors must have completed minimum of 3 training courses
-->

<h2 id="efficiently-checking-assertions">Efficiently Checking Assertions</h2>

<p><code class="language-plaintext highlighter-rouge">CREATE ASSERTION</code> is, unfortunately, vaporware—no popular database supports it.<sup id="fnref:rdb" role="doc-noteref"><a href="#fn:rdb" class="footnote" rel="footnote">4</a></sup> Why is that? There probably isn’t a definitive answer to this question, but we can take some educated guesses at the reasons.</p>

<p>First, generic assertions are not a critical feature. The other constraints SQL supports – <code class="language-plaintext highlighter-rouge">PRIMARY KEY</code>, <code class="language-plaintext highlighter-rouge">FOREIGN KEY</code>, <code class="language-plaintext highlighter-rouge">CHECK</code> – cover most applications’ practical needs. Anything more complex that would need to be expressed as an assertion can instead be implemented in your application layer or as part of a stored procedure.</p>

<p>A deeper reason is that assertions are very difficult to check efficiently. An assertion can touch an arbitrary number of tables and involve potentially expensive operations. Every time a table referenced in an assertion is modified, the assertion needs to be rechecked. That means running a potentially expensive query that may involve joins or aggregations to confirm the assertion still holds. While the assertion is being checked, potentially large ranges of data may need to be locked to ensure consistency in the face of concurrent readers and writers. And if the assertion fails, the original modification to the referenced tables needs to also fail and be rolled back. Compare this to the traditional constraints supported by SQL, which are typically limited to the rows being updated and do not require looking outside of that narrow range.</p>

<p>To express this problem differently, a key aspect of what makes assertions expensive to check is that they are expensive to <em>maintain</em>. That is, given a database in a consistent state with some active assertions, how do you check that an assertion still holds when an arbitrary change is made to the database? Naively, this means running the query that defines the assertion. If the query involves a join and aggregation across two large tables, that means computing the join and aggregation against those large tables from scratch. In the case of our doctors example, that means scanning the table to count how many doctors are on-call.</p>

<!-- [illustration of graph of queries with some marked red as assertions; use doctor example] -->

<!-- [illustration of small input change triggering large recomputation] -->

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/query-constraint-language/on-call-doctors.png" width="600" />
    </span>
    <figcaption>
    One doctor is on call, and since `1 &gt;= 1`our constraint query returns a single `True` row.
    </figcaption>
</figure>
</div>

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/query-constraint-language/on-call-doctors-stale.png" width="600" />
    </span>
    <figcaption>
    The one doctor that was on-call is no longer on-call. Our constraint query output is now stale and needs to be recomputed to account for the change to the `Doctors` table.
    <!-- Ideally, we'd be able to update the constraint query just by looking at the rows in `Doctors` that changed, as opposed to scanning the whole table. -->
    </figcaption>
</figure>
</div>

<p>But why should we run anything from scratch? Given an incremental update to our database, we’d ideally want to be able to incrementally update the queries that define our assertions. That is, instead of recomputing the output of an assertion query from scratch, we want to incrementally update the output given the incremental changes to the input tables referenced in the query. So if one doctor updates their on-call status, we should be able to recompute the on-call constraint by looking at that one changed row.</p>

<p>This problem is starting to look like incrementally maintaining a <a href="/writing/data-pipeline-materialized-view">materialized view</a>. To recap how we got here: A general data assertion or constraint can be modeled as a query with a name. A named query is a view. We need to persist the output of this view in order to check it quickly, meaning we need a materialized view. And when the view output changes, we want a way to efficiently update it without recomputing the whole view. So we conclude that – in its essence – <strong>maintaining a data constraint is the same problem as maintaining a materialized view.</strong></p>

<!--
[^hello-mz]
[^hello-mz]: Hello [Materialize][mz]! This seems right up your alley.
-->

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/query-constraint-language/pipeline-with-constraint-nodes.png" width="600" />
    </span>
    <figcaption>
    If you've modeled your data platform as a <a href="/writing/data-pipeline-materialized-view">graph of queries</a>, then a constraint is simply another node in this graph with a special property: If any update to the graph violates the constraint, that update is rolled back. In this example, updates to either datasets A or B will trigger an update to constraint C. If C is violated, the updates to A and B are rolled back.
    </figcaption>
</figure>
</div>

<h2 id="so-what">So what?</h2>

<p>So data constraints can be modeled as queries, and efficiently checking arbitrary constraints is the same problem as efficiently updating a materialized view. What does that get us? If nothing else, understanding how these seemingly separate ideas are deeply connected provides some mental clarity.</p>

<p>I find that valuable for its own sake, but there are perhaps some practical benefits we can derive from this understanding, too.
The main one being: If your data must conform to some constraints or <em><a href="/writing/how-not-to-die-hard-with-hypothesis">invariants</a></em> – that is, things that must always be true (like, there must always be at least one on-call doctor) – then express them declaratively. That could be in SQL, some DataFrame API, Datalog, or something else.</p>

<p>Your data platform may not support realtime enforcement of complex constraints, but you can still periodically check for constraint violations using these expressive, high-level queries written in a declarative language. Declarative checks like this would be much easier to write, understand, and maintain compared to relatively low-level, imperative data tests. And you can run them against your real production data to find problems, not just against test data.</p>

<p>There already are modern systems for applying constraints or validation checks to data lakes, like <a href="https://github.com/awslabs/deequ">Deequ</a> and <a href="https://greatexpectations.io">Great Expectations</a>. These systems offer their own <a href="https://docs.greatexpectations.io/docs/guides/expectations/contributing/how_to_contribute_a_new_expectation_to_great_expectations/#1-choose-a-parent-class-to-help-your-implementation">custom APIs</a> for expressing checks, instead of building on widespread query languages like SQL or DataFrames. I feel this is a missed opportunity. On the other hand, I suspect at least one benefit they have derived from building custom APIs is to make it easier to compute groups of validation checks <a href="https://github.com/awslabs/deequ/blob/d243a7c592e30d0422c97988d1c5313c47c0eee0/src/main/scala/com/amazon/deequ/analyzers/Analyzer.scala">efficiently</a>, e.g. by avoiding repeated scans of the same data that may be referenced by multiple constraints.</p>

<p>Other projects that are already based on building graphs of declarative queries – like <a href="https://www.getdbt.com">dbt</a>, <a href="https://materialize.com">Materialize</a>, <a href="https://ksqldb.io">ksqlDB</a>, and <a href="https://databricks.com/product/delta-live-tables">Delta Live Tables</a> – could likewise take advantage of the idea laid out in this post. Some of these projects already allow users to define row-level constraints on a dataset, analogous to the typical <a href="https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-CHECK-CONSTRAINTS">SQL <code class="language-plaintext highlighter-rouge">CHECK</code> constraint</a>. But what if you could declare the whole dataset itself to be a constraint? Whenever an update to the pipeline triggers a refresh of the dataset, the constraint is rechecked.</p>

<h2 id="wrapping-up">Wrapping Up</h2>

<p>Model your <a href="/writing/modern-data-lake-database">data platform as a database</a>. Use a declarative query language to interact with it. If you do both of these things, you can then use that same language to define the constraints on your data platform. That’s the message of this post.</p>

<p>This idea is not new. Many teams run their data platform on a traditional relational database where this idea fits most naturally. In fact, Oracle admins have literally been able to implement <a href="https://tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html">complex constraints as materialized views</a> – perhaps the most direct application of this post’s idea – since the early 2000s.</p>

<p>But you don’t have to be running on an actual relational database to take inspiration from this idea. Declarative query languages have stood the test of time and been reimplemented for the modern data lake. You can use them today to describe your data in ways that a simple <code class="language-plaintext highlighter-rouge">CHECK</code> constraint cannot capture. And hopefully, in the near future we will see more direct support for efficiently maintained, general data constraints.</p>

<p><em>Thanks to Matthew for reading a draft of this post.</em></p>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:validation" role="doc-endnote">
      <p>The difference being that <em>constraints</em> prevent rule violations upfront, while <em>validation checks</em> identify violations after they have happened. <a href="#fnref:validation" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:1" role="doc-endnote">
      <p>Martin Kleppmann, Designing Data-Intensive Applications, Chapter 7: Transactions, “Write Skew and Phantoms”, p.246-247 in the fourth release of the first edition. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:non-deter" role="doc-endnote">
      <p>Constraints that make use of non-deterministic functions like <a href="https://www.postgresql.org/docs/14/functions-datetime.html#FUNCTIONS-DATETIME-CURRENT"><code class="language-plaintext highlighter-rouge">CURRENT_TIME</code></a> present added complications that I won’t get into in this post. But they matter especially when we consider how to efficiently maintain constraints. <a href="#fnref:non-deter" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:rdb" role="doc-endnote">
      <p>A user posting to a PostgreSQL mailing list back in 2006 <a href="https://www.postgresql.org/message-id/87sljd7gbn.fsf%40wolfe.cbbrowne.com">reported</a> that a database called Rdb supported assertions. But I cannot find mention of assertions in the <a href="https://www.oracle.com/database/technologies/related/rdb-doc.html#release73">Rdb 7.3 SQL reference manual</a>, which was released in late 2018 by Oracle. <a href="#fnref:rdb" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Nicholas Chammas</name></author><category term="data" /><category term="data-pipelines" /><summary type="html"><![CDATA[What’s the difference between a data constraint and a data query? Is there anything that can be expressed in one form but not the other? My sense is that there is no such thing.]]></summary></entry><entry><title type="html">A Data Pipeline is a Materialized View</title><link href="https://nchammas.com/writing/data-pipeline-materialized-view" rel="alternate" type="text/html" title="A Data Pipeline is a Materialized View" /><published>2021-01-23T00:00:00+00:00</published><updated>2021-01-23T00:00:00+00:00</updated><id>https://nchammas.com/writing/data-pipeline-materialized-view</id><content type="html" xml:base="https://nchammas.com/writing/data-pipeline-materialized-view"><![CDATA[<p>Say you run an online book store and want to build a data pipeline that figures out who the top-selling authors are. Logically, the input to the pipeline is a log of every individual book purchase on the store for all time, along with details about each book like who authored it. And the output is a list of the top-selling authors per month.</p>

<!-- [image: book purchases + authorship info -> top-selling authors of the month] -->

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/data-pipeline-materialized-view/top-selling-authors.png" width="1000" />
    </span>
    <figcaption>
    </figcaption>
</figure>
</div>

<p>The output of this data pipeline is a function of the input. In other words, the output is derived from the input by running the input through the pipeline.</p>

<!-- [image: f(input) -> output] -->

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/data-pipeline-materialized-view/f-input-output.png" width="400" />
    </span>
    <figcaption>
    </figcaption>
</figure>
</div>

<p>This is an important characteristic of the output. As long as the input data and pipeline transformations (i.e. the pipeline code) are preserved, the output can always be recreated. The input data is <em>primary</em>; if lost, it cannot be replaced. The output data, along with any intermediate stages in the pipeline, are <em>derivative</em>; they can always be recreated from the primary data using the pipeline.</p>

<h2 id="the-logical-view">The Logical View</h2>

<p>Let’s represent our hypothetical “Top-Selling Authors” pipeline as a directed graph, where the nodes represent datasets and the edges represent transformations of those datasets. Furthermore, let’s color each dataset in the graph based on whether it’s primary or derivative.</p>

<!-- [image: colored graph of transformations] -->

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/data-pipeline-materialized-view/pipeline-graph.png" width="500" />
    </span>
    <figcaption>
    </figcaption>
</figure>
</div>

<p>Most data pipelines, if you zoom out far enough, look something like this. You have some source data; it gets sliced, diced, and combined in various ways to produce some outputs. If someone were to wipe out all the derived data in this pipeline, you’d be able to regenerate it without any data loss. The pipeline could include any number of arbitrary steps, like copying files from an FTP share, or scraping data from a web page. It doesn’t matter as long as the pipeline produces the same output when given the same input.</p>

<p>Any time someone queries the output of the pipeline, it’s logically equivalent to them running the entire pipeline on the source data to get the output they’re looking for. In this way, a pipeline is a <a href="https://docs.microsoft.com/en-us/sql/relational-databases/views/views?view=sql-server-ver15">view</a> into the source data.</p>

<h2 id="materializing-the-view">Materializing the View</h2>

<p>Of course, data pipelines don’t work this way in practice. It would be a waste of resources and a long wait for users if every query triggered a series of computations stretching all the way back to the primary data. When you ask for this month’s top-selling authors, you expect a quick response.</p>

<p>Hence, the typical real-world pipeline <em>materializes</em> its output, and often also several of the intermediate datasets required to produce that final output. Materializing a dataset simply means saving it to persistent storage, as opposed to repeatedly computing it on the fly. So when you ask for that list of authors, whatever system answering your query can start from the closest materialized dataset, as opposed to starting at the source or primary data.</p>

<!-- [image: cached nodes in graph colored; key explaining colors; caption] -->

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/data-pipeline-materialized-view/pipeline-with-cached-nodes.png" width="500" />
    </span>
    <figcaption>
    A query against dataset B only needs to recompute the pipeline starting from A, since A is materialized.
    All derivative datasets, whether materialized or not, can be thrown away and recreated from the primary data.
    </figcaption>
</figure>
</div>

<p>So we’ve turned our view into a <em>materialized view</em>. “View” represents the logical transformations expressed in the pipeline. “Materialized” represents the fact that we cache the output of the pipeline, and perhaps also some of the intermediate steps. A complex set of interdependent data pipelines can be conceptualized in this way, as a graph of materialized views.</p>

<p>Note that this concept can be applied very broadly, and not just to what we think of as “normal” data pipelines:</p>
<ul>
  <li>A traditional web cache alleviates read traffic from the primary database, which is the source of truth. The cache is derivative and can be regenerated from the database at any time. The data in the cache is materialized so that incoming queries do not need to go all the way back to the database to get an answer.</li>
  <li>A build system compiles or assembles source code into artifacts like executables or test reports. The artifacts are derivative, whereas the source code is primary. When you run a program over and over, you reuse the artifacts output by your build system, as opposed to recompiling them from source every time.</li>
</ul>

<h2 id="updating-a-materialized-view">Updating a Materialized View</h2>

<p>Materializing the output, though a practical necessity for most pipelines, adds an administrative cost. When the source data changes, the materialized views need to be updated. Otherwise, the data you get from the view will be <em>stale</em>.</p>

<!-- [image: highlighted new source row; out-of-date output aggregation] -->

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/data-pipeline-materialized-view/update-materialized-view.png" width="600" />
    </span>
    <figcaption>
    The sales total of 101 for Eric Carle is stale. The correct value is now 103.
    </figcaption>
</figure>
</div>

<p>To update a materialized view, there are two high-level properties you typically care about: the update <em>trigger</em>, and the update <em>granularity</em>. The former affects the freshness of your output, which impacts end-users of the data, and the latter affects the performance of your update process, which impacts the engineers or operators responsible for that process.</p>

<h3 id="update-trigger">Update Trigger</h3>

<p>The update trigger is the event that prompts a refresh of the materialized view—e.g. by running your pipeline against the latest source data.</p>

<p>That event may be a file landing in a shared drive, or some data arriving on an event stream, or another pipeline completing. For some pipelines, the update trigger may just be a certain time of day, in which case it might be more useful to talk about the update <em>frequency</em> rather than trigger.</p>

<p>A typical batch pipeline, for example, might run on a daily or hourly cadence, whereas a streaming pipeline may run every few seconds or minutes, or whenever a new event is delivered via some sort of event stream. Whenever the pipeline runs, it updates its output, and the whole process can be viewed as a <em>refresh</em> of the materialized view.</p>

<h3 id="update-granularity">Update Granularity</h3>

<p>The update granularity refers to how much of the materialized view needs to be modified to account for the latest changes to the source data.</p>

<p>A common update granularity is the full refresh. No matter how small or large the change to the source data, when the pipeline runs it throws away the entire output table and rebuilds it from scratch.</p>

<p>A more sophisticated pipeline might rebuild only a subset of the table, like a date partition. And an extremely precise pipeline may know how to update exactly the output rows that are impacted by the latest changes to the source data.</p>

<p>The update trigger and granularity are independent. You can have a pipeline that runs every second and does a full refresh of its output, and you can have a pipeline that runs once a day but carefully updates only the rows that it needs to.</p>

<h3 id="typical-examples">Typical Examples</h3>

<p>Let’s explore these two properties a bit using our example pipeline that computes the top-selling authors of the month.</p>

<h4 id="the-daily-batch-update">The Daily Batch Update</h4>

<p>Every night at 1 a.m., an automated process looks for a dump of the latest purchases from the previous day. The dump is a compressed CSV file.</p>

<p>The update process uses this dump to recompute the month’s sales numbers for all authors. It replaces the entire output table with all-new calculations for all authors. Many of the authors’ numbers may not have changed since the last update (because they had no new sales in that time period), but they all get recomputed nonetheless.</p>

<p>This is a very typical example of a batch pipeline. It has a scheduled update trigger at 1 a.m. every night, and an update granularity of the entire output.</p>

<h4 id="the-live-updating-table">The Live-Updating Table</h4>

<p>In this version of our top-selling authors pipeline, individual purchases are streamed in as they happen, via a stream processor like Apache Kafka. Every purchase on this stream triggers an update to the calculation of top-selling authors.</p>

<p>The update process uses each individual purchase to incrementally recompute the sales total for the relevant author. If an author has no new sales over a given span of updates, their sales total is not recomputed (though their rank in the top-selling authors may need to be updated).</p>

<p>This is an example of a precise streaming pipeline. The update trigger is the purchase event that is streamed in, and the update granularity is the sales total for a single author.</p>

<h2 id="the-declarative-data-lake">The Declarative Data Lake</h2>

<p>We previously discussed the idea of conceptualizing your <a href="/writing/modern-data-lake-database">data lake as a database</a>. And here we’ve shown how you can conceptualize your data pipelines as materialized views.</p>

<p>But what if we could take this idea further than just as a conceptual tool? What if you could actually implement your data pipelines as a graph of materialized views?</p>

<p>Taken far enough, the promise of such an idea would be to build a <em>declarative data lake</em>, where the code that manages the lake focuses more on defining <em>what</em> the datasets are and less on <em>how</em> to mechanically build or update them.</p>

<p>Two relatively new projects express aspects of this vision in clear but different ways, and they merit some discussion here: <a href="https://www.getdbt.com">dbt</a> and <a href="https://materialize.com">Materialize</a>.</p>

<h3 id="dbt-pipelines-as-batch-updated-sql-queries">dbt: Pipelines as Batch-Updated SQL Queries</h3>

<p>The core of <a href="https://www.getdbt.com">dbt</a> is an engine for building <a href="https://docs.getdbt.com/docs/introduction#what-makes-dbt-so-powerful">a graph of SQL queries</a>. Parts of any given query can be generated dynamically using a templating language (<a href="https://docs.getdbt.com/tutorial/using-jinja/">Jinja</a>), and queries can reference other queries.</p>

<p>Every query has a configured materialization strategy, which defines whether the results of the query are generated ahead of time, and if so, how they are stored and updated.</p>

<p>If the results are materialized, they can be updated with a full refresh or <a href="https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models/#understanding-incremental-models">incrementally</a>, though there are some restrictions on what kinds of updates can be done incrementally. Updates are typically triggered on a schedule.</p>

<h3 id="materialize-pipelines-as-live-updated-materialized-views">Materialize: Pipelines as Live-Updated Materialized Views</h3>

<p><a href="https://materialize.com">Materialize</a> is an engine for building live, incrementally updated materialized views from streaming sources like Apache Kafka. A view can reference other live-updated views, as well as fixed tables.</p>

<p>The primary interface for creating these views is plain and elegant: A <a href="https://materialize.com/docs/sql/create-materialized-view/"><code class="language-plaintext highlighter-rouge">CREATE MATERIALIZED VIEW</code></a> SQL statement.</p>

<p>Conceptually, this is roughly the same statement that is available in 
<a href="https://docs.oracle.com/en/database/oracle/oracle-database/21/sqlrf/CREATE-MATERIALIZED-VIEW.html">traditional</a>
<a href="https://www.postgresql.org/docs/current/sql-creatematerializedview.html">relational</a>
<a href="https://docs.microsoft.com/en-us/sql/relational-databases/views/create-indexed-views">databases</a>.
Materialize’s implementation however, allows for very efficient incremental updates against very <a href="https://materialize.com/joins-in-materialize/">flexible and expressive</a> queries. Materialize’s capabilities are based on relatively <a href="https://timelydataflow.github.io/differential-dataflow/introduction.html">new research</a> done by its creators.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The ideas presented in this post are not new. But materialized views never saw widespread adoption as a primary tool for building data pipelines, likely due to their <a href="https://stackoverflow.com/a/25642149/877069">limitations</a> and ties to relational database technologies. Perhaps with this new wave of tools like dbt and Materialize we’ll see materialized views used more heavily as a primary building block in the typical data pipeline.</p>

<p>Regardless of whether we see that kind of broad change, materialized views are still a useful design tool for conceptualizing what we are doing when we build data pipelines.</p>

<p>Get clear on what data is primary and what is derivative. Map your pipeline to the concept of a graph of transformations with materialized, intermediate datasets, each with a specific update trigger and update granularity.</p>

<p>The exercise should help bring some conceptual order to even the messiest pipelines.</p>

<p><em>Read the discussion about this post <a href="https://news.ycombinator.com/item?id=26217911">on Hacker News</a>.</em></p>]]></content><author><name>Nicholas Chammas</name></author><category term="data-pipelines" /><category term="databases" /><summary type="html"><![CDATA[Say you run an online book store and want to build a data pipeline that figures out who the top-selling authors are. Logically, the input to the pipeline is a log of every individual book purchase on the store for all time, along with details about each book like who authored it. And the output is a list of the top-selling authors per month.]]></summary></entry><entry><title type="html">Thoughts on States vs. Transitions</title><link href="https://nchammas.com/writing/states-vs-transitions" rel="alternate" type="text/html" title="Thoughts on States vs. Transitions" /><published>2020-09-27T00:00:00+00:00</published><updated>2020-09-27T00:00:00+00:00</updated><id>https://nchammas.com/writing/states-vs-transitions</id><content type="html" xml:base="https://nchammas.com/writing/states-vs-transitions"><![CDATA[<p>Many years ago, working as a database developer at a video game company, I was tasked with designing the database behind an in-game wallet service. The wallet service would store each player’s current balance and transaction history. In other words, it was a <a href="https://dba.stackexchange.com/q/5608/2660">simple banking database</a>.</p>

<p>Working on that problem gave me a hands-on introduction to several recurring themes in software engineering. One of those themes is the divide between two ways of thinking about data: thinking about the <strong>state</strong> of something, and thinking about the <strong>transition</strong> of something from one state to another.</p>

<div style="display: flex; justify-content: center; align-items: center;">
  <div style="text-align: center; margin-right: 20px;">
  <figure>
    <span>
      <img src="/assets/images/states-vs-transitions/balances.png" width="130" />
    </span>
    <figcaption>
    The state of Alice's balance.
    </figcaption>
  </figure>
  </div>

  <div style="text-align: center; margin-left: 20px;">
  <figure>
    <span>
      <img src="/assets/images/states-vs-transitions/transactions.png" width="300" />
    </span>
    <figcaption>
    The transitions that led to Alice's balance.
    </figcaption>
  </figure>
  </div>
</div>

<p>The <em>state</em> of something describes <em>what</em> it is at a specific point in time, and a <em>transition</em> describes a change from one state to another—i.e. <em>how</em> the state changed at a particular point in time. There are many other ways to describe the same concepts, but I’ll stick with <em>state</em> and <em>transition</em> for this post. In this player wallet scenario, a player’s balance is a piece of state, whereas the individual transactions that the player makes are transitions on that state.</p>

<p>In this post I’d like to share all the common threads I’ve found in software design and data management once I started to think about this divide.</p>

<h2 id="perspective-state-first-or-transition-first">Perspective: State-First or Transition-First</h2>

<p>When you have a problem that requires you to manage both states and transitions, like in the player wallet example, one of those ways of thinking about the problem will tend to dominate. What I’ve noticed is that, whichever way is the primary way you think about the problem, you end up needing to build automatic and efficient ways of deriving the other representation of your problem.</p>

<ul>
  <li>If states are the primary way you think of a problem, then when a piece of state changes you need to automatically derive the transitions that implement that change.</li>
  <li>If transitions are the primary way you think of a problem, then as the number of transitions grows large you need to build a way to efficiently query the state at a specific point in time (which is typically “right now”).</li>
</ul>

<p>Our wallet example is a typical example of a “transitions-first” problem. The primary data are the individual transactions against a balance, since that corresponds most naturally to the activity we’re capturing, so we need to build an efficient way to derive the current balance from the history of transactions.</p>

<p>There are many problems that fit this “transitions-first” pattern, as well as problems that fit the “state-first” pattern. Let’s look at some examples of each, and see how in each case the secondary way of thinking about the problem needs to be automatically derived.</p>

<h2 id="state-first-thinking">State-First Thinking</h2>

<h3 id="infrastructure-management">Infrastructure Management</h3>

<p><a href="https://www.terraform.io">Terraform</a> is a tool for managing infrastructure that uses declarative configuration files to describe the infrastructure under its control. For example, here is a <a href="https://learn.hashicorp.com/terraform/getting-started/build#configuration">simple configuration</a> that describes a single EC2 instance running on AWS:</p>

<div class="language-terraform highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">provider</span> <span class="s2">"aws"</span> <span class="p">{</span>
  <span class="nx">profile</span> <span class="p">=</span> <span class="s2">"default"</span>
  <span class="nx">region</span>  <span class="p">=</span> <span class="s2">"us-east-1"</span>
<span class="p">}</span>

<span class="k">resource</span> <span class="s2">"aws_instance"</span> <span class="s2">"example"</span> <span class="p">{</span>
  <span class="nx">ami</span>           <span class="p">=</span> <span class="s2">"ami-00b882ac5193044e4"</span>
  <span class="nx">instance_type</span> <span class="p">=</span> <span class="s2">"t2.micro"</span>

  <span class="nx">tags</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">Name</span>  <span class="p">=</span> <span class="s2">"TerraformExample"</span>
    <span class="nx">Owner</span> <span class="p">=</span> <span class="s2">"Nick"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The configuration doesn’t explain <em>how</em> to create this infrastructure. It simply describes <em>what</em> the infrastructure is. In other words, it describes the <em>state</em> of the infrastructure.</p>

<p>When you deploy this configuration, Terraform compares the desired configuration against what is already out there and automatically figures out what operations are required to change the deployed infrastructure to match the configuration. To use the terminology we’re using in this post: The user specifies the desired infrastructure <em>state</em>, and Terraform automatically derives the required <em>transitions</em> to bring about that state.</p>

<p>So when you first deploy the above configuration against an empty environment (at least to Terraform’s knowledge), Terraform reports what actions it will take to bring about the specified infrastructure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_instance.example will be created
  + resource "aws_instance" "example" {
      + ami                          = "ami-00b882ac5193044e4"
      ...
      + tags                         = {
          + "Name"  = "TerraformExample"
          + "Owner" = "Nick"
        }
      ...

Plan: 1 to add, 0 to change, 0 to destroy.
</code></pre></div></div>

<p>Suppose that after you deploy this infrastructure, you update your configuration to change the tags on the instance:</p>

<div class="language-terraform highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">resource</span> <span class="s2">"aws_instance"</span> <span class="s2">"example"</span> <span class="p">{</span>
  <span class="p">...</span>

  <span class="nx">tags</span> <span class="p">=</span> <span class="p">{</span>
    <span class="nx">Name</span>        <span class="p">=</span> <span class="s2">"TerraformExample"</span>
    <span class="nx">Owner</span>       <span class="p">=</span> <span class="s2">"Bob"</span>
    <span class="nx">Environment</span> <span class="p">=</span> <span class="s2">"Dev"</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When you instruct Terraform to deploy this change, it figures out how to modify the existing infrastructure to match your desired configuration, even though you haven’t specified <em>how</em> it would do that, only <em>what</em> infrastructure you want at the end of the day:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # aws_instance.example will be updated in-place
  ~ resource "aws_instance" "example" {
        ami                          = "ami-00b882ac5193044e4"
        ...
      ~ tags                         = {
          + "Environment" = "Dev"
            "Name"        = "TerraformExample"
          ~ "Owner"       = "Nick" -&gt; "Bob"
        }
        ...

Plan: 0 to add, 1 to change, 0 to destroy.
</code></pre></div></div>

<p>As you can see in the plan, Terraform detects that only parts of the infrastructure need to be changed, and it figured out exactly how to change them to match the desired state.</p>

<p>As a Terraform user, you think about infrastructure primarily as what you want its current state to be, and Terraform figures out for you how to transition your infrastructure to match your desired configuration.</p>

<p>This seems like a natural way to approach infrastructure management, but you can probably imagine how a transitions-first approach to this problem would play out. Instead of specifying “I want one instance” in your configuration, you’d say “Add one instance”, and so on. You’d then have to carefully run just the appropriate steps as they are needed, or somehow build idempotency into each operation so it’s safe to rerun them without careful pre-planning. Otherwise, you’d likely end up creating duplicate infrastructure or make unwanted changes.</p>

<h3 id="database-schema-migrations">Database Schema Migrations</h3>

<p>Relational databases are typically tightly coupled to the applications they back. As an application changes, the database schema backing it often also needs to change. But where an application can be updated with a simple code push, the database needs more care because it’s stateful. In other words, it’s carrying all this data that you want to maintain; you typically don’t want to update your database schema by dropping the whole database and redeploying it from scratch, which is in effect what you do when you deploy a new version of your application—replace the old application code entirely with the new. Instead, you want to migrate the database schema in-place, preserving all the data.</p>

<p>When I worked as a database developer, one of my tasks was to plan and execute migrations like this. I’d compare the current database schema against the new one that needed to be deployed, and hand craft a migration script that would <code class="language-plaintext highlighter-rouge">ALTER</code> tables and make any other necessary changes to mutate the schema as needed.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">-- Version 1</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">person</span><span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">first_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">200</span><span class="p">),</span>
  <span class="n">last_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">200</span><span class="p">)</span>
<span class="p">);</span>

<span class="c1">-- Version 2</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">person</span><span class="p">(</span>
  <span class="n">id</span> <span class="nb">INT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
  <span class="n">first_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">200</span><span class="p">),</span>
  <span class="n">last_name</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">200</span><span class="p">),</span>
  <span class="n">birth_date</span> <span class="nb">DATE</span>
<span class="p">);</span>

<span class="c1">-- Derived v1-&gt;v2 migration script</span>
<span class="k">ALTER</span> <span class="k">TABLE</span> <span class="n">person</span>
<span class="k">ADD</span> <span class="k">COLUMN</span> <span class="n">birth_date</span> <span class="nb">DATE</span><span class="p">;</span>
</code></pre></div></div>

<p>Every release of an application had an associated database schema as well as a migration script to upgrade a database from the previous schema version. The full database schema at a given version was the primary description of the database, and the migration script was derived from the comparison of the full schema at two different versions.</p>

<p>To fit this into the common thread I’m tracing in this post, database schemas fit the state-first mode of thinking. The state of your database schema at a given version is primary (i.e. what the schema is), and the transitions from one schema version to another are secondary (i.e. how to get the schema to that state).</p>

<p>There are tools for approaching database schemas in this fashion, like <a href="https://www.red-gate.com/products/sql-development/sql-compare/">Redgate SQL Compare</a> for SQL Server and <a href="https://github.com/facebookincubator/OnlineSchemaChange">OnlineSchemaChange</a> for MySQL. You give these tools two full schemas, and they compute the appropriate migration script. There are some risks to performing automatic migrations in this fashion. There may be semantic changes to your schema or strict availability requirements that an automated schema migration tool cannot satisfy without human input. But I think these tools address the problem of schema migrations in a conceptually natural way.</p>

<h3 id="source-control-commits">Source Control: Commits</h3>

<p>Consider git. Depending on what you’re doing, git seamlessly moves between a state-first and transition-first view of the world. One of the primary things you do with git is create new commits to capture changes to your codebase. When you create a commit, git <em>derives</em> the diff for the commit by comparing the current state of your codebase against its state at the most recent commit.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gh">diff --git a/.travis.yml b/.travis.yml
index c7938b6..c1a3c91 100644
</span><span class="gd">--- a/.travis.yml
</span><span class="gi">+++ b/.travis.yml
</span><span class="p">@@ -1,6 +1,5 @@</span>
 language: python
 python:
<span class="gd">-  - "3.4"
</span>   - "3.5"
   - "3.6"
 # Work-around for Python 3.7 on Travis CI pulled from here:
</code></pre></div></div>

<p>In other words, you as the developer focus simply on what you want your code to look like now, and git figures out for you how to capture that as an incremental change from the most recently committed state of the code. You specify the desired <em>state</em> of the code, and git computes the <em>transition</em> from one state of the code to the other.</p>

<p>The derived transitions have a number of uses which you are probably familiar with. We’ll take a look at some of them in the next section.</p>

<h2 id="transition-first-thinking">Transition-First Thinking</h2>

<p>In contrast to these examples of state-first thinking, we have transition-first thinking. This form of thinking naturally fits many real-world problems that are oriented around capturing or responding to events: a customer bought something; a user added a comment; a player made a move.</p>

<p>We saw how bank transactions fit this way of thinking; the individual transactions are primary, and the account balance is derived from those transactions. Let’s take a look at a few more examples of transition-first thinking and see how the state of the world ends up being derived from those transitions.</p>

<h3 id="social-media-activity">Social Media Activity</h3>

<p>“Like, comment, and subscribe!” – a common refrain on social media platforms – also captures a straightforward example of a transition-first problem. Users each add individual likes to a post, for example. Each such action is recorded in the backing database. That’s the primary record of what happened.</p>

<div style="text-align: center;">
<figure>
  <span>
    <img src="/assets/images/states-vs-transitions/social-media.png" width="600" />
  </span>
    <!-- <figcaption>
    </figcaption> -->
</figure>
</div>

<p>When people view a given post, however, they don’t see all the individual likes (at least not by default). What they see is a summary which is derived from all the individual likes and represents the current state of the total. From the perspective of the backend database, the individual likes are primary; the derived total is secondary.</p>

<h3 id="turn-based-games">Turn-Based Games</h3>

<p>Games like chess are well represented as a sequence of transitions. Each player takes a turn to make a move, and at any point in time the state of the game can be derived from the history of moves that have been made leading up to that point.</p>

<div style="text-align: center; float: right;">
<figure>
    <img src="/assets/images/states-vs-transitions/deepblue-kasparov-1996-game1.png" width="300" />
    <!-- <figcaption>
    </figcaption> -->
</figure>
</div>

<p>These are the first 10 moves of each player in <a href="https://en.wikipedia.org/wiki/Deep_Blue_versus_Kasparov,_1996,_Game_1">Game 1 of the 1996 match between Garry Kasparov and Deep Blue</a>.</p>

<ol>
  <li>e4 c5</li>
  <li>c3 d5</li>
  <li>exd5 Qxd5</li>
  <li>d4 Nf6</li>
  <li>Nf3 Bg4</li>
  <li>Be2 e6</li>
  <li>h3 Bh5</li>
  <li>0-0 Nc6</li>
  <li>Be3 cxd4</li>
  <li>cxd4 Bb4</li>
</ol>

<p>The state of the board shown in the image is derived from this sequence of moves.</p>

<h3 id="image-editing">Image Editing</h3>

<div style="text-align: center; float: right;">
<figure>
    <span>
        <img src="/assets/images/states-vs-transitions/pixlr-edit-history.png" width="300" />
    </span>
    <figcaption>
    Edit history from the <a href="https://pixlr.com/e/">Pixlr E</a> photo editor.
    </figcaption>
</figure>
</div>

<p>Image editing tools like Photoshop track your edits to the image you are working on, allowing you to easily undo changes or quickly flip between past image states.</p>

<p>You can think of the current state of the image as being derived from the history of edits to the original image. Each edit is a transition from one image state to another. The edits are the primary activity the user engages in, and the resulting image follows from them.</p>

<h3 id="source-control-code-review">Source Control: Code Review</h3>

<p>This is the other side of source control systems like git. Git lets you focus on the state of your code when you make a commit, and derives from that the transition from one state of the code to another. But after that, almost everything else in git starts with the transitions first and derives from those the state of the code.</p>

<p>A great example of a transitions-first interface is what happens when you submit a code change for review, like a GitHub Pull Request.</p>

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/states-vs-transitions/pull-request-diff.png" width="600" />
    </span>
    <figcaption>
    A pull request diff from the <a href="https://github.com/apache/spark/pull/29510/files">Apache Spark project</a>.
    </figcaption>
</figure>
</div>

<p>Instead of presenting the entire code base at once, the pull request focuses on a set of code changes – a “diff” – that a person can review in isolation. Focusing on this limited transformation of the code is what makes code review practical. And when a pull request gets merged in, git uses the encapsulated set of changes to efficiently update all downstream clones of the repository. The diff is primary, and the state of the codebase is derived from a series of transitions (i.e. diffs).</p>

<p><em>Thanks to Ivan, Michael, and Cip for reading drafts of this post.</em></p>]]></content><author><name>Nicholas Chammas</name></author><summary type="html"><![CDATA[Many years ago, working as a database developer at a video game company, I was tasked with designing the database behind an in-game wallet service. The wallet service would store each player’s current balance and transaction history. In other words, it was a simple banking database.]]></summary></entry><entry><title type="html">The Modern Data Lake is a Database</title><link href="https://nchammas.com/writing/modern-data-lake-database" rel="alternate" type="text/html" title="The Modern Data Lake is a Database" /><published>2020-05-10T00:00:00+00:00</published><updated>2020-05-10T00:00:00+00:00</updated><id>https://nchammas.com/writing/modern-data-lake-database</id><content type="html" xml:base="https://nchammas.com/writing/modern-data-lake-database"><![CDATA[<p>I was recently talking to some coworkers about the mix of data technology we have in our stack. Apache Spark, HDFS, Amazon Athena, Amazon S3, AWS Glue… The list is long. The technologies obviously work together somehow, but to a newcomer it may not be clear how each technology relates to the other. And in the details of how a given technology works it’s easy to lose sight of what purpose it serves in the grand scheme of things.</p>

<p>In a previous post we discussed why many teams use both <a href="/writing/database-access-patterns">Postgres and Redshift</a>, or some equivalent, in their data stack. In this post let’s look at the broader collection of data systems that constitute the modern <a href="https://en.wikipedia.org/wiki/Data_lake">data lake</a> and give you, the newcomer, a mental map of them organized around a longstanding and very useful abstraction—the database.</p>

<h2 id="what-is-a-database">What is a Database?</h2>

<p>First, what is a database? In the abstract, it’s a system for storing and retrieving data. For the purposes of this post, however, I want to take a more practical view inspired by the classic graphical database client interface that is widespread in the software industry.</p>

<p>So in practical terms, this is a database:</p>

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/database-three-components.png" width="600" />
    </span>
    <figcaption>
        A typical database client highlighting the three main components of a database. (<a href="https://json8.wordpress.com/2011/10/30/heidisql-alternative-for-linux/">image source</a>)
    </figcaption>
</figure>
</div>

<p>Specifically, it’s a collection of three components that work together as one system:</p>

<ol>
  <li><strong>Catalog</strong>: The Catalog tracks what data you have – i.e. schemas, tables, and columns – and where in the Storage layer to find it. The Catalog also tracks statistics about the data, like how many rows are in a table, or what the most common values in a specific column are. The Query Engine uses these statistics to figure out how to execute a query efficiently.</li>
  <li><strong>Query Engine</strong>: The Query Engine is what takes your query, in this case a SQL query, and translates it into specific machine instructions that will fetch and assemble the data you asked for. In other words, it takes a <a href="https://neo4j.com/blog/imperative-vs-declarative-query-languages/">declarative query</a> describing <em>what</em> you want and translates it into instructions detailing <a href="https://docs.microsoft.com/en-us/sql/relational-databases/performance/execution-plans"><em>how</em> to get it</a>. The Query Engine uses the Catalog to lookup the datasets referenced in the query and find them in Storage.</li>
  <li><strong>Storage</strong>: The Storage layer holds all of the database’s data. Its job is to store all the rows of data for all the tables in the database and retrieve or update them as requested.</li>
</ol>

<p>Every traditional relational database system, like Postgres or MySQL, comes with all three of these components packaged into one coherent system. They work together seamlessly, but they’re also inseparable. You cannot, for example, query or update the data in the database using regular Unix tools like <code class="language-plaintext highlighter-rouge">grep</code> or <code class="language-plaintext highlighter-rouge">sed</code>; you have to go through the database’s query engine. And while some databases let you use the database’s query engine to query data from <a href="https://www.postgresql.org/docs/current/ddl-foreign-data.html">outside of its own storage layer</a>, it’s very much a secondary capability that you wouldn’t want to rely on heavily.</p>

<h2 id="breaking-up-the-database">Breaking up the Database</h2>

<p>In the years since this formula was first developed and perfected, there’s been an explosion of new database and data processing technology: Graph databases, document databases, column-oriented databases, stream processing systems, and more. Among these new technologies is the group of distributed data processing systems – also known as “Big Data” tools – dominated by the <a href="http://hadoop.apache.org">Hadoop</a> ecosystem. This ecosystem includes systems like Apache Spark.</p>

<p>Broadly speaking, what distinguishes these systems from <a href="/writing/database-access-patterns">traditional databases</a> is that they enable you to process
a) large amounts of data
b) in varied formats
c) quickly
d) and affordably.
They do that by distributing the work to process data over a large number of cheap machines that are clustered together, and by allowing you to process data as it is on your storage system. In other words, instead of sending your data to the query engine, you send your query engine to the data. This contrasts with a traditional database system, where you would need to load the data into a specialized format in an area managed exclusively by the database. So to give a simple example of something you could do with these systems, which wasn’t as easy or practical to do before, you could process 20 TB of plain old CSV data distributed across 100 cheap machines in a few minutes.</p>

<p>When these systems were first being developed, the focus was on making them scalable and fault-tolerant, and the programming APIs weren’t very <a href="https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0">friendly</a>. Over time, these distributed data processing systems evolved to recreate the convenience and productivity of the traditional database system. Instead of MapReduce, you could now query data using plain old SQL. And instead of referring to data by fixed paths on a filesystem, you could now refer to them by abstract schema and table names, just like in a traditional database.</p>

<p>In effect, the people building these systems took the three components of the traditional database – Catalog, Query Engine, and Storage – and reinvented each as a stand-alone component for the distributed, massively scalable world. These components interoperate through shared catalog and storage formats.</p>

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/database-three-components-many-options.png" width="800" />
    </span>
    <figcaption>
        The modern data lake as a logical database--three components, many options.
    </figcaption>
</figure>
</div>

<!-- - Catalog -> Hive Metastore
- Query Engine -> Spark (and Hive, Presto, Amazon Athena, Snowflake, Impala, Redshift Spectrum)
- Storage
    - Storage Formats -> CSV, JSON, Parquet, ORC, Avro
    - Storage Systems -> HDFS, Amazon S3, Azure Blob Storage -->

<p>This means you can store your data in one place, like S3, and query it using multiple tools, like Spark and Presto. These query engines will have the same view of the available datasets by pointing to a shared instance of the Hive Metastore.</p>

<p>Another key point is that storage is split up into <em>formats</em> and <em>systems</em>. Instead of having your data in a closed format on a single server operated on by a single database system, you can have data in multiple, open formats (like CSV or Parquet), across several storage systems. And because the data formats are not specific to any query engine, data created by one query engine can easily be read by another.</p>

<h2 id="example-the-spark-database">Example: The Spark “Database”</h2>

<p>Apache Spark is extremely popular with teams building data lakes. If you’re reading this post, chances are that you’ve used it at some point. But if your experience with Spark was limited to its RDD or DataFrame APIs, you may not have realized that it can be integrated with these other systems to create a logical database with SQL as the primary query language. So let’s take a quick look at how to do that, keeping in mind that you can do something similar for many other “Big Data” query engines.</p>

<p>Spark comes with a command-line utility called <code class="language-plaintext highlighter-rouge">spark-sql</code>. It’s similar, for example, to Postgres’s <code class="language-plaintext highlighter-rouge">psql</code>. It gives you a SQL-only prompt where you can create, destroy, and query tables in a virtual database. By default, the catalog for this database is stored in a folder called <code class="language-plaintext highlighter-rouge">metastore_db</code>, and the data for the tables in the database is stored in a folder called <code class="language-plaintext highlighter-rouge">spark-warehouse</code>, typically in Parquet format. That’s already pretty neat, but you can take this further by calling <code class="language-plaintext highlighter-rouge">./sbin/start-thriftserver.sh</code> from the Spark home directory. This will start up a JDBC server that you can connect to with any old database client, like <a href="https://dbeaver.io">DBeaver</a>. That will give you the full “Spark is a database” experience. I won’t go over how to do this in detail, since that’s not the focus of this post, but the documentation for Spark’s JDBC server and SQL CLI <a href="http://spark.apache.org/docs/2.4.5/sql-distributed-sql-engine.html">is here</a>.</p>

<p>We can extend this experience to the cloud. If you work with Spark on Amazon EMR, you can <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html">connect Spark to the AWS Glue Data Catalog</a>. This gives Spark the same view into your datasets that several other AWS services have, including Amazon Athena and Amazon Redshift Spectrum. In other words, you can have one catalog, managed by AWS Glue, one location for your data, on S3, and any number of different services or query engines updating or querying that data using SQL. And just as you can with Spark running locally, on EMR you can <a href="https://aws.amazon.com/premiumsupport/knowledge-center/jdbc-connection-emr/">start a JDBC server</a> and connect to it with a regular database client.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>I hope this post connected some dots for you about the various distributed data systems out there. There are many ways to conceptualize a data lake. Thinking of it as a database – i.e. a combination of catalog, query engine, and storage layer – provides a familiar abstraction that will help you mentally map out many of the technologies in this space.</p>

<p>This idea is more powerful than just as a conceptual tool, though! After all, a team may use these same technologies to build a data lake without integrating them to create that cohesive “database” package. What are they missing out on? As we touched on earlier, by actually building your data lake around the database abstraction, you can can shift the focus of your work away from <em>where</em> the data is or <em>how</em> to manipulate it, and instead focus on <em>what</em> data you want. Let’s explore this idea in a <a href="/writing/data-pipeline-materialized-view">future post</a>.</p>

<p><em>Thanks to Michelle, Yuna, Cip, and Sophie for reading drafts of this post.</em></p>]]></content><author><name>Nicholas Chammas</name></author><category term="apache-spark" /><category term="databases" /><summary type="html"><![CDATA[I was recently talking to some coworkers about the mix of data technology we have in our stack. Apache Spark, HDFS, Amazon Athena, Amazon S3, AWS Glue… The list is long. The technologies obviously work together somehow, but to a newcomer it may not be clear how each technology relates to the other. And in the details of how a given technology works it’s easy to lose sight of what purpose it serves in the grand scheme of things.]]></summary></entry><entry><title type="html">Postgres and Redshift: Why use both?</title><link href="https://nchammas.com/writing/database-access-patterns" rel="alternate" type="text/html" title="Postgres and Redshift: Why use both?" /><published>2020-03-28T00:00:00+00:00</published><updated>2020-03-28T00:00:00+00:00</updated><id>https://nchammas.com/writing/database-access-patterns</id><content type="html" xml:base="https://nchammas.com/writing/database-access-patterns"><![CDATA[<p>A couple of co-workers who are new to database technology recently asked me why we use both Postgres and Redshift in our stack. They’re both SQL databases and seem to do the same thing. So why not just use one technology? It would be simpler.</p>

<p>It’s a great question. In fact, it’s very common for teams to use a combination of databases in their stack, especially a combination like Postgres and Redshift. Let’s explore why.</p>

<h2 id="understanding-database-types-via-their-access-patterns">Understanding Database Types via their Access Patterns</h2>

<p>There are a lot of database types in the world. A very powerful way to understand any given database is via the <em>access patterns</em> it is designed for.</p>

<p>At a high level, we can divide databases into two broad categories:</p>

<ul>
  <li>transactional</li>
  <li>analytical</li>
</ul>

<p>(Note: We’re not talking about transactions in the sense of <a href="https://en.wikipedia.org/w/index.php?title=ACID_(computer_science)">ACID</a>; we’re focusing just on the transactional access pattern.)</p>

<h2 id="transactional-databases">Transactional Databases</h2>

<p>A transactional database is what most people think of when they hear “database”. It’s typically a database that backs online, interactive operations, like for a store or a game, where users expect instant responses to their queries.</p>

<p>Each query typically touches a very small amount of data, since the user executing the query is usually only reading or writing data about themselves, like updating their profile information or noting a new purchase. Tables tend to be narrow and highly <a href="https://docs.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description">normalized</a>. At any one time there may be thousands or tens of thousands of such queries executing against a transactional database, as a multitude of people interact with the service that the database backs.</p>

<p>Query times in well-operated transactional databases are typically measured in milliseconds or less. Engineers spend a lot of time designing table indexes to enable the database to sift through the minimum number of rows required to answer a query, and tuning database parameters to keep as much data in memory as possible and minimize disk I/O.</p>

<p>In summary, transactional databases are designed for:</p>
<ul>
  <li>point reads/writes, i.e. small amounts of data per query</li>
  <li>high concurrency, i.e. many queries running at the same time</li>
  <li>very low latency, i.e. quick query response times</li>
</ul>

<p>Examples of transactional databases include all the popular database systems you’ve heard of:</p>
<ul>
  <li>MySQL</li>
  <li>Oracle</li>
  <li>Postgres</li>
  <li>SQL Server</li>
</ul>

<div style="text-align: center;">
<figure>
    <a href="http://www.warfaremagazine.co.uk/articles/1415-The-Battle-of-Agincourt/171">
        <img src="/assets/images/battle-of-agincourt-compressed.jpg" width="300" />
    </a>
    <figcaption>
        The transactional database access pattern: Lots and lots of tiny chunks of data coming at you real fast.
    </figcaption>
</figure>
</div>

<h2 id="analytical-databases">Analytical Databases</h2>

<p>An analytical database is designed for a very different access pattern. Instead of backing an online store or game, an analytics database typically backs pipelines or tools that help users analyze large swathes of data.</p>

<p>A typical analytics query will touch a large range of data, like a reporting query that summarizes sales numbers by day for an entire quarter. Compared to a transactional database, there will only be a small number of queries running at one time against an analytics database, and each query will usually only touch a handful of columns in any given table. Tables tend to be wide (i.e. they have many columns) and highly denormalized.</p>

<p>Query times in an analytics database will typically be on the order of seconds or minutes. Indexes, which help queries quickly find their target row, aren’t relevant to analytics databases because queries rarely target a single row. Instead, engineers optimize analytical databases by reorganizing how the data is stored on disk to minimize the number of <em>columns</em> that need to be read, and by compressing the data so that large chunks of data can be read quickly. Trying to hold most of your data in memory is typically not possible or even necessary for analytics workloads.</p>

<p>In summary, analytical databases are designed for:</p>
<ul>
  <li>bulk reads/writes, i.e. large amounts of data per query</li>
  <li>lower concurrency, i.e. fewer queries running at the same time</li>
  <li>higher latency, i.e. longer query response times</li>
</ul>

<p>Popular analytical database systems include:</p>
<ul>
  <li>Vertica</li>
  <li>Redshift</li>
  <li>Teradata</li>
</ul>

<div style="text-align: center;">
<figure>
    <a href="https://en.wikipedia.org/wiki/Trebuchet">
        <img src="/assets/images/trebuchet-castelnaud-compressed.jpg" width="400" />
    </a>
    <figcaption>
        The analytical database access pattern: A handful of huge chunks of data coming at you relatively slowly.
    </figcaption>
</figure>
</div>

<h2 id="typical-usage-examples">Typical Usage Examples</h2>

<p>Now we can answer the question that opened this post: Why use both Postgres and Redshift?</p>

<p>A typical pattern is for teams to use both to build an analytics product. For example, consider a team building a product that tracks visits to your website and then shows you a handy chart summarizing your web traffic over the past few weeks.</p>

<p>The team uses Redshift to bulk load detailed event data tracking every visit to your site and then aggregate it down to a set of summary statistics and key metrics. They then load that summarized data – for example, total visits to your site per day for the past 12 weeks – into Postgres and serve it up from there to a website or API endpoint for users to access. Redshift answers a relatively small number of queries that crunch a lot of data and take a lot of time each, as part of a batch update pipeline, while Postgres answers a larger number of lighter queries that touch smaller amounts of data each, as users browse the summary statistics for their website.</p>

<p>It’s also common for the flow between the systems to go the other way around. Consider a team building an online game–an <a href="https://en.wikipedia.org/wiki/Massively_multiplayer_online_role-playing_game">MMORPG</a>, perhaps.</p>

<p>They use Postgres to back online game operations and track what actions a player is taking in the game. Those actions affect the online world and develop the player’s character as they are playing the game. The game only needs to know what a player has done in the current session, so to keep the transactional database light, the team regularly moves data for old sessions from Postgres to Redshift. In Redshift, analysts study player behavior across long stretches of time and try to answer questions like “What is the most popular path players take through our world?” or “Where are players quitting our game, and why?” Postgres handles the flurry of detail-level activity to serve thousands of online players, while Redshift answers big picture queries for a handful of in-house analysts.</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>There are many more ways to understand and categorize database systems:</p>
<ul>
  <li>by the <a href="https://fauna.com/blog/demystifying-database-systems-introduction-to-consistency-levels">consistency guarantees</a> they provide;</li>
  <li>by the levels of <a href="http://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html">transaction isolation</a> they provide;</li>
  <li>by how they <a href="https://docs.microsoft.com/en-us/azure/sql-database/sql-database-elastic-scale-introduction#horizontal-and-vertical-scaling">scale</a> to handle additional load;</li>
  <li>by the <a href="https://neo4j.com/blog/why-database-query-language-matters/#cypher">query languages</a> and <a href="https://www.mongodb.com/document-databases">data structures</a> they support;</li>
  <li>or by how they <a href="https://en.wikipedia.org/wiki/Column-oriented_DBMS">lay out data on disk</a>, to name a few.</li>
</ul>

<p>In this post we’ve focused just on access patterns, though databases designed for different access patterns typically do so by differing on these other axes, too.</p>

<p>What about systems like Amazon Athena and Spark SQL, by the way? Many teams with data-intensive workflows tend to use these tools as well. And they certainly <em>look</em> like databases, though there’s something weird about them. Roughly speaking, systems like Athena and Spark SQL <em>can</em> be categorized as analytical databases, but there’s more to them than that. We’ll explore these systems in more detail in a <a href="/writing/modern-data-lake-database">follow-up post</a>.</p>

<p><em>Thanks to Michelle, Yuna, Sam, Cip, Fabian, and Roland for reading drafts of this post.</em></p>]]></content><author><name>Nicholas Chammas</name></author><category term="databases" /><summary type="html"><![CDATA[A couple of co-workers who are new to database technology recently asked me why we use both Postgres and Redshift in our stack. They’re both SQL databases and seem to do the same thing. So why not just use one technology? It would be simpler.]]></summary></entry><entry><title type="html">Solving the Water Jug Problem from Die Hard 3 with TLA+ and Hypothesis</title><link href="https://nchammas.com/writing/how-not-to-die-hard-with-hypothesis" rel="alternate" type="text/html" title="Solving the Water Jug Problem from Die Hard 3 with TLA+ and Hypothesis" /><published>2017-03-28T00:00:00+00:00</published><updated>2017-03-28T00:00:00+00:00</updated><id>https://nchammas.com/writing/how-not-to-die-hard-with-hypothesis</id><content type="html" xml:base="https://nchammas.com/writing/how-not-to-die-hard-with-hypothesis"><![CDATA[<p>In the movie <a href="https://en.wikipedia.org/wiki/Die_Hard_with_a_Vengeance">Die Hard with a Vengeance</a> (aka Die Hard 3), there is a famous scene where John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson) are forced to solve a problem or be blown up: Given a 3 gallon jug and 5 gallon jug, how do you measure out exactly 4 gallons of water?</p>

<div style="text-align: center;">
<figure>
    <span>
        <img src="/assets/images/die-hard/die-hard-water-jugs.jpg" width="500" />
    </span>
    <figcaption>
    McClane and Carver puzzle over the water jug problem.
    </figcaption>
</figure>
</div>

<p>Apparently, you can solve this problem using a formal specification
language like <a href="https://en.wikipedia.org/wiki/TLA%2B">TLA+</a>. I don’t know
much about this topic, but it appears that a <a href="https://en.wikipedia.org/wiki/Formal_specification">formal specification language</a>
is much like a programming language in that it lets you describe the
behavior of a system. However, it’s much more rigorous and builds
on mathematical techniques that enable you to reason more effectively
about the behavior of the system you’re describing than you can with
a typical programming language.</p>

<p>In a recent discussion on Hacker News about TLA+,
I came across <a href="https://news.ycombinator.com/item?id=13919251">this comment</a>
which linked to a fun and simple example
showing <a href="https://github.com/tlaplus/Examples/blob/master/specifications/DieHard/DieHard.tla">how to solve the Die Hard 3 problem with TLA+</a>.
I had to watch the first two lectures from <a href="http://lamport.azurewebsites.net/video/videos.html">Leslie Lamport’s video course on TLA+</a>
to understand the example well, but once I did I was reminded of the
idea of property-based testing and, specifically, <a href="http://hypothesis.works/">Hypothesis</a>.</p>

<p>So what’s property-based testing? It’s a powerful way of testing your
logic by giving your machine a high-level description of how your code
should behave and letting it generate test cases automatically to see
if that description holds. Compare that to traditional unit testing,
for example, where you manually code up specific inputs and outputs
and make sure they match.</p>

<h2 id="how-not-to-die-hard-with-hypothesis">How not to Die Hard with Hypothesis</h2>

<p>Hypothesis has an excellent implementation of property-based testing
<a href="https://github.com/HypothesisWorks/hypothesis-python">for Python</a>.
I thought to myself: I wonder if you can write that
Die Hard specification using Hypothesis? As it turns out, Hypothesis
supports <a href="https://hypothesis.readthedocs.io/en/latest/stateful.html">stateful testing</a>,
and I was able to port the <a href="https://github.com/tlaplus/Examples/blob/master/specifications/DieHard/DieHard.tla">TLA+ example</a>
to Python pretty easily:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">hypothesis</span> <span class="kn">import</span> <span class="n">note</span><span class="p">,</span> <span class="n">settings</span>
<span class="kn">from</span> <span class="nn">hypothesis.stateful</span> <span class="kn">import</span> <span class="n">RuleBasedStateMachine</span><span class="p">,</span> <span class="n">rule</span><span class="p">,</span> <span class="n">invariant</span>


<span class="c1"># The default for `max_examples` is sometimes not enough for Hypothesis
# to find a falsifying example.
</span><span class="o">@</span><span class="n">settings</span><span class="p">(</span><span class="n">max_examples</span><span class="o">=</span><span class="mi">2000</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">DieHardProblem</span><span class="p">(</span><span class="n">RuleBasedStateMachine</span><span class="p">):</span>
    <span class="n">small</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">big</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="o">@</span><span class="n">rule</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">fill_small</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">=</span> <span class="mi">3</span>

    <span class="o">@</span><span class="n">rule</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">fill_big</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">=</span> <span class="mi">5</span>

    <span class="o">@</span><span class="n">rule</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">empty_small</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="o">@</span><span class="n">rule</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">empty_big</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="o">@</span><span class="n">rule</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">pour_small_into_big</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">old_big</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">big</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">small</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">-</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">-</span> <span class="n">old_big</span><span class="p">)</span>

    <span class="o">@</span><span class="n">rule</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">pour_big_into_small</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">old_small</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">small</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">big</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">-</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">-</span> <span class="n">old_small</span><span class="p">)</span>

    <span class="o">@</span><span class="n">invariant</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">physics_of_jugs</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="k">assert</span> <span class="mi">0</span> <span class="o">&lt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">small</span> <span class="o">&lt;=</span> <span class="mi">3</span>
        <span class="k">assert</span> <span class="mi">0</span> <span class="o">&lt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">&lt;=</span> <span class="mi">5</span>

    <span class="o">@</span><span class="n">invariant</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">die_hard_problem_not_solved</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="n">note</span><span class="p">(</span><span class="s">"&gt; small: {s} big: {b}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">s</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">small</span><span class="p">,</span> <span class="n">b</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">big</span><span class="p">))</span>
        <span class="k">assert</span> <span class="bp">self</span><span class="p">.</span><span class="n">big</span> <span class="o">!=</span> <span class="mi">4</span>


<span class="n">DieHardTest</span> <span class="o">=</span> <span class="n">DieHardProblem</span><span class="p">.</span><span class="n">TestCase</span>
</code></pre></div></div>

<p>Calling <code class="language-plaintext highlighter-rouge">pytest</code> on this file quickly digs up a solution:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>self = DieHardProblem({})

    @invariant()
    def die_hard_problem_not_solved(self):
        note("&gt; small: {s} big: {b}".format(s=self.small, b=self.big))
&gt;       assert self.big != 4
E       AssertionError: assert 4 != 4
E        +  where 4 = DieHardProblem({}).big

how-not-to-die-hard-with-hypothesis.py:17: AssertionError
----------------------------- Hypothesis -----------------------------
&gt; small: 0 big: 0
Step #1: fill_big()
&gt; small: 0 big: 5
Step #2: pour_big_into_small()
&gt; small: 3 big: 2
Step #3: empty_small()
&gt; small: 0 big: 2
Step #4: pour_big_into_small()
&gt; small: 2 big: 0
Step #5: fill_big()
&gt; small: 2 big: 5
Step #6: pour_big_into_small()
&gt; small: 3 big: 4
====================== 1 failed in 0.22 seconds ======================
</code></pre></div></div>

<p><em>Note: I last tested this against Python 3.13, pytest 8.4, and Hypothesis 6.138.</em></p>

<h2 id="whats-going-on-here">What’s Going on Here</h2>

<p>The code and test output are pretty self-explanatory, but here’s a
recap of what’s going on:</p>

<p>We’re defining a state machine. That state machine has an initial
state (two empty jugs) along with some possible transitions. Those
transitions are captured with the <code class="language-plaintext highlighter-rouge">rule()</code> decorator. The initial
state and possible transitions together define how our system works.</p>

<p>Next we define invariants, which are properties that must always hold
true in our system. Our first invariant, <code class="language-plaintext highlighter-rouge">physics_of_jugs</code>, says that
the jugs must hold an amount of water that makes sense. For example,
the big jug can never hold more than 5 gallons of water.</p>

<p>Our next invariant, <code class="language-plaintext highlighter-rouge">die_hard_problem_not_solved</code>, is where it gets
interesting. Here we’re declaring that the problem of getting exactly
4 gallons in the big jug <em>cannot</em> be solved. Since Hypothesis’s job
is to test our logic for bugs, it will give our state machine a
thorough shake down and see if we ever violate our invariants.
In other words, we’re basically goading Hypothesis into solving the
Die Hard problem for us.</p>

<p>I’m not entirely clear on how Hypothesis does its work, but I know
the basic summary is this: It takes the program properties we’ve
specified – including things like rules, invariants, data types, and
function signatures – and generates data or actions to probe the
behavior of our program. If Hypothesis finds a piece of data or
sequence of actions that get our program to violate its stated properties, it
tries to whittle that down to a <em>minimum falsifying example</em>—i.e.
something that exposes the same problem but with a minimum number of
steps. This makes it much easier for you to understand how Hypothesis
broke your code.</p>

<p>Hypothesis’s output above tells us that it was able to violate the
<code class="language-plaintext highlighter-rouge">die_hard_problem_not_solved</code> invariant and provides us with a
minimal reproduction showing exactly how it did so. That reproduction
is our solution to the problem. It’s also how McClane and Carver did
it in the movie!</p>

<h2 id="final-thoughts">Final Thoughts</h2>

<p>All in all, I was pretty impressed with how straightforward it was to
translate the TLA+ example into Python using Hypothesis. And when
Hypothesis spit out the solution, I couldn’t help but smile. It’s
pretty cool to see your computer essentially generate a program that
solves a problem for you. And the Python version of the Die Hard
“spec” is not much more verbose than the
original in TLA+, though TLA+’s notation for current vs. next value
(e.g. <code class="language-plaintext highlighter-rouge">small</code> vs. <code class="language-plaintext highlighter-rouge">small'</code>) is elegant and cuts out the need to have
variables like <code class="language-plaintext highlighter-rouge">old_small</code> and <code class="language-plaintext highlighter-rouge">old_big</code>.</p>

<p>I don’t know how Hypothesis compares to TLA+ in a general sense. I’ve
only just started to learn about property-based testing and TLA+, and
I wonder if they have a place in the work that I do these days, which
is mostly Data Engineering-type stuff. Still, I found this little
exercise fun, and I hope you learned something interesting from it.</p>

<p><em>Thanks to <a href="http://jvns.ca/">Julia</a>, <a href="https://danluu.com/">Dan</a>, Laura, Anjana, and Cip for reading drafts
of this post.</em></p>

<p><em>Read the discussion about this post <a href="https://lobste.rs/s/alz5mw/solving_water_jug_problem_from_die_hard_3">on Lobsters</a>.</em></p>]]></content><author><name>Nicholas Chammas</name></author><category term="python" /><category term="property-based testing" /><category term="hypothesis" /><category term="tla" /><summary type="html"><![CDATA[In the movie Die Hard with a Vengeance (aka Die Hard 3), there is a famous scene where John McClane (Bruce Willis) and Zeus Carver (Samuel L. Jackson) are forced to solve a problem or be blown up: Given a 3 gallon jug and 5 gallon jug, how do you measure out exactly 4 gallons of water?]]></summary></entry></feed>