<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://dev.arie.bovenberg.net/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dev.arie.bovenberg.net/" rel="alternate" type="text/html" /><updated>2024-10-24T19:49:51+00:00</updated><id>https://dev.arie.bovenberg.net/feed.xml</id><title type="html">Arie Bovenberg</title><subtitle></subtitle><entry><title type="html">__init__.py files are optional. Here’s why you should still use them</title><link href="https://dev.arie.bovenberg.net/blog/still-use-init-py/" rel="alternate" type="text/html" title="__init__.py files are optional. Here’s why you should still use them" /><published>2024-10-07T00:00:00+00:00</published><updated>2024-10-07T00:00:00+00:00</updated><id>https://dev.arie.bovenberg.net/blog/still-use-init-py</id><content type="html" xml:base="https://dev.arie.bovenberg.net/blog/still-use-init-py/"><![CDATA[<p>If you’ve ever googled the question
“Why do Python packages have empty <code class="language-plaintext highlighter-rouge">__init__.py</code> files?”,
you could get the idea that Python packages wouldn’t work without them.
This is a common misconception—they’ve been optional since Python 3.3!
Why then, do most Python projects still have them?</p>

<h2 id="what-are-these-files-again">What are these files again?</h2>

<p><code class="language-plaintext highlighter-rouge">__init__.py</code> files are often used to mark directories as Python packages.
For example, a file structure like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my_package/
    __init__.py
    some_module.py
</code></pre></div></div>

<p>will allow you to run:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">my_package.some_module</span>
<span class="c1"># or
</span><span class="kn">from</span> <span class="nn">my_package</span> <span class="kn">import</span> <span class="n">some_module</span>
</code></pre></div></div>

<p>What you might not know is that in modern Python,
you can omit the <code class="language-plaintext highlighter-rouge">__init__.py</code> file
and still be able to run the same import!
So why not get rid of these files altogether?</p>

<h2 id="the-benefits-of-being-explicit">The benefits of being explicit</h2>

<p>Imagine the following codebase without any <code class="language-plaintext highlighter-rouge">__init__.py</code> files:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>services/
    component_a/
        one.py
    component_b/
        child/
            two.py
        three.py
    scripts/
        my_script.py
</code></pre></div></div>

<p>Encountering this structure, you might wonder
which of these directories are meant to be packages,
and which are just directories that happen to contain Python files.</p>

<p>This matters in non-obvious ways.
Take the “services” directory for example. Are you meant to…</p>

<ol>
  <li>import <code class="language-plaintext highlighter-rouge">services.component_a.one</code>?</li>
  <li>or <code class="language-plaintext highlighter-rouge">component_a.one</code> with “services” as the working directory?</li>
</ol>

<p>The problem is: only one of these will actually work,
because the package internals likely assume one or the other.
For example, if <code class="language-plaintext highlighter-rouge">one.py</code> contains:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">component_b.three</span>
</code></pre></div></div>

<p>then only option 2 will work.</p>

<p>Adding the proper <code class="language-plaintext highlighter-rouge">__init__.py</code>  files
takes away the guesswork and makes the structure clear:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>services/
    component_a/
        __init__.py
        one.py
    component_b/
        __init__.py
        child/
            __init__.py
            two.py
        three.py
    scripts/
        my_script.py
</code></pre></div></div>

<p>Now, it’s immediately clear that <code class="language-plaintext highlighter-rouge">component_a</code> and <code class="language-plaintext highlighter-rouge">component_b</code> are packages,
and <code class="language-plaintext highlighter-rouge">services</code> is just a directory.
It also makes clear that “scripts” isn’t a package at all, and <code class="language-plaintext highlighter-rouge">my_script.py</code>
isn’t something you should be importing.
<code class="language-plaintext highlighter-rouge">__init__.py</code> files help developers understand the structure of your codebase.</p>

<h2 id="tooling-needs-to-understand-your-package-structure-too">Tooling needs to understand your package structure too</h2>

<p>You might think: <em>“I’m not convinced. I know my codebase well enough.
And besides, I document how to import my packages in the README.”</em></p>

<p>What you may forget is that it’s not just humans that need
to understand the package structure.
Tools like <code class="language-plaintext highlighter-rouge">mypy</code> and <code class="language-plaintext highlighter-rouge">ruff</code> also need to understand what is a package and what isn’t,
in order to work correctly.
What makes it extra tricky is that you may not notice problems
at first, but they can crop up later as your codebase grows.
Fixing these issues can be a real headache, especially if you’re
not aware of the intricacies of Python’s import system.
By omitting <code class="language-plaintext highlighter-rouge">__init__.py</code> files,
you may be putting a maintenance timebomb in your codebase.</p>

<h2 id="what-about-implicit-namespace-packages">What about implicit namespace packages?</h2>

<p>When omitting <code class="language-plaintext highlighter-rouge">__init__.py</code> files,
you’re actually creating what’s called an <a href="https://peps.python.org/pep-0420/">“implicit namespace package”</a>.
This has some benefits, like allowing you to split a package across multiple directories.
If you use namespace packages for this purpose,
you’re probably aware of the trade-offs,
and you’ve likely already struggled with issues of tooling compatibility
and developer confusion.</p>

<p>For this reason, implicit namespace packages are rare.
So long as you don’t need the advanced features of implicit namespace packages,
you should stick to using <code class="language-plaintext highlighter-rouge">__init__.py</code> files.</p>

<h2 id="other-loose-ends">Other loose ends</h2>

<ul>
  <li>Although <code class="language-plaintext highlighter-rouge">__init__.py</code> files are often empty, they can also contain code.
For more information, see the <a href="https://docs.python.org/3/tutorial/modules.html#packages">Python documentation</a>.</li>
  <li>You can enforce the use of <code class="language-plaintext highlighter-rouge">__init__.py</code> files in your codebase
<a href="https://docs.astral.sh/ruff/rules/implicit-namespace-package/">using <code class="language-plaintext highlighter-rouge">ruff</code></a>
or a <a href="https://pypi.org/project/flake8-no-pep420/">flake8 plugin</a>.</li>
</ul>

<h2 id="recommendations">Recommendations</h2>

<p>You should use <code class="language-plaintext highlighter-rouge">__init__.py</code> files to make it clear which directories are packages and which aren’t.
This isn’t only helpful for other developers, it’s often necesssary for tools like
<code class="language-plaintext highlighter-rouge">mypy</code> to work correctly.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[If you’ve ever googled the question “Why do Python packages have empty __init__.py files?”, you could get the idea that Python packages wouldn’t work without them. This is a common misconception—they’ve been optional since Python 3.3! Why then, do most Python projects still have them?]]></summary></entry><entry><title type="html">Ten Python datetime pitfalls, and what libraries are (not) doing about it</title><link href="https://dev.arie.bovenberg.net/blog/python-datetime-pitfalls/" rel="alternate" type="text/html" title="Ten Python datetime pitfalls, and what libraries are (not) doing about it" /><published>2024-01-20T00:00:00+00:00</published><updated>2024-01-20T00:00:00+00:00</updated><id>https://dev.arie.bovenberg.net/blog/python-datetime-pitfalls</id><content type="html" xml:base="https://dev.arie.bovenberg.net/blog/python-datetime-pitfalls/"><![CDATA[<p>It’s no secret that the Python datetime library has its quirks.
Not only are there probably more than you think;
third-party libraries don’t address most of them!
I created a <a href="https://github.com/ariebovenberg/whenever">new library</a> to explore what a better datetime library could look like.</p>

<p>💬 Discuss this post <a href="https://www.reddit.com/r/Python/comments/1ag6uxc/ten_python_datetime_pitfalls_and_what_libraries/">on Reddit</a>
or <a href="https://news.ycombinator.com/item?id=39417231">Hacker News</a>.</p>

<div class="toc">

  <h3 id="contents">Contents</h3>

  <p><strong>Before we start</strong></p>

  <ul>
    <li><a href="#whats-a-pitfall">What’s a pitfall?</a></li>
    <li><a href="#libraries-considered">Libraries considered</a></li>
  </ul>

  <p><strong>The pitfalls</strong></p>

  <ol>
    <li><a href="#1-incompatible-concepts-are-squeezed-into-one-class">Incompatible concepts are squeezed into one class</a></li>
    <li><a href="#2-operators-ignore-daylight-saving-time-dst">Operators ignore Daylight Saving Time (DST)</a></li>
    <li><a href="#3-the-meaning-of-naïve-is-inconsistent">The meaning of “naïve” is inconsistent</a></li>
    <li><a href="#4-non-existent-datetimes-pass-silently">Non-existent datetimes pass silently</a></li>
    <li><a href="#5-guessing-in-the-face-of-ambiguity">Guessing in the face of ambiguity</a></li>
    <li><a href="#6-disambiguation-breaks-equality">Disambiguation breaks equality</a></li>
    <li><a href="#7-inconsistent-equality-within-timezone">Inconsistent equality within timezone</a></li>
    <li><a href="#8-datetime-inherits-from-date">Datetime inherits from date</a></li>
    <li><a href="#9-datetimetimezone-isnt-enough-for-timezone-support"><code class="language-plaintext highlighter-rouge">datetime.timezone</code> isn’t enough for timezone support</a></li>
    <li><a href="#10-the-local-timezone-is-dst-unaware">The local timezone is DST-unaware</a></li>
  </ol>

  <p><strong>Takeaways</strong></p>

  <ul>
    <li><a href="#datetime-library-scorecard">Datetime library scorecard</a></li>
    <li><a href="#why-should-you-care">Why should you care?</a></li>
    <li><a href="#imagining-a-solution">Imagining a solution</a></li>
  </ul>

</div>

<h2 id="whats-a-pitfall">What’s a pitfall?</h2>

<p>Two notes before we start:</p>

<ul>
  <li>Pitfalls aren’t bugs. They’re cases where <code class="language-plaintext highlighter-rouge">datetime</code> behaves in a way
that is surprising or confusing. It’s always a bit
subjective whether something is a pitfall or not.</li>
  <li>Many pitfalls exist simply because the authors couldn’t
possibly anticipate all future needs.
Adding big features over 20 years—without breaking compatibility—isn’t easy.</li>
</ul>

<h2 id="libraries-considered">Libraries considered</h2>

<p>With that out of the way, these are the third-party datetime
libraries I’m looking at in this post:</p>

<ul>
  <li><a href="https://github.com/arrow-py/arrow"><code class="language-plaintext highlighter-rouge">arrow</code></a> — Probably the most historically popular
datetime library. Its goal is to make datetime easier to use,
and to add features that many people feel are missing from the standard library.</li>
  <li><a href="https://github.com/sdispater/pendulum"><code class="language-plaintext highlighter-rouge">pendulum</code></a> — The only library that
rivals arrow in popularity. It has similar goals, while explicitly improving
on Arrow’s handling of Daylight Saving Time (DST).</li>
  <li><a href="https://github.com/glyph/DateType"><code class="language-plaintext highlighter-rouge">DateType</code></a> — a library that allows
type-checkers to distinguish between naïve and aware datetimes.
It doesn’t change the runtime behavior of <code class="language-plaintext highlighter-rouge">datetime</code>.</li>
  <li><a href="https://github.com/channable/heliclockter"><code class="language-plaintext highlighter-rouge">heliclockter</code></a> — a young library
that offers datetime subclasses for UTC, local, and zoned datetimes.</li>
</ul>

<p>These libraries I’m <em>not</em> looking at:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">pytz</code> and <code class="language-plaintext highlighter-rouge">python-dateutil</code>, which aren’t (full) datetime replacements</li>
  <li><code class="language-plaintext highlighter-rouge">delorean</code>, <code class="language-plaintext highlighter-rouge">maya</code>, and <code class="language-plaintext highlighter-rouge">moment</code> which all appear abandoned</li>
</ul>

<p>Now: on to the pitfalls!</p>

<h2 id="1-incompatible-concepts-are-squeezed-into-one-class">1. Incompatible concepts are squeezed into one class</h2>

<p>It’s an infamous pain point that a <code class="language-plaintext highlighter-rouge">datetime</code> instance can be either naïve or aware,
and that they can’t be mixed.
In any complex codebase, it’s difficult to be sure you won’t accidentally mix them
without actually running the code.
As a result, you end up writing redundant runtime checks,
or hoping all developers diligently read the docstrings.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Naïve or aware? No way to tell...
</span><span class="k">def</span> <span class="nf">plan_mission</span><span class="p">(</span><span class="n">launch_utc</span><span class="p">:</span> <span class="n">datetime</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span> <span class="p">...</span>
</code></pre></div></div>

<p>There’s also the question whether distinguishing aware and naïve is enough,
since within the “aware” category there are actually several different kinds
of datetimes.
While compatible,
the semantics of UTC/offset and IANA timezones are notably different when
it comes to ambiguity, for example.</p>

<h4 id="whats-being-done-about-it">What’s being done about it?</h4>

<ul>
  <li>:heavy_check_mark: <code class="language-plaintext highlighter-rouge">heliclockter</code> has separate classes for local, zoned, and UTC datetimes.</li>
  <li>:heavy_check_mark: <code class="language-plaintext highlighter-rouge">DateType</code> allows type-checkers to distinguish naïve or aware datetimes</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">arrow</code> and <code class="language-plaintext highlighter-rouge">pendulum</code> still have one class for naïve and aware.</li>
</ul>

<h2 id="2-operators-ignore-daylight-saving-time-dst">2. Operators ignore Daylight Saving Time (DST)</h2>

<p>Given that <code class="language-plaintext highlighter-rouge">datetime</code> supports timezones with DST transitions,
you’d reasonably expect that the <code class="language-plaintext highlighter-rouge">+/-</code> operators would take
them into account—but they don’t!</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">paris</span> <span class="o">=</span> <span class="n">ZoneInfo</span><span class="p">(</span><span class="s">"Europe/Paris"</span><span class="p">)</span>
<span class="c1"># On the eve of moving the clock forward
</span><span class="n">bedtime</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">25</span><span class="p">,</span> <span class="mi">22</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">paris</span><span class="p">)</span>
<span class="n">wake_up</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">26</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">paris</span><span class="p">)</span>

<span class="c1"># It says 9 hours, but it's actually 8!
# (because we skipped directly from 2am to 3am due to DST)
</span><span class="n">sleep</span> <span class="o">=</span> <span class="n">wake_up</span> <span class="o">-</span> <span class="n">bedtime</span>
</code></pre></div></div>

<h4 id="whats-being-done-about-it-1">What’s being done about it?</h4>

<ul>
  <li>:heavy_check_mark: <code class="language-plaintext highlighter-rouge">pendulum</code> explicitly fixes this issue</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">heliclockter</code>, <code class="language-plaintext highlighter-rouge">arrow</code>, and <code class="language-plaintext highlighter-rouge">DateType</code> don’t address it</li>
</ul>

<h2 id="3-the-meaning-of-naïve-is-inconsistent">3. The meaning of “naïve” is inconsistent</h2>

<p>In various parts of the standard library, “naïve” datetimes are interpreted
differently. Ostensibly, “naïve” means “detached from the real world”,
but in the datetime library it is often implicitly treated as local time.
Confusingly, it is sometimes treated as UTC<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, while in other places it is
treated as neither!</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># a naïve datetime
</span><span class="n">d</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2024</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>

<span class="c1"># here: treated as a local time
</span><span class="n">d</span><span class="p">.</span><span class="n">timestamp</span><span class="p">()</span>
<span class="n">d</span><span class="p">.</span><span class="n">astimezone</span><span class="p">(</span><span class="n">UTC</span><span class="p">)</span>

<span class="c1"># here: assumed UTC
</span><span class="n">d</span><span class="p">.</span><span class="n">utctimetuple</span><span class="p">()</span>
<span class="n">email</span><span class="p">.</span><span class="n">utils</span><span class="p">.</span><span class="n">format_datetime</span><span class="p">(</span><span class="n">d</span><span class="p">)</span>
<span class="n">datetime</span><span class="p">.</span><span class="n">utcnow</span><span class="p">()</span>

<span class="c1"># here: neither! (error)
</span><span class="n">d</span> <span class="o">&gt;=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">UTC</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="whats-being-done-about-it-2">What’s being done about it?</h4>

<ul>
  <li>:x: While <code class="language-plaintext highlighter-rouge">pendulum</code> and <code class="language-plaintext highlighter-rouge">arrow</code> do discourage using naïve datetimes,
they still support the same inconsistent semantics.</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">DateType</code> and <code class="language-plaintext highlighter-rouge">heliclockter</code> don’t address this</li>
</ul>

<h2 id="4-non-existent-datetimes-pass-silently">4. Non-existent datetimes pass silently</h2>

<p>When the clock in a timezone is set forward, a “gap” is created. For example,
if DST moves the clock forward from 2am to 3am, the time 2:30am is skipped.
The standard library doesn’t warn you when you create such a non-existent time.
As soon as you operate on these objects, you run into problems.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># This time doesn't exist on this date
</span><span class="n">d</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">26</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">paris</span><span class="p">)</span>

<span class="c1"># No timestamp exists, so it takes another one from the future
</span><span class="n">t</span> <span class="o">=</span> <span class="n">d</span><span class="p">.</span><span class="n">timestamp</span><span class="p">()</span>
<span class="n">datetime</span><span class="p">.</span><span class="n">fromtimestamp</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">tz</span><span class="o">=</span><span class="n">paris</span><span class="p">)</span> <span class="o">==</span> <span class="n">d</span>  <span class="c1"># False!?
</span></code></pre></div></div>

<h4 id="whats-being-done-about-it-3">What’s being done about it?</h4>

<ul>
  <li>:x: <code class="language-plaintext highlighter-rouge">pendulum</code> replaces the current silent behavior with another: it
fast-forwards to a valid time <a href="https://github.com/sdispater/pendulum/issues/697">without warning</a>.</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">arrow</code>, <code class="language-plaintext highlighter-rouge">DateType</code> and <code class="language-plaintext highlighter-rouge">heliclockter</code> don’t address this issue</li>
</ul>

<h2 id="5-guessing-in-the-face-of-ambiguity">5. Guessing in the face of ambiguity</h2>

<p>When the clock in a timezone is set backwards, an ambiguity is created.
For example, if DST sets the clock one hour back at 3am, the time 2:30am exists
twice: before and <em>after</em> the change.
The <code class="language-plaintext highlighter-rouge">fold</code> attribute <a href="https://peps.python.org/pep-0495/">was introduced</a>
to resolve these ambiguities</p>

<p>The problem is that there is no objective default value for <code class="language-plaintext highlighter-rouge">fold</code>:
whether you want the “earlier” or “later”
option will depend on the particular context.
For backwards compatibility, the standard library defaults to <code class="language-plaintext highlighter-rouge">0</code>,
which has the effect of silently assuming that you want the earlier occurrence<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">2</a></sup>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Guesses your intent without warning
</span><span class="n">d</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">29</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">paris</span><span class="p">)</span>
</code></pre></div></div>

<h4 id="whats-being-done-about-it-4">What’s being done about it?</h4>

<ul>
  <li>:x: <code class="language-plaintext highlighter-rouge">pendulum</code> also guesses, but rather arbitrarily decides that <code class="language-plaintext highlighter-rouge">1</code>
is the better default<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup>.</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">arrow</code>, <code class="language-plaintext highlighter-rouge">DateType</code> and <code class="language-plaintext highlighter-rouge">heliclockter</code> don’t address the issue.</li>
</ul>

<h2 id="6-disambiguation-breaks-equality">6. Disambiguation breaks equality</h2>

<p>Even though <code class="language-plaintext highlighter-rouge">fold</code> was introduced to disambiguate times,
comparisons of disambiguated times between timezones <em>always</em> evaluate false due to
<a href="https://peps.python.org/pep-0495/#id12">backwards compatibility reasons</a>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># A properly disambiguated time...
</span><span class="n">d</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">29</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">paris</span><span class="p">,</span> <span class="n">fold</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">d_utc</span> <span class="o">=</span> <span class="n">d</span><span class="p">.</span><span class="n">astimezone</span><span class="p">(</span><span class="n">UTC</span><span class="p">)</span>
<span class="n">d_utc</span><span class="p">.</span><span class="n">timestamp</span><span class="p">()</span> <span class="o">==</span> <span class="n">d</span><span class="p">.</span><span class="n">timestamp</span><span class="p">()</span>  <span class="c1"># True: same moment in time
</span><span class="n">d_utc</span> <span class="o">==</span> <span class="n">d</span>  <span class="c1"># False!?
</span></code></pre></div></div>

<h4 id="whats-being-done-about-it-5">What’s being done about it?</h4>

<ul>
  <li>:x: None of the libraries addresses this issue</li>
</ul>

<h2 id="7-inconsistent-equality-within-timezone">7. Inconsistent equality within timezone</h2>

<p>In a mirror image of the previous pitfall, there is a false positive
when comparing two datetimes with the exact same <code class="language-plaintext highlighter-rouge">tzinfo</code> object.
In that case, they are compared by their “wall time”.
This is mostly the same <em>except</em> when <code class="language-plaintext highlighter-rouge">fold</code> is involved…</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># two times one hour apart (due to DST transition)
</span><span class="n">earlier</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">29</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">paris</span><span class="p">,</span> <span class="n">fold</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">later</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">29</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">paris</span><span class="p">,</span> <span class="n">fold</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">earlier</span><span class="p">.</span><span class="n">timestamp</span><span class="p">()</span> <span class="o">==</span> <span class="n">later</span><span class="p">.</span><span class="n">timestamp</span><span class="p">()</span>  <span class="c1"># false, as expected
</span><span class="n">earlier</span> <span class="o">==</span> <span class="n">later</span>  <span class="c1"># true!?
</span></code></pre></div></div>

<p>Remember I said <em>exact same</em> <code class="language-plaintext highlighter-rouge">tzinfo</code> object? If you
compare with the same timezone, but you get its object from <code class="language-plaintext highlighter-rouge">dateutil.tz</code>
instead of <code class="language-plaintext highlighter-rouge">ZoneInfo</code>, you’ll get a different result!</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dateutil</span> <span class="kn">import</span> <span class="n">tz</span>
<span class="n">later2</span> <span class="o">=</span> <span class="n">later</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="n">tzinfo</span><span class="o">=</span><span class="n">tz</span><span class="p">.</span><span class="n">gettz</span><span class="p">(</span><span class="s">"Europe/Paris"</span><span class="p">))</span>
<span class="n">earlier</span> <span class="o">==</span> <span class="n">later2</span>  <span class="c1"># now false
</span></code></pre></div></div>

<h4 id="whats-being-done-about-it-6">What’s being done about it?</h4>

<ul>
  <li>:x: None of the libraries addresses this issue</li>
</ul>

<h2 id="8-datetime-inherits-from-date">8. Datetime inherits from date</h2>

<p>You may be surprised to know that <code class="language-plaintext highlighter-rouge">datetime</code> is a subclass of <code class="language-plaintext highlighter-rouge">date</code>.
This doesn’t seem problematic at first, but it leads to odd behavior.
Most notably, the fact that <code class="language-plaintext highlighter-rouge">date</code> and <code class="language-plaintext highlighter-rouge">datetime</code> cannot be compared
violates <a href="https://en.wikipedia.org/wiki/Liskov_substitution_principle">basic assumptions</a>
of how subclasses should work.
The <code class="language-plaintext highlighter-rouge">datetime/date</code> inheritance is now
<a href="https://discuss.python.org/t/renaming-datetime-datetime-to-datetime-datetime/26279/2">widely considered</a>
to be a <a href="https://github.com/python/typeshed/issues/4802">design flaw</a>
in the standard library.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Breaks on a datetime, even though it's a subclass
</span><span class="k">def</span> <span class="nf">is_future</span><span class="p">(</span><span class="n">d</span><span class="p">:</span> <span class="n">date</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
    <span class="k">return</span> <span class="n">d</span> <span class="o">&gt;</span> <span class="n">date</span><span class="p">.</span><span class="n">today</span><span class="p">()</span>

<span class="c1"># Some methods inherited from `date` don't make sense
</span><span class="n">datetime</span><span class="p">.</span><span class="n">today</span><span class="p">()</span>  <span class="c1"># fun exercise: what does this return?
</span></code></pre></div></div>

<h4 id="whats-being-done-about-it-7">What’s being done about it?</h4>

<ul>
  <li>:heavy_check_mark: <code class="language-plaintext highlighter-rouge">DateType</code> was explicitly developed to fix this inheritance relationship
at type-checking time.</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">arrow</code>, <code class="language-plaintext highlighter-rouge">pendulum</code>, and <code class="language-plaintext highlighter-rouge">heliclockter</code> don’t address the issue.
Their datetime classes all inherit from <code class="language-plaintext highlighter-rouge">datetime</code> (and thus also <code class="language-plaintext highlighter-rouge">date</code>).</li>
</ul>

<h2 id="9-datetimetimezone-isnt-enough-for-timezone-support">9. <code class="language-plaintext highlighter-rouge">datetime.timezone</code> isn’t enough for timezone support</h2>

<p>OK—so this is maybe something you learn once and then never forget.
But it’s still confusing that <code class="language-plaintext highlighter-rouge">datetime.timezone</code> is only for fixed offsets,
and you need <code class="language-plaintext highlighter-rouge">ZoneInfo</code> to express real-world timezone behavior with DST transitions.
For beginners that don’t know the difference, this is an unfortunate trap.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">timezone</span><span class="p">,</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span>
<span class="kn">from</span> <span class="nn">zoneinfo</span> <span class="kn">import</span> <span class="n">ZoneInfo</span>

<span class="c1"># Wrong: it's a fixed offset only valid in winter!
</span><span class="n">paris_tz</span> <span class="o">=</span> <span class="n">timezone</span><span class="p">(</span><span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="s">"CET"</span><span class="p">)</span>

<span class="c1"># Correct: accounts for all timezone changes
</span><span class="n">paris_tz</span> <span class="o">=</span> <span class="n">ZoneInfo</span><span class="p">(</span><span class="s">"Europe/Paris"</span><span class="p">)</span>
</code></pre></div></div>

<ul>
  <li>:heavy_check_mark: Both <code class="language-plaintext highlighter-rouge">arrow</code> and <code class="language-plaintext highlighter-rouge">pendulum</code> side-step this issue by specifying
timezones as strings instead of requiring special class instance.</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">heliclockter</code> and <code class="language-plaintext highlighter-rouge">DateType</code> don’t address this issue</li>
</ul>

<h2 id="10-the-local-timezone-is-dst-unaware">10. The local timezone is DST-unaware</h2>

<p>Calling <code class="language-plaintext highlighter-rouge">astimezone()</code> without arguments gives you the time in the local system
timezone. However, it returns it as a fixed offset (<code class="language-plaintext highlighter-rouge">datetime.timezone</code>) instead of a
full timezone (<code class="language-plaintext highlighter-rouge">ZoneInfo</code>) that knows about DST transitions.
In Paris, for example, <code class="language-plaintext highlighter-rouge">astimezone()</code> returns a fixed offset of UTC+1
or UTC+2 (depending on whether it’s winter or summer) instead
of the full <code class="language-plaintext highlighter-rouge">Europe/Paris</code> timezone.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># you think you've got the local timezone
</span><span class="n">my_tz</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">).</span><span class="n">astimezone</span><span class="p">().</span><span class="n">tzinfo</span>
<span class="c1"># but you actually only have the wintertime variant
</span><span class="k">print</span><span class="p">(</span><span class="n">my_tz</span><span class="p">)</span>  <span class="c1"># timezone(offset=timedelta(hours=1), "CET")
</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2023</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">my_tz</span><span class="p">)</span>  <span class="c1"># not valid for summer!
</span></code></pre></div></div>

<h4 id="whats-being-done-about-it-8">What’s being done about it?</h4>

<ul>
  <li>:heavy_check_mark: <code class="language-plaintext highlighter-rouge">pendulum</code> and <code class="language-plaintext highlighter-rouge">arrow</code> have methods to convert to the full local timezone.</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">heliclockter</code> has a local datetime type with the same issue,
although a fix is in the works.</li>
  <li>:x: <code class="language-plaintext highlighter-rouge">DateType</code> doesn’t address this issue</li>
</ul>

<h2 id="datetime-library-scorecard">Datetime library scorecard</h2>

<p>Below is a summary of how the libraries address the pitfalls (:heavy_check_mark:) or not (:x:).</p>

<table>
  <thead>
    <tr>
      <th>Pitfall</th>
      <th>Arrow</th>
      <th>Pendulum</th>
      <th>DateType</th>
      <th>Heliclockter</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>aware/naïve in one class</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:heavy_check_mark:</td>
      <td>:heavy_check_mark:</td>
    </tr>
    <tr>
      <td>Operators ignore DST</td>
      <td>:x:</td>
      <td>:heavy_check_mark:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td>Unclear “naïve” semantics</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td>Silent non-existence</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td>Guesses on ambiguity</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td>Disambiguation breaks equality</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td>Inconsistent equality within zone</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td>datetime inherits from date</td>
      <td>:x:</td>
      <td>:x:</td>
      <td>:heavy_check_mark:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">timezone</code> isn’t enough for timezone support</td>
      <td>:heavy_check_mark:</td>
      <td>:heavy_check_mark:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
    <tr>
      <td>DST-unaware local timezone</td>
      <td>:heavy_check_mark:</td>
      <td>:heavy_check_mark:</td>
      <td>:x:</td>
      <td>:x:</td>
    </tr>
  </tbody>
</table>

<h2 id="why-should-you-care">Why should you care?</h2>

<p>The pitfalls roughly fall into two categories:
<em>confusing design</em> and <em>surprising edge cases</em>.
Here is why you should care about both.</p>

<h3 id="confusing-design">Confusing design</h3>

<p>Confusing design is the larger problem,
because it amplifies the biggest source of bugs: human error.
While good design helps minimize the chance of mistakes,
bad design introduces more opportunities for them.
Looking at other languages, it’s clear that better designs are possible.
Java, C#, and Rust all have distinct classes for naïve and aware datetimes (and more).
We can also see that redesigns are worth the substantial effort:
Java <a href="https://jcp.org/en/jsr/detail?id=310">adopted Joda-Time</a>,
and JavaScript is <a href="https://tc39.es/proposal-temporal/docs/">modernizing as well</a>.
Will Python’s datetime be left behind?</p>

<h3 id="surprising-edge-cases">Surprising edge cases</h3>

<p>Because these pitfalls are rare, you may think they’re not worth worrying about.
After all, DST transitions only represent about 0.02% of the year.
While this sentiment is understandable, I’d argue that the opposite is true:</p>

<ul>
  <li>Getting timezones right is one of the main <em>reasons for existence</em> of
a datetime library. If it can’t do that reliably, what’s the point?</li>
  <li>Rare cases are the most dangerous: they are the ones you’re least likely to test,
and allow bad actors to trip up your code.</li>
  <li>Rare is still too common for such a fundamental concept as time.
Would you run your business on <code class="language-plaintext highlighter-rouge">numpy</code> if it had a
0.02% chance of returning the wrong result?
Would you accept a language in which 1 in 4000 booleans would arbitrarily be flipped?
There is no reason why these pitfalls shouldn’t be corrected.</li>
</ul>

<h2 id="imagining-a-solution">Imagining a solution</h2>

<p>Inspired by these findings, I created a
<a href="https://github.com/ariebovenberg/whenever">new library</a> to explore
what a better datetime library could look like.
Here is how it addresses the pitfalls:</p>

<ol>
  <li>
    <p>It has distinct classes for the most common use cases:</p>

    <p>(note: the types have been updated since the original article)</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">whenever</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="c1"># In case you don't care about timezones
</span>    <span class="n">Instant</span><span class="p">,</span>
    <span class="c1"># Simple localization sans DST
</span>    <span class="n">OffsetDateTime</span><span class="p">,</span>
    <span class="c1"># Full-featured IANA timezones
</span>    <span class="n">ZonedDateTime</span><span class="p">,</span>
    <span class="c1"># The current system timezone
</span>    <span class="n">SystemDateTime</span><span class="p">,</span>
    <span class="c1"># 'Naive' local times without a timezone
</span>    <span class="n">LocalDateTime</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div>    </div>
  </li>
  <li>Addition and subtraction take DST into account.</li>
  <li>Naïve is always naïve. UTC and local time have their own separate classes.</li>
  <li>Creating non-existent datetimes raises an exception.</li>
  <li>
    <p>Ambiguous datetimes must be explicitly disambiguated.</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ZonedDateTime</span><span class="p">(</span>
    <span class="mi">2023</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">tz</span><span class="o">=</span><span class="s">"Europe/Paris"</span><span class="p">,</span>
<span class="p">)</span>  <span class="c1"># ok: not ambiguous
</span><span class="n">ZonedDateTime</span><span class="p">(</span>
    <span class="mi">2023</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">29</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">tz</span><span class="o">=</span><span class="s">"Europe/Paris"</span><span class="p">,</span>
<span class="p">)</span>  <span class="c1"># ERROR: ambiguous!
</span><span class="n">ZonedDateTime</span><span class="p">(</span>
    <span class="mi">2023</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">29</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">tz</span><span class="o">=</span><span class="s">"Europe/Paris"</span><span class="p">,</span>
    <span class="n">disambiguate</span><span class="o">=</span><span class="s">"later"</span>
<span class="p">)</span>  <span class="c1"># that's better!
</span></code></pre></div>    </div>
  </li>
  <li>Disambiguated datetimes work correctly in comparisons.</li>
  <li>
    <p>Aware datetimes are equal if they occur at the same moment. No exceptions.</p>

    <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">==</span> <span class="n">b</span>
<span class="c1"># always equivalent to:
</span><span class="n">a</span><span class="p">.</span><span class="n">instant</span><span class="p">()</span> <span class="o">==</span> <span class="n">b</span><span class="p">.</span><span class="n">instant</span><span class="p">()</span>
</code></pre></div>    </div>
  </li>
  <li>The datetime classes don’t inherit from date.</li>
  <li>IANA timezones are used everywhere, no separate classes are needed.</li>
  <li>Local datetimes handle DST transitions correctly.</li>
</ol>

<p><a href="https://github.com/ariebovenberg/whenever">Feedback is welcome!</a> :star2:</p>

<h2 id="changelog">Changelog</h2>

<p>See the <a href="https://github.com/ariebovenberg/ariebovenberg.github.io/commits/main/_posts/2024-01-20-python-datetime-pitfalls.md">git history</a>
for exact changes to this article since initial publication.</p>

<h3 id="2024-02-01-1814000100">2024-02-01 18:14:00+01:00</h3>

<ul>
  <li>Clarified wording and code comments in pitfall #3.</li>
</ul>

<h3 id="2024-02-02-1013000100">2024-02-02 10:13:00+01:00</h3>

<ul>
  <li>Clarified wording around timezones and IANA tz database in pitfall #9,
and throughout the article.</li>
  <li>Added reddit link</li>
</ul>

<h3 id="2024-02-13-0840000100">2024-02-13 08:40:00+01:00</h3>

<ul>
  <li>Clarified wording on distinguishing “aware” types in pitfall #1.</li>
  <li>Added note about RFC 5545 in pitfall #5.</li>
</ul>

<h3 id="2024-02-18-2028000100">2024-02-18 20:28:00+01:00</h3>

<ul>
  <li>Added Hacker News link</li>
  <li>Clarification in pitfall #4, fix code example</li>
  <li>Added non-emoji text to scorecard for systems that don’t support it</li>
</ul>

<h3 id="2024-02-18-2110000100">2024-02-18 21:10:00+01:00</h3>

<ul>
  <li>A better solution for emoji :tada:</li>
</ul>

<h3 id="2024-10-03-1915000200">2024-10-03 19:15:00+02:00</h3>

<ul>
  <li>Updated the types in the example code to match the current version of the library</li>
</ul>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>In the standard library, methods like <code class="language-plaintext highlighter-rouge">utcnow()</code> are slowly being deprecated,
  but many UTC-assuming parts remain. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>This does coincide with RFC 5545, but this is probably
  coincidental. PEP495 doesn’t mention RFC 5545, and its semantics
  aren’t followed in other areas of the standard library. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Interestingly, pendulum used to have an explicit <code class="language-plaintext highlighter-rouge">dst_rule</code> parameter that
  was silently <a href="https://github.com/sdispater/pendulum/issues/789">removed in 3.0</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[It’s no secret that the Python datetime library has its quirks. Not only are there probably more than you think; third-party libraries don’t address most of them! I created a new library to explore what a better datetime library could look like.]]></summary></entry><entry><title type="html">The curious case of Pydantic and the 1970s timestamps</title><link href="https://dev.arie.bovenberg.net/blog/pydantic-timestamps/" rel="alternate" type="text/html" title="The curious case of Pydantic and the 1970s timestamps" /><published>2024-01-08T00:00:00+00:00</published><updated>2024-01-08T00:00:00+00:00</updated><id>https://dev.arie.bovenberg.net/blog/pydantic-timestamps</id><content type="html" xml:base="https://dev.arie.bovenberg.net/blog/pydantic-timestamps/"><![CDATA[<p>When parsing Unix timestamps, Pydantic
guesses whether to interpret them in seconds or milliseconds.
While this is certainly convenient and works most of the time,
it can drastically (and silently) distort timestamps from a few decades ago.</p>

<p>Let’s imagine you’re dealing with some rocket launch data:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># some timestamps in milliseconds
</span><span class="n">marsrover</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2020</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">50</span><span class="p">).</span><span class="n">timestamp</span><span class="p">()</span> <span class="o">*</span> <span class="mi">1000</span>
<span class="n">pathfinder</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">1996</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">58</span><span class="p">).</span><span class="n">timestamp</span><span class="p">()</span> <span class="o">*</span> <span class="mi">1000</span>
<span class="n">apollo_13</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">1970</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">11</span><span class="p">,</span> <span class="mi">19</span><span class="p">,</span> <span class="mi">13</span><span class="p">).</span><span class="n">timestamp</span><span class="p">()</span> <span class="o">*</span> <span class="mi">1000</span>
</code></pre></div></div>

<p>When we use Pydantic to load this data, we notice something strange…</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span>

<span class="k">class</span> <span class="nc">Mission</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="n">launch</span><span class="p">:</span> <span class="n">datetime</span>

<span class="n">Mission</span><span class="p">(</span><span class="n">launch</span><span class="o">=</span><span class="n">marsrover</span><span class="p">)</span>   <span class="c1"># 2020-07-30 11:50
</span><span class="n">Mission</span><span class="p">(</span><span class="n">launch</span><span class="o">=</span><span class="n">pathfinder</span><span class="p">)</span>  <span class="c1"># 1996-12-04 06:58
</span><span class="n">Mission</span><span class="p">(</span><span class="n">launch</span><span class="o">=</span><span class="n">apollo_13</span><span class="p">)</span>   <span class="c1"># 2245-11-14 00:40 ???
</span></code></pre></div></div>

<p>While the first timestamps are parsed correctly, the third is <em>wildly</em> different!
How did this happen?</p>

<p>Let’s take a closer look at the timestamp values:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="n">marsrover</span><span class="p">)</span>   <span class="c1"># 1596102600000
</span><span class="k">print</span><span class="p">(</span><span class="n">pathfinder</span><span class="p">)</span>  <span class="c1"># 849679080000
</span><span class="k">print</span><span class="p">(</span><span class="n">apollo_13</span><span class="p">)</span>   <span class="c1"># 8705580000
</span></code></pre></div></div>

<p>What jumps out is that the timestamp for Apollo 13 is a lot smaller.
This makes sense as it’s closer to the Unix epoch of 1970-1-1, after all.</p>

<p>Pydantic draws a different conclusion:
it’s small because…<em>it probably represents seconds</em>, not milliseconds.
In other words: at some point in the seventies, it starts interpreting
millisecond timestamps as seconds instead.
At best, you’ll get a confusing error about out-of-bounds time data,
but at worst, your data is drastically and silently transformed.</p>

<p>You might think: <em>who cares? This is such a rare case — and it’s often helpful!</em></p>

<p>Yes, but:</p>

<ol>
  <li>It’s not uncommon for large companies to have data from the 70s,
and milliseconds are frequently used to store timestamps.</li>
  <li>Working with time is already complex and error-prone.
We should be critical of adding another edge case.</li>
</ol>

<p>Thankfully, the Pydantic team is quick to respond, and a
<a href="https://github.com/pydantic/pydantic/issues/7940">solution is in the works</a>.</p>

<h2 id="the-larger-lesson-here">The larger lesson here</h2>

<p>Libraries have become so good at ingesting our data <em>automagically</em><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>,
that we can forget to do proper software engineering.
With basic research, you can often find out what your data looks
like <em>before</em> you ingest it. And if you define an API, <em>you</em> dictate the data format!</p>

<p>Unless you’re dealing with unconstrained and erratic data,
you most likely already know whether the timestamps you’re reading are in seconds or milliseconds.
Don’t rely on a library to <em>guess</em> it correctly for you!
Yes, it may take slightly more time to code —
but your app will be safer, more predictable, and faster for it.</p>

<p>If you’re still unconvinced of the danger of automagical parsing,
look no further than Microsoft Excel.
Who among us hasn’t been tripped up by
its <a href="https://www.reddit.com/r/excel/comments/jfir5s/stop_automatically_reformatting_my_data_into/">notoriously overeager data inference</a>?
Let’s not repeat this mistake in Python.
The Zen of Python already warns us:</p>

<blockquote>
  <p>In the face of ambiguity, refuse the temptation to guess.</p>
</blockquote>

<p>Refuse the temptation of automagical parsing. <strong>Be explicit about data you ingest.</strong></p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I’m also looking at you, <code class="language-plaintext highlighter-rouge">pandas.read_csv()</code>… <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[When parsing Unix timestamps, Pydantic guesses whether to interpret them in seconds or milliseconds. While this is certainly convenient and works most of the time, it can drastically (and silently) distort timestamps from a few decades ago.]]></summary></entry><entry><title type="html">Finding broken slots in popular Python libraries (and so can you!)</title><link href="https://dev.arie.bovenberg.net/blog/finding-broken-slots-in-popular-python-libraries/" rel="alternate" type="text/html" title="Finding broken slots in popular Python libraries (and so can you!)" /><published>2022-01-03T00:00:00+00:00</published><updated>2022-01-03T00:00:00+00:00</updated><id>https://dev.arie.bovenberg.net/blog/finding-broken-slots-in-popular-python-libraries</id><content type="html" xml:base="https://dev.arie.bovenberg.net/blog/finding-broken-slots-in-popular-python-libraries/"><![CDATA[<p>Adding <code class="language-plaintext highlighter-rouge">__slots__</code> to a class in Python is a great way to reduce memory usage.
But to work properly, all base classes need to implement it.
This is easy to forget and there is nothing warning you that you messed up.
In popular projects, a few of these mistakes have laid undetected — until now!</p>

<h2 id="show-me">Show me!</h2>

<p>I built a small tool, <a href="https://github.com/ariebovenberg/slotscheck/">slotscheck</a>,
that scans a package for these mistakes.
My hope is libraries can get a free mini performance boost from fixing their slots.</p>

<p>Here’s how to use it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pip <span class="nb">install </span>slotscheck
<span class="nv">$ </span>pip <span class="nb">install </span>pandas  <span class="c"># or whatever library you'd like to check</span>
<span class="nv">$ </span>slotscheck pandas
ERROR: <span class="s1">'SingleArrayManager'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'Block'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'NumericBlock'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'DatetimeLikeBlock'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'ObjectBlock'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'CategoricalBlock'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'BaseArrayManager'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'BaseBlockManager'</span> has slots but inherits from non-slot class
ERROR: <span class="s1">'SingleBlockManager'</span> has slots but inherits from non-slot class
</code></pre></div></div>

<h2 id="fixing-slots-what-is-this-about">Fixing slots? What is this about?</h2>

<p>Declaring <code class="language-plaintext highlighter-rouge">__slots__</code> allow you to limit class attributes to a fixed set.
With this information, Python can optimize the layout of the class<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>.
However, to get the full advantages of slots,
all bases of a class also need to have it defined.</p>

<p>Let’s look at slots without inheritance:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># checks complete size of objects in memory
</span><span class="kn">from</span> <span class="nn">pympler.asizeof</span> <span class="kn">import</span> <span class="n">asizeof</span>

<span class="k">class</span> <span class="nc">EmptyNoSlots</span><span class="p">:</span> <span class="k">pass</span>

<span class="k">class</span> <span class="nc">EmptyWithSlots</span><span class="p">:</span> <span class="n">__slots__</span> <span class="o">=</span> <span class="p">()</span>

<span class="k">class</span> <span class="nc">NoSlots</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">b</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>

<span class="k">class</span> <span class="nc">WithSlots</span><span class="p">:</span>
    <span class="n">__slots__</span> <span class="o">=</span> <span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">b</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>

<span class="k">print</span><span class="p">(</span><span class="n">asizeof</span><span class="p">(</span><span class="n">EmptyNoSlots</span><span class="p">()))</span>    <span class="c1"># 152
</span><span class="k">print</span><span class="p">(</span><span class="n">asizeof</span><span class="p">(</span><span class="n">EmptyWithSlots</span><span class="p">()))</span>  <span class="c1"># 32
</span><span class="k">print</span><span class="p">(</span><span class="n">asizeof</span><span class="p">(</span><span class="n">NoSlots</span><span class="p">()))</span>         <span class="c1"># 328
</span><span class="k">print</span><span class="p">(</span><span class="n">asizeof</span><span class="p">(</span><span class="n">WithSlots</span><span class="p">()))</span>       <span class="c1"># 112 !!!
</span></code></pre></div></div>

<p>Looks like quite the difference!
So what about inheritance?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">class</span> <span class="nc">WithSlotsAndProperBaseClass</span><span class="p">(</span><span class="n">EmptyWithSlots</span><span class="p">):</span>
    <span class="n">__slots__</span> <span class="o">=</span> <span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">b</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>

<span class="k">class</span> <span class="nc">NoSlotsAtAll</span><span class="p">(</span><span class="n">EmptyNoSlots</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">b</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>

<span class="k">class</span> <span class="nc">WithSlotsAndBadBaseClass</span><span class="p">(</span><span class="n">EmptyNoSlots</span><span class="p">):</span>
    <span class="n">__slots__</span> <span class="o">=</span> <span class="p">(</span><span class="s">"a"</span><span class="p">,</span> <span class="s">"b"</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">a</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">b</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span>

<span class="k">print</span><span class="p">(</span><span class="n">asizeof</span><span class="p">(</span><span class="n">WithSlotsAndProperBaseClass</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>  <span class="c1"># 112
</span><span class="k">print</span><span class="p">(</span><span class="n">asizeof</span><span class="p">(</span><span class="n">NoSlotsAtAll</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>                 <span class="c1"># 328
</span><span class="k">print</span><span class="p">(</span><span class="n">asizeof</span><span class="p">(</span><span class="n">WithSlotsAndBadBaseClass</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>     <span class="c1"># 232 !!!
</span></code></pre></div></div>

<p>As you can see, bad <code class="language-plaintext highlighter-rouge">__slots__</code> inheritance can really bloat your memory footprint!<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<h2 id="what-did-i-find">What did I find?</h2>

<p>Having built <code class="language-plaintext highlighter-rouge">slotscheck</code>, I couldn’t wait to see what I could find.
I didn’t have to look far:
I found some missing slots <a href="https://bugs.python.org/issue46244">in the standard library</a>.</p>

<p>Also, a scan of the <a href="https://hugovk.github.io/top-pypi-packages/">5000 most popular PyPI packages</a>
showed several of them seem to have some classes with broken slots:</p>

<ul>
  <li>libcst (144) (<a href="https://github.com/Instagram/LibCST/issues/574">issue opened</a>)</li>
  <li>dpkt (129)</li>
  <li>scapy (85)</li>
  <li>exchangelib (39)</li>
  <li>sqlalchemy (12) (<a href="https://github.com/sqlalchemy/sqlalchemy/issues/7527">issue opened</a>)</li>
  <li>pandas (9) (<a href="https://github.com/pandas-dev/pandas/issues/45124">issue opened</a>)</li>
  <li>trafaret (9)</li>
  <li>acme (9)</li>
  <li>tensorflow_probability (6)</li>
  <li>torch (5)</li>
  <li>srsly (5)</li>
  <li>dynaconf (5)</li>
  <li>falcon (4)</li>
  <li>glom (4)</li>
  <li>aio_pika (4)</li>
  <li>returns (3) (<a href="https://github.com/dry-python/returns/pull/1147">solved</a>)</li>
  <li>parso (3)</li>
  <li>autobahn (3)</li>
  <li>rx (3)</li>
  <li>pika (3)</li>
  <li>aiormq (3)</li>
  <li>boxsdk (3)</li>
  <li>pipenv (3)</li>
  <li>oauthlib (2)</li>
  <li>xmlschema (2)</li>
  <li>aiohttp (2)</li>
  <li>dagster (2)</li>
  <li>peewee (2)</li>
  <li>xlrd (2)</li>
  <li>fiona (2)</li>
  <li>zeroconf (2)</li>
  <li>parsimonious (2)</li>
  <li>wand (2)</li>
  <li>werkzeug (1)</li>
  <li>josepy (1)</li>
  <li>llvmlite (1)</li>
  <li>sphinx (1)</li>
  <li>markupsafe (1)</li>
  <li>tensorflow (1)</li>
  <li>Pathy (1)</li>
  <li>sanic (1)</li>
  <li>mongoengine (1)</li>
  <li>requests_html (1)</li>
</ul>

<p>(Note that some may be false positives, or out of date since this post.
The list is not exhaustive.)</p>

<p>I was actually surprised how many packages <em>didn’t</em> have issues.
This is mostly due to them not having many <code class="language-plaintext highlighter-rouge">__slots__</code> classes in the first place.
For example, <code class="language-plaintext highlighter-rouge">requests</code> has 43 classes, none with slots;
<code class="language-plaintext highlighter-rouge">azure</code> has 2391 classes, only 5 with slots.
I hope tools like <code class="language-plaintext highlighter-rouge">slotscheck</code> help more libraries adopt slots!</p>

<h2 id="what-now">What now?</h2>

<p>The first version of <code class="language-plaintext highlighter-rouge">slotscheck</code> is available on PyPI.
Include it in your CI pipeline to prevent slots mistakes from appearing again!
Or, contribute to the community by fixing slots in any of the packages listed above.
Check out the GitHub <a href="https://github.com/ariebovenberg/slotscheck/">repo</a>
to follow further development, and leave a ⭐️ if you like.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>If you’re interested in all the details,
  there is a great explanation <a href="https://stackoverflow.com/a/28059785">here</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Of course the exact numbers will change depending on how many attributes
  there are, and their type — among other things. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[Adding __slots__ to a class in Python is a great way to reduce memory usage. But to work properly, all base classes need to implement it. This is easy to forget and there is nothing warning you that you messed up. In popular projects, a few of these mistakes have laid undetected — until now!]]></summary></entry><entry><title type="html">Is your Python code vulnerable to log injection?</title><link href="https://dev.arie.bovenberg.net/blog/is-your-python-code-vulnerable-to-log-injection/" rel="alternate" type="text/html" title="Is your Python code vulnerable to log injection?" /><published>2021-12-27T00:00:00+00:00</published><updated>2021-12-27T00:00:00+00:00</updated><id>https://dev.arie.bovenberg.net/blog/is-your-python-code-vulnerable-to-log-injection</id><content type="html" xml:base="https://dev.arie.bovenberg.net/blog/is-your-python-code-vulnerable-to-log-injection/"><![CDATA[<p>Following the news on log4j lately,
you may wonder if Python’s logging library is safe.
After all, there is a potential for injection attacks where string formatting
meets user input.
Thankfully, Python’s logging isn’t vulnerable to remote code execution.
Nonetheless it is still important to be careful with untrusted data.
This article will describe some common pitfalls,
and how the popular practice of logging f-strings could — in certain situations — leave you vulnerable to other types of attacks.</p>

<h2 id="the-basics-of-logging">The basics of logging</h2>

<p>Where does formatting meet user input exactly?
Let’s start with a basic logging setup.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">logging</span>
<span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>
</code></pre></div></div>

<p>Logging a message:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"hello world"</span><span class="p">)</span>
</code></pre></div></div>

<p>Which prints:</p>

<pre><code class="language-log">INFO:__main__:hello world
</code></pre>

<p>Let’s see the string formatting<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> in action:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">context</span> <span class="o">=</span> <span class="p">{</span><span class="s">'user'</span><span class="p">:</span> <span class="s">'bob'</span><span class="p">,</span> <span class="s">'msg'</span><span class="p">:</span> <span class="s">'hello everybody'</span><span class="p">}</span>
<span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"user '%(user)s' commented: '%(msg)s'."</span><span class="p">,</span> <span class="n">context</span><span class="p">)</span>
</code></pre></div></div>

<p>This outputs the following:</p>

<pre><code class="language-log">INFO:__main__:user 'bob' commented: 'hello everybody'.
</code></pre>

<h2 id="simple-injection">Simple injection</h2>

<p>If you don’t sanitize your inputs,
you may be vulnerable to <a href="https://owasp.org/www-community/attacks/Log_Injection">log injection</a>.
Consider the following message:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"hello'.</span><span class="se">\n</span><span class="s">INFO:__main__:user 'alice' commented: 'I like pineapple pizza"</span>
</code></pre></div></div>

<p>If logged with the previous template, this results in:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>INFO:__main__:user 'bob' commented: 'hello'.
INFO:__main__:user 'alice' commented: 'I like pineapple pizza'.
</code></pre></div></div>

<p>As you can see, an attacker can not only corrupt logs, but also incriminate others.</p>

<h3 id="mitigation">Mitigation</h3>

<p>We can mitigate this particular attack by <a href="https://github.com/darrenpmeyer/logging-formatter-anticrlf">escaping newline characters</a>.
But, beware that there are plenty of other <a href="https://www.python.org/dev/peps/pep-0672/">evil unicode control characters</a>
which can mess up your logs.
The safest solution is to simply not log untrusted text.
If you need to store it for an audit trail, use a database.
Alternatively, <a href="https://www.structlog.org/en/stable/">structured logging</a>
can prevent newline-based attacks.</p>

<h2 id="double-formatting-trouble">Double formatting trouble</h2>

<p>There’s another interesting vulnerability which is particular to Python.
Because the old <code class="language-plaintext highlighter-rouge">%</code>-style formatting used by <code class="language-plaintext highlighter-rouge">logging</code> is often considered ugly,
many people prefer to use f-strings:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"user '</span><span class="si">{</span><span class="n">user</span><span class="si">}</span><span class="s">' commented: '</span><span class="si">{</span><span class="n">msg</span><span class="si">}</span><span class="s">'."</span><span class="p">)</span>
</code></pre></div></div>

<p>Admittedly this <em>looks</em> nicer, but it won’t stop <code class="language-plaintext highlighter-rouge">logging</code>
from trying to format the resulting string itself.
So if the <code class="language-plaintext highlighter-rouge">msg</code> is…</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"%(foo)s"</span>
</code></pre></div></div>

<p>…we are left with this after the f-string evaluates:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"user 'bob' commented: '%(foo)s'."</span><span class="p">)</span>
</code></pre></div></div>

<p>So what does <code class="language-plaintext highlighter-rouge">logging</code> do? Does it try to look up <code class="language-plaintext highlighter-rouge">foo</code> and crash?
Thankfully not. In the depths of the <code class="language-plaintext highlighter-rouge">logging</code> source code we find:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">args</span><span class="p">:</span>
    <span class="n">msg</span> <span class="o">=</span> <span class="n">msg</span> <span class="o">%</span> <span class="bp">self</span><span class="p">.</span><span class="n">args</span>
</code></pre></div></div>

<p>No arguments, no formatting. Makes sense.
But once there is an argument, things get interesting. Consider this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"user '%(user)s' commented: '</span><span class="si">{</span><span class="n">msg</span><span class="si">}</span><span class="s">'."</span><span class="p">,</span> <span class="n">context</span><span class="p">)</span>
</code></pre></div></div>

<p>Of course, nobody is likely to mix formatting styles at first.
But it is plausible that either:</p>
<ul>
  <li>Someone would add the <code class="language-plaintext highlighter-rouge">msg</code> parameter to an existing log statement in this way;</li>
  <li>When refactoring to an f-string, someone forgot to remove the <code class="language-plaintext highlighter-rouge">context</code> argument;</li>
  <li>That log messages are passed through a user-defined function or logging filter
which adds a <code class="language-plaintext highlighter-rouge">context</code> argument.</li>
</ul>

<p>In this case we get an error like this:</p>

<pre><code class="language-log">--- Logging error ---
[...snip...]
KeyError: 'foo'
Call stack:
  File "example.py", line 29, in &lt;module&gt;
    logger.info(f"user '%(user)s' commented: '{msg}'.", context)
Message: "user '%(user)s' commented: '%(foo)s'."
Arguments: {'user': 'bob'}
</code></pre>

<p>Annoying to have in the logs? Yes. Dangerous? Not…<em>yet</em>.
But by formatting an external string into our log message
(which in turn gets formatted again by <code class="language-plaintext highlighter-rouge">logging</code>)
we open the door to <a href="https://owasp.org/www-community/attacks/Format_string_attack">string formatting attacks</a>.
Thankfully, Python is a lot less vulnerable than C,
but there are still ways to abuse it.</p>

<h3 id="padding-a-ton">Padding a ton</h3>

<p>One such case is abusing <a href="https://pyformat.info/#string_pad_align">padding syntax</a>.
Consider this message:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"%(user)999999999s"</span>
</code></pre></div></div>

<p>This will pad the <code class="language-plaintext highlighter-rouge">user</code> with almost a gigabyte of whitespace.
Not only will this slow your log statement down to a crawl,
it could also clog up your logging infrastructure.</p>

<p>Why is this a problem? Attackers being able to crash your server is bad enough.
If they cripple your logging, you wouldn’t even know what hit you.</p>

<h3 id="leaky-logs">Leaky logs</h3>

<p>Another potential risk is leaking sensitive information.
In our example, if the <code class="language-plaintext highlighter-rouge">context</code> contains a <code class="language-plaintext highlighter-rouge">"secret"</code> key,
an attacker could leak them into the logs with the following message:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"%(secret)s"</span>
</code></pre></div></div>

<p>This is particularly dangerous when combined with the padding vulnerability,
as the attacker could use timing to sniff out which keys are present.</p>

<p>On the flip side, we can be thankful that Python’s <code class="language-plaintext highlighter-rouge">%</code>-style formatting syntax is so limited.
If logging used the new braces-style, an attacker wouldn’t even need
sensitive data to be present in <code class="language-plaintext highlighter-rouge">context</code>, by <a href="https://lucumr.pocoo.org/2016/12/29/careful-with-str-format/">using a message like</a>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"{0.__init__.__globals__['SECRET']}"</span>
</code></pre></div></div>

<p>You might wonder what the big deal is if secrets land in the logs
— it’s not in the open, right?
The problem is that logs are usually a lot easier for an attacker to access
than a credential store.
Because of this, CWE ranks <em>“Insertion of Sensitive Information into Log File”</em> as number 39
on their list of <a href="https://cwe.mitre.org/top25/archive/2021/2021_cwe_top25.html">most dangerous software weaknesses</a>.</p>

<h3 id="mitigation-1">Mitigation</h3>

<p>To eliminate these risks, you should <em>always</em> let <code class="language-plaintext highlighter-rouge">logging</code> handle string formatting.
Don’t format log messages yourself with f-strings or otherwise <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.
Thankfully, there is a <a href="https://github.com/globality-corp/flake8-logging-format">flake8 plugin</a>
that can check this for you.
Also, once <a href="https://www.python.org/dev/peps/pep-0675">PEP675</a> is implemented,
you could perhaps use a typechecker to check only literal strings are passed to the logger.</p>

<h2 id="recommendations">Recommendations</h2>

<ol>
  <li><em>Don’t log untrusted text</em>. Python’s logging library doesn’t protect you from
newlines or other unicode characters which allow
attackers to mess up — or even forge — logs.</li>
  <li><em>Don’t format logs yourself (with f-strings or otherwise)</em>.
In certain situations this could leave you vulnerable to
denial-of-service attacks or even sensitive data exposure.</li>
</ol>

<p>A full sample of the code used can be found
<a href="https://gist.github.com/ariebovenberg/dfd849ddc7a0dc7428a22b5b8a468134">here</a>,
so you can experiment for yourself.</p>

<p>You can discuss this post on <a href="https://www.reddit.com/r/Python/comments/rqaysb/is_your_python_code_vulnerable_to_log_injection/">reddit</a>
or <a href="https://news.ycombinator.com/item?id=29803530">Hacker News</a>.</p>

<h3 id="update-2022-01-04">Update 2022-01-04</h3>

<p>I’ve since created <a href="https://bugs.python.org/issue46200">an issue on the Python bug tracker</a>
to document security risks in the logging docs,
and perhaps even to create a more secure <code class="language-plaintext highlighter-rouge">logger</code> API.</p>

<h3 id="update-2022-02-19">Update 2022-02-19</h3>

<p>The log formatting DoS vulnerability has been
<a href="https://www.python.org/dev/peps/pep-0675/#logging-format-string-injection">included in PEP675</a>
as a potential use of for the string literal type.
I’ve opened a discussion on
<a href="https://discuss.python.org/t/safer-logging-methods-for-f-strings-and-new-style-formatting/13802">discuss.python.org</a>
for improvements to the logging API.</p>

<h3 id="thanks">Thanks</h3>

<p>To <a href="https://daan.fyi">Daan Debie</a> for reviewing this post!</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>This may be less well-known, but <code class="language-plaintext highlighter-rouge">logging</code> supports named formatting
  when a dictionary is passed as an argument.
  This is different from the <code class="language-plaintext highlighter-rouge">extra=</code> parameter!
  You can see for yourself how this works
  with the <code class="language-plaintext highlighter-rouge">%</code> operator:
  <code class="language-plaintext highlighter-rouge">"hello %(name)" % {"name": "bob"}</code> and <code class="language-plaintext highlighter-rouge">"hello %s" % "bob"</code>.
  In this post I’ll be focussing on vulnerabilities when using
  the dictionary approach. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Using logging’s built-in formatting is also
  <a href="https://dev.to/izabelakowal/what-is-the-best-string-formatting-technique-for-logging-in-python-d1d">often better for performance, among other reasons</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[Following the news on log4j lately, you may wonder if Python’s logging library is safe. After all, there is a potential for injection attacks where string formatting meets user input. Thankfully, Python’s logging isn’t vulnerable to remote code execution. Nonetheless it is still important to be careful with untrusted data. This article will describe some common pitfalls, and how the popular practice of logging f-strings could — in certain situations — leave you vulnerable to other types of attacks.]]></summary></entry></feed>