Asynchronous Programming. Python3.5+

16 min read Last updated Nov 01 2024

#python #programming #concurrency

This is the fifth post in a series on asynchronous programming. The whole series explores a single question: What is asynchrony? When I first started digging into this, I thought I had a solid grasp of it. Turns out, I didn't know the first thing about asynchrony. So, let's dive in together!

Whole series:

In this post, we will talk about Python stack with the concepts we've talked about so far — synchrony and asynchrony: from threads and processes to the asyncio library.

Asynchronous programming with Python is becoming more and more popular recently. There are many different libraries for performing asynchronous programming on Python. One of these libraries is asyncio, which is a standard library on python added in Python 3.4. On Python 3.5, we got the syntax of async/await.

Asyncio and ASGI are probably one of the reasons why asynchronous programming is becoming more and more popular on Python. In this article, we will tell you what asynchronous programming is and compare some of these libraries.

Quick Recap

Here's what we've established in previous posts:

Sync: One thread is assigned to one task and is effectively locked until the task completes.
Async: Blocking operations are removed from the main application thread.
Concurrency: Tasks making progress together.
Parallelism: Tasks making progress simultaneously.
Parallelism implies concurrency, but concurrency does not always mean parallelism.

Python code can generally be executed in one of two ways: synchronously or asynchronously. You can think of them as distinct worlds with different libraries and function call styles but with a shared syntax and variables.

Synchronous World

In Python's synchronous world, which has existed for decades, functions are called directly, and code executes in a consistent, straightforward order, exactly as you wrote it.

In this post, we'll compare different implementations of the same code, using two functions. The first calculates the power of a number:

def cpu_bound(a, b):
    return a ** b

We'll run this function N times:

def simple_1(N, a, b):
    for i in range(N):
        cpu_bound(a, b)

The second function downloads data from the internet:

def io_bound(urls):
    data = []
    for url in urls:
        data.append(urlopen(url).read())
    return data

def simple_2(N, urls):
    for i in range(N):
        io_bound(urls)

To compare how long the function has been running, we will implement a simple decorator/context manager to measure time:

import time
from contextlib import ContextDecorator

class timeit(ContextDecorator):

    def __enter__(self):
        self.start_time = time.time()

    def __exit__(self, *args):
        elapsed = time.time() - self.start_time
        print("{:.3} sec".format(elapsed))

Now let's put it all together and run it to see how long my machine will execute this code:

import time
import functools
from urllib.request import urlopen
from contextlib import ContextDecorator


class timeit(ContextDecorator):

    def __enter__(self):
        self.start_time = time.time()

    def __exit__(self, *args):
        elapsed = time.time() - self.start_time
        print("{:.3} sec".format(elapsed))


def cpu_bound(a, b):
    return a ** b


def io_bound(urls):
    data = []
    for url in urls:
        data.append(urlopen(url).read())
    return data


@timeit()
def simple_1(N, a, b):
    for i in range(N):
        cpu_bound(a, b)


@timeit()
def simple_2(N, urls):
    for i in range(N):
        io_bound(urls)


if __name__ == '__main__':
    a = 7777
    b = 200000
    urls = [
        "http://google.com",
        "http://yahoo.com",
        "http://linkedin.com",
        "http://facebook.com"
    ]
    simple_1(10, a, b)
    simple_2(10, urls)

On my hardware, the cpu_bound function took 2.18 sec, while io_bound took 31.4 sec.

This baseline helps us understand how these functions perform in a synchronous model. Let's move on to our first approach: threads.

Threads

Now that we've established a synchronous baseline, let's explore how threads allow us to achieve some concurrency.

A thread is the smallest unit of processing that the OS can schedule. Threads in a process share access to global variables, meaning that if one thread modifies a global variable, the change is visible to all threads.

Simply put, a thread is a sequence of operations in a program that can execute independently of the main thread. Threads can execute concurrently (using time-division multiplexing) and, depending on the system, may run in parallel.

In Python, threads are implemented using OS threads across implementations (CPython, PyPy, Jython). Each Python thread is an OS-level thread (e.g., POSIX or Windows threads).

A single thread runs on a single CPU core. It executes until its time slice (100 ms by default) is up or it voluntarily yields control by making a system call.

Let's rewrite our functions using threads:

from threading import Thread

@timeit()
def threaded(n, func, *args):
    jobs = []
    for _ in range(n):
        thread = Thread(target=func, args=args)
        jobs.append(thread) 

    # Start the threads
    for job in jobs:
        job.start()

    # Wait for all threads to finish
    for job in jobs:
        job.join()

if __name__ == '__main__':
    ...
    threaded(10, cpu_bound, a, b)
    threaded(10, io_bound, urls)

On my hardware, cpu_bound took 2.47 sec, and io_bound took 7.9 sec.

The I/O-bound function is over 5 times faster since data loads in parallel. But why did the CPU-bound function slow down?

This slowdown is due to the Global Interpreter Lock (GIL), a unique feature of Python's CPython implementation, which we'll explore next.

Global Interpreter Lock (GIL)

The Global Interpreter Lock (GIL) is a lock that Python's interpreter requires for any thread to access Python's runtime. This lock is necessary not only for executing Python code but also for calls made through the Python C API. Essentially, the GIL is a global semaphore that ensures only one thread can execute within the interpreter at any given time, limiting true parallel execution of CPU-bound tasks.

Strictly speaking, the only action allowed when starting an interpreter without the GIL is to acquire it. Violating this rule can result in an immediate emergency shutdown (the best-case scenario) or a delayed crash (worse and harder to debug).

How does it work?

When a thread starts, it first captures the GIL. After a certain amount of time, the process scheduler may decide that this thread has done enough and shift control to another thread. The new thread (#2) checks for the GIL, sees that it's already held, and goes to sleep, yielding back to the first thread.

However, threads cannot hold the GIL indefinitely. Before Python 3.3, the GIL switched every 100 bytecode instructions. In later versions, a thread can hold the GIL for up to 5 ms. Additionally, the GIL is released if a thread makes a system call, performs disk I/O, or network I/O operations.

This is why I/O-bound tasks can benefit from threading in Python: while waiting on I/O operations, threads release the GIL, allowing other threads to execute. However, the GIL prevents true parallelism for CPU-bound tasks, which don't involve I/O. For such tasks, using threads won't improve performance and may even slow things down, as threads share CPU time and incur the overhead of frequent context switching. To take full advantage of multi-core processors, CPU-bound tasks are best executed in separate processes.

Workarounds and Alternatives

Multiprocessing: For CPU-bound tasks, the multiprocessing library is a popular alternative to threading, as each process has its own Python interpreter and memory space, bypassing the GIL.
C Extensions: Some libraries, like NumPy and Pandas, release the GIL during heavy computations. Cython or numba also allow developers to write C-like code that can release the GIL for intensive loops.
Alternative Python Implementations: Jython (Python for the JVM) and IronPython (Python for .NET) don't use the GIL, as they rely on JVM and CLR threading models, respectively.

While these workarounds help, they introduce additional complexity, and not all libraries or programs benefit from them.

Why does the GIL exist?

The GIL helps protect Python's internal data structures from concurrent access issues. For instance, it prevents race conditions when modifying an object's reference counter, providing a simple, reliable solution for Python's memory management model. It also makes it easier to integrate non-thread-safe C libraries, enabling Python to offer fast modules and bindings for a wide range of functionalities.

Future of the GIL

The Python community continues to debate removing or optimizing the GIL to allow true multi-threaded concurrency for CPU-bound tasks. Projects like PEP 703 aim to make the GIL optional, paving the way for better concurrency and parallelism in future Python versions. However, due to compatibility and performance concerns, removing the GIL entirely remains challenging.

In conclusion, threads are effective for parallelizing I/O-bound tasks, but CPU-bound tasks are best executed in separate processes to avoid the GIL's limitations.

Processes

From the OS point of view, a process is a data structure that contains a memory space and some other resources, such as files opened by it.

Often a process has one thread called main thread, but the program can create any number of threads. In the beginning, a thread does not allocate individual resources, instead, it uses memory and the resources of the process that generated it. Because of this, threads can start and stop quickly.

Multi-tasking is handled by the scheduler, part of the operating system kernel, which in turn loads execution threads into the CPU.

Like threads, processes are always executed concurrently but they can also run in parallel, depending on the hardware.

Process implementation:

from multiprocessing import Process

@timeit()
def multiprocessed(n, func, *args):
    processes = []
    for _ in range(n):
        p = Process(target=func, args=args)
        processes.append(p)

    # start the processes
    for p in processes:
        p.start()

    # ensure all processes have finished execution
    for p in processes:
        p.join()

if __name__ == '__main__':
    ...
    multiprocessed(10, cpu_bound, a, b)
    multiprocessed(10, io_bound, urls)

On my hardware, cpu_bound took 1.12 sec, io_bound — 7.22 sec.

Thus, the calculation operation is faster than a threaded implementation because now we are not stuck in a GIL capturing, but the I/O binding function took a little longer because the processes are heavier than threads.

Asynchronous world

In the asynchronous world, things work differently from synchronous programming. At the core is the event loop — a small piece of code that allows you to run multiple coroutines concurrently. Coroutines work synchronously until they reach a point where they need to wait for a result. At that point, they pause and transfer control back to the event loop, allowing other tasks to proceed.

The event loop cycles through tasks, checking each for readiness and using a selector to monitor I/O operations (like reading data from a socket or file). When a task is ready, it runs in the main event loop thread until it reaches another await, at which point control returns to the loop to check for other tasks.

Green threads

Green threads represent a basic level of asynchronous programming. A green thread behaves like a standard thread, except that the application, rather than the operating system, manages context switching between threads.

Gevent is a well-known Python library that implements green threads. Gevent uses green threads to avoid blocking I/O. By "monkey-patching" the standard Python libraries with gevent.monkey, Gevent can modify their behavior to enable non-blocking I/O.

Other libraries that support green threads include:

Let's see how performance will change if we start using green threads with the gevent library on Python:

import gevent.monkey

# Patch standard libraries to enable non-blocking I/O
gevent.monkey.patch_all()

@timeit()
def green_threaded(n, func, *args):
    jobs = []
    for _ in range(n):
        jobs.append(gevent.spawn(func, *args))
    # ensure all jobs have finished execution
    gevent.wait(jobs)

if __name__ == '__main__':
    ...
    green_threaded(10, cpu_bound, a, b)
    green_threaded(10, io_bound, urls)

Results are: cpu_bound — 2.23 sec, io_bound — 6.85 sec.

As expected, performance decreases for the CPU-bound function but improves for the I/O-bound function.

Asyncio

The asyncio package, as described in the Python documentation, is a library for writing concurrent code. asyncio is neither multithreading nor multiprocessing; instead, it implements an asynchronous event loop at a low level, intended for high-level frameworks like Twisted, Gevent, or Tornado to build upon. However, asyncio itself functions as a full-featured async framework.

Fundamentally, asyncio is single-threaded and single-process: it uses cooperative multitasking with the Proactor pattern. asyncio allows us to write asynchronous programs within a single thread by leveraging an event loop to schedule tasks and multiplex I/O operations.

How It Works

Synchronous and asynchronous functions work differently, so they cannot be mixed directly. For example, using time.sleep(10) instead of await asyncio.sleep(10) in a coroutine will prevent control from returning to the event loop, blocking the entire process.

Think of your codebase as divided into synchronous and asynchronous segments. Everything inside an async def block is asynchronous; everything else (including the main script body or regular class methods) is synchronous.

Asynchronous API

Here's the basic concept: the event loop orchestrates asynchronous function execution. Declaring an async function using async def changes its behavior — the function call immediately returns a coroutine object rather than blocking, signaling that it can run and return a result once awaited.

We pass these coroutines to the event loop, which returns a "future" object—a promise to provide a result when ready. We can periodically check this promise, and when it holds a value, we use it in subsequent operations.

When await is called, the function pauses, freeing the event loop to manage other tasks. Once the awaited operation finishes, the event loop resumes the function, passing back any result. Here's a simple example:

import asyncio

async def say(what, when):
    await asyncio.sleep(when)
    print(what)

loop = asyncio.get_event_loop()
loop.run_until_complete(say('hello world', 1))
loop.close()

In the example here, the say() function pauses and gives back control to the event loop, which sees that sleep needs to be run and calls it. Then, those calls get suspended with a marker to resume them in one second. Once they resume, say returns a result. After say is completed the main thread ready to run again and the event loop resumes with the returned value.

Only when we call loop.run_until_completed the event loop starts executing all the coroutines that have been added to the loop. loop.run_until_completed will block your program until the Future you gave as an argument is completed.

This is how asynchronous code can have so many things happening at once — anything that's blocking calls await, and gets put onto the event loop's list of paused coroutines so that something else could run. Everything that's paused has an associated callback that will wake it up again — some is time-based, some is event-based, and most of them are like the example above and waiting for a result from another coroutine.

Handling Blocking Functions

Returning to our cpu_bound and io_bound blocking functions, we cannot directly mix synchronous and asynchronous code, so we need to make these functions asynchronous. While many operations have asynchronous equivalents, some code remains blocking and must run without blocking the event loop.

For blocking tasks, asyncio provides run_in_executor(), which runs specified code in a thread pool, keeping the main event loop unblocked. We'll use this for our CPU-bound function and rewrite the I/O-bound function with await at necessary points.

import asyncio
import aiohttp

async def async_func(N, func, *args):
    coros = [func(*args) for _ in range(N)]
    # run awaitable objects concurrently
    await asyncio.gather(*coros)


async def a_cpu_bound(a, b):
    result = await loop.run_in_executor(None, cpu_bound, a, b)
    return result


async def a_io_bound(urls):
    # create a coroutine function where we will download from individual url
    async def download_coroutine(session, url):
        async with session.get(url, timeout=10) as response:
            await response.text()

    # set an aiohttp session and download all our urls
    async with aiohttp.ClientSession(loop=loop) as session:
        for url in urls:
            await download_coroutine(session, url)


if __name__ == '__main__':
    ...
    loop = asyncio.get_event_loop()
    with timeit():
        loop.run_until_complete(async_func(10, a_cpu_bound, a, b))

    with timeit():
        loop.run_until_complete(async_func(10, a_io_bound, urls))

Results: cpu_bound — 2.23 sec, io_bound — 4.37 sec.

The CPU-bound function remains slower, while the I/O-bound function runs nearly twice as fast compared to the threaded example.

Optimizing Asyncio with `uvloop`

For network-intensive applications, you can further optimize asyncio by using `uv

loop, a high-performance drop-in replacement for asyncio's event loop that provides substantial speed improvements. To use uvloop`, simply replace the event loop:

import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())

Making the Right Choice

CPU-bound -> multiprocessing
I/O-bound, fast I/O, Limited Number of Connections -> multithreading
I/O-bound, slow I/O, many connections -> asyncio

Conclusion

Threads may be easier if you have a typical web application that is independent of external services and a relatively limited number of users for whom response time is predictably short.

Asynchrony is appropriate if the application spends most of its time reading/writing data rather than processing it. For example, you have a lot of slow requests - web sockets, a long pooling, or you have slow external synchronous backends for which you don't know when the requests will be finished.

Synchronous programming is the easiest to start the development of applications. In this approach, the sequential execution of commands is performed. Even with conditional branching, loops and function calls, we think about code from the viewpoint of performing one step at a time.

An asynchronous application behaves in a different way. It still works one step at a time, but the difference is that the system moving forward does not wait for the current execution step to be completed. As a result, we move on to event-driven programming.

asyncio is a great library, and it's great that it was included in the standard Python library. asyncio has already started building an ecosystem (aiohttp, asyncp, etc.) for application development. There are other implementations of the event loop (uvloop, dabeaz/curio, python-trio/trio) and I think after a while asyncio will evolve into an even more powerful tool than it is today.

Additional materials

Asynchronous Programming by Kirill Bobrov
Grokking Concurrency by Kirill Bobrov
PEP 342
PEP 492
Check the old guido's presentation of the asyncio approach.
Interesting talk of Robert Smallshire "Get to grips with asyncio in Python 3"
David Beazley talk about getting rid of asyncio
uvloop - faster event-loop for asyncio
Some thoughts on asynchronous API design in a post-async/await world

Liked this? I publish one deep-dive every week.

Join 2,500+ engineers. No BS, no vendor fluff.

Get the newsletter