Category Archives: python

photo of meerkats looking at the light

Monitoring Celery Tasks with Sentry

Sentry is a great tool for monitoring celery tasks, and alerting when they fail or don’t run on time. But it requires a bit of work to setup properly. Below is some sample code for setting up sentry monitoring of periodic tasks, followed by an explanation.

import math
import sentry_sdk
from celery import signals
from sentry_sdk import monitor
from sentry_sdk.integrations.celery import CeleryIntegration
@signals.beat_init.connect # if you use beats
@signals.celeryd_init.connect
def init_sentry(**kwargs):
    sentry_sdk.init(
        dsn=...,
        integrations=[
            CeleryIntegration(monitor_beat_tasks=False)
        ]
    )
@signals.worker_shutdown.connect
@signals.task_postrun.connect
def flush_sentry(**kwargs):
    sentry_sdk.flush(timeout=5)
def add_periodic_task(celery, schedule, task):
    max_runtime = math.ceil(schedule * 4 / 60)
    monitor_config = {
        "recovery_threshold": 1,
        "failure_issue_threshold": 10,
        "checkin_margin": max_runtime,
        "max_runtime": max_runtime,
        "schedule": {
            "type": "interval",
            "value": math.ceil(schedule / 60.0)
            "unit": "minute"
        }
    }
    name = task.__name__
    task = monitor(monitor_slug=name, monitor_config=monitor_config)(task)
    celery.add_periodic_task(schedule, celery.task(task).s(), name=name)

Initialize Sentry

The init_sentry function must be called before any tasks start executing. The sentry docs for celery recommend using the celeryd_init signal. And if you use celery beats for periodic task execution, then you also need to initialize on the beat_init signal.

Monitoring Beats Tasks

In this example, I’m setting monitor_beat_tasks=False to show how you can do manual monitoring. monitor_beat_tasks=True is much simpler, and doesn’t require any code like in add_periodic_task. But in my experience, it’s not reliable when using async celery functions. The automatic beats monitoring uses some celery signals that likely don’t get executed correctly under async conditions. But manual monitoring isn’t that hard with a function wrapper, as shown above.

Adding a Periodic Task

The add_periodic_task function takes a Celery instance, a periodic interval in seconds, and a function to execute. This function can be normal or async. It then does the following:

  1. Calculates a max_runtime in minutes, so that sentry knows when a task has gone over time. This is also used for checkin_margin, giving the task plenty of buffer time before an issue is created. You should adjust these according to your needs.
  2. Creates a monitor_config for sentry, specifying the following:
    • schedule in minutes (rounded up, because sentry doesn’t handle schedules in seconds)
    • the number of failures allowed before creating an issue (I put 10, but you should adjust as needed)
    • how many successful checkins are required before the issue is marked as resolved (1 is the default, but adjust as needed)
  3. Wraps the function in the sentry monitor decorator, using the function’s name as the monitor_slug. With default beats monitoring, the slug is set to the full package.module.function path, which can be quite long and becomes hard to scan when you have many tasks.
  4. Schedules the task in celery.

Sentry Flush

While this may not be strictly necessary, calling sentry_sdk.flush on the worker_shutdown and task_postrun signals ensures that events are sent to sentry when a celery task completes.

Monitoring your crons

Once this is all setup and running, you should be able to go to Insights > Crons in your sentry web UI, and see all your celery tasks. Double check your monitor settings to make sure they’re correct, then sit back and relax, while sentry keeps track of how your tasks are running.

green celery on blue background

Async Python Functions with Celery

Celery is a great tool for scheduled function execution in python. You can also use it for running functions in the background asynchronously from your main process. However, it does not support python asyncio. This is a big limitation, because async functions are usually much more I/O efficient, and there are many libraries that provide great async support. And parallel data processing with async.gather becomes impossible in celery without async support.

Celery Async Issues

Unfortunately, based on the current Open status of these issues, celery will not support async functions anytime soon.

But luckily there are two projects that provide async celery support.

AIO Celery

This project is an alternative independent asyncio implementation of Celery

aio-celery “does not depend on the celery codebase”. Instead, it provides a new implementation of the Celery Message Protocol that enables asyncio tasks and workers.

It is written completely from scratch as a thin wrapper around aio-pika (which is an asynchronous RabbitMQ python driver) and it has no other dependencies

It is actively developed, and seems like a great celery alternative. But there are some downsides:

  1. “Only RabbitMQ as a message broker” means you cannot use any other broker such as Redis
  2. “Only Redis as a result backend” means you can’t store results in any other database
  3. “Complete feature parity with upstream Celery project is not the goal”, so there may be features from celery you want that are not present in aio-celery

Celery AIO Pool

celery-aio-pool provides a custom worker pool implementation that works with celery 5.3+. Unlike aio-celery, you can keep using your existing celery implementation. All you have to do to get async task support in celery is:

  1. Start your celery worker with this environment variable: CELERY_CUSTOM_WORKER_POOL='celery_aio_pool.pool:AsyncIOPool'
  2. Run the celery worker process with --pool=custom

So your worker command will look like

CELERY_CUSTOM_WORKER_POOL='celery_aio_pool.pool:AsyncIOPool' celery worker --pool=custom

plus whatever other arguments or environment variables you need. Once you have this in place, you can start using async functions as celery tasks.

While celery-aio-pool is not as actively developed, it works, and has the following benefits:

  • Simple to install and configure with Celery >= 5.3
  • Works with any celery support message broker or result backend
  • Works with your existing celery setup without requiring any other changes
burned matches on white background in minimalist style

Python Async Gather in Batches

Python’s asyncio.gather function is great for I/O bound parallel processing. There’s a simple utility function I like to use that I call gather_in_batches:

async def gather_in_batches(tasks, batch_size=100, return_exceptions=False):
    for i in range(0, len(tasks), batch_size):
        batch = tasks[i:i+batch_size]
        for result in await asyncio.gather(*batch, return_exceptions=return_exceptions):
            yield result

The way you use it is

  1. Generate a list of tasks
  2. Gather your results

Here’s some simple sample code to demonstrate:

tasks = [process_async(obj) for obj in objects]
return [result async for result in gather_in_batches(tasks)]

objects could be all sorts of things:

  • records from a database
  • urls to scrape
  • filenames to read

And process_async is an async function that would just do whatever processing you need to do on that object. Assuming it is mostly I/O bound, then this is very simple and effective method to process data in parallel, without getting into threads, multi-processing, greenlets, or any other method.

You’ll need to experiment to figure out what the optimal batch_size is for your use case. And unless you don’t care about errors, you should set return_exceptions=True, then check if isinstance(result, Exception) to do proper error handling.

Python Vulnerability Checking Links

NLTK Trainer Updates

I’ve recently pushed some updates to nltk-trainer, so that it now supports Python 3.7 and NLTK 3.4.5 or greater. NLTK also just released version 3.5.

One significant change is that the default part of speech tagger is now the PerceptronTagger, originally written by Matthew Honnibal (author of spacy) before it was ported to NLTK.

Using word2vec with NLTK

word2vec is an algorithm for constructing vector representations of words, also known as word embeddings. The vector for each word is a semantic description of how that word is used in context, so two words that are used similarly in text will get similar vector represenations. Once you map words into vector space, you can then use vector math to find words that have similar semantics.

gensim provides a nice Python implementation of Word2Vec that works perfectly with NLTK corpora. The model takes a list of sentences, and each sentence is expected to be a list of words. This is exactly what is returned by the sents() method of NLTK corpus readers. So let’s compare the semantics of a couple words in a few different NLTK corpora:

[sourcecode language=”python”]
>>> from gensim.models import Word2Vec
>>> from nltk.corpus import brown, movie_reviews, treebank
>>> b = Word2Vec(brown.sents())
>>> mr = Word2Vec(movie_reviews.sents())
>>> t = Word2Vec(treebank.sents())

>>> b.most_similar(‘money’, topn=5)
[(‘pay’, 0.6832243204116821), (‘ready’, 0.6152011156082153), (‘try’, 0.5845392942428589), (‘care’, 0.5826011896133423), (‘move’, 0.5752171277999878)]
>>> mr.most_similar(‘money’, topn=5)
[(‘unstoppable’, 0.6900672316551208), (‘pain’, 0.6289106607437134), (‘obtain’, 0.62665855884552), (‘jail’, 0.6140228509902954), (‘patients’, 0.6089504957199097)]
>>> t.most_similar(‘money’, topn=5)
[(‘short-term’, 0.9459682106971741), (‘-LCB-‘, 0.9449775218963623), (‘rights’, 0.9442864656448364), (‘interested’, 0.9430986642837524), (‘national’, 0.9396077990531921)]

>>> b.most_similar(‘great’, topn=5)
[(‘new’, 0.6999611854553223), (‘experience’, 0.6718623042106628), (‘social’, 0.6702290177345276), (‘group’, 0.6684836149215698), (‘life’, 0.6667487025260925)]
>>> mr.most_similar(‘great’, topn=5)
[(‘wonderful’, 0.7548679113388062), (‘good’, 0.6538234949111938), (‘strong’, 0.6523671746253967), (‘phenomenal’, 0.6296845078468323), (‘fine’, 0.5932096242904663)]
>>> t.most_similar(‘great’, topn=5)
[(‘won’, 0.9452997446060181), (‘set’, 0.9445616006851196), (‘target’, 0.9342271089553833), (‘received’, 0.9333916306495667), (‘long’, 0.9224691390991211)]

>>> b.most_similar(‘company’, topn=5)
[(‘industry’, 0.6164317727088928), (‘technical’, 0.6059585809707642), (‘orthodontist’, 0.5982754826545715), (‘foamed’, 0.5929019451141357), (‘trail’, 0.5763031840324402)]
>>> mr.most_similar(‘company’, topn=5)
[(‘colony’, 0.6689200401306152), (‘temple’, 0.6546304225921631), (‘arrival’, 0.6497283577919006), (‘army’, 0.6339291334152222), (‘planet’, 0.6184555292129517)]
>>> t.most_similar(‘company’, topn=5)
[(‘panel’, 0.7949466705322266), (‘Herald’, 0.7674347162246704), (‘Analysts’, 0.7463694214820862), (‘amendment’, 0.7282689809799194), (‘Treasury’, 0.719698429107666)]
[/sourcecode]

I hope it’s pretty clear from the above examples that the semantic similarity of words can vary greatly depending on the textual context. In this case, we’re comparing a wide selection of text from the brown corpus with movie reviews and financial news from the treebank corpus.

Note that if you call most_similar() with a word that was not present in the sentences, you will get a KeyError exception. This can be a common occurrence with smaller corpora like treebank.

For more details, see this tutorial on using Word2Vec. And Word Embeddings for Fashion is a great introduction to these concepts, although it uses a different word embedding algorithm called glove.

NLTK 3 Changes

NLTK 3 has quite a number of changes from NLTK 2, many of which will break old code. You can see a list of documented changes in the wiki page, Porting your code to NLTK 3.0. Below are the major changes I encountered while working on the NLTK 3 Cookbook.

Probability Classes

The FreqDist api has changed. It now inherits from collections.Counter, which implements most of the previous functionality, but in a different way. So instead of fd.inc(tag), you now need to do fd[tag] += 1.

fd.samples() doesn\’t exist anymore. Instead, you can use fd.most_common(), which is a method of collections.Counter that returns a list that looks like [(word, count)].

ConditionalFreqDist now inherits from collections.defaultdict (one of my favorite Python data structures) which provides most of the previous functionality for free.

WordNet API

NLTK 3 has changed many wordnet Synset attributes to methods:

  • syn.definition -> syn.definition()
  • syn.examples -> syn.examples()
  • syn.lemmas -> syn.lemmas()
  • syn.name -> syn.name()
  • syn.pos -> syn.pos()

Same goes for the Lemma class. For example, lemma.antonyms() is now a method.

Tagging

The batch_tag() method is now tag_sents(). The brill tagger API has changed significantly: brill.FastBrillTaggerTrainer is now brill_trainer.BrillTaggerTrainer, and the brill templates have been replaced by the tbl.feature.Feature interface with brill.Pos or brill.Word as implementations of the interface.

Universal Tagset

Simplified tags have been replaced with the universal tagset. So tagged_corpus.tagged_sents(simplify_tags=True) becomes tagged_corpus.tagged_sents(tagset=\'universal\'). In order to make this work, TaggedCorpusReader should be initialized with a known tagset, using the tagset kwarg, so that its tags can be mapped to the universal tagset. Known tagset mappings are stored in nltk_data/taggers/universal_tagset. The treebank tagset is called en-ptb (PennTreeBank) and the brown tagset is called en-brown. These files are simply 2 column, tab separated mappings of source tag to universal tag. The function nltk.tag.mapping.map_tag(source, target, source tag) is used to perform the mapping.

Chunking & Parse Trees

The main change in chunkers & parsers is replacing the term node with label. RegexpChunkParser now takes a chunk chunk_label argument instead of chunk_node, while in the Tree class, the node attribute has been replaced with the label() method.

Classification

The SVM classifiers and scipy based MaxentClassifier algorithms (like CG) have been removed, but the addition of the SklearnClassifier more than makes up for it. This classifier allows you to make use of most scikit-learn classification algorithms, which are generally faster and more memory efficient than the other NLTK classifiers, while being at least as accurate.

Python 3

NLTK 3 is compatible with both Python 2 and Python 3. If you are new to Python 3, then you\’ll likely be puzzled when you find that training the same model on the same data can result in slightly different accuracy metrics, because dictionary ordering is random in Python 3. This is a deliberate decision to improve security, but you can control it with the PYTHONHASHSEED environment variable. Just run $ PYTHONHASHSEED=0 python to get consistent dictionary ordering & accuracy metrics.

Python 3 has also removed the separate unicode string object, so that now all strings are unicode. But some of the NLTK corpus functions return byte strings, which look like b"raw string", so you may need convert these to normal strings before doing any further string processing.

Here\’s a few other Python 3 changes I ran into:

  • itertools.izip -> zip
  • dict.iteritems() doesn\’t exist, use dict.items() instead
  • dict.keys() does not produce a list (it returns a view). If you want a list, use dict.dict_keys()

Upgrading

Because of the above switching costs, upgrading right away may not be worth it. I\’m still running plenty of NLTK 2 code, because it\’s stable and works great. But if you\’re starting a new project, or want to take advantage of new functionality, you should definitely start with NLTK 3.

Python 3 Text Processing with NLTK 3 Cookbook

Python Text Processing with NLTK 3 Cookbook

After many weekend writing sessions, the 2nd edition of the NLTK Cookbook, updated for NLTK 3 and Python 3, is available at Amazon and Packt. Code for the book is on github at nltk3-cookbook. Here’s some details on the changes & updates in the 2nd edition:

First off, all the code in the book is for Python 3 and NLTK 3. Most of it should work for Python 2, but not all of it. And NLTK 3 has made many backwards incompatible changes since version 2.0.4. One of the nice things about Python 3 is that it’s unicode all the way. No more issues with ASCII versus unicode strings. However, you do have to deal with byte strings in a few cases. Another interesting change is that hash randomization is on by default, which means that if you don’t set the PYTHONHASHSEED environment variable, training accuracy can change slightly on each run, because the iteration order of dictionaries is no longer consistent by default.

In Chapter 1, Tokenizing Text and WordNet Basics, I added a recipe for training a sentence tokenizer using the PunktSentenceTokenizer. This is surprisingly easy, and you can find the code in chapter1.py.

Chapter 2, Replacing and Correcting Words, shows the additional languages supported by the SnowballStemmer. An unfortunate removal from this chapter is <span class="pre">babelizer</span>, which was a fun library to use, but is no longer supported by Yahoo.

NLTK 3 replaced <span class="pre">simplify_tags</span> with universal tagset mappings, so I updated Chapter 3, Creating Custom Corpora to show how to use these tagset mappings to get the universal tags.

In Chapter 4, Part-of-Speech Tagging, the last recipe shows how to use train_tagger.py from NLTK-Trainer to replicate most of the tagger training recipes detailed earlier in the chapter. NLTK-Trainer was largely inspired by my experience writing Python Text Processing with NLTK 2.0 Cookbook, after realizing that many aspects of training part-of-speech taggers could be encapsulated in a command line script.

Chapter 5, Extracing Chunks, adds examples for using train_chunker.py to train phrase chunkers.

Chapter 7, Text Classification, adds coverage of train_classifier.py, along with examples of using the SklearnClassifier, which provides access to many of the scikit-learn classification algorithms. The scikit-learn classifiers tend to be at least as accurate as NLTK’s classifiers, are often faster to train, and have much smaller memory & disk footprints. And since NLTK 3 removed support for scipy based <span class="pre">MaxentClassifier</span> algorithms and SVM classifiers, the choice of which classifers to use has become very easy: when in doubt, choose SklearnClassifier (code examples can be found in chapter7.py).

There are a few library changes in Chapter 9, Parsing Specific Data Types:

  • <span class="pre">timex</span> and SimpleParse recipes have been removed due to lack of Python 3 compatibility
  • uses beautifulsoup4 with examples of UnicodeDammit
  • chardet was replaced with charade, which is compatible with both Python 2 & 3. But since publication, charade was merged back into chardet and is no longer maintained. I recommend installing chardet and replacing all instances of the <span class="pre">charade</span> module name with <span class="pre">chardet</span>.

So if you want to learn the latest & greatest NLTK 3, pickup your copy of Python 3 Text Processing with NLTK 3 Cookbook, and checkout the code at nltk3-cookbook. If you like the book, please review it at Amazon or goodreads.

Instant PyGame Book Review

Pygame for Python Game Development How-toThis is a review of the book Instant Pygame for Python Game Development How-to, by Ivan Idris. Packt asked me to review the book, and I agreed because like many developers, I’ve thought about writing my own game, and I’ve been curious about the capabilities of pygame. It’s a short book, ~120 pages, so this is a short review.

The book covers pygame basics like drawing images, rendering text, playing sounds, creating animations, and altering the mouse cursor. The author has helpfully posted some video demos of some of the exercises, which are linked from the book. I think this is a great way to show what’s possible, while also giving the reader a clear idea of what they are creating & what should happen. After the basic intro exercises, I think the best content was how to manipulate pixel arrays with numpy (the author has also written two books on numpy: NumPy Beginner’s Guide & NumPy Cookbook), how to create & use sprites, and how to make your own version of the game of life.

There were 3 chapters whose content puzzled me. When you’ve got such a short book on a specific topic, why bring up matplotlib, profiling, and debugging? These chapters seemed off-topic and just thrown in there randomly. The organization of the book could have been much better too, leading the reader from the basics all the way to a full-fledged game, with each chapter adding to the previous chapters. Instead, the chapters sometimes felt like unrelated low-level examples.

Overall, the book was a quick & easy read, that rapidly introduces you to basic pygame functionality, and leads you on to more complex activities. My main takeaway is that pygame provides an easy to use & low-level framework for building simple games, and can be used to create more complex games (but probably not FPS or similar graphically intensive games). The ideal games would probably be puzzle based and/or dialogue heavy, and only require simple interactions from the user. So if you’re interested in building such a game in Python, you should definitely get a copy of Instant Pygame for Python Game Development How-to.

Text Classification for Sentiment Analysis – NLTK + Scikit-Learn

Now that NLTK versions 2.0.1 & higher include the SklearnClassifier (contributed by Lars Buitinck), it’s much easier to make use of the excellent scikit-learn library of algorithms for text classification. But how well do they work?

Below is a table showing both the accuracy & F-measure of many of these algorithms using different feature extraction methods. Unlike the standard NLTK classifiers, sklearn classifiers are designed for handling numeric features. So there are 3 different values under the feats column for each algorithm. bow means bag-of-words feature extraction, where every word gets a 1 if present, or a 0 if not. int means word counts are used, so if a word occurs twice, it gets the number 2 as its feature value (whereas with bow it would still get a 1). And tfidf means the TfidfTransformer is used to produce a floating point number that measures the importance of a word, using the tf-idf algorithm.

All numbers were determined using nltk-trainer, specifically, python train_classifier.py movie_reviews <span class="pre">--no-pickle</span> <span class="pre">--classifier</span> sklearn.ALGORITHM <span class="pre">--fraction</span> 0.75. For int features, the option <span class="pre">--value-type</span> int was used, and for tfidf features, the options <span class="pre">--value-type</span> float <span class="pre">--tfidf</span> were used. This was with NLTK 2.0.3 and sklearn 0.12.1.

algorithmfeatsaccuracyneg f-measurepos f-measure
BernoulliNBbow82.282.781.6
BernoulliNBint82.282.781.6
BernoulliNBtfidf82.282.781.6
GaussianNBbow66.465.167.6
GaussianNBint66.866.367.3
MultinomialNBbow82.282.781.6
MultinomialNBint81.281.580.1
MultinomialNBtfidf81.683.080.0
LogisticRegressionbow85.685.885.4
LogisticRegressionint83.283.083.4
LogisticRegressiontfidf82.081.582.5
SVCbow67.675.352.9
SVCint67.871.762.6
SVCtfidf50.20.866.7
LinearSVCbow86.086.285.8
LinearSVCint81.881.781.9
LinearSVCtfidf85.885.586.0
NuSVCbow85.085.584.5
NuSVCint81.481.781.1
NuSVCtfidf50.20.866.7

As you can see, the best algorithms are BernoulliNB, MultinomialNB, LogisticRegression, LinearSVC, and NuSVC. Surprisingly, int and tfidf features either provide a very small performance increase, or significantly decrease performance. So let’s see if we can improve performance with the same techniques used in previous articles in this series, specifically bigrams and high information words.

Bigrams

Below is a table showing the accuracy of the top 5 algorithms using just unigrams (the default, a.k.a single words), and using unigrams + bigrams (pairs of words) with the option <span class="pre">--ngrams</span> 1 2.

algorithmunigramsbigrams
BernoulliNB82.286.0
MultinomialNB82.286.0
LogisticRegression85.686.6
LinearSVC86.086.4
NuSVC85.085.2

Only BernoulliNB & MultinomialNB got a modest boost in accuracy, putting them on-par with the rest of the algorithms. But we can do better than this using feature scoring.

Feature Scoring

As I’ve shown previously, eliminating low information features can have significant positive effects. Below is a table showing the accuracy of each algorithm at different score levels, using the option <span class="pre">--min_score</span> SCORE (and keeping the <span class="pre">--ngrams</span> 1 2 option to get bigram features).

algorithmscore 1score 2score 3
BernoulliNB62.897.295.8
MultinomialNB62.897.295.8
LogisticRegression90.491.691.4
LinearSVC89.891.490.2
NuSVC89.490.891.0

LogisticRegression, LinearSVC, and NuSVC all get a nice gain of ~4-5%, but the most interesting results are from the BernoulliNB & MultinomialNB algorithms, which drop down significantly at <span class="pre">--min_score</span> 1, but then skyrocket up to 97% with <span class="pre">--min_score</span> 2. The only explanation I can offer for this is that Naive Bayes classification, because it does not weight features, can be quite sensitive to changes in training data (see Bayesian Poisoning for an example).

Scikit-Learn

If you haven’t yet tried using scikit-learn for text classification, then I hope this article convinces you that it’s worth learning. NLTK’s SklearnClassifier makes the process much easier, since you don’t have to convert feature dictionaries to numpy arrays yourself, or keep track of all known features. The Scikits classifiers also tend to be more memory efficient than the standard NLTK classifiers, due to their use of sparse arrays.