Category Archives: programming

photo of meerkats looking at the light

Monitoring Celery Tasks with Sentry

Sentry is a great tool for monitoring celery tasks, and alerting when they fail or don’t run on time. But it requires a bit of work to setup properly. Below is some sample code for setting up sentry monitoring of periodic tasks, followed by an explanation.

import math
import sentry_sdk
from celery import signals
from sentry_sdk import monitor
from sentry_sdk.integrations.celery import CeleryIntegration
@signals.beat_init.connect # if you use beats
@signals.celeryd_init.connect
def init_sentry(**kwargs):
    sentry_sdk.init(
        dsn=...,
        integrations=[
            CeleryIntegration(monitor_beat_tasks=False)
        ]
    )
@signals.worker_shutdown.connect
@signals.task_postrun.connect
def flush_sentry(**kwargs):
    sentry_sdk.flush(timeout=5)
def add_periodic_task(celery, schedule, task):
    max_runtime = math.ceil(schedule * 4 / 60)
    monitor_config = {
        "recovery_threshold": 1,
        "failure_issue_threshold": 10,
        "checkin_margin": max_runtime,
        "max_runtime": max_runtime,
        "schedule": {
            "type": "interval",
            "value": math.ceil(schedule / 60.0)
            "unit": "minute"
        }
    }
    name = task.__name__
    task = monitor(monitor_slug=name, monitor_config=monitor_config)(task)
    celery.add_periodic_task(schedule, celery.task(task).s(), name=name)

Initialize Sentry

The init_sentry function must be called before any tasks start executing. The sentry docs for celery recommend using the celeryd_init signal. And if you use celery beats for periodic task execution, then you also need to initialize on the beat_init signal.

Monitoring Beats Tasks

In this example, I’m setting monitor_beat_tasks=False to show how you can do manual monitoring. monitor_beat_tasks=True is much simpler, and doesn’t require any code like in add_periodic_task. But in my experience, it’s not reliable when using async celery functions. The automatic beats monitoring uses some celery signals that likely don’t get executed correctly under async conditions. But manual monitoring isn’t that hard with a function wrapper, as shown above.

Adding a Periodic Task

The add_periodic_task function takes a Celery instance, a periodic interval in seconds, and a function to execute. This function can be normal or async. It then does the following:

  1. Calculates a max_runtime in minutes, so that sentry knows when a task has gone over time. This is also used for checkin_margin, giving the task plenty of buffer time before an issue is created. You should adjust these according to your needs.
  2. Creates a monitor_config for sentry, specifying the following:
    • schedule in minutes (rounded up, because sentry doesn’t handle schedules in seconds)
    • the number of failures allowed before creating an issue (I put 10, but you should adjust as needed)
    • how many successful checkins are required before the issue is marked as resolved (1 is the default, but adjust as needed)
  3. Wraps the function in the sentry monitor decorator, using the function’s name as the monitor_slug. With default beats monitoring, the slug is set to the full package.module.function path, which can be quite long and becomes hard to scan when you have many tasks.
  4. Schedules the task in celery.

Sentry Flush

While this may not be strictly necessary, calling sentry_sdk.flush on the worker_shutdown and task_postrun signals ensures that events are sent to sentry when a celery task completes.

Monitoring your crons

Once this is all setup and running, you should be able to go to Insights > Crons in your sentry web UI, and see all your celery tasks. Double check your monitor settings to make sure they’re correct, then sit back and relax, while sentry keeps track of how your tasks are running.

green celery on blue background

Async Python Functions with Celery

Celery is a great tool for scheduled function execution in python. You can also use it for running functions in the background asynchronously from your main process. However, it does not support python asyncio. This is a big limitation, because async functions are usually much more I/O efficient, and there are many libraries that provide great async support. And parallel data processing with async.gather becomes impossible in celery without async support.

Celery Async Issues

Unfortunately, based on the current Open status of these issues, celery will not support async functions anytime soon.

But luckily there are two projects that provide async celery support.

AIO Celery

This project is an alternative independent asyncio implementation of Celery

aio-celery “does not depend on the celery codebase”. Instead, it provides a new implementation of the Celery Message Protocol that enables asyncio tasks and workers.

It is written completely from scratch as a thin wrapper around aio-pika (which is an asynchronous RabbitMQ python driver) and it has no other dependencies

It is actively developed, and seems like a great celery alternative. But there are some downsides:

  1. “Only RabbitMQ as a message broker” means you cannot use any other broker such as Redis
  2. “Only Redis as a result backend” means you can’t store results in any other database
  3. “Complete feature parity with upstream Celery project is not the goal”, so there may be features from celery you want that are not present in aio-celery

Celery AIO Pool

celery-aio-pool provides a custom worker pool implementation that works with celery 5.3+. Unlike aio-celery, you can keep using your existing celery implementation. All you have to do to get async task support in celery is:

  1. Start your celery worker with this environment variable: CELERY_CUSTOM_WORKER_POOL='celery_aio_pool.pool:AsyncIOPool'
  2. Run the celery worker process with --pool=custom

So your worker command will look like

CELERY_CUSTOM_WORKER_POOL='celery_aio_pool.pool:AsyncIOPool' celery worker --pool=custom

plus whatever other arguments or environment variables you need. Once you have this in place, you can start using async functions as celery tasks.

While celery-aio-pool is not as actively developed, it works, and has the following benefits:

  • Simple to install and configure with Celery >= 5.3
  • Works with any celery support message broker or result backend
  • Works with your existing celery setup without requiring any other changes
burned matches on white background in minimalist style

Python Async Gather in Batches

Python’s asyncio.gather function is great for I/O bound parallel processing. There’s a simple utility function I like to use that I call gather_in_batches:

async def gather_in_batches(tasks, batch_size=100, return_exceptions=False):
    for i in range(0, len(tasks), batch_size):
        batch = tasks[i:i+batch_size]
        for result in await asyncio.gather(*batch, return_exceptions=return_exceptions):
            yield result

The way you use it is

  1. Generate a list of tasks
  2. Gather your results

Here’s some simple sample code to demonstrate:

tasks = [process_async(obj) for obj in objects]
return [result async for result in gather_in_batches(tasks)]

objects could be all sorts of things:

  • records from a database
  • urls to scrape
  • filenames to read

And process_async is an async function that would just do whatever processing you need to do on that object. Assuming it is mostly I/O bound, then this is very simple and effective method to process data in parallel, without getting into threads, multi-processing, greenlets, or any other method.

You’ll need to experiment to figure out what the optimal batch_size is for your use case. And unless you don’t care about errors, you should set return_exceptions=True, then check if isinstance(result, Exception) to do proper error handling.

Salt Recipe for Creating a MySQL User with Grants for Scalyr

Salt is a great tool for managing the configuration of many servers. And when you have many servers, you should also be monitoring them with a tool like Dataset (aka Scalyr). The scalyr agent can monitor many things, but in this example, I’m going to show you how to create a MySQL user for the scalyr agent with just the right amount of permissions.

Salt Formula

{% set scalyr_user = salt['pillar.get']('scalyr:mysql:user', 'scalyr-agent-monitor') %}
mysql_scalyr_user:
  mysql_user.present:
    # - host: localhost
    - name: {{ scalyr_user }}
    - password: {{ pillar['scalyr']['mysql']['password'] }}
  mysql_grants.present:
    - grant: 'process, replication client'
    - database: '*.*'
    # - host: localhost
    - user: {{ scalyr_user }}
    - require:
      - mysql_user: {{ scalyr_user }}

Salt uses yaml with jinja templating to define states. This template does the following:

  1. Creates a MySQL user for scalyr
  2. Grants permissions for that scalyr user to access MySQL process & replication metrics on all databases

You can view the full range of options for the mysql_user and mysql_grants states if you need to customize it more.

Pillar Configuration

The above salt recipe requires a corresponding pillar configuration that looks like this:

scalyr:
  mysql:
    user: scalyr-agent-monitor
    password: RANDOM_PASSWORD

Scalyr Agent Configuration

Then in your scalyr agent JSON, you can use a template like this:

{
  logs: [{
    path: "/var/log/mysql/error.log",
    attributes: {parser: "mysql_error"}
  }, {
    path: "/var/log/mysql/slow.log",
    attributes: {parser: "mysql_slow"}
  }],
  monitors: [{
    module: "scalyr_agent.builtin_monitors.mysql_monitor",
    database_username: "{{ salt['pillar.get']('scalyr:mysql:user') }}",
    database_password: "{{ salt['pillar.get']('scalyr:mysql:password') }}",
    database_socket: "/var/run/mysqld/mysqld.sock"
  }]
}

How to use it

If you’re already familiar with salt, then hopefully this all makes sense. Let’s say you named your state mysql_user in a scalyr state directory. Then you could apply it like this:

salt SERVER_ID state.sls scalyr.mysql_user

And now you have a MySQL user just for scalyr. This same idea can likely be applied to any other MySQL monitoring program.

If you’d like some help automating your server configuration and monitoring using tools and formulas like this, contact us at Streamhacker Technologies.

NLP for Log Analysis – Tokenization

This is part 1 of a series of posts based on a presentation I gave at the Silicon Valley Cyber Security Meetup on behalf of my company, Insight Engines. Some of the ideas are speculative and I do not know if they are used in practice. If you have any experience applying these techniques on logs, please share in the comments below.

Natural language processing is the art of applying software algorithms to human language. However, the techniques operate on text, and there’s a lot of text that is not natural language. These techniques have been applied to code authorship classification, so why not apply them to log analysis?

Tokenization

To process any kind of text, you need to tokenize it. For natural language, this means splitting the text into sentences and words. But for logs, the tokens are different. Some tokens may be words, but other tokens could be symbols, timestamps, numbers, and more.

Another difference is punctuation. For many human languages, punctuation is mostly regular and predictable, although social media & short text writing has been challenging this assumption.

Logs come in a whole variety of formats. If you only have 1 type of log, then you may be able to tokenize it with a regular expression, like apache access logs for example. But when you have multiple types of logs, regular expressions can become overwhelming or even unusable. Many logs are written by humans, and there’s few rules or conventions when it comes to formatting or use of punctuation. A generic tokenizer could be a useful first pass at parsing arbitrary logs.

Whitespace Tokenizer

Tokenizing on whitespace is an obvious thing to try first. Here’s an example log and the result when run through my NLTK tokenization demo.

Sep 19 19:18:40 acmepayroll syslog: 04/30/10 12:18:51 39480627 wksh: HANDLING TELNET CALL (User: root, Branch: ABCDE, Client: 10101) pid=9644

NLTK Whitespace Tokenizer log exampleAs you can see, it does get some tokens, but there’s punctuation in weird places that would have to be cleaned up later.

Wordpunct Tokenizer

My preferred NLTK tokenizer is the WordPunctTokenizer, since it’s fast and the behavior is predictable: split on whitespace and punctuation. But this is a terrible choice for log tokenization.

NLTK WordPunct Tokenizer log exampleLogs are filled with punctuation, so this produces far too many tokens.

Treebank Tokenizer

Out of curiosity, I tried the TreebankWordTokenizer on the log example. This tokenizer uses a statistical model trained on news text, and it does surprisingly well.

NLTK Treebank Word Tokenizer log exampleThere’s no weird punctuation in the tokens, some is separated out and other punctuation is within tokens. It all looks pretty logical & useful. This was an unexpected result, and indicates that perhaps logs are often closer to natural language text than one might think.

Next Steps

After tokenization, you’ll want to do something with the log tokens. Maybe extract certain features, and then cluster or classify the log. These topics will be covered in a future post.

Recent Advances in Deep Learning for Natural Language Processing

This article was original published at The New Stack under the title “How Deep Learning Supercharges Natural Language Processing“.

Voice search, intelligent assistants, and chatbots are becoming common features of modern technology. Users and customers are demanding a better, more human experience when interacting with computers. According to Tableau’s business trends report, IDC predicts that by 2019, intelligent assistants will become commonly accessible to enterprise workers, while Gartner predicts that by 2020, 50 percent of analytics queries will involve some form of natural language processing. Chatbots, intelligent assistants, natural language queries, and voice-enabled applications all involve various forms of natural language processing. To fully realize these new user experiences, we will need to build upon the latest methods, some of which I will cover here.

Let’s start with the basics: what is natural language processing? Natural language processing (NLP), is a collection of techniques for helping machines understand human language. For example, one of the essential techniques is tokenization: breaking up text into “tokens,” such as words. Given individual words in sequence, you can start to apply reason to them, and do things like sentiment analysis to determine if a piece of text is positive or negative. But even a task as simple as word identification can be quite tricky. Is the word what’s really one word or two (what + is, or what + was)? What about languages that use characters to represent multi-word concepts, like Kanjii?

Deep learning is an advanced type of machine learning using neural networks. It became popular due to the success of the techniques at solving problems such as image classification (labeling an image based on visual content) and speech recognition (converting sounds into text). Many people thought that deep learning techniques, when applied to natural language, would quickly achieve similar levels of performance. But because of all the idiosyncrasies of natural language, the field has not seen the same kind of breakthrough success with deep learning as other fields, like image processing. However, that appears to be changing. In the past few years, researchers have been applying newer deep learning methods to natural language processing, and I will share some of these recent successes.

Deep learning — through recent improvements to word embeddings, a focus on attention, mobile enablement, and its appearance in the home — is starting to capture natural language processing like it previously captured image processing. In this article, I will cover some recent deep learning-based NLP research successes that have made an impact on the field. Because of these improvements, we will see simpler and more natural user experiences, better software performance, and more powerful home and mobile applications.

Word Embeddings

Words are essential to every natural language processing system. Traditional NLP looks at words as strings, but deep learning techniques can only process numeric vectors. Word embeddings were invented as a way to transform words into vectors, enabling new kinds of mathematical feature analysis. But the vector representation of words is only as good as the text it was trained on.

The more common word embeddings are trained on Wikipedia, but Wikipedia text may not be representative of whatever text you’re processing. It’s generally written as well structured factual statements, which is nothing like text found on twitter, and both of these are different than restaurant reviews. So vectors trained on Wikipedia might be mathematically misleading if you use those vectors to analyze a different style of text. Text from the Common Crawl provides a more diverse set of text for training a word embedding model. The FastText library provides some great pre-trained English word vectors, along with tools for training your own. Training your own vectors is essential if you’re processing any language other than English.

Character level embeddings have also shown surprising results. This technique tries to learn vectors for individual characters, where words would be represented as a composition of the individual character vectors. In an effort to learn how to predict the next character in reviews, researchers discovered a sentiment neuron, which they could control to produce positive or negative review output. Using the sentiment neuron, they were able to beat the previous top accuracy score on the sentiment treebank. This is quite an impressive result for something discovered as a side effect of other research.

CNNs, RNNs, and Attention

Moving beyond vectors, deep learning requires training neural networks for various tasks. Vectors are the input and output, in between are layers of nodes connected together in a network. The nodes represent functions on the input data, with each function taking the input from the previous layer and producing output for the next layer. The structure of the network and how the nodes are connected very much determines the learning capabilities and performance.

In general, the deeper and more complicated a network, the longer it takes to train. When using large datasets, many networks can only be effectively trained using clusters of graphics processors (GPUs), because GPUs are optimized for the necessary floating point math. This puts some types of deep learning outside the reach of anyone not at large companies or institutions that can afford the expensive GPU clusters necessary for deep learning on big data.

Standard neural networks are feedforward networks, where each node in a layer is forward connected to every node in the next layer. A Recurrent Neural Network (RNN) is a network where the nodes in each layer also connect back to the previous layer. This creates a kind of memory that can be great for learning from sequences, such as words in a sentence.

A Convolutional Neural Networks (CNN) is a type feedforward network, but with more layers, and where the forward connections have been manipulated, or convoluted, to achieve certain properties. CNNs tend to be good at extracting position invariant features, meaning they do not care so much about sequence ordering. Because of this, CNNs can be trained in a more parallel manner, leading to faster training and optimization compared to RNNs.

While CNNs may win in raw speed, both types of neural networks tend to have comparable performance characteristics.  In fact, RNNs have a slight edge when it comes to sequence oriented tasks like Part-of-Speech tagging, where you are trying to identify the part of speech (such as “noun” or “verb”) for each word in a sentence. For a detailed performance comparison of CNNs and RNNs applied to NLP see: Comparative Study of CNN and RNN for Natural Language Processing.

The most successful RNN models are the LSTM (Long short-term memory) and GRU (gated recurrent unit). These use attention gates, which act as a kind of short-term memory for the network. However, a newer research paper implies that attention may be all you need. By doing away with recurrence networks and convolution, and keeping only attention mechanisms, these models can be trained in parallel like a CNN, but even faster, and have comparable better performance than RNNs on some sequence learning tasks, such machine translation.

Reducing the training cost while maintaining comparable performance means that smaller companies and individuals can throw more data at their deep learning models, and potentially compete more effectively with larger companies and institutions.

Software 2.0

One of the nice properties of neural network models is that the core algorithms and math are mostly the same. Once you have the infrastructure, model definition, and training algorithms all setup, these models are very reusable. “Software 2.0” is the idea that significant components of an application or system can be replaced by neural network models. Instead of writing code, developers:

  1. Collect training data
  2. Clean and label the data
  3. Train a model
  4. Integrate the model

While the most interesting parts are often steps three and four, most of the work happens in the data preparation steps one and two. Collecting and curating good, useful, clean data can be a significant amount of work, which is why methods like corpus bootstrapping are important for getting to good data faster. In the long run, it is often easier to make better data than it is to design better algorithms.

The past few years have demonstrated that neural networks can achieve much better performance than many alternatives, sometimes even in areas not traditionally touched by machine learning. One of the most recent and interesting advances is in learning data indexing structures. B-tree indexes are a commonly used data structure that provides an efficient way of finding data, assuming the tree is structured well. However, these newly learned indexes significantly outperformed the traditional B-tree indexes in both speed and memory usage. Such low-level data structure performance improvements could have far-reaching impacts if it can be integrated into standard development practices.

As research progresses, and the necessary infrastructure becomes cheaper and more available, deep learning models are likely to be used in more and more parts of the software stack, including mobile applications.

Mobile Machine Learning

Most deep learning requires clusters of expensive GPUs and lots of RAM. This level of compute power is only accessible to those who can afford it, usually in the cloud. But consumers are increasingly using mobile devices, and much of the world does not have reliable and affordable full-time wireless connectivity. Getting machine learning into mobile devices will enable more developers to create all sorts of new applications.

  • Apple’s CoreML framework enables a number of NLP capabilities on iOS devices, such as language identification and named entity recognition.
  • Baidu developed a CNN library for mobile deep learning that works on both iOS and Android.
  • Qualcomm created a Neural Processing Engine for its mobile processors, enabling popular deep learning frameworks to operate on mobile devices.

Expect a lot more of this in the near future, as mobile devices continue to become more powerful and ubiquitous. Marc Andreessen famously said that “software is eating the world,” and now machine learning appears to be eating software. Not only is it in our pocket, it is also in our homes.

Deep Learning in the Home

Alexa and other voice assistants became mainstream in 2017, bringing NLP into millions of homes. Mobile users are already familiar with Siri and Google Assistant, but the popularity of Alexa and Google Home shows how many people have become comfortable having conversations with voice-activated dialogue systems. How much these systems rely on deep learning is somewhat unknown, but it is fairly certain that significant parts of their dialogue systems use deep learning models for core functions such as speech to text, part of speech tagging, natural language generation, and text to speech.

As research advances and these companies collect increasing amounts of data from their users, deep learning capabilities will improve as well, and implementations of “software 2.0” will become pervasive. While a few large companies are creating powerful data moats, there is always room on the edges for highly specialized, domain-specific applications of natural languages, such as cybersecurity, IT operations, and data analytics.

Deep learning has become a core component of modern natural language processing systems.

However, many traditional natural language processing techniques are still quite effective and useful, especially in areas that lack the huge amounts of training data necessary for deep learning. I will cover these traditional statistical techniques in an upcoming article.

Monetizing the Text-Processing API with Mashape

This is a short story about the text-processing.com API, and how it became a profitable side-project, thanks to Mashape.

Text-Processing API

When I first created text-processing.com, in the summer of 2010, my initial intention was to provide an online demo of NLTK’s capabilities. I trained a bunch of models on various NLTK corpora using nltk-trainer, then started making some simple Django forms to display the results. But as I was doing this, I realized I could fairly easily create an API based on these models. Instead of rendering HTML, I could just return the results as JSON.

I wasn’t sure if anyone would actually use the API, but I knew the best way to find out was to just put it out there. So I did, initially making it completely open, with a rate limit of 1000 calls per day per IP address. I figured at the very least, I might get some PHP or Ruby users that wanted the power of NLTK without having to interface with Python. Within a month, people were regularly exceeding that limit, and I quietly increased it to 5000 calls/day, while I started searching for the simplest way to monetize the API. I didn’t like what I found.

Monetizing APIs

Before Mashape, your options for monetizing APIs were either building a custom solution for authentication, billing, and tracking, or pay thousands of dollars a month for an “enterprise” solution from Mashery or Apigee. While I have no doubt Mashery & Apigee provide quality services, they are not in the price range for most developers. And building a custom solution is far more work than I wanted to put into it. Even now, when companies like Stripe exist to make billing easier, you’d still have to do authentication & call tracking. But Stripe didn’t exist 2 years ago, and the best billing option I could find was Paypal, whose API documentation is great at inducing headaches. Lucky for me, Mashape was just opening up for beta testing, and appeared to be in the process of solving all of my problems 🙂

Mashape

Mashape was just what I needed to monetize the text-processing API, and it’s improved tremendously since I started using it. They handle all the necessary details, like integrated billing, plus a lot more, such as usage charts, latency & uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins & outs of your API. Once you’ve documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.

Here’s a little history for the curious: Mashape originally did authentication and tracking by exchanging tokens thru an API call. So you had to write some code to call their token API on every one of your API calls, then check the results to see if the call was valid, or if the caller had reached their limit. They didn’t have all of the nice charts they have now, and their billing solution was the CEO manually handling Paypal payments. But none of that mattered, because it worked, and from conversations with them, I knew they were focused on more important things: building up their infrastructure and positioning themselves as a kind of app-store for APIs.

Mashape has been out of beta for a while now, with automated billing, and a custom proxy server for authenticating, routing, and tracking all API calls. They’re releasing new features on a regular basis, and sponsoring events like MusicHackDay. I’m very impressed with everything they’re doing, and on top of that, they’re good hard-working people. I’ve been over to their “hacker house” in San Francisco a few times, and they’re very friendly and accomodating. And if you’re ever in the neighborhood, I’m sure they’d be open to a visit.

Profit

Once I had integrated Mashape, which was maybe 20 lines of code, the money started rolling in :). Just kidding, but using the typical definition of profit, when income exceeds costs, the text-processing API was profitable within a few months, and has remained so ever since. My only monetary cost is a single Linode server, so as long as people keep paying for the API, text-processing.com will remain online. And while it has a very nice profit margin, total monthly income barely approaches the cost of living in San Francisco. But what really matters to me is that text-processing.com has become a self-sustaining excuse for me to experiment with natural language processing techniques & data sets, test my models against the market, and provide developers with a simple way to integrate NLP into their own projects.

So if you’ve got an idea for an API, especially if it’s something you could charge money for, I encourage you to build it and put it up on Mashape. All you need is a working API, a unique image & name, and a Paypal account for receiving payments. Like other app stores, Mashape takes a 20% cut of all revenue, but I think it’s well worth it compared to the cost of replicating everything they provide. And unlike some app stores, you’re not locked in. Many of the APIs on Mashape also provide alternative usage options (including text-processing), but they’re on Mashape because of the increased exposure, distribution, and additional features, like client library generation. SaaS APIs are becoming a significant part of modern computing infrastructure, and Mashape provides a great platform for getting started.

Testing Command Line Scripts with Roundup

As nltk-trainer becomes more stable, I realized that I needed some way to test the command line scripts. My previous ad-hoc method of “test whatever script options I can remember” was becoming unwieldy and unreliable. But how do you make repeatable tests for a command line script? It doesn’t really fit into the standard unit testing model.

Enter roundup by Blake Mizerany. (NOTE: do not try to do apt-get install roundup. You will get an issue tracking system, not a script testing tool).

Roundup provides a great way to prevent shell bugs by creating simple test functions within a shell script. Here’s the first dozen lines of train_classifier.sh, which you can probably guess tests train_classifier.py:

[sourcecode language=”bash”]
#!/usr/bin/env roundup

describe "train_classifier.py"

it_displays_usage_when_no_arguments() {
./train_classifier.py 2>&1 | grep -q "usage: train_classifier.py"
}

it_cannot_find_foo() {
last_line=$(./train_classifier.py foo 2>&1 | tail -n 1)
test "$last_line" "=" "ValueError: cannot find corpus path for foo"
}
[/sourcecode]

describe is like the name of a module or test case, and all test functions begin with test_. Within the test functions, you use standard shell commands that should produce no output on success (like grep -q or the test command). You can also match multiple lines of output, as in:

[sourcecode language=”bash”]
it_trains_movie_reviews_paras() {
test "$(./train_classifier.py movie_reviews –no-pickle –no-eval –fraction 0.5 –instances paras)" "=" "loading movie_reviews
2 labels: [‘neg’, ‘pos’]
1000 training feats, 1000 testing feats
training NaiveBayes classifier"
}
[/sourcecode]

Once you’ve got all your test functions defined, make sure your test script is executable and roundup is installed, then run your test script. You’ll get nice output that looks like:

nltk-trainer$ tests/train_classifier.sh
train_classifier.py
  it_displays_usage_when_no_arguments:             [PASS]
  it_cannot_find_foo:                              [PASS]
  it_cannot_import_reader:                         [PASS]
  it_trains_movie_reviews_paras:                   [PASS]
  it_trains_corpora_movie_reviews_paras:           [PASS]
  it_cross_fold_validates:                         [PASS]
  it_trains_movie_reviews_sents:                   [PASS]
  it_trains_movie_reviews_maxent:                  [PASS]
  it_shows_most_informative:                       [PASS]
=========================================================
Tests:    9 | Passed:   9 | Failed:   0

So far, roundup has been a perfect tool for testing all the nltk-trainer scripts, and the only downside is the one-time manual installation. I highly recommend it for anyone writing custom commands and scripts, no matter what language you use to write them.

Hierarchical Classification

Hierarchical classification is an obscure but simple concept. The idea is that you arrange two or more classifiers in a hierarchy such that the classifiers lower in the hierarchy are only used if a higher classifier returns an appropriate result.

For example, the text-processing.com sentiment analysis demo uses hierarchical classification by combining a subjectivity classifier and a polarity classifier. The subjectivity classifier is first, and determines whether the text is objective or subjective. If the text is objective, then a label of neutral is returned, and the polarity classifier is not used. However, if the text is subjective (or polar),  then the polarity classifier is used to determine if the text is positive or negative.

Hierarchical Sentiment Classification Model

Hierarchical classification is a useful way to combine multiple binary classifiers, if you have a hierarchy of labels that can modeled as a binary tree. In this model, each branch of the tree either continues on to a new pair of branches, or stops, and at each branching you use a classifier to determine which branch to take.

The Perils of Web Crawling

I’ve written a lot of web crawlers, and these are some of the problems I’ve encountered in the process. Unfortunately, most of these problems have no solution, other than quitting in disgust. Instead of looking for a way out when you’re already deep in it, take this article as a warning of what to expect before you start down the perilous path of making a web spider. Now to be clear, I’m referring to deep web crawlers coded for a specific site, not a generic link harvester or image downloader. Generic crawlers probably have their own set of problems on top of everything mentioned below.

Most web pages suck

I have yet to encounter a website whose HTML didn’t generate at least 3 WTFs.

The only valid measure of code quality: WTFs/minute

Seriously, I now have a deep respect for web browsers that are able to render the mess of HTML that exists out there. If only browsers were better at memory management, I might actually like them again (see unified theory of browser suckage).

Inconsistent information & navigation

2 pages on the same site, with the same url pattern, and the same kind of information, may still not be the same. One will have additional information that you didn’t expect, breaking all your assumptions about what HTML is where. Another may not have essential information you expected all similar pages to have, again throwing your HTML parsing assumptions out the window.

Making things worse, the same kind of page on different sections of a site may be completely different. This generally occurs on larger sites where different sections have different teams managing them. The worst is when the differences are non-obvious, so that you think they’re the same, until you realize hours later that the underlying HTML structure is completely different, and CSS had been fooling you, making it look the same when it’s not.

90% of writing a crawler is debugging

Your crawler will break. It’s not a matter of if but when. Sometimes it breaks immediately after you write it, because it can’t deal with the malformed HTML garbage that you’re trying to parse. Sometimes it happens randomly, weeks or months later, when the site makes a minor change to their HTML that sends your crawler into a chaotic death spiral. Whatever the case, it’ll happen, it may not be your fault, but you’ll have to deal with it (I recommend cursing profusely). If you can’t deal with random sessions of angry debugging because some stranger you don’t know but want to punch in the face made a change that broke your crawler, quit now.

You can’t make their connection any faster

If your crawler is slow, it’s probably not your fault. Chances are, your incoming connection is much faster and less congested than their outgoing connection. The internet is a series of tubes, and while their tube may be a lot larger than yours, it’s filled with other people’s crap, slowing down your completely legitimate web crawl to, well, a crawl. The only solution is to request more pages in parallel, which leads directly to the next problem…

Sites will ban you

Many web sites don’t want to be crawled, at least not by you. They can’t really stop you, but they’ll try. In the end, they only win if you let them, but it can be quite annoying to deal with a site that doesn’t want you to crawl them. There’s a number of techniques for dealing with this:

  • route your crawls thru TOR, if you don’t mind crawling slower than a pentium 1 on with a 2.4 bps modem (if you’re too young to know what that means, think facebook or myspace, but 100 times slower, and you’re crying because you’re peeling a dozen onions while you wait for the page to load)
  • use anonymous proxies, which of course are always completely reliable, and never randomly go offline for no particular reason
  • slow down your crawl to the point a one armed blind man could do it faster
  • ignore robots.txt – what right do these sites have to tell you what you can & can’t crawl?!? This is America, isn’t it!?!

Conclusion

Don’t write a web crawler. Seriously, it’s the most annoying programming work in the world. Of course, chances are that if you’re actually considering writing a crawler, it’s because there’s no other option. In that case, good luck, and stock up on liquor.