Category Archives: python

Upcoming Python Book Reviews

Programming Collective IntelligenceProgramming Collective Intelligence

I recently finished reading Programming Collective Intellegince and will be posting a review soon. The TL;DR review is: get it if want an great introduction to machine learning with Python. It covers a lot of complex algorithms in a simple way, and provides some great example use cases.

Python Testing CookbookPython Testing Cookbook

Testing is something nearly every developer can do more of, and this Python Testing Cookbook looks to be full of techniques for integrating testing at various levels of a project. As a preview, you can download a PDF of Chapter 3 – Creating Testable Documentation with doctest.

Python 3 Web Development

Python 3 Web Development Beginner’s Guide

I haven’t used Python 3 yet, so Python 3 Web Development Beginner’s Guide is a good excuse to do so. I also haven’t done any web development outside of Django in a few years, and I’m interested to see how it compares to doing it from scratch. As a preview, you can download a PDF of Chapter 3 – Tasklist I Persistence.

Kindle 3

Kindle 3

I’m reading all of these on a Kindle 3, which has worked out surprisingly well. It’s obviously not good for copy & pasting code snippets, but that’s generally a bad idea anyway. And if don’t want to type code in yourself, you can always download it from the publisher’s site.

Analyzing Tagged Corpora and NLTK Part of Speech Taggers

NLTK Trainer includes 2 scripts for analyzing both a tagged corpus and the coverage of a part-of-speech tagger.

Analyze a Tagged Corpus

You can get part-of-speech tag statistics on a tagged corpus using analyze_tagged_corpus.py. Here’s the tag counts for the treebank corpus:

$ python analyze_tagged_corpus.py treebank
loading nltk.corpus.treebank
100676 total words
12408 unique words
46 tags
  Tag      Count
=======  =========
#               16
$              724
''             694
,             4886
-LRB-          120
-NONE-        6592
-RRB-          126
.             3874
:              563
CC            2265
CD            3546
DT            8165
EX              88
FW               4
IN            9857
JJ            5834
JJR            381
JJS            182
LS              13
MD             927
NN           13166
NNP           9410
NNPS           244
NNS           6047
PDT             27
POS            824
PRP           1716
PRP$           766
RB            2822
RBR            136
RBS             35
RP             216
SYM              1
TO            2179
UH               3
VB            2554
VBD           3043
VBG           1460
VBN           2134
VBP           1321
VBZ           2125
WDT            445
WP             241
WP$             14
WRB            178
``             712
=======  =========

By default, analyze_tagged_corpus.py sorts by tags, but you can sort by the highest count using <span class="pre">--sort</span> count <span class="pre">--reverse</span>. You can also see counts for simplified tags using <span class="pre">--simplify_tags</span>:

$ python analyze_tagged_corpus.py treebank --simplify_tags
loading nltk.corpus.treebank
100676 total words
12408 unique words
31 tags
  Tag      Count
=======  =========
              7416
#               16
$              724
''             694
(              120
)              126
,             4886
.             3874
:              563
ADJ           6397
ADV           2993
CNJ           2265
DET           8192
EX              88
FW               4
L               13
MOD            927
N            19213
NP            9654
NUM           3546
P             9857
PRO           2698
S                1
TO            2179
UH               3
V             6000
VD            3043
VG            1460
VN            2134
WH             878
``             712
=======  =========

Analyze Tagger Coverage

You can analyze the coverage of a part-of-speech tagger against any corpus using analyze_tagger_coverage.py. Here’s the results for the treebank corpus using NLTK’s default part-of-speech tagger:

$ python analyze_tagger_coverage.py treebank
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger
  Tag      Found
=======  =========
#               16
$              724
''             694
,             4887
-LRB-          120
-NONE-        6591
-RRB-          126
.             3874
:              563
CC            2271
CD            3547
DT            8170
EX              88
FW               4
IN            9880
JJ            5803
JJR            386
JJS            185
LS              12
MD             927
NN           13166
NNP           9427
NNPS           246
NNS           6055
PDT             21
POS            824
PRP           1716
PRP$           766
RB            2800
RBR            130
RBS             33
RP             213
SYM              1
TO            2180
UH               3
VB            2562
VBD           3035
VBG           1458
VBN           2145
VBP           1318
VBZ           2124
WDT            440
WP             241
WP$             14
WRB            178
``             712
=======  =========

If you want to analyze the coverage of your own pickled tagger, use <span class="pre">--tagger</span> PATH/TO/TAGGER.pickle. You can also get detailed metrics on Found vs Actual counts, as well as Precision and Recall for each tag by using the <span class="pre">--metrics</span> argument with a corpus that provides a tagged_sents method, like treebank:

$ python analyze_tagger_coverage.py treebank --metrics
loading tagger taggers/maxent_treebank_pos_tagger/english.pickle
analyzing tag coverage of treebank with ClassifierBasedPOSTagger
Accuracy: 0.995689
Unknown words: 440
  Tag      Found      Actual      Precision      Recall
=======  =========  ==========  =============  ==========
#               16          16  1.0            1.0
$              724         724  1.0            1.0
''             694         694  1.0            1.0
,             4887        4886  1.0            1.0
-LRB-          120         120  1.0            1.0
-NONE-        6591        6592  1.0            1.0
-RRB-          126         126  1.0            1.0
.             3874        3874  1.0            1.0
:              563         563  1.0            1.0
CC            2271        2265  1.0            1.0
CD            3547        3546  0.99895833333  0.99895833333
DT            8170        8165  1.0            1.0
EX              88          88  1.0            1.0
FW               4           4  1.0            1.0
IN            9880        9857  0.99130434782  0.95798319327
JJ            5803        5834  0.99134948096  0.97892938496
JJR            386         381  1.0            0.91489361702
JJS            185         182  0.96666666666  1.0
LS              12          13  1.0            0.85714285714
MD             927         927  1.0            1.0
NN           13166       13166  0.99166034874  0.98791540785
NNP           9427        9410  0.99477911646  0.99398073836
NNPS           246         244  0.99029126213  0.95327102803
NNS           6055        6047  0.99515235457  0.99722414989
PDT             21          27  1.0            0.66666666666
POS            824         824  1.0            1.0
PRP           1716        1716  1.0            1.0
PRP$           766         766  1.0            1.0
RB            2800        2822  0.99305555555  0.975
RBR            130         136  1.0            0.875
RBS             33          35  1.0            0.5
RP             213         216  1.0            1.0
SYM              1           1  1.0            1.0
TO            2180        2179  1.0            1.0
UH               3           3  1.0            1.0
VB            2562        2554  0.99142857142  1.0
VBD           3035        3043  0.990234375    0.98065764023
VBG           1458        1460  0.99650349650  0.99824868651
VBN           2145        2134  0.98852223816  0.99566473988
VBP           1318        1321  0.99305555555  0.98281786941
VBZ           2124        2125  0.99373040752  0.990625
WDT            440         445  1.0            0.83333333333
WP             241         241  1.0            1.0
WP$             14          14  1.0            1.0
WRB            178         178  1.0            1.0
``             712         712  1.0            1.0
=======  =========  ==========  =============  ==========

These additional metrics can be quite useful for identifying which tags a tagger has trouble with. Precision answers the question “for each word that was given this tag, was it correct?”, while Recall answers the question “for all words that should have gotten this tag, did they get it?”. If you look at PDT, you can see that Precision is 100%, but Recall is 66%, meaning that every word that was given the PDT tag was correct, but 6 out of the 27 words that should have gotten PDT were mistakenly given a different tag. Or if you look at JJS, you can see that Precision is 96.6% because it gave JJS to 3 words that should have gotten a different tag, while Recall is 100% because all words that should have gotten JJS got it.

Training Part of Speech Taggers with NLTK Trainer

NLTK trainer makes it easy to train part-of-speech taggers with various algorithms using train_tagger.py.

Training Sequential Backoff Taggers

The fastest algorithms are the sequential backoff taggers. You can specify the backoff sequence using the <span class="pre">--sequential</span> argument, which accepts any combination of the following letters:

a:AffixTagger
u:UnigramTagger
b:BigramTagger
t:TrigramTagger

For example, to train the same kinds of taggers that were used in Part of Speech Tagging with NLTK Part 1 – Ngram Taggers, you could do the following:

python train_tagger.py treebank --sequential ubt

You can rearrange ubt any way you want to change the order of the taggers (though ubt is generally the most accurate order).

Training Affix Taggers

The <span class="pre">--sequential</span> argument also recognizes the letter a, which will insert an AffixTagger into the backoff chain. If you do not specify the <span class="pre">--affix</span> argument, then it will include one AffixTagger with a 3-character suffix. However, you can change this by specifying one or more <span class="pre">--affix</span> N options, where N should be a positive number for prefixes, and a negative number for suffixes. For example, to train an aubt tagger with 2 AffixTaggers, one that uses a 3 character suffix, and another that uses a 2 character prefix, specify the <span class="pre">--affix</span> argument twice:

python train_tagger.py treebank --sequential aubt --affix -3 --affix 2

The order of the <span class="pre">--affix</span> arguments is the order in which each AffixTagger will be trained and inserted into the backoff chain.

Training Brill Taggers

To train a BrillTagger in a similar fashion to the one trained in Part of Speech Tagging Part 3 – Brill Tagger (using FastBrillTaggerTrainer), use the <span class="pre">--brill</span> argument:

python train_tagger.py treebank --sequential aubt --brill

The default training options are a maximum of 200 rules with a minimum score of 2, but you can change that with the <span class="pre">--max_rules</span> and <span class="pre">--min_score</span> arguments. You can also change the rule template bounds, which defaults to 1, using the <span class="pre">--template_bounds</span> argument.

Training Classifier Based Taggers

Many of the arguments used by train_classifier.py can also be used to train a ClassifierBasedPOSTagger. If you don’t want this tagger to backoff to a sequential backoff tagger, be sure to specify <span class="pre">--sequential</span> ''. Here’s an example for training a NaiveBayesClassifier based tagger, similar to what was shown in Part of Speech Tagging Part 4 – Classifier Taggers:

python train_tagger.py treebank --sequential '' --classifier NaiveBayes

If you do want to backoff to a sequential tagger, be sure to specify a cutoff probability, like so:

python train_tagger.py treebank --sequential ubt --classifier NaiveBayes --cutoff_prob 0.4

Any of the NLTK classification algorithms can be used for the <span class="pre">--classifier</span> argument, such as Maxent or MEGAM, and every algorithm other than NaiveBayes has specific training options that can be customized.

Phonetic Feature Options

You can also include phonetic algorithm features using the following arguments:

<span class="pre">--metaphone</span>:Use metaphone feature
<span class="pre">--double-metaphone</span>:Use double metaphone feature
<span class="pre">--soundex</span>:Use soundex feature
<span class="pre">--nysiis</span>:Use NYSIIS feature
<span class="pre">--caverphone</span>:Use caverphone feature

These options create phonetic codes that will be included as features along with the default features used by the ClassifierBasedPOSTagger. The <span class="pre">--double-metaphone</span> algorithm comes from metaphone.py, while all the other phonetic algorithm have been copied from the advas project (which appears to be abandoned).

I created these options after discussions with Michael D Healy about Twitter Linguistics, in which he explained the prevalence of regional spelling variations. These phonetic features may be able to reduce that variation where a tagger is concerned, as slightly different spellings might generate the same phonetic code.

A tagger trained with any of these phonetic features will be an instance of nltk_trainer.tagging.taggers.PhoneticClassifierBasedPOSTagger, which means nltk_trainer must be included in your PYTHONPATH in order to load & use the tagger. The simplest way to do this is to install nltk-trainer using python setup.py install.

Spelling Replacers in Microsoft Speller Challenge

Microsoft/Bing recently introduced its Speller Challenge, and I immediately thought about using my spelling replacer code from Chapter 2, Replacing and Correcting Words, in Python Text Processing with NLTK Cookbook. The API is now online, and can be accessed by doing a GET request to http://text-processing.com/api/spellcorrect/?runID=replacers&q=WORD. With an Expected F1 of ~0.5, I’m currently at number 12 on the Leaderboard, though I don’t expect that position to last long (I was at 10 when I first wrote this). I’m actually quite suprised the score is as high as it is considering the simplicity / lack of sophistication – it means there’s merit in replacing repeating character and/or that Enchant generally gives decent spelling suggestions when controlled by edit distance. Here’s an outline of the code, which should make sense if you’re familiar with the replacers module from Replacing and Correcting Words in Python Text Processing with NLTK Cookbook:

[sourcecode language=”python”]
repeat_replacer = RepeatReplacer()
spelling_replacer = SpellingReplacer()

def replacer_suggest(word):
suggest = repeat_replacer.replace(word)

if suggest == word:
suggest = spelling_replacer.replace(word)

return [(suggest, 1.0)]
[/sourcecode]

Python Text Processing with NLTK Cookbook Chapter 2 Errata

It has come to my attention that there are two errors in Chapter 2, Replacing and Correcting Words of Python Text Processing with NLTK Cookbook. My thanks to the reader who went out of their way to verify my mistakes and send in corrections.

In Lemmatizing words with WordNet, on page 29, under How it works…, I said that “cooking” is not a noun and does not have a lemma. In fact, cooking is a noun, and as such is its own lemma. Of course, “cooking” is also a verb, and the verb form has the lemma “cook”.

In Removing repeating characters, on page 35, under How it works…, I explained the repeat_regexp match groups incorrectly. The actual match grouping of the word “looooove” is (looo)(o)o(ve) because the pattern matching is greedy. The end result is still correct.

NLTK Default Tagger CoNLL2000 Tag Coverage

Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.

NLTK Default Tagger Performance on CoNLL2000

The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.

TagFoundActualPrecisionRecall
#464711
$2122213410.6
1811180911
(0351None0
)0358None0
,131601316011
-LRB-35100None
-NONE-5900None
-RRB-35800None
.108001080211
:128812850.71431
CC658965860.68750.7333
CD10325102330.9720.9919
DT22301223550.78261
EX22925411
FW14210.0455
IN27798278350.73150.7899
JJ15370160490.73720.7303
JJR111410550.54120.575
JJS6114510.69120.7966
LS1300None
MD261626370.71430.75
NN38023367890.73450.8441
NNP24967246900.87520.9421
NNPS5895500.45530.3684
NNS17068166530.85720.9527
PDT24650.66671
POS222422030.66671
PRP462046340.84380.7941
PRP$229223020.63641
RB768179610.80760.8582
RBR2883920.50.3684
RBS902400.50.1667
RP634950.11761
SYM06None0
TO6257625910.75
UH21710.1111
VB668172860.90420.8313
VBD850184240.75210.8605
VBG373040000.84930.8603
VBN576358670.81640.8721
VBP323234070.67540.6638
VBZ522455610.72730.6906
WDT115611570.60.5
WP63763911
WP$383911
WRB5665710.90.75
185518540.66671

Unknown Words in CoNLL2000

The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-“. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.

Missing Symbols and Rare Tags

The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.

NLTK Default Tagger Treebank Tag Coverage

For some research I’m doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I’ve added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger’s results against the actual tags.

NLTK Default Tagger Performance on Treebank

Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I’ve explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn’t have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).

TagFoundActualPrecisionRecall
#161611
$72472411
69469411
,4887488611
-LRB-12012011
-NONE-6591659211
-RRB-12612611
.3874387411
:56356311
CC2271226511
CD354735460.9990.999
DT8170816511
EX888811
FW4411
IN988098570.99130.958
JJ580358340.99130.9789
JJR38638110.9149
JJS1851820.96671
LS121310.8571
MD92792711
NN13166131660.99170.9879
NNP942794100.99480.994
NNPS2462440.99030.9533
NNS605560470.99520.9972
PDT212710.6667
POS82482411
PRP1716171611
PRP$76676611
RB280028220.99310.975
RBR13013610.875
RBS333510.5
RP21321611
SYM1111
TO2180217911
UH3311
VB256225540.99141
VBD303530430.99020.9807
VBG145814600.99650.9982
VBN214521340.98850.9957
VBP131813210.99310.9828
VBZ212421250.99370.9906
WDT44044510.8333
WP24124111
WP$141411
WRB17817811
71271211

Unknown Words in Treebank

Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it’s not that bad, since it’s only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.

Django Application Conventions

A Django application is really just a python package with a few conventionally named modules. Most apps will not need all of the modules described below, but it’s important to follow the naming conventions and code organization because it will make your application easier to use. Following these conventions gives you a common model for understanding and building the various pieces of a Django application. It also makes it possible for others who share the same common model to quickly understand your code, or at least have an idea of where certain parts of code are located and how everything fits together. This is especially important for reusable applications. For examples, I highly recommend browsing through the code of applications in django.contrib, as they all (mostly) follow the same conventional code organization.

models.py

models.py is the only module that’s required by Django, even if you don’t have any code in it. But chances are that you’ll have at least 1 database model, signal handler, or perhaps an API connection object. models.py is the best place to put these because it is the one app module that is guarenteed to be imported early. This also makes it a good location for connection objects to NoSQL databases such as Redis or MongoDB. Generally, any code that deals with data access or storage should go in models.py, except for simple lookups and queries.

managers.py

Model managers are sometimes placed in a separate managers.py module. This is optional, and often overkill, as it usually makes more sense to define custom managers in models.py. However, if there’s a lot going in your custom manager, or if you have a ton of models, it might make sense to separate the manager classes for clarity’s sake.

admin.py

To make your models viewable within Django’s Admin system, then create an admin.py module with ModelAdmin objects for each necessary model. These models can then be autodiscovered if you use the admin.autodiscover() call in your top level urls.py.

views.py

View functions (or classes) have 3 responsibilities:

  1. request handling
  2. form processing
  3. template rendering

If a view function is doing anything else, then you’re doing it wrong. There are many things that fall under request handling, such as session management and authentication, but any code that does not directly use the request object, or that will not be used to render a template, does not belong here. One valid is exception is sending signals, but I’d argue that a form or models.py is a better location. View functions should be short & simple, and any data access should be primarily read-only. Code that updates data in a database should either be in models.py or the save() method of a form.

Keep your view functions short & simple – this will make it clear how a specific request will produce a corresponding response, and where potential bottlenecks are. Speed has business value, and the easiest way to speed up code is to make it simpler. Do less, and move the complexity elsewhere, such as forms.py.

Use decorators generously for validating requests. require_GET, require_POST, or require_http_methods should go first. Next, use login_required or permission_required as necessary. Finally, use ajax_request or render_to from django annoying so that your view can simply return a dict of data that will be translated into a JSON response or a RequestContext. It’s not unheard of to have view functions with more decorators than lines of code, and that’s ok because the process flow is still clear, since each decorator has a specific purpose. However, if you’re distributing a pluggable app, then do not use render_to. Instead, use a template_name keyword argument, which will allow developers to override the default template name if they wish. This template name should be prefixed by an appropriate subdirectory. For example, django.contrib.auth.views uses the template subdirectory registration/ for all its templates. This encourages template organization to mirror application organization.

If you have lots of views that can be grouped into separate functionality, such as account management vs everything else, then you can create separate view modules. A good way to do this is to create a views subpackage with separate modules within it. The comments contrib app organizes its views this way, with the user facing comments views in views/comments.py, and the moderator facing moderation views in views/moderation.py.

decorators.py

Before you write your own decorators, checkout the http decorators, admin.views.decorators, auth.decorators, and annoying.decorators. What you want may already be implemented, but if not, you’ll at least get to see a bunch of good examples for how to write useful decorators.

If you do decide to write your own decorators, put them in decorators.py. This module should contain functions that take a function as an argument and return a new function, making them higher order functions. This enables you to attach many decorators to a single view function, since each decorators wraps the function returned from the next decorator, until the final view function is reached.

You can also create functions that take arguments, then return a decorator. So instead of being a decorator itself, this kind of function generates and returns a decorator based on the arguments provided. render_to is such a higher order function: it takes a template name as an argument, then returns a decorator that renders that template.

middleware.py

Any custom request/response middleware should go in middleware.py. Two commonly used middleware classes are AuthenticationMiddleware and SessionMiddleware. You can think of middleware as global view decorators, in that a middleware class can pre-process every request or post-process every response, no matter what view is used.

urls.py

It’s good practice to define urls for all your application’s views in their own urls.py. This way, these urls can be included in the top level urls.py with a simple include call. Naming your urls is also a good idea – see django.contrib.comments.urls for an example.

forms.py

Custom forms should go in forms.py. These might be model forms, formsets, or any kind of data validation & transformation that needs to happen before storing or passing on request data. The incoming data will generally come from a request QueryDict, such as request.GET or request.POST, though it could also come from url parameters or view keyword arguments. The main job of forms.py is to transform that incoming data into a form suitable for storage, or for passing on to another API.

You could have this code in a view function, but then you’d be mixing data validation & transformation in with request processing & template rendering, which just makes your code confusing and more deeply nested. So the secondary job of forms.py is to contain complexity that would otherwise be in a view function. Since form validation is often naturally complicated, this is appropriate, and keeps the complexity confined to a well defined area. So if you have a view function that’s accessing more than one variable in request.GET or request.POST, strongly consider using a form instead – that’s what they’re for!

Forms often save data, and the convention is to use a save method that can be called after validation. This is how model forms behave, but you can do the same thing in your own non-model forms. For example, let’s say you want to update a list in Redis based on incoming request data. Instead of putting the code in a view function, create a Form with the necessary fields, and implement a save() method that updates the list in redis based on the cleaned form data. Now your view simply has to validate the form and call save() if the data is valid.

There should generally be no template rendering in forms.py, except for sending emails. All other template rendering belongs in views.py. Email template rendering & sending should also be implemented in a save() method. If you’re creating a pluggable app, then the template name should be a keyword argument so that developers can override it if they want. The PasswordResetForm in django.contrib.auth.forms provides a good example of how to do this.

tests.py

Tests are always a good idea (even if you’re not doing TDD), especially for reusable apps. There are 2 places that Django’s test runner looks for tests:

  1. doctests in models.py
  2. unit tests or doctests in tests.py

You can put doctests elsewhere, but then you have to define your own test runner to run them. It’s often easier to just put all non-model tests into tests.py, either in doctest or unittest form. If you’re testing views, be sure to use Django’s TestCase, as it provides easy access to the test client, making view testing quite simple. For a complete account of testing Django, see Django Testing and Debugging.

backends.py

If you need custom authentication backends, such as using an email address instead of a username, put these in backends.py. Then include them in the AUTHENTICATION_BACKENDS setting.

signals.py

If your app is defining signals that others can connect to, signals.py is where they should go. If you look at django.contrib.comments.signals, you’ll see it’s just a few lines of code with many more lines of comments explaining when each signal is sent. This is about right, as signals are essentially just global objects, and what’s important is how they are used, and in what context they are sent.

management.py

The post_syncdb signal is a management signal that can only be connected to within a module named management.py. So if you need to connect to the post_syncdb signal, management.py is the only place to do it.

feeds.py

To define your own syndication feeds, put the subclasses in feeds.py, then import them in urls.py.

sitemaps.py

Custom Sitemap classes should go in sitemaps.py. Much like the classes in admin.py, Sitemap subclasses are often fairly simple. Ideally, you can just use GenericSitemap and bypass custom Sitemap objects altogether.

context_processors.py

If you need to write custom template context processors, put them in context_processors.py. A good case for a custom context processor is to expose a setting to every template. Context processors are generally very simple, as they only return a dict with no more than a few key-values. And don’t forget to add them to the TEMPLATE_CONTEXT_PROCESSORS setting.

templatetags

The templatetags subpackage is necessary when you want to provide custom template tags or filters. If you’re only creating one templatetag module, give it the same name as your app. This is what django.contrib.humanize does, among others. If you have more than one templatetag module, then you can namespace them by prefixing each module with the name of your app name followed by an underscore. And be sure to create __init__.py in templatetags/, so python knows it’s a proper subpackage.

management/commands

If you want to provide custom management commands that can be used through manage.py or django-admin.py, these must be modules with the commands/ subdirectory of a management/ subdirectory. Both of these subdirectories must have __init__.py to make them python subpackages. Each command should be a separate module whose name will be the name of the command. This module should contain a single class named Command, which must inherit from BaseCommand or a BaseCommand subclass. For example, django.contrib.auth provides 2 custom management commands: changepassword and createsuperuser. Both of these commands are modules of the same name within django.contrib.auth.management.commands. For more details, see creating Django management commands.

Python Text Processing with NLTK Book Reviews

If you’ve been considering buying Python Text Processing with NLTK 2.0 Cookbook, but haven’t yet, below are a couple reviews that may help convince you how awesome it is 🙂

Jaganadh says in his review of Python Text Processing with NLTK Cookbook at Jaggu’s World:

The eight chapter a revolutionary one which deals with Distributed data processing and handling large scale data with NLTK. (…) This chapter will be really helpful for industry people who is looking for to adopt NLTK in to NLP projects.

I give 9 out of 10 for the book. Natural Language Processing students, teachers, professional hurry and bag a copy of this book.

Sum-Wai says in his review of Python Text Processing with NLTK Cookbook at Tips Tank:

I like it where in each recipe, the author provides extra knowledge on the particular problem, like how a problem can be enhance and solve in another way, or what we need to do if the problem on hand changed, and some extra technical tips, which is very nice and useful.

If you’re thinking about the O’Reilly’s NLTK book – Natural Language Processing with Python, IMHO this book and the O’Reilly NLTK book complements each other. The O’Reilly NLTK book focuses more on getting you to know NLP and the features and usage of NLTK , while Python Text Processing with NLTK teaches us how we would implement NLP/NLTK with tools like MongoDB into solving real world problems.

And Neil Kodner, @neilkod, says:

I’m loving python text processing with nltk cookbook by @japerk, its an excellent companion to the O’Reilly NLTK book

Christmas is coming up, and who doesn’t think about python text processing during the holidays?

If you want a reviewer copy to write your own review, contact Packt at reviewrequest@packtpub.com. And if you do write a review and want to let me know about it, leave a comment here, or contact me on twitter.

The Beginning of Python Text Processing with NLTK Cookbook

It all started with an email to the baypiggies mailing list. An acquisition editor for Packt was looking for authors to expand their line of python cookbooks. For some reason I can’t remember, I thought they wanted to put together a multi-author cookbook, where each author contributes a few recipes. That sounded doable, because I’d already written a number of articles that could serve as the basis for a few recipes. So I replied with links to the following articles:

The reply back was:

The next step is to come up with around 8-14 topics/chapters and around 80-100 recipes for the book as a whole.

My first reaction was “WTF?? No way!” But luckily, I didn’t send that email. Instead, I took a couple days to think it over, and realized that maybe I could come up with that many recipes, if I broke my knowledge down into small pieces. I also decided to choose recipes that I didn’t already know how to write, and use them as motivation for learning & research. So I replied back with a list of 92 recipes, and got to work. Not surprisingly, the original list of 92 changed significantly while writing the book, and I believe the final recipe count is 81.

I was keenly aware that there’d be some necessary overlap with the original NLTK book, Natural Language Processing with Python. But I did my best to minimize that overlap, and to present a different take on similar content. And there’s a number of recipes that (as far as I know) you can’t find anywhere else, the largest group of which can be found in Chapter 6, Transforming Chunks and Trees. I’m very pleased with the result, and I hope everyone who buys the book is too. I’d like to think that Python Text Processing with NLTK 2.0 Cookbook is the practical companion to the more teaching oriented Natural Language Processing with Python.

If you’d like a taste of the book, checkout the online sample chapter (pdf) Chapter 3, Custom Corpora, which details how many of the included corpus readers work, how to use them, and how to create your own corpus readers. The last recipe shows you how to create a corpus reader on top of MongoDB, and it should be fairly easy to modify for use with any other database.

Packt has also published two excerpts from Chapter 8, Distributed Processing and Handling Large Datasets, which are partially based on those original 2 articles: