Category Archives: python

Python and Django Testing and Continuous Integration Links

Django Continuous Integration:

Python Testing:

NLTK Classifier Based Chunker Accuracy

The NLTK Book has been updated with an explanation of how to train a classifier based chunker, and I wanted to compare it’s accuracy versus my previous tagger based chunker.

Tag Chunker

I already covered how to train a tagger based chunker, with the the discovery that a UnigramBigram TagChunker is the narrow favorite. I’ll use this Unigram-Bigram Chunker as a baseline for comparison below.

Classifier Chunker

A Classifier based Chunker uses a classifier such as the MaxentClassifier to determine which IOB chunk tags to use. It’s very similar to the TagChunker in that the Chunker class is really a wrapper around a Classifier based part-of-speech tagger. And both are trainable alternatives to a regular expression parser. So first we need to create a ClassifierTagger, and then we can wrap it with a ClassifierChunker.

Classifier Tagger

The ClassifierTagger below is an abstracted version of what’s described in the Information Extraction chapter of the NLTK Book. It should theoretically work with any feature extractor and classifier class when created with the train classmethod. The kwargs are passed to the classifier constructor.

[sourcecode language=”python”]
from nltk.tag import TaggerI, untag

class ClassifierTagger(TaggerI):
”’Abstracted from "Training Classifier-Based Chunkers" section of
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html
”’
def __init__(self, feature_extractor, classifier):
self.feature_extractor = feature_extractor
self.classifier = classifier

def tag(self, sent):
history = []

for i, word in enumerate(sent):
featureset = self.feature_extractor(sent, i, history)
tag = self.classifier.classify(featureset)
history.append(tag)

return zip(sent, history)

@classmethod
def train(cls, train_sents, feature_extractor, classifier_cls, **kwargs):
train_set = []

for tagged_sent in train_sents:
untagged_sent = untag(tagged_sent)
history = []

for i, (word, tag) in enumerate(tagged_sent):
featureset = feature_extractor(untagged_sent, i, history)
train_set.append((featureset, tag))
history.append(tag)

classifier = classifier_cls.train(train_set, **kwargs)
return cls(feature_extractor, classifier)
[/sourcecode]

Classifier Chunker

The ClassifierChunker is a thin wrapper around the ClassifierTagger that converts between tagged tuples and parse trees. args and kwargs in __init__ are passed in to ClassifierTagger.train().

[sourcecode language=”python”]
from nltk.chunk import ChunkParserI, tree2conlltags, conlltags2tree

class ClassifierChunker(nltk.chunk.ChunkParserI):
def __init__(self, train_sents, *args, **kwargs):
tag_sents = [tree2conlltags(sent) for sent in train_sents]
train_chunks = [[((w,t),c) for (w,t,c) in sent] for sent in tag_sents]
self.tagger = ClassifierTagger.train(train_chunks, *args, **kwargs)

def parse(self, tagged_sent):
if not tagged_sent: return None
chunks = self.tagger.tag(tagged_sent)
return conlltags2tree([(w,t,c) for ((w,t),c) in chunks])
[/sourcecode]

Feature Extractors

Classifiers work on featuresets, which are created with feature extraction functions. Below are the feature extractors I evaluated, partly copied from the NLTK Book.

[sourcecode language=”python”]
def pos(sent, i, history):
word, pos = sent[i]
return {‘pos’: pos}

def pos_word(sent, i, history):
word, pos = sent[i]
return {‘pos’: pos, ‘word’: word}

def prev_pos(sent, i, history):
word, pos = sent[i]

if i == 0:
prevword, prevpos = ‘<START>’, ‘<START>’
else:
prevword, prevpos = sent[i-1]

return {‘pos’: pos, ‘prevpos’: prevpos}

def prev_pos_word(sent, i, history):
word, pos = sent[i]

if i == 0:
prevword, prevpos = ‘<START>’, ‘<START>’
else:
prevword, prevpos = sent[i-1]

return {‘pos’: pos, ‘prevpos’: prevpos, ‘word’: word}

def next_pos(sent, i, history):
word, pos = sent[i]

if i == len(sent) – 1:
nextword, nextpos = ‘<END>’, ‘<END>’
else:
nextword, nextpos = sent[i+1]

return {‘pos’: pos, ‘nextpos’: nextpos}

def next_pos_word(sent, i, history):
word, pos = sent[i]

if i == len(sent) – 1:
nextword, nextpos = ‘<END>’, ‘<END>’
else:
nextword, nextpos = sent[i+1]

return {‘pos’: pos, ‘nextpos’: nextpos, ‘word’: word}

def prev_next_pos(sent, i, history):
word, pos = sent[i]

if i == 0:
prevword, prevpos = ‘<START>’, ‘<START>’
else:
prevword, prevpos = sent[i-1]

if i == len(sent) – 1:
nextword, nextpos = ‘<END>’, ‘<END>’
else:
nextword, nextpos = sent[i+1]

return {‘pos’: pos, ‘nextpos’: nextpos, ‘prevpos’: prevpos}

def prev_next_pos_word(sent, i, history):
word, pos = sent[i]

if i == 0:
prevword, prevpos = ‘<START>’, ‘<START>’
else:
prevword, prevpos = sent[i-1]

if i == len(sent) – 1:
nextword, nextpos = ‘<END>’, ‘<END>’
else:
nextword, nextpos = sent[i+1]

return {‘pos’: pos, ‘nextpos’: nextpos, ‘word’: word, ‘prevpos’: prevpos}
[/sourcecode]

Training

Now that we have all the pieces, we can put them together with training.

NOTE: training the classifier takes a long time. If you want to reduce the time, you can increase min_lldelta or decrease max_iter, but you risk reducing the accuracy. Also note that the MaxentClassifier will sometimes produce nan for the log likelihood (I’m guessing this is a divide-by-zero error somewhere). If you hit Ctrl-C once at this point, you can stop the training and continue.

[sourcecode language=”python”]
from nltk.corpus import conll2000
from nltk.classify import MaxentClassifier

train_sents = conll2000.chunked_sents(‘train.txt’)
# featx is one of the feature extractors defined above
chunker = ClassifierChunker(train_sents, featx, MaxentClassifier,
min_lldelta=0.01, max_iter=10)
[/sourcecode]

Accuracy

I ran the above training code for each feature extractor defined above, and generated the charts below. ub still refers to the TagChunker, which is included to provide a comparison baseline. All the other labels on the X-Axis refer to a classifier trained with one of the above feature extraction functions, using the first letter of each part of the name (p refers to pos(), pnpw refers to prev_next_pos_word(), etc).

conll2000 chunk training accuracy
treebank chunk training accuracy

One of the most interesting results of this test is how including the word in the featureset affects the accuracy. The only time including the word improves the accuracy is if the previous part-of-speech tag is also included in the featureset. Otherwise, including the word decreases accuracy. And looking ahead with next_pos() and next_pos_word() produces the worst results of all, until the previous part-of-speech tag is included. So whatever else you have in a featureset, the most important features are the current & previous pos tags, which, not surprisingly, is exactly what the TagChunker trains on.

Custom Training Data

Not only can the ClassifierChunker be significantly more accurate than the TagChunker, it is also superior for custom training data. For my own custom chunk corpus, I was unable to get above 94% accuracy with the TagChunker. That may seem pretty good, but it means the chunker is unable to parse over 1000 known chunks! However, after training the ClassifierChunker with the prev_next_pos_word feature extractor, I was able to get 100% parsing accuracy on my own chunk corpus. This is a huge win, and means that the behavior of the ClassifierChunker is much more controllable thru manualation.

jQuery Validation with Django Forms

Django has everything you need to do server-side validation, but it’s also a good idea to do client-side validation. Here’s how you can integrate the jQuery Validation plugin with your Django Forms.

jQuery Validation Rules

jQuery validation works by assigning validation rules to each element in your form. These rules can be assigned a couple different ways:

  1. Class Rules
  2. Metadata Rules
  3. Rules Object

Django Form Class Rules

The simplest validation rules, such as required, can be assigned as classes on your form elements. To do this in Django, you can specify custom widget attributes.

[sourcecode language=”python”]
from django import forms
from django.forms import widgets

class MyForm(forms.Form):
title = forms.CharField(required=True, widget=widgets.TextInput(attrs={
‘class’: ‘required’
}))
[/sourcecode]

In Django 1.2, there’s support for a required css class, but you can still use the technique above to specify other validation rules.

Django Form Metadata Rules

For validation methods that require arguments, such minlength and maxlength, you can create metadata in the class attribute. You’ll have to include the jQuery metadata plugin for this style of rules.

[sourcecode language=”python”]
from django import forms
from django.forms import widgets

class MyForm(forms.Form):
title = forms.CharField(required=True, minlength=2, maxlength=100, widget=widgets.TextInput(attrs={
‘class’: ‘{required:true, minlength:2, maxlength:100}’
}))
[/sourcecode]

jQuery Validate Rules Object

If your validation requirements are more complex, or you don’t want to use the metadata plugin or class based rules, you can create a rules object to pass as an option to the validate method. This object can be generated in your template like so:

[sourcecode language=”html”]
<script type="text/javascript">
FORM_RULES = {
‘{{ form.title.name }}’: ‘required’
};

$(document).ready(function() {
$(‘form’).validate({
rules: FORM_RULES
});
});
</script>
[/sourcecode]

The reason I suggest generating the rules object in your template is to avoid hardcoding the field name in your javascript. A rules object can also be used in conjunction with class and metadata rules, so you could have some rules assigned in individual element classes or metadata, and other rules in your rules object.

Error Messages

If you want to keep the client-side validation error messages consistent with Django’s validation error messages, you’ll need to copy Django’s error messages and specify them in the metadata or in a messages object.

Metadata Messages

Messages must be specified per-field, and per-rule. Here’s an example where I specify the minlength message for the title field.

[sourcecode language=”python”]
from django import forms
from django.forms import widgets

class MyForm(forms.Form):
title = forms.CharField(minlength=2, widget=widgets.TextInput(attrs={
‘class’: ‘{minlength:2, messages:{minlength:"Ensure this value has at least 2 characters"}}’
}))
[/sourcecode]

Messages Object

Messages can also be specified in javascript object, like so:

[sourcecode language=”html”]
<script type="text/javascript">
FORM_RULES = {
‘{{ form.title.name }}’: ‘required’
};

FORM_MESSAGES = {
‘{{ form.title.name }}’: ‘This field is required’
};

$(document).ready(function() {
$(‘form’).validate({
rules: FORM_RULES,
messages: FORM_MESSAGES
});
});
</script>
[/sourcecode]

Just like with validation rules, messages in element metadata can be used in conjunction with a global messages object. Note: if an element has a title attribute, then the title will be used as the default error message, unless you specify ignoreTitle: false in the jQuery validate options.

Error Labels vs Errorlist

Django’s default error output is an error list, while the default for jQuery Validation errors is a label with class="error". So in order to unify your validation errors, there’s 2 options:

  1. make jQuery Validation output an error list
  2. output error labels instead of an error list in the template

Personally, I prefer the simple error labels produced by jQuery validation. To make Django generate those instead of an error list, you can do the following in your templates:

[sourcecode language=”html”]
{{ field }}
{% if field.errors %}
{# NOTE: must use id_NAME for jquery.validation to overwrite error label #}
<label class=’error’ for=’id_{{ field.name }}’ generated="true">{{ field.errors|join:". " }}</label>
{% endif %}
[/sourcecode]

You could also create your own error_class for outputting the error labels, but then you’d lose the ability to specify the for attribute.

If you want to try to make jQuery validation produce an error list, that’s a bit harder. You can specify a combination of jQuery validation options and get a list, but there’s not an obvious way to get the errorlist class on the ul.

[sourcecode language=”javascript”]
$(‘form’).validate({
errorElement: ‘li’,
wrapper: ‘ul’
});
[/sourcecode]

Other options you can look into are errorLabelContainer, errorContainer, and a highlight function.

Final Recommendations

I find it’s easiest to specify class and metadata rules in custom widget attributes 90% of the time, and use a rules object only when absolutely necessary. For example, if I want to require only the first elements in a formset, but not the rest, then I may use a rules object in addition to class and metadata rules. For error messages, I generally use a field template like the above example that I include for each field:

{% with form.title as field %}{% include "field.html" %}{% endwith %}

Or if the form is really simple, I do

{% for field in form %}{% include "field.html" %}{% endfor %}

Django Model Formsets

Django model formsets provide a way to edit multiple model instances within a single form. This is especially useful for editing related models inline. Below is some knowledge I’ve collected on some of the lesser documented and undocumented features of Django’s model formsets.

Model Formset Factory Methods

Django Model Formsets are generally created using a factory method. The default is modelformset_factory, which wraps formset_factory to create Model Forms. You can also create inline formsets to edit related objects, using inlineformset_factory. inlineformset_factory wraps modelformset_factory to restrict the queryset and set the initial data to the instance’s related objects.

Adding Fields to a Model Formset

Just like with a normal Django formset, you can add additional fields to a model formset by creating a base formset class with an add_fields method, then passing it in to the factory method. The only difference is the class you inherit from. For inlineformset_factory, you should inherit from BaseInlineFormSet.

If you’re using modelformset_factory, then you should import and inherit from BaseModelFormSet instead. Also remember that form.instance may be used to set initial data for the fields you’re adding. Just check to make sure form.instance is not None before you try to access any properties.

[sourcecode language=”python”]
from django.forms.models import BaseInlineFormSet, inlineformset_factory

class BaseFormSet(BaseInlineFormSet):
def add_fields(self, form, index):
super(BasePlanItemFormSet, self).add_fields(form, index)
# add fields to the form

FormSet = inlineformset_factory(MyModel, MyRelatedModel, formset=BaseFormSet)
[/sourcecode]

Changing the Default Form Field

If you’d like to customize one or more of the form fields within your model formset, you can create a formfield_callback function and pass it to the formset factory. For example, if you want to set required=False on all fields, you can do the following.

[sourcecode language=”python”]
def custom_field_callback(field):
return field.formfield(required=False)

FormSet = modelformset_factory(model, formfield_callback=custom_field_callback)
[/sourcecode]

field.formfield() will create the default form field with whatever arguments you pass in. You can also create different fields, and use field.name to do field specific customization. Here’s a more advanced example.

[sourcecode language=”python”]
def custom_field_callback(field):
if field.name == ‘optional’:
return field.formfield(required=False)
elif field.name == ‘text’:
return field.formfield(widget=Textarea)
elif field.name == ‘integer’:
return IntegerField()
else:
return field.formfield()
[/sourcecode]

Deleting Models in a Formset

Pass can_delete=True to your factory method, and you’ll be able to delete the models in your formsets. Note that inlineformset_factory defaults to can_delete=True, while modelformset_factory defaults to can_delete=False.

Creating New Models with Extra Forms

As with normal formsets, you can pass an extra argument to your formset factory to create extra empty forms. These empty forms can then be used to create new models. Note that when you have extra empty forms in the formset, you’ll get an equal number of None results when you call formset.save(), so you may need to filter those out if you’re doing any post-processing on the saved objects.

If you want to set an upper limit on the number of extra forms, you can use the max_num argument to restrict the maximum number of forms. For example, if you want up to 6 forms in the formset, do the following:

[sourcecode language=”python”]
MyFormSet = inlineformset_factory(MyModel, MyRelatedModel, extra=6, max_num=6)
[/sourcecode]

Saving Django Model Formsets

Model formsets have a save method, just like with model forms, but in this case, you’ll get a list of all modified instances instead of a single instance. Unmodified instances will not be returned. As mentioned above, if you have any extra empty forms, then those list elements will be None.

If you want to create custom save behavior, you can override 2 methods in your BaseFormSet class: save_new and save_existing. These methods look like this:

[sourcecode language=”python”]
from django.forms.models import BaseInlineFormSet

class BaseFormSet(BaseInlineFormSet):
def save_new(self, form, commit=True):
# custom save behavior for new objects, form is a ModelForm
return super(BaseFormSet, self).save_new(form, commit=commit)

def save_existing(self, form, instance, commit=True):
# custom save behavior for existing objects
# instance is the existing object, and form has the updated data
return super(BaseFormSet, self).save_existing(form, instance, commit=commit)
[/sourcecode]

Inline Model Admin

Django’s Admin Site includes the ability to specify InlineModelAdmin objects. Subclasses of InlineModelAdmin can use all the arguments of inlineformset_factory, plus some admin specific arguments. Everything mentioned above applies equally to InlineModelAdmin arguments: you can specify the number of extra forms, the maximum number of inline forms, and even your own formset with custom save behavior.

Far Future Expires Header with django-storages S3Storage

One way to decrease your site’s load time is to set a far future Expires header on all your static content. This doesn’t help first-time visitors, but can greatly improve the experience of returning visitors. And you get to decrease your bandwidth needs at the same time, because all your static content will be cached by their browser.

S3

weotta puts all of its awesome plan images in Amazon’s S3 using django-storages S3Storage backend, which by default does not set any Expires header. To remedy this, I set AWS_HEADERS in settings.py like so

[sourcecode language=”python”]
from datetime import date, timedelta
tenyrs = date.today() + timedelta(days=365*10)
# Expires 10 years in the future at 8PM GMT
AWS_HEADERS = {
‘Expires’: tenyrs.strftime(‘%a, %d %b %Y 20:00:00 GMT’)
}
[/sourcecode]

Now every uploaded file gets an Expires header set to 10 years in the future.

upload_to

One potential drawback to using a far future Expires header is that if you change the file content without also changing the file name, no one will notice because they’ll keep using the old cached version of the file. Luckily, Django makes it easy to create (mostly) unique new file names by letting you include strftime formatting codes in a FileField or ImageField upload_to path, such as upload_to='images/%Y/%m/%d'. This way, every uploaded file automatically gets stored by date, which means it would take some deliberate effort to change the contents of a file without also changing the file name.

Execnet vs Disco for Distributed NLTK

There’s a number of options for distributed processing and mapreduce in python. Before execnet surfaced, I’d been using Disco to do distributed NLTK. Now that I’ve happily switched to distributed NLTK with execnet, I can explain some of the differences and why execnet is so much better for my purposes.

Disco Overhead

Disco is a mapreduce framework for python, with an erlang core. This is very cool, but unfortunately introduces overhead costs when your functions are not pure (meaning they require external code and/or data). And part of speech tagging with NLTK is definitely not pure; the map function requires a part of speech tagger in order to do anything. So to use a part of speech tagger within a Disco map function, it must be loaded inline, which means unpickling the object before doing any work. And since a pickled part of speech tagger can easily exceed 500K, unpickling it can take over 2 seconds. When every map call has a fixed overhead of 2 seconds, your mapreduce task can take orders of magnitude longer to complete.

As an example, let’s say you need to do 6000 map calls, at 1 second of pure computation each. That’s 100 minutes, not counting overhead. Now add in the 2s fixed overhead on each call, and you’re at 300 minutes. What should be just over 1.6 hours of computation has jumped to 5 hours.

Execnet FTW

execnet provides a very different computational model: start some gateways and communicate thru message channels. In my case, all the fixed overhead can be done up-front, loading the part of speech tagger once per gateway, resulting in greatly reduced compute times. I did have to change my old Disco based code to work with execnet, but I actually ended up with less code that’s easier to understand.

Conclusion

If you’re just doing pure mapreduce computations, then consider using Disco. After the one time setup (which can be non-trivial), writing the functions will be relatively easy, and you’ll get a nice web UI for configuration and monitoring. But if you’re doing any dirty operations that need expensive initialization procedures, or can’t quite fit what you need into a pure mapreduce framework, then execnet is for you.

Distributed NLTK with execnet

(This page has been translated into Spanish by Maria Ramos, and has also been translated into Belorussian)

Want to speed up your natural language processing with NLTK? Have a lot of files to process, but don’t know how to distribute NLTK across many cores?

Well, here’s how you can use execnet to do distributed part of speech tagging with NLTK.

execnet

execnet is a simple library for creating a network of gateways and channels that you can use for distributed computation in python. With it, you can start python shells over ssh, send code and/or data, then receive results. Below are 2 scripts that will test the accuracy of NLTK’s recommended part of speech tagger against every file in the brown corpus. The first script (the runner) does all the setup and receives the results, while the second script (the remote module) runs on every gateway, calculating and sending the accuracy of each file it receives for processing.

Runner

The runner does the following:

  1. Defines the hosts and number of gateways. I recommend 1 gateway per core per host.
  2. Loads and pickles the default NLTK part of speech tagger.
  3. Opens each gateway and creates a remote execution channel with the tag_files module (the remote module covered below).
  4. Sends the pickled tagger and the name of a corpus (brown) thru the channel.
  5. Once all the channels have been created and initialized, it then sends all of the fileids in the corpus to alternating channels to distribute the work.
  6. Finally, it creates a receive queue and prints the accuracy response from each channel.

run_tag_files.py

[sourcecode language=”python”]
import execnet
import nltk.tag, nltk.data
import cPickle as pickle
import tag_files

HOSTS = {
‘localhost’: 2
}

NICE = 20

channels = []

tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))

for host, count in HOSTS.items():
print ‘opening %d gateways at %s’ % (count, host)

for i in range(count):
gw = execnet.makegateway(‘ssh=%s//nice=%d’ % (host, NICE))
channel = gw.remote_exec(tag_files)
channels.append(channel)
channel.send(tagger)
channel.send(‘brown’)

count = 0
chan = 0

for fileid in nltk.corpus.brown.fileids():
print ‘sending %s to channel %d’ % (fileid, chan)
channels[chan].send(fileid)
count += 1
# alternate channels
chan += 1
if chan >= len(channels): chan = 0

multi = execnet.MultiChannel(channels)
queue = multi.make_receive_queue()

for i in range(count):
channel, response = queue.get()
print response
[/sourcecode]

Remote Module

The remote module is much simpler.

  1. Receives and unpickles the tagger.
  2. Receives the corpus name and loads it.
  3. For each fileid received, evaluates the accuracy of the tagger on the tagged sentences and sends an accuracy response.

tag_files.py

[sourcecode language=”python”]
import nltk.corpus
import cPickle as pickle

if __name__ == ‘__channelexec__’:
tagger = pickle.loads(channel.receive())
corpus_name = channel.receive()
corpus = getattr(nltk.corpus, corpus_name)

for fileid in channel:
accuracy = tagger.evaluate(corpus.tagged_sents(fileids=[fileid]))
channel.send(‘%s: %f’ % (fileid, accuracy))
[/sourcecode]

Putting it all together

Make sure you have NLTK and the corpus data installed on every host. You must also have passwordless ssh access to each host from the master host (the machine you run run_tag_files.py on).

run_tag_files.py and tag_files.py only need to be on the master host; execnet will take care of distributing the code. Assuming run_tag_files.py and tag_files.py are in the same directory, all you need to do is run python run_tag_files.py. You should get a message about opening gateways followed by a bunch of send messages. Then, just wait and watch the accuracy responses to see how accurate the built in part of speech tagger is on the brown corpus.

If you’d like test the accuracy of a different corpus, make sure every host has the corpus data, then send that corpus name instead of brown, and send the fileids from the new corpus.

If you want to test your own tagger, pickle it to a file, then load and send it instead of NLTK’s tagger. Or you can train it on the master first, then send it once training is complete.

Distributed File Processing

In practice, it’s often a PITA to make sure every host has every file you want to process, and you’ll want to process files outside of NLTK’s builtin corpora. My recommendation is to setup a GlusterFS storage cluster so that every host has a common mount point with access to every file that you want to process. If every host has the same mount point, you can send any file path to any channel for processing.

Django Tools and Links

Using Django
Social Apps
Forms
Notifications
Geolocation
Misc

Machine Learning Links

Django IA: Registration-Activation

django-registration is a pluggable Django app that implements a common registration-activation flow. This flow is quite similar to the password reset flow, but slightly simpler with only 3 views:

  1. register
  2. registration_complete
  3. activate

The basic idea is that an anonymous user can create a new account, but cannot login until they activate their account by clicking a link they’ll receive in an activation email. It’s a way to automatically verify that the new user has a valid email address, which is generally an acceptable proxy for proving that they’re human. Here’s an Information Architecture diagram, again using jjg’s visual vocabulary.

Django Registration IA

Here’s a more in-depth walk-thru with our fictional user named Bob:

  1. Bob encounters a section of the site that requires an account, and is redirected to the login page.
  2. But Bob does not have an account, so he goes to the registration page where he fills out a registration form.
  3. After submitting the registration form, Bob is taken to a page telling him that he needs to activate his account by clicking a link in an email that he should be receiving shortly.
  4. Bob checks his email, finds the activation email, and clicks the activation link.
  5. Bob is taken to a page that tells him his account is active, and he can now login.

As with password reset, I think the last step is unnecessary, and Bob should be automatically logged in when his account is activated. But to do that, you’ll have to write your own custom activate view. Luckily, this isn’t very hard. If you take a look at the code for registration.views.activate, the core code is actually quite simple:

[sourcecode language=”python”]
from registration.models import RegistrationProfile

def activate(request, activation_key):
user = RegistrationProfile.objects.activate_user(activation_key.lower())

if not user:
# handle invalid activation key
else:
# do stuff with the user, such as automatically login, then redirect
[/sourcecode]

The rest of the custom activate view is up to you.