Extended Depth of Field In Python using Mahotas

I wanted to implement an extended depth of field algorithm last week for visualisation.

The idea of extended depth of focus is that you have a stack of images, taken at different focal points, and you build a single artificial image so that you get everything in focus. Some simple methods work OK, some very advanced methods work better. We were happy with OK for our setting. Here’s how to do it with mahotas:

Start with standard imports:

import numpy as np
import mahotas as mh

We are going to assume that you have an image object, which has three dimensions: the stack, height, and width:

stack,h,w = image.shape

We use mh.sobel as the measure of infocusness for each pixel [1]:

focus = np.array([mh.sobel(t, just_filter=True) for t in image])

Now, we select the best slice at each pixel location:

best = np.argmax(focus, 0)

So far, very easy. The next part is the hard part. We want to do the following:

r = np.zeros((h,w))-1
for y in xrange(h):
    for x in xrange(w):
        r[y,x] = image[best[y,x], y, x]

But this is very slow (never run nested loops in Python if you can avoid it). We get the same result with a slightly less legible, but faster manipulation [2]:

image = image.reshape((stack,-1)) # image is now (stack, nr_pixels)
image = image.transpose() # image is now (nr_pixels, stack)
r = image[np.arange(len(image)), best.ravel()] # Select the right pixel at each location
r = r.reshape((h,w)) # reshape to get final result

Et voilà!

§

Here is an example, from a stack of plankton images. This is is the maximum intensity projection:

Maximum Intensity Projection

This is the most in-focus slice (using the sobel operator):

Most in-focus slice

And this is the extended depth of field result:

Extended Depth of Field

It is clearly sharper (look at some of the organelles on the top-left: they are a blur in the other two images),at the expense of some possible noise. I actually played around with blurring the image a little bit and it did improve things ever so slightly.

§

Cite the mahotas paper if using this in a scientific publication.

[1] Other methods simply use a different measure here.
[2] I am not 100% convinced that this is the best. After all we create an array of size len(image) just to index. I would be happy to find an alternative.

Jug now outputs metadata on the computation

This week-end, I added new functionality to jug (previous related posts). Jug can now output the final result of a computation including metadata on all the intermediate inputs!

For example:

from jug import TaskGenerator
from jug.io import write_metadata # <---- THIS IS THE NEW STUFF

@TaskGenerator
def double(x):
    return 2*x

x = double(2)
x2 = double(x)write_metadata(x2, 'x2.meta.yaml')

When you execute this script (with jug execute), the write_metadata function will write a YAML description of the computation to the file x.meta.yaml!

This file will look like this:

args:
- args: [2]
  meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
  name: jugfile.double
meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
name: jugfile.double

It tells you that a computation named jugfile.double was computated at 18h31 on Saturday August 3. It also gives the same information recursively for all intermediate results.

§

This is the result of  a few conversations I had at BOSC 2013.

§

This is part of the new release, 0.9.6, which I put out yesterday.

Mahotas: A continuous improvement example

Last week, I tweeted:

Let me highlight a good example of continuous improvement in mahotas and the advantages of eating your own dog food.

I was using mahotas to compute some wavelets, but I couldn’t remember the possible parameter values. So I got some error:

[1]: mh.daubechies(im, 'db4')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-2e939df57e6a> in <module>()
----> 1 mh.daubechies(im, 'db4')

/home/luispedro/work/mahotas/mahotas/convolve.pyc in daubechies(f, code, inline)
    492     '''
    493     f = _wavelet_array(f, inline, 'daubechies')
--> 494     code = _daubechies_codes.index(code)
    495     _convolve.daubechies(f, code)
    496     _convolve.daubechies(f.T, code)ValueError: 'db4' is not in list

I could have just looked up the right code and moved on. Instead, I considered the unhelpfulness of the error message a bug and fixed it (here is the commit)

Now we get a better error message:

ValueError: mahotas.convolve: Known daubechies codes are ['D2', 'D4', 'D6',
'D8', 'D10', 'D12', 'D14', 'D16', 'D18', 'D20']. You passed in db4.

You still get an error, but at least it tells you what you should be doing.

Good software works well. Excellent software fails well too.

Mahotas software paper published

I got a new paper published [1]:

Mahotas: open source software for scriptable computer vision by Luis Pedro Coelho in Journal of Open Research Software [DOI]

This is about my computer vision software package, mahotas. It started as a way to do bioimage informatics, but the sotware is actually generic to computer vision.

Figure 1

Earlier, I posted a tutorial on image segmentation which used mahotas.

The journal calls these metapapers in that they are not the work but a reference to the work, i.e., the software. This is an interesting new iniciative to reward scientific software development. As I argued before, release of scientific software is a collective action problem: It is better for science to have software released, but not for the individual researcher. I also wrote that it would be a good idea to change the incentives to make it more profitable to do the right thing. The Journal of Open Research Software is exactly one such step.

It has already been used in a few publications, both by myself and others:

As you can see, this is not all about Bioimage Informatics, which is sort of nice as it means that this is actually useful outside of the field in which it was initially developed.

[1] Yes it has been a good month for publications.

The Hard Part is Motivation. Books. &c

Building Machine Learning Systems with Python

Because my book just came out, I am again excerpting from a Portuguese interview with me I mentioned previously:

Who is the target audience for this book?

Luis Pedro: There are two distinct audiences: The first are programmers who do not know much of machine learning, but liked to use a classifier, for example. The second are people who do not need an introduction to machine learning (because they already know), but maybe do not know how to do it with the tools in Python.

What were the main difficulties encountered in this process?

Luis Pedro: The challenge is always to find examples that are not too easy, but they are not also too difficult for an introduction. In one case (image classification), I took some photographs to create a dataset. In the literature, there are classical examples which are trivial to handle with modern techniques and current research problems that are very difficult. My dataset consists of poorly framed photographs taken with a mobile phone camera. Therefore, it also has a style more akin to a problem area or mobile web. The aim is to distinguish photographs of buildings, natural landscape, or texts.

There was nothing too complex for anyone who knows the area. In fact, when it comes to writing about something you already know how to do (as was the case), the hard part is the motivation.

[…]

One thing we stress throughout is principled evaluation and we warn against overselling your results. Even in the world of research, we still find papers that mix training set and test set! Obviously not coming from groups that work in machine learning, but in more applied areas, we still find people who test hyper-parameters in the test set.

Making Your Mistakes in Public

I recently released a new version of mahotas, my computer vision package. It was version 1.0.1, which was quickly followed by 1.0.2.

1.0 had introduced a mistake, which was caught and fixed by Tony S Yu on github (issue 33). Along with a few other improvements, this warranted a new release.

Then, 1.0.1 had a little mistake of its own, caught by Jean-Patrick Pommier [jip’s homepage] on the pythonvision mailing list.

Thus, 1.0.2 was necessary. I released 1.0.2 about an hour after I was alerted to the errors.

§

This is why I develop my scientific code as open source. (1) Other people catch your mistakes and fix them for you  (and the whole community wins)! (2) because the code was public, I felt a great urge to fix the mistakes very fast. I even added tests to make sure they would not happen again. I am harnessing public pressure for good.

Public code becomes better as its openness pushes me to have it as pristine as possible.

Putting My Effort Where My Mouth Is

Yesterday, I argued that computer programming should be part of basic instruction.

Today, I’ll talk about a particular group of people who need to learn to programme: scientists, even lab scientists.

Personally, one of the great draws of moving away from academia into industry (the good industry places, at least) is to be able to use decent tools. Code sharing is a solved problem (the solution is version control). CVS was released in 1986. That’s almost 30 years ago. In academia, use of this type of tool is not (yet?) widespread.

§

In this context, I have (for a long time now) been putting my effort where my mouth is and created a course called Programming for Scientists, which I have taught a couple of times.

Last week, at EMBL, I helped teach a course on Python. This was done on a tight schedule and still we were overwhelmed with demand (we sold out in about 24 after a single email announcement).

Software Carpentry is another great project to teach scientists about programming and computer usage.

Building Machine Learning Systems with Python

I wrote a book. Well, only in part. Willi Richert and I wrote a book.

It is called Building Machine Learning Systems With Python and is now available from Amazon (or Amazon.co.uk), although it has already been partially available directly from the publisher for a while (in a form where you get chapters as editing is finished).

§

The book is an introduction to using machine learning in Python.

We mostly rely on scikit-learn, which is the most complete package for machine learning in Python. I do prefer my own code for my own projects, but milk is not as complete. It has stuff that scikit-learn does not (and stuff they have, correctly, appropriated).

We try to cover all the major modes in machine learning and, in particular, have:

  1. classification
  2. regression
  3. clustering
  4. dimensionality reduction
  5. topic modeling

and also, towards the end, three more applied chapters:

  1. classification of music
  2. pattern recognition in images
  3. using jug for parallel processing (including in the cloud).

§

The approach is tutorial-like, without much math but lots of code examples.

This should get people started and will be more than enough if the problem is easy (and there are still many easy problems out there). With good features (which are problem-specific, anyway) knowing how to run an SVM will very often be enough.

Lest you fear we are giving people enough just enough knowledge to be dangerous, we stress correct evaluation of the results throughout the book. We warn repeatedly against mixing up your training and testing data. This simple principle is, unfortunately, still often disregarded in scientific publications. [1]

§

There is an aspect that I really enjoyed about this whole process:

Before starting the book, I had already submitted two papers, neither of which is out already (even though, after some revisions, they are in accepted state). In the meanwhile, the book has been written, edited (only a few minor issues are still pending) and people have been able to buy parts of it for a few months now.

I have now a renewed confidence in the choice to stay in science (also because I moved from a place where things are completely absurd to a place where the work very well). But the delay in publications that is common in the life sciences is an emotional drag. In some cases, the bulk of the work was finished a few years before the paper is finally out.

Update (July 26 2013): Amazon is now shipping the book! I changed the wording above to reflect this.

[1] It is rare to see somebody just report training accuracy and claim their algorithm does well. In fact, I have never seen it in a recent paper. However, performing feature selection or parameter tuning on the whole data prior to cross-validating on the selected features with the tuned parameters is pretty common still today (there are other sins of evaluation too: “we used multiple parameters and report the best”). This leads to inflated results all around. One of the problems is that, if you do things correctly in this environment, you risk that reviewers of your work will say “looks great, but so-and-so got better results” because so-and-so tuned on the testing set and seems to have “beaten” you. (Yes, I’ve had this happen, multiple times; but that is a rant for another day.)

Parameter Optimization and Early Exit

Andrew Gelman calls this the folk theorem of statistical computing:

When you have computational problems, often there’s a problem with your model.

(This is not a folk theorem per se, as it is not a theorem; but the name stuck. [1])

There is an interesting corollary:

When you are performing parameter selection using cross-validation or a similar blind procedure, the bad parameter choices take longer than the good ones!

This reminds me of the advertising quip (typically attributed to John Wanamaker) that Half the money I spend on advertising is wasted; the trouble is I don’t know which half.

Of course, you can reply: if I knew the right parameters, then you would not need to find them in the first place. Still, understanding that your computation is slower in the really bad cases can be actionable.

§

Here is an example from machine learning: if you are using a support vector machine based system, you will often need to fit two parameters:

  1. The SVM penalty $C$.
  2. The kernel parameter (in the case of radial basis functions, the width $sigma$).

A simple solution is to try a grid of these parameters:

for c in [2**(-2), 2**(-1), 2**0,...]:
    for s in [2**(-2), 2**(-1), 2**0,...]:
        use cross validation to evaluate using c & s
pick best one

There is a fair share of wasted computation. For example, let’s say you have a parameter choice that is so awful it gives you 100% error rate and another which is so good it gives you 0% error. If you are lucky and you check the good combination first, you can abort the bad parameter choices early: the first time you see an error, you know it will never be as good.

This leads to the following algorithm:

error = { param -> 0 for all parameters }
until done:
    next_param = parameter with less error which has not run all folds
    run crossvalidation fold on next_param
    error[next_param] += error on this fold
    best_err = error[ best completed_parameter_value ]
    if best_err is minimum error:
        return best_completed_parameter_value

This aborts as early as possible with the best error.

§

This is implemented in my machine learning package (github link) by default.

I tested it on murphylab_slf7dna (with a bit of hacking of the internals to print statistics &c). I see that fitting with the right parameters takes 650ms (after preprocessing). We check a total of 48 parameter values. So we might expect that to take 0.65 * 48 = 31s. Since bad parameters take longer, it actually takes 48s (50% longer).

Using the early exit trick, it takes it down to 24s. This is half the time of running the full grid. This despite the fact that slightly more than the full grid was run: 57%.

[1] A really interesting question is whether you can formalise and prove it. A good model will often be one that has a nice little “fitness” peak, which is also easy to fit. Likelihood functions with local optima all over the place or ill-conditioned may correspond to worse models. There may be a proof hiding in here.