The False Hope of Usable Data Analysis

I changed the regular schedule of the posts because I wanted to write down these ideas.

A few days ago, in a panel at EuBIAS, I argued again that scientists should learn how to programme. I also argued that usability of bioimage analysis was a false hope.

Now, to be sure: usability is great, but usability does not mean usable without programming skills. Good usable programming environments can be the most usable way to achieve something[1]. I find the Python environment one of the most usable for data analysis currently, although there is still a lot of work which could improve it.

We can build communication systems without words, but only if the vocabulary is very limited. Otherwise, people need to learn how to read [2]. I think this a good analogy for non-programming environments.

The problem is that image analysis (or data analysis) is not a closed goal. Whatever we are doing today, will probably be packaged into simple-to-use tools, but the problems will grow in size and complexity.

For a fixed target, like sending email or writing a blog, we can build nice tools that don’t require programming. Any modern email client basically does email well enough. There is probably only a small set of behaviours we want our blogs to do (like scheduling a post) and I think we can get a small set of features that covers 95%+ of uses. There might be a need for a few hundred plugins, but not constant innovation. There is no constant pressure to do 10 times more.

But data analysis is not in the same category as sending email. It’s an open-ended problem, which will grow continuously, which has been growing continuously. Only a full-blown artificial intelligence system will be able to deal with the sort of analyses that we will want to do in 10 years. There are even analyses that we already want to do, but do not yet have the right code and tools.

If anything, as time has passed, I have felt more and more of a need to think in low-level terms [3].

A few years ago, push-button analysis was sufficient for most problems. Load your data into Excel, select the rows, and plot. Fit a line, compute some stats. STATA gave you a bit more power if Excel did not suffice. Now, the problems grew and push-button solutions do not scale. Not only do we have more data, we have more complex, more unstructured data.

Afew years ago, pointing out that Excel can only handle 1 million rows would have made you seem like a technically-obsessed weirdo, now it is a serious limitation.

A few years ago, people were writing things like feel free to use interpreted languages, it doesn’t matter that you’re losing performance compared to C; computers are super-fast, waste them. Now, there is much more interest in building implementations that are as fast as C (normally using Just-in-time compilation).

This will not get better and just saying that tools should be easier for non-programmers is missing the point.

Programming is like writing: a general purpose technological skill which transforms all activities. And this means that, eventually, it becomes useful (or even necessary) for many activities which are outside the core of programming (who’d have thought a salesperson would have to know how to read and write? A firefighter?).

Almost any job that does not require programming is one which can be done by a robot. Except entertainment and those jobs that Tyler Cowen, for lack of a better word, calls marketing. Tyler calls them marketing, but prostitution might be just as accurate, as it is about providing not a specific service or product, which could be provided by a machine, but the general positive feeling that comes from human contact [4].

Related

Bayes and Big Data by Cosma Shalizi

The Average is Over by Tyler Cowen

[1]	If you wish, read scripting for programming. I never cared much for this division.

[2]	If you google for traffic signs you’ll see that actually most images have at least one sign with words or images.

[3]	The need to managing parallelism (as our cores multiply, but not get faster) and memory access patterns as data grows faster than RAM have forced me to think about exactly what is happening in my machines.
[4]	Obviously, Tyler is right to use the word marketing even if it’s not a good fit. Prostitution has a strong negative charge..

Mahotas and the Python Scientific Ecosystem for Bioimage Analysis

This week, I’m in Barcelona to talk about mahotas at the EuBIAS 2013 Meeting.

You can get a preview of my talk here. The title is “Mahotas and the Python Scientific Ecosystem for Bioimage Analysis” and you’ll see it does not exclusively talks about mahotas, but the whole ecosystem. Comments are welcome (especially if they come in the next 24 hours).

In preparation, I released new versions of mahotas (and the sister mahotas-imread) yesterday:

There are a few bugfixes and small improvements throughout.

If you like and use mahotas, please cite the software paper.

Another Mention of our Book

I would highly recommend the book ”Building machine learning system with python” on Packtpub or on Amazon

via Python Scikit-learn to simplify Machine learning : { Bag of words } To [ TF-IDF ] « Datum Engineering !.

Removing a string prefix in Python

A little Python thing I like to do, but have never seen in other people’s code is to remove a prefix like this:

s = 'some string'
if s.startswith('some '):
    s = s[len('some '):]

I like the s[len('some '):] approach as I find it both error-robust (as opposed to typing the actual number like s[5:]) and self-documenting. For example, consider:

from glob import glob
files = glob('datadir/experiment/*.txt')

ids = [f[len('datadir/'):] for f in files]

It is pretty clear that what I want to do is remove the datadir/ prefix.

It works for suffixes too:

without_ext = filename[:-len('.txt')]
combined = filename[len('datadir/experiment/'):-len('.txt')]

This is much better than [1]:

combined = filename[18:-4]

(One may be tempted to write filename.replace('.txt','') to get rid of a suffix, but this is wrong! It does not work with 'datadir/experiments/datafiles.txt/filename.txt', which is perfectly legal.)

It is slightly inefficient because the Python interpreter will actually create a string, then compute its length. [2]

However, this is generally in code where it does not matter that much. If it did, I’d be doing it in C(++) or using some other method.

[1]	It should have been `filename[19:-4]`, but it’s hard to see immediately. In any case, writing a number always makes me think and code should not make you think too muchg

[2]	It is not allowed to just replace it by the result statically because you may have redefined the function `len`. It could have a check for the common case, I suppose.

Another Review of our Book

Another review of our book:

Give yourself a time, get this book, study and use it with your data: suddenly many things that were obscure or abstruse, will became clearer. This books is a investment which worths every penny.

[Amazon link for the book]

To reproduce the paper, you cannot use the code we used for the paper

Over the last few posts, I described my nuclear segmentation paper.

It has a reproducible research archive.

If you now download that code, that is not the code that was used for the paper!

In fact, the version that generates the tables in the paper does not run anymore, because it only runs with old versions of numpy!

In order for it to compute the computation in the paper, I had to update the code. In order to run the code in the paper, you need to get old versions of software.

To some extent, this is due to numpy’s frustrating lack of forward compatibility [1]. The issue at hand was the changed semantics of the histogram function.

In the end, I think I completely avoided that function in my code for a few years as it was toxic (when you write libraries for others, you never know which version of numpy they are running).

But as much as I can gripe about numpy breaking code between minor versions, they would eventually be justified in changing their API with the next major version change.

In the end, the half-life of code is such that each year, it becomes harder to reproduce older papers even if the code is available.

[1]	I used to develop for the KDE Project where you did not break user’s code ever and so I find it extremely frustrating to have to explain that you should not change an API on esthetical grounds in between minor versions.

Mahotas-imread Now Accepts Options When Writing

This week, I committed to mahotas-imread, some code to allow for setting options when saving:

from imread import imsave
image = ...
imsave('file.jpeg', image, opts={ 'jpeg:quality': 95 })

This saves the image array to file file.jpeg with quality 95 (out of 100).

This is only available in the version from github (at the moment), but I will probably put up a new release soon.

(If you like and use mahotas, please cite the software paper.)

imread 0.3.1 (pypi.python.org)

Hard and Soft Documentation

A good poker player polarizes his hands. This means that, for example, they might play a check-raise (this means you first refrain from betting and then come over the top on your opponent—This is normally done to give the impression that you have a strong hand) when they do have a very strong hand or when they completely missed the flop (they have very bad cards and are just bluffing). Intermediate hands are played differently [1]

I think good software documentation is often also polarized: you should have hard documentation and soft documentation, but nothing in the middle.

Hard Documentation

This is low level documentation, generally. This is of the kind that a Unix manpage gets you. This tells you exactly what each function and each argument does. If it is good, it will often be very succint.

Mahotas has always excelled at this level. Here is the sobel edge function:

def sobel(img, just_filter=False):
    '''
    edges = sobel(img, just_filter=False)

    Compute edges using Sobel's algorithm

    `edges` is a binary image of edges computed according to Sobel's algorithm.

    This implementation is tuned to match MATLAB's implementation.

    Parameters
    ----------
    img : Any 2D-ndarray
    just_filter : boolean, optional
        If true, then return the result of filtering the image with the sobel
        filters, but do not threashold (default is False).    Returns
    -------
    edges : ndarray
        Binary image of edges, unless `just_filter`, in which case it will be
        an array of floating point values.
    '''

This is because I can remember the general ideas behind each function, but I might like to look up the exact arguments. So, every little detail is documented.

Soft Documentation

Soft documentation are tutorials and other higher level guides. They do not pertain to a single function or a single object, but to the overall structure and thinking behind the software.

Mahotas has not had so much of these, but I have been trying to add some over the past few months (Finding Wally, for example). Some more mahotas blogging might also help.

The Intermediate Level

I don’t care so much for intermediate level documentation. I rarely find that level helpful. Unfortunately, this is the level at which too much bad documentation is written. Stuff like:

This function is part of the image segmentation pipeline. It can be used after pre-filtering or directly on the raw image data.

Ok, sort of helpful, but not really.

[1]

The reason for the randomness is that if you always do a single thing, people will catch on and exploit it (if you bluff a lot, people will call it; if you always have a strong hand, then you won’t get the added benefit of having someone try to call your bluff and give you even more chips). Intermediate hands should not be played like this because if the opponent pushes back, they probably have something that beats your intermediate hand. As always in poker, YMMV.

Extended Depth of Field In Python using Mahotas (metarabbit.wordpress.com)

Merging directories without loss of data

A problem I often have is to have two directories which are probably mostly the same, but maybe not completely as some of the files might be newer (edited) versions of the other.

For example, directory A:

A/
A/document.txt
A/blogpost.txt
A/photo.jpg
A/me.jpg

and B:

B/
B/document.txt
B/photo.jpg
B/me.jpg
B/you.jpg

Now, I want to merge A and B. With only this small number of files, I could easily check by hand if document.txt is the same on both sides, &c. However, in a large directory, this becomes impossible, so I wrote up a small utitlity to do so:

mergedirs B A

Will go through all of the files in B and check whether an equivalent file in A exists. If so, it will check the contents* (and flags, depending on the command line arguments used) and refuse to remove any file for which you do not have a copy.

Another cute thing it can do is compute a hash of a directory with all its files:

mergedirs --mode=hash

Prints out (for a directory called merge):

merge                    4a44a8706698da50f41fef5fdcffd163

This can be useful to check whether two directories in different computers are exactly the same (in terms of file contents, flags &c).

It’s mostly a tool I wrote to scratch my own itch. I have no plans to develop it beyond my needs, but I it might be useful for others too.

Working Around Bugs in Third Party Libraries

This is another story of continuous improvement, this time in mahotas’ little brother imread.

Last week, Volker Hilsenstein here at EMBL had a few problems with imread on Windows. This is one of those very hard issues: how to help someone on a different platform, especially one which you know nothing about?

In the end, the problem was not Windows per se, but an old version of libtiff. In that version, there is a logic error (literally, there is a condition which is miswritten and always false) and the code will attempt to read a TIFF header from a file even when writing. Mahotas-imread was not ready for this.

Many (especially in the research open-source world, unfortunately) would just say: well, I won’t support broken versions of libtiff: if your code does not adhere to the spec, I am just going to not work for you, if you don’t do exactly what you should, then I won’t work either. See this excellent old essay by Joel Spolsky on this sort of thing.

In my case, I prefer to work around the bug and when libtiff tries to read in write mode, return no data; which it correctly handles. I wrote the following data reading function to pass to libtiff:

tsize_t tiff_no_read(thandle_t, void*, tsize_t) {
        return 0;
}

The purpose of this code is simply to make imread work even on a broken, 5 year old version of a third party library.

In the meanwhile, we also fixed compilation in Cygwin as well as a code path which led to a hard crash.

Especially the possibility of a hard crash made me decide that this was important enough to merit a new release.

imread 0.3.0 (pypi.python.org)
Using imread to save disk space (metarabbit.wordpress.com)

	PLOS ONE：我該不該投這本期刊？… on How Long Does Plos One Take to…
	Ravindra Puli on ANN : Diskhash. Disk-based, pe…
	Covid-falcões, covid… on Did anyone in Santa Clara Coun…
	Marcelo Huerta on Grit: a non-existent alternati…
	luispedro on Friday Links (one day late,…

Related articles

Hard Documentation

Soft Documentation

The Intermediate Level

Related articles

Related articles