Snippet: sgn function

From time to time, I find that I need a function to return +1, 0, or -1 representing the sign of a value: +1 or -1 for positive or negative values, or 0 for a zero value. I remember learning this as the sgn function in high school. Python’s standard lib leaves this out, but does supply cmp as a built-in, so the standard approach would probably be to define sgn using:

sgn = lambda x : cmp(x,0)

I found a nice website of Bit-Twiddling Hacks the other day, and it had a nice alternative for sgn, which in Python would be:

sgn = lambda x : (x>0) - (x<0)

The elegance of this appeals to me and I did a quick timing pass:

C:\Users\Paul>python -m timeit "[cmp(x,0) for x in (-100,0,14)]"
1000000 loops, best of 3: 0.645 usec per loop

C:\Users\Paul>python -m timeit "[lambda x:x>0 - x<0 for x in (-100,0,14)]"
1000000 loops, best of 3: 0.496 usec per loop

So by cutting out the repeated function calls to cmp, this also has the benefit of being just a tad faster.

1 Comment

Snippet: Uniquify a sequence, preserving order

I came across some code today that used a set to keep track of previously seen values while iterating over a sequence, keeping just the not-seen-before items. A brute force kind of thing:

seen = set()
unique = []
for item in sequence:
    if not item in seen:
        unique.append(item)
        seen.add(item)

I remembered that I had come up with a simple form of this a while ago, using a list comprehension to do this in a single expression. I dug up the code, and wrapped it up in a nice little method. This version accepts any sequence or generator, and makes an effort to return a value of the same type as the input sequence:

def unique(seq):
    """Function to keep only the unique values supplied in a given 
       sequence, preserving original order."""
       
    # determine what type of return sequence to construct
    if isinstance(seq, (list,tuple)):
        returnType = type(seq)
    elif isinstance(seq, basestring):
        returnType = type(seq)('').join
    else:
        # - generators and their ilk should just return a list
        returnType = list

    try:
        seen = set()
        return returnType(item for item in seq if not (item in seen or seen.add(item)))
    except TypeError:
        # sequence items are not of a hashable type, can't use a set for uniqueness
        seen = []
        return returnType(item for item in seq if not (item in seen or seen.append(item)))

My first pass at this tried to compare the benefit of using a list vs. a set for seen – it turns out, both versions are useful, in case the items in the incoming sequence aren’t hashable, in which case the only option for seen is a list.

1 Comment

RK integrator

While moving files from my old laptop drive to my new one, I found a nice Runge-Kutta integrator class that I had written ages ago. So long ago, in fact, that I was a little embarrassed at the newbiness of some of my code. So I decided to update my code to get a nice RK class out of it, using list comprehensions instead of “for i in range” loops, and including an integrate method that acts as a generator so that the calling code can cycle through each integration step. As is typical in R-K, the system state is maintained in a vector X, and the calling method must provide a callback function that will return dX/dt.

Here is the class:

class RKIntegrator :
    "Class used to perform Runge-Kutta integration of set of ODE's"

    def __init__( self, dt, derivFunc, degree=0, initConds=None ):
        self.dt = float(dt)
        self.dt_2 = dt / 2.0
        self.t = float(0)
        if not (degree or initConds):
            raise ValueError("must specify degree or initial conditions")
        if initConds is not None:
            self.x = initConds[:]
        else:
            self.x = [0.0] * degree
        self.derivFunc = derivFunc


    def doIntegrationStep( self ):
        dt = self.dt
        dxFunc = self.derivFunc
        t2 = self.t + self.dt_2

        dx = dxFunc( self.t, self.x )
        delx0 = [ dx_i*dt for dx_i in dx ]
        xv = [x_i + delx0_i/2.0 for x_i, delx0_i in zip(self.x, delx0)]
        dx = dxFunc( t2, xv )
        delx1 = [ dx_i*dt for dx_i in dx ]
        xv = [x_i + delx1_i/2.0 for x_i,delx1_i in zip(self.x,delx1)]
        dx = dxFunc( t2, xv )
        delx2 = [ dx_i*dt for dx_i in dx ]
        xv = [x_i + delx1_2 for x_i,delx1_2 in zip(self.x,delx2)]
          
        self.t += dt
        dx = dxFunc(self.t, xv)
        
        self.x = [
            x_i + ( delx0_i + dx_i*dt + 2.0*(delx1_i + delx2_i) ) / 6.0
                for x_i, dx_i, delx0_i, delx1_i, delx2_i in 
                    zip(self.x, dx, delx0, delx1, delx2)
            ]
    
    def integrate(self):
        while True:
            self.doIntegrationStep()
            yield self.t, self.x

Here is an example of finding X with constant acceleration of 4:

def getDX( t, x ):
  return [ 
    x[1], 
    4.0 
    ]

isWhole = lambda x : abs(x-round(x)) < 1e6 

rk = RKIntegrator( dt=0.1, derivFunc=getDX, initConds = [0.0, 0.0] )
for t,x in rk.integrate():
    if t > 10: 
        break
    if isWhole(t):
        print(t, ', '.join('%.2f' % x_i for x_i in x))

Googling for ‘Python runge kutta’, I came across this blog posting:
http://doswa.com/blog/2009/01/02/fourth-order-runge-kutta-numerical-integration/
This does a good job, but hardcodes the vector size to just x, velocity, and acceleration. Here is how my R-K integrator would implement Doswa’s code:

def accel(t,x):
    stiffness = 1
    damping = -0.005
    x,v = x
    return -stiffness*x - damping*v

def getDX(t,x):
    return [
        x[1],
        accel(t,x)
        ]

rk = RKIntegrator( dt=1.0/40.0, derivFunc=getDX, initConds = [50.0, 5.0] )
for t,x in rk.integrate():
    if t > 100.1: 
        break
    if isWhole(t):
        print t,', '.join('%.2f' % x_i for x_i in x)

My results match the posted results to 2 places.

3 Comments

“dulce” becomes “littletable”

It’s funny how a little experiment can start to take on momentum all by itself. After looking at other Python databases, it wasn’t long before Google’s BigTable cropped in my searches.  This suggested to me a more descriptive and maybe more appopriate name for my experiment – littletable.  It’s expectations are modest, and so it has a fairly modest-sounding name.

Tables of objects are created simply by creating an empty table and loading like objects into it.  No schema, no SQL.  The attributes of the objects themselves, and the attributes used in the queries and joins, describe an organic, emergent schema.  I loaded a data table of zipcodes by state (from xxx), and a table of states.  There are a total of over 42,000 defined zipcodes (data as of 1999).  Here is a query of zipcodes:

fullzips = (zips.join_on("statecode") + states)()

A table can keep an index on a particular attribute, with the option to require uniqueness or not.  Indexes are used at join time to optimize the join performance, by minimizing the number of records that have to be sifted through.

The latest version of littletable (0.3) now includes table pivoting.  This makes it very easy to look at data in a large table to see how it is distributed across particular keys.  For instance, here is a table of the top 20 states with the most zip codes:

    TX Texas             2676
    CA California        2675
    NY New York          2238
    PA Pennsylvania      2224
    IL Illinois          1595
    OH Ohio              1470
    FL Florida           1449
    VA Virginia          1253
    MO Missouri          1192
    MI Michigan          1169
    NC North Carolina    1083
    IA Iowa              1073
    MN Minnesota         1036
    KY Kentucky          1016
    IN Indiana            992
    GA Georgia            975
    WV West Virginia      930
    WI Wisconsin          914
    AL Alabama            847
    TN Tennessee          806

created by “pivoting” the zip code table on the single attribute stateabbr.

The states with the fewest zip codes are:

    GU Guam                21
    VI Virgin Islands      16
    FM Federated State      4
    MP Northern Marian      3
    MH Marshall Island      2
    AS American Samoa       1
    PW Palau                1

And this query:

    nozips = states.where(lambda o:o.statecode not in zips.statecode)

returns a single record:

    ['UM', None]

(“UM” is the postal state abbreviation for the U.S. Minor Outlying Islands, a group of uninhabited islands SW of Hawaii – see  http://en.wikipedia.org/wiki/U.S._Minor_Outlying_Islands).

A nice characteristic of littletable queries and joins is that they each return a new fully-functional table, containing the joined and/or filtered records described in the query.  Tables can then be exported to CSV files, making it easy to save and restore the results of a particular query.  Tables are just wrappers around Python lists, so it is still possible to access parts of them using slice notation.

Here is a query from a database of US place names, retrieving all of the tunnels in the US, sorted by descending elevation.

    tunnels = us_names.query(feature="TNL", _orderby="elev desc")

Using basic python slicing, we can then find the 15 highest and 15 lowest tunnels in the country:

    for t in tunnels[:15]+tunnels[-15:]:
        print "%-30.30s %s %5d" % (t.name, t.state, t.elev)

Giving:

    Twin Lakes Reservoir and Canal CO  4003
    Harold D Roberts Tunnel        CO  3763
    Eisenhower Memorial Tunnel     CO  3728
    Twin Lakes Reservoir Tunnel Nu CO  3709
    Ivanhoe Tunnel                 CO  3680
    Vasquez Tunnel                 CO  3653
    Old Alpine Tunnel (historical) CO  3639
    Hagerman Tunnel                CO  3637
    McCullough Tunnel              CO  3635
    Strickler Tunnel               CO  3608
    August P Gumlick Tunnel        CO  3605
    Charles H Boustead Tunnel      CO  3603
    Quandary Tunnel                CO  3574
    Chapman Tunnel                 CO  3561
    Hoosier Pass Tunnel            CO  3552
    Harvey Tunnel                  LA     2
    Posey Tube                     CA     2
    Harbor Tunnel                  MD     0
    Baytown Tunnel                 TX     0
    Chesapeake Channel Tunnel      VA     0
    Downtown Tunnel                VA     0
    Midtown Tunnel                 VA     0
    Thimble Shoal Channel Tunnel   VA     0
    Holland Tunnel                 NJ     0
    Lincoln Tunnel                 NJ     0
    Brooklyn-Battery Tunnel        NY     0
    Pennsylvania Tunnnels          NY     0
    Queens Midtown Tunnel          NY     0
    Webster Street Tube            CA     0
    Rapid Transit Trans-Bay Tube   CA    -2

So it isn’t necessary to support every SQL feature in the litletable API, since the objects *are* in what is essentially a list structure.

So far littletable has been a handy little tool for quick data manipulation, and maybe some simple database concept experimentation. Not sure if there is much more I really need to add – we’ll see if there are many takers out there for this little recipe.

1 Comment

dulce – a little Python ORM

Not sure how I got started with this, I think I was looking at some ORM-style APIs and wanted to try my hand at it. Not too surprising, my result is reminiscent of pyparsing – using operator ‘+’ to define table joins, and __call__ to execute joins and queries. I called this little project “dulce”, as it is really little more than a syntactic sweet, a wrapper around a Python list of Python objects. But here’s two things I like:

  • a simple join syntax:
    wishlists = customers.join_on("id") + wishitems.join_on("custid") + catalog.join_on("sku")
  • a simple query-by-key syntax:
    print customers.id["0030"].name

Also, all queries and joins return a new full-fledged dulce Table object, so chaining queries, or exporting the output of joins is very easy.  And there is no schema definition, the schema “emerges” based on the object attributes used in queries and joins.

As it is pure Python, I think it would be ridiculous to make claims about how fast this is, although with psyco acceleration, loading and joining a table with 10,000+ zip codes takes about 2 seconds. I guess the biggest advantages would be:

  • pure Python portability
  • small Python footprint – single script file, about 500 lines long, including docs
  • quick start to organize a collection of objects
  • simple join interface

I created a SourceForge project for this little thing, you can find it here. I haven’t even packaged any releases yet, but the source can be extracted from the project SVN repository.

1 Comment

Code Golf – Converting seconds to hh:mm:ss.mmm, or reduce() ad absurdam

I’ve had to do some timing using time.clock(), but I like seeing output as hh:mm:ss.000.  divmod is the perfect tool for this, as it does the division and remainder in a single call (rekindling my love for Python’s ease in returning multiple values from a function):

def to_hhmmss(sec):
    min,sec = divmod(sec,60)
    hr,min = divmod(min,60)
    return "%d:%02d:%06.3f" % (hr,min,sec)

Right about the time I wrote that, I happened to see an example using reduce, and thought about using successive calls to divmod to build up the tuple to be passed in to this function’s string interpolation.  That is, I needed to call reduce in such a way to convert 3601.001 to the tuple (1, 0, 1, 1) (1 hour, 0 minutes, 1 second, and 1 millisecond).

reduce() applies a binary operation to a sequence by taking the first two items of the sequence, performing the operation on them and saving the result to a temporary accumulator, then applies the binary operation to the accumulated value and the 3rd item in the sequence, the to the 4th item, etc.  For instance to use reduce to sum up a list of integers [1,5,10,50], we would call reduce with the binary operation:

fn = lambda a,b: a+b

and reduce would work through the list as if running this explicit code:

acc = 1
acc = fn(acc, 5)
acc = fn(acc, 10)
acc = fn(acc, 50)
return acc

I need reduce to build up a tuple, which is a perfectly good thing to do with tuple addition.
Also, I’m going to convert the milliseconds field into its own integer field also, so I’ll need to convert our initial value containing seconds to a value containing milliseconds (also allowing me to round off any decimal portion smaller than 1 msec). And I’ll use divmod with a succession of divisors to get the correct time value for each field.

So the sequence of values that I will pass to reduce will be the succession of divisors for each field in the tuple, working right to left.  To convert to hh, mm, ss, and msec values, these divisors are (1000, 60, 60), and these will be the 2nd thru nth values of the input sequence – the 1st value will be the value in milliseconds to be converted.

The last thing to do is to define our binary function, that will perform the successive divmods, will accumulating our growing tuple of time fields.  Maybe it would be easiest to just map out how our sample conversion of 3601.001 seconds (or 3601001 msec) would work, with some yet-to-be-defined function F.  The last step in the reduce would give us:

(1,0,1,1) = F((0,1,1), 60)
  (0,1,1) = F((1,1), 60)
    (1,1) = F(msec, 1000)

Well, that last line (representing the first binary reduction) looks bad, since all the others are taking a tuple as the first value.  It seems that each successive reduction grows that value by one more term, so the first term should be a 1-value tuple, containing our value in milliseconds:

    (1,1) = F((msec,), 1000)

Also, it’s not really true that this step would have to return (1,1).  Just so long as the final tuple ends with (…,1,1).  So I’ll redo the sequence of steps showing X’s for the unknown intermediate values:

(1,0,1,1) = F((X,1,1), 60)
  (X,1,1) = F((X,1), 60)
    (X,1) = F((msec,), 1000)

The heart of our mystery function F is essentially a call to divmod, using the 0’th element in the current accumulator:

F = lambda a,b : divmod(a[0],b)

But F has to also carry forward (accumulate) the previous divmod remainders, found in the 1-thru-n values in the tuple in a.  So we modify F to include these:

F = lambda a,b : divmod(a[0],b) + a[1:]

Now we have just about all the pieces we need. We have our the binary function F.  The sequence of values to pass to F is composed of the initial accumulator (msec,), followed by the divisors (1000, 60, 60).  We just need to build our call to reduce, and use it to interpolate into a suitable hh:mm:ss-style string:

def to_hhmmss(s):
    return "%d:%02d:%02d.%03d" % \
        reduce(lambda a,b : divmod(a[0],b) + a[1:], [(s*1000,),1000,60,60])

This may not be the most easy-to-read version of to_hhmmss, but it is good to stretch our thinking into some alternative problem-solving approaches once in a while.

, ,

Leave a comment

Another super() wrinkle – raising TypeError

I came across a very interesting wrinkle on the behavior of super() yesterday at work. In our most recent major release, we upgraded the Python we use from 2.4 to 2.6, and have been learning a lot about forward compatibility issues between the two versions.

In one part of our code, we import plugins at runtime, and the plugins expose a method process. One particular plugin PluginA has a subclass PluginAAlias that adds some simple aliasing to one of the arguments. So once the subclass does the argument translation, it finishes by calling the process method in its superclass PluginA, using this statement:

return super(PluginAAlias, self).process(a,b,c)

We got a customer report that this code, which worked under Python 2.4, was now failing under 2.6 with this error:

TypeError: super(type, obj): obj must be an instance or subtype of type

Now the simplest solution would be to dump super() and just hardwire the call to PluginA.process. But we use super() in a number of places in the product, and I didn’t want to chase them all down and change them too. I really wanted to better understand what the problem was.

Our first guess (suggested by colleague Mike Thornton) was that PluginA was not a new-style class. Our codebase includes some old and dusty code in places, and it is not inconceivable that we still have an old-style class floating around here and there. But sure enough, PluginA inherits from Plugin, which inherits from object. So that theory, while a good guess, was busted.

I googled for super(), Python 2.5 and 2.6 release notes, and the error message itself, and found a number of cautionary articles on the perils of super(), particularly in class hierarchies with multiple inheritance – but in this case, we are strictly singly inheriting. Just about all of the examples showed a subclass making upcalls to the superclass __init__ method, which also didn’t quite match my situation. I did learn that in 2.5, super was tightened up to ensure that the object passed to super was in fact an instance of the named class, but I knew this was the case since the problem code statement was *in* a method of that class, so how else would we have gotten there if it weren’t?

So then I started to add some print statements around the problem line of code, and sure enough, what I thought was impossible was in fact possible.

print PluginAAlias
print self.__class__
print isinstance(self, PluginAAlias)

Gave me:

<class pluginAAlias.PluginAAlias>
<class pluginAAlias.PluginAAlias>
False

Huh??!!! I then expanded the print statements to print the id() of the classes, and sure enough the id’s were different – so that was why isinstance was failing, and super() was raising the TypeError.

With a little more instrumenting, I found that our code for loading the plugin modules can get run repeatedly. And this was the final piece of the puzzle. Our plugin-loading code was not using import, but imp.load_module(). The docs for imp.load_module() tell us that repeated calls with the same module reference will act like a reload. And so here is the smoking gun.

Completely outside of the product, I created a little module, a.py, containing an empty class A. Then from the Python prompt, I ran these commands:

>>> import imp
>>> m = imp.find_module("a")
>>> a = imp.load_module("a", *m)
>>> a.A
<class 'a.A'>
>>> aobj = a.A()
>>> aobj.__class__
<class 'a.A'>

Now that I had an object created, I reloaded the a module:

>>> a = imp.load_module("a", *m)
>>> print a.A
<class 'a.A'>
>>> isinstance(aobj, a.A)
False

I had recreated my “impossible” condition, an instance of a.A that fails isinstance(aobj, a.A).

 The final proof, calling super() as in the original bug:

>>> super(a.A, aobj)
Traceback (most recent call last):
File "<stdin>", line 1, in
TypeError: super(type, obj): obj must be an instance or subtype of type

Voila! The root cause was found! Because the repeated calls to load_module act as a reload, the objects created using the old class no longer satisfy the isinstance test, so super will fail.

My solution? Originally, I thought I would memoize our plugin loader, so that plugins won’t get reloaded for the same module name. But I’m still a bit new on this team, and it occurred to me that reloading of plugins without having to restart a daemon might be of some advantage. So instead, I added an __init__ method to PluginAAlias.

def __init__(self, *args, **kwargs):
    self.as_super = super(PluginAAlias, self)
    self.as_super.__init__(*args, **kwargs)

My readings on super told me that super doesn’t just do a cast of self, but returns a proxy object that delegates attribute lookups to self, following the MRO beginning with self’s superclass. I knew that at the original init time as the plugin object was created, that super() must succeed, since this was the initial creation of the object and we had not yet had any chance to reload the plugin.

Then I changed the offending line to read:

return self.as_super.process(a,b,c)

The as_super attribute, being built in the original __init__ method, contains a proxy to the correct superclass of the plugin, even if the plugin module gets reloaded later. (I also thought about saving the superclass’s process method in a super_process attribute, and just calling that; but I wanted a more general solution, in case there were other methods on plugins that I hadn’t found yet.)

So that was the solution. I thought I’d write this up, as it was a slightly different form of super() failure than any I had found in my own googling, so it might be of interest to someone else struggling with crazy-looking TypeError messages, when you just know the object is of the correct class.

, ,

13 Comments

Starting a new Blog

I was trying to keep a blog on some of my experiences and insights while coding in Python.  My first post on Blogger was a disaster, as the editor has very limited options for highlighting keywords or formatting code samples.  So I looked about, and found out that I had already created an account on WordPress, just never started a blog here.  Let’s hope I can post more often than once a year.

Leave a comment