• Alternate terrain shading

    Mike pointed me to an interesting slide deck by Leland Brown in October 2010, about using a fractional Laplacian to colour a terrain map instead of the usual altitude greyscale or hill shading. The results are pretty nice and the presentation of how he did it is quite clear.

    Conceptually, what he’s doing is a bit like a standard altitude greyscale. Ie: 0′ is black, 15,000′ is white, and use a grey ramp inbetween. But the transform he applies highlights local terrain differences. Ie: the shade of a pixel isn’t just its absolute altitude, but also how it differs based on its neighboring pixels. It’s a little analagous to edge enhancement on an image: find regions of high variance and highlight them. The result is a map that’s essentially altitude coloured, but with highlighting of local terrain details.

    One nice aspect of the math he’s using is that it’s scale invariant. That means the algorithm makes sense at all scales and, also, doesn’t need to be recalculated at different scales.

    The computation is more expensive than hill shading but not terrible: he says he can do 10,000 x 10,000 pixels in a few minutes. He notes at the end it’s a bit tricky to do this with tiled data, because you should consider data outside the tile itself when doing the rendering.

  • Quick PostgreSQL tuning notes

    My little database became a big database, 100 gigs, and I never bothered to tune PostgreSQL. The defaults look tuned for a machine with about 1 gig of RAM total. I’ve got 8 gigs. Here’s a quick stab at some tuning:

    shared_buffers = 256MB
    work_mem = 16MB
    maintenance_work_mem = 256MB
    wal_buffers = 16MB
    checkpoint_segments = 10
    effective_cache_size = 1024MB

    I’m just guessing, not testing. Got some hints from

    Those defaults blow out MacOS’ default shared memory limits. I bumped them up with notes from here and here
    $ sudo sysctl -w kern.sysv.shmmax=536870912
    $ sudo sysctl -w kern.sysv.shmall=131072
  • Contours

    I wanted to play with contours for some mapping. Ie: a shapefile of where the 8000′ or higher terrain is across the US. Turns out to be pretty simple to generate something basically working from SRTM data. The gdal_contour program does the work. The OpenStreetMap folks have some nice notes on how to use it with SRTM data. The details, however, are much harder.

    Basics

    $ gdal_contour -a elevation N40W108.hgt N40W108.shp -i 304.8 -snodata 32767

    That generates 1000′ contours, very fast, maybe 150ms with the data in cache. The output file size varies wildly: for that hilly part of Colorado it’s 1 meg. Generating 100′ contours instead takes about 6x as long and results in 10x the resulting file size. (gdal_contour is dumb about files: if the output file already exists, the error you get is “ERROR 1: N40W108.shp is not a directory”)

    North America has 2415 data files so even on a slow computer 1000′ contours can be generated in about an hour. Once you have the contour shapefiles, you can import them into PostGIS. It goes pretty easily, but is slower than generating the shape files themselves. Should probably create an index, too.

    $ shp2pgsql -p N20W076.hgt.shp srtm_contours | psql contours
    $ shp2pgsql -a N20W076.hgt.shp srtm_contours | psql contours
    $ shp2pgsql -a N20W075.hgt.shp srtm_contours | psql contours
    $ shp2pgsql -a N19W075.hgt.shp srtm_contours | psql contours
    $ shp2pgsql -a N19W076.hgt.shp srtm_contours | psql contours

    The import of all of North America took a couple of hours and the table is 14 gigs on disk. 32,865,353 rows. Some quick efforts to make maps with TileMill suggest it’s too much data. Need indexing on elevation and the geometry, also would really like to simplify those contour lines.

    Thinking in feet

    The elevations are in meters, stored as a double, but I think in feet and it’d be nice to simplify. So let’s convert them to integer feet and index on elevation.

    alter table srtm_contours add column elevation_feet integer;
    update srtm_contours set elevation_feet = round(3.28084*elevation);

    This is very slow, over an hour. PostgreSQL is completely copying the whole table to make rows with the new column. Maybe it would have been better to update the value in place.

    Indexing the data

    Let’s create a basic spatial index!

    create index idx_srtm_contours_the_geom on srtm_contours using gist(the_geom);  -- about 30 minutes, 2 gigs
    vacuum analyze srtm_contours; -- 30+ minutes, at least before database tuning

    Subsetting for TileMill

    Even with the index the contours are just way, way too much for TileMill. I’m not sure how smart TileMill is about querying only data it needs. A different solution is to create a special table with just rows for the 8000′ elevation level (the one we care about). It also seems to help to simplify the contours some.

    create table contour_8000_simple (
      gid integer,
      id integer,
      the_geom geometry,
      constraint contour_8000_pk primary key (gid)
     );
    
    insert into contour_8000_simple
    select gid, id, ST_SimplifyPreserveTopology(the_geom, 0.001) from srtm_contours where elevation_feet = 8000;

    (SQL is approximate, not cut-and-paste ready). The 0.001 parameter means “simplify to 0.001 degree”, in this case.

    Next Steps

    After all the selecting and simplifying the resulting data works OK in TileMill. It renders at an acceptable speed, at least, and if you draw big fat lines you can schmudge over the awkward too-detailed contour. But this work has gotten pretty hacky. I really need polygons, not linestrings, so I can fill the 8000’+ range. But because the contours were calculated on 1degree square tiles in the first place, closing the polygons takes a bit of work. Also I haven’t dealt properly with NODATA regions. Need to sit down and re-do it with more time.

    Here’s what it looks like, for a region of mountains SE of Salt Lake City, UT.

  • Special use airspace for mapping

    I’m playing around with drawing airspace. The FADDS database has nice ESRI shapefiles for B, C, D, and E. Easy to use them directly with QuantumGIS or TileMill, or import them into PostGIS, or whatever. But they don’t publish special use airspace in shapefile format.

    One option is to download the SUA data from the SoaringWeb community in OpenAir format, then convert and import it. It seems to work.

    $ ogr2ogr -f "ESRI Shapefile" allusa allusa.v11.08-25.1.txt
    $ shp2pgsql -s 4326 airspaces.shp | psql faadata
    $ echo '\d airspaces' | psql faadata
    
     gid      | integer               |
     class    | character varying(80) |
     name     | character varying(80) |
     floor    | character varying(80) |
     ceiling  | character varying(80) |
     the_geom | geometry              |
    
    $ echo 'select class, count(*) from airspaces group by class' | psql faadata
    
     class | count
    -------+-------
     Q     |   456
     B     |   341
     P     |    12
     C     |   321
     R     |   417
     D     |   550
    
    

    Classes.. B, C, and D are Bravo, Charlie, and Delta. P is prohibited, R is restricted, Q is “danger” (MOAs and Alert areas, I think). Name is free text. Floor and ceiling are also free text, things like “SFC”, “12500 MSL”, “300 AGL”, and “FL 180”.

    FADDS does provide the special use airspace geometry in a couple of formats. There’s SUA.txt (10 megs), an old text format that ogr2ogr doesn’t seem to support. (The “SUA” driver is something else). There’s also the new SAA AIXM files. They are in FADDS and ogr2ogr has some support for them, but a quick test made it seem pretty inconvenient to use the data. (Update: I wrote Lynn, who makes the Soaring Web files, and she says she wrote her own complex code to handle SUA.txt)

  • Postgres professionalism

    I’m an old MySQL guy, it’s the only relational database I’ve used a lot. So it’s been fun to learn about Postgres. It’s like a database for grownups! Constraints! Views! Transactions!

    MySQL has gotten better over the years, particularly with InnoDB. But fundamentally it was designed as a fast and loose database and it shows. Does it still coerce any string to an integer on insert rather than throwing errors? I can’t tell you how many 0s corrupted my database because someone inserted “apple” into an integer column, or even better null (when the column was marked not null). Also the MySQL driver for Python has always been pretty cryptic and poorly documented. It got the job done, but it didn’t lend a lot of confidence.

    The psycopg2 docs are terrific, really well thought out Python driver. I love that it supports asynchronous queries, including with various coroutine libraries. The way you can extend the cursor and connection classes is quite spiffy, too. (DictCursors and named tuples! Logging cursors!) And lots of nice caveats about using the library properly, including what resources are held by cursors and connections and the like. Docs written by someone who’s really used the library in production.

    I’m told MySQL is still faster than Postgres for big data. I’ve also heard Postgres replication is error prone. Nothing’s perfect, I guess. Looking forward to finding out. Of course my #1 reason to use Postgres is PostGIS: MySQL really can’t touch Postgres for geodata, it’s not even an option.

     

  • Comparing Mongo and Postgres for wind calculation

    I’m playing around with my weather data, CSV files containing things like “KSQL 1136073600 270 10”. (Station name, timestamp, wind direction, wind speed). Each row is 10 bytes of honest data and I have oh, 300 million rows.

    I would like a sophisticated query language for doing aggregate stats, like “show me a histogram of directions for winds > 5kts on Tuesday mornings”. I have been hacking reports up by hand in Python by loading CSV files into lists and doing lots of dict and slices, which is just awful.

    Insert

    I tried Mongo first because it was new and fun, but I think its query facilities are too limited to be useful to me. More on that in a bit. But it’s sure fast for inserts: 30,000 rows a second, 47 bytes / row on disk. It’s also fun to work with a schema-free database, it always takes me 20 minutes to get a SQL create table right.

    Postgres is slower on inserts, about 10,000 rows a second and 52 bytes / row. (The Postgres FAQ says there’s about 28 bytes of overhead per row.) This is with naive Python executemany() insert; a bulk CSV import would be faster.

    But I don’t much care about speed, and less about scalability. This is all one off calculations I’m running by myself. And I’m not equipped to do proper benchmarking anyway.

    One caveat: I’m doing all of this measurement without indices. Well, not entirely, Mongo gets a sort of implicit index because I’m storing each station in a separate collection. In the real world of course you’d want an index on station, maybe also on aspects of the timestamp.

    Queries

    Mongo has very limited query support. It has a basic query predicate language that’s awkward and verbose, no date and time calculations, and precious little aggregation capability like SQL’s AVERAGE() operator. You can write all your queries in Javascript to run on the server. And Mongo has a nifty mapreduce capability for big processing. Only mapreduce doesn’t actually run in parallel in a single Mongo instance, so there’s not much benefit to that computing paradigm unless you have a sharded database.

    The big drawback with Mongo is a lot of code to do something that’d be a one liner in SQL. For instance, here’s getting the average wind speed at a station.

    # select average(s) from KSFO
    # This map reduce version is slow, about 1s on 50,000 rows. It's single threaded :-(
    # http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Parallelism
    
    # Map function: turn a document into the value (1, speed)
    # The key for this map is a constant, the number 1: we're summing over the whole collection
    mapFunction = bson.Code("""
    function() {
      emit(1, { count: 1, sum: this.s });
    }""")
    
    # Reduce function: sum up the counts and sums
    # Note this can be called on its own output, so have to match types from map
    reduceFunction = bson.Code("""
    function(key, values) {
        var result = {count: 0, sum: 0};
    
        values.forEach(function(value) {
          result.count += value.count;
          result.sum += value.sum;
        });
    
        return result;
    }""")
    
    # Execute the query and grab the results. Up to us to do the average calculation.
    result = coll.map_reduce(mapFunction, reduceFunction, "myresults")
    total = result.find_one()["value"]["sum"]
    count = result.find_one()["value"]["count"]
    
    print "Average speed: %.1fkts (%d samples)" % ((total / count), count)
    

    That’s a lot of code just to do an average! Code I have to test, and maintain. It’s not particularly fast, either, 1000ms.

    By contrast Postgres really is just one line of code for averages select avg(speed) from winds where station = 'KPAO'. And SQL has great complex date math, stuff I’d have to write myself for Mongo. Postgres is faster on this query, too: 600ms to do the average and that’s on a full table scan of millions of rows without even an index on the station. (It goes to 5ms with the index).

    Conclusion

    I’m not trying to say Postgres good Mongo bad, not at all. But Mongo’s not a good match for what I’m doing. I don’t have terabytes of data I need to shard across multiple servers. I don’t need sophisticated schema agility. I don’t really need relations, either. All I really need is efficient storage and a fast, convenient query language. Postgres and SQL are pretty good at that.

  • NFS notes

    I set up some NFS mounts from my Linux (Debian) box to my Mac. Some notes.

    Servers: install nfs-kernel-server. Configure it along these lines. NFSv4 seems to want to serve all its fileshares from a single directory, which seems weird, so you do this “mount –bind” thing to collect them on the server. Options on the server are ro,nohide,insecure,no_subtree_check,async

    Client: Use MacOS Disk Utility (File / NFS Mounts..) to add static mount points. They seem to mount automatically once you exit the tool, but I’m not sure. I set the following advanced options: ro,intr,soft,nfsvers=4. Here’s some Mac docs.

    NFSv4 is the new hotness. You can verify you’re using it with “nfsstat -4 -s” on the server and count the numbers.

    These notes are all for a read-only export of world readable data. Next step is to export read-write, which requires synchronizing users. Unless you go some authenticated route (ie: Kerberos) the best bet is to be sure numeric user IDs are the same on the client and the server. On MacOS you can change your uid by right clicking on your name in the Users & Groups preferences pane. I’m guessing you’ll have to manually chown the files afterwards.

  • MongoDB size notes

    I’m playing with using MongoDB to store a bunch of wind time series data for my wind map. First time trying it out. Some quick notes:

    I have a bunch of data that’s 4-tuples like this:

    “KSQL”, 2011-09-23 11:47:00, 200, 7

    That says that at KSQL at a specific time the wind was 7kts from 200 degrees. In an ideal encoding each row would take 10 bytes. In a comfortable encoding it’ll take 20. I’ll settle for 50 bytes per row in the unindexed database. MongoDB is taking 67 bytes per object. Or more!

    Sample input: KDTW and KMDS. 116,000 rows total, 2 megs of CSV, 400k of csv.gz. After import, Mono is reporting a 7.8 meg data size, or 68 bytes / object. That’s with very short element names (“d”, “s”, etc). If I use long element names (“direction”, “speed”) it balloons to 87 bytes / object. And yes, the extra 19 bytes is exactly the # of characters I added to my element name.

    Lame! Apparently Mongo compresses nothing, it doesn’t even try to use a lookup table to avoid storing a zillion copies of element names? (See this user request to tokenize field names.) That seems very bizarre; not only would a simple table save space, wouldn’t it save time by making the memory working set smaller?

    I don’t have any need for queries across multiple station names. So instead of having one giant table (er, “collection”) for all my data, I’m going to make one collection per station. That’ll give me some 5000 collections. I’m still at 47 bytes per row, which I said was acceptable but seems awfully wasteful. I’m beginning to wonder if MongoDB is a good match for this data. All those ObjectIds aren’t doing me any good, that’s for sure.

  • Notes on getting TileStache renders working on MacOS

    Personal work notes for this project I’m working on. Mac DLL hell included. Unlikely to be of interest to any unfortunate reader who found this post.

    I’m trying to run some tile generation software that generates odd TIFFs. The TIFFs have two separate images in one single file. This is a legitimate TIFF but it’s unusual and I’m having a hard time producing it on my Mac. The exact same end code works fine on Linux.

    There’s a bunch of software involved here. Python, gdal, boost, MapNik, Cairo, swig, TileStache, … When it’s all mashed up the code I’m running is basically a Python program with a lot of bindings to geo and image C libraries, using boost and swig. Ie: a big fucking hairball.

    If I use MacOS Lion’s version of libtiff, the code all runs fine but the images it produces are corrupted. Same error as this discussion. gdalinfo can’t load them:

    gdalinfo out/10/000/163/000/395.tiff
    ERROR 1: out/10/000/163/000/395.tiff:Failed to allocate memory for to read TIFF directory (0 elements of 12 bytes each)
    ERROR 1: TIFFReadDirectory:Failed to read directory at offset 52744
    gdalinfo failed - unable to open 'out/10/000/163/000/395.tiff'.

    If I use homebrew’s default libtiff (3.9.5) the code hangs. Poking at it with gdb I see the program is spinning at 100% CPU deep inside some error binding between libtiff, Python, and either Boost or Swig. Yuck!

    If I build my own tiff-4.0.0beta7 in homebrew and use it, it runs and produces the same corrupted TIFFs as Lion. (To build in Homebrew, you have to edit the libtiff.rb formula and add an option to configure, “--x-includes=/usr/X11/include”. Otherwise, it can’t find some of the GLheader files when compiling.). tiff-4.0.0beta5 is no better.

    A big confounding issue has been getting the Python environment right in Homebrew. More notes on that here. I think everything’s working anyway, various Python and gdal programs do, but you never know.

    Another confounding issue has been compiling things correctly. The transition in C compilers on Lion makes a mess. Homebrew gamely tries to use LLVM but some stuff only works with gcc-4.2 (including Numpy). I was worried that mixing the two compilers might have been causing problems, so finally rebuilt everything with just gcc using this environment:

    export HOMEBREW_USE_GCC=true
    export CC=gcc-4.2
    export CXX=g++-4.2
    export FFLAGS=-ff2c
    export HOMEBREW_BUILD_FROM_SOURCE=true

    Extra notes: “man dyld” is very useful for manipulating shared libraries. In particular the DYLD_INSERT_LIBRARIES and DYLD_PRINT_LIBRARIES environment variables let you see and control which versions of libraries are being used.

  • Serving mbtiles tiles for slippy maps

    I’m playing with TileMill (see my blog post here). TileMill publishes to an mbtiles file, a sqlite3 database that contains the rendered image tiles. One simple file, so nice! Now, how to serve it? The simplest solution is TileStream hosting, but I’m too cheap and hackish to pay the very reasonable $50 / month.

    So I set up a tile server on TileStache instead, a Python system. You can see the setup on GitHub. It took about four hours to get working, even though none of it is too complex. Here’s my notes.

    1. Install TileStache via Python easy_install and configure. It’s pretty simple, the only subtle thing is that TileStache wants a cache. This may not be necessary with pre-rendered mbtiles bundles but it was easy enough to set up memcached and configure it.
    2. Set up TileStache to render via CGI, with a CGI script and a bit of Apache config. CGI is a bad way to serve things, but I’d hoped it was sufficient and it was easy to get going. Turns out to be unacceptably slow.
    3. I tried a couple of failed approaches for WSGI. First I tried TileStache’s built in WerkZeug server. This works and is faster than CGI, but lacks administrative scripts to wrap it. Then I tried Apache WSGI and ran into an error with no useful debug output. Gave up on that.
    4. Set up gunicorn, the excellent Python asynchronous server for WSGI. It’s easy to set up on Debian if you get version 0.13 from Backports, it comes with nice administration scripts. Just needs a little config file. I also installed python-gevent for an async worker.
    Doing all this, my crappy old single CPU server with 1 gig of RAM can handle serving a map to a single user at acceptable speed. It can probably do more than that, I haven’t load tested it.
    I left some things not done, this was intended as a quick hack:
    • gunicorn really wants a proxy HTTP server like nginx in front of it. Good idea, but I didn’t want to bother. The bummer is gunicorn doesn’t set any caching headers at all; really need those for a real deployment.
    • The cool kids building TileMill are doing everything with Node.js these days. Their TileStream server would be another option, but getting Node running on Debian was too much work for a quick hack.
    It’s a little surprising that MBTiles is so complex to serve. It’s basically just a big archive of JPG images, it’d be nice to have a little C program to serve it blindingly fast. “mbserv foo.mbtiles –port 8989”. TileStache is a big program that does a lot of stuff, but even TileStream is 900+ lines of code and all it does is serve MBTiles.