• A place where bzip2 is useful

    I almost always regret using uber-compressor bzip2. It generally compresses to, oh, 90% of the size of gzip but takes 10x as long to compress and 2x as long to decompress. Not worth it, bad space/time tradeoff.

    The exception is Apache access log files. Something about the terribly repetitive text of these things is really good for bzip2. Here’s a quick test on a 26 meg access.log (using bzip2 1.0.5 and gzip 1.4 on Debian i686)

    gzip: 10.8%, 860ms to compress, 160ms to decompress

    bzip2: 5.7%, 12700ms to compress, 1600ms to decompress

    The bzip2 is literally half the size of the gzip. That’s a lot. Too bad it took 15x as much time to generate it and 10x as much time to decompress it again. A quick Google suggests most people find bzip2 about 3x-5x slower, not sure why it’s so bad for me. Maybe because my sample is also unusually good at compressing!

    Some other fast compression options: pigz, pbzip2threadzip for parallel compressors, or snappy for Google’s very-fast-but-not-as-tight compressor.

    Here’s the sample data I tested:

    
    207.46.13.138 - - [04/Jul/2010:06:32:55 +0000] "GET /robots.txt HTTP/1.1" 301 329 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
    207.46.13.138 - - [04/Jul/2010:06:32:55 +0000] "GET /robots.txt HTTP/1.1" 200 107 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
    61.135.216.105 - - [04/Jul/2010:06:32:55 +0000] "GET /weblog/index.atom HTTP/1.1" 304 - "-" "Mozilla/5.0 (compatible;YoudaoFeedFetcher/1.0;http://www.youdao.com/help/reader/faq/topic006/;2 subscribers;)"
    
    
  • Ubuntu server cron jobs

    Here’s all the cron jobs on my relatively vanilla Ubuntu Server setup. I’ve added Apache and a few other minor things to the default install.

    Daily:

    • htcacheclean, for Apache’s mod_cache disk cache
    • apport, cleans up old reports in /var/crash
    • apt, cache management
    • aptitude, making logs of what’s installed
    • dpkg, also making logs of what’s installed
    • popularity-contest, telling Ubuntu what people installed
    • bsdmainutils, managing user’s ~/calendar files
    • logrotate
    • man-db, managing the man page database
    • mlocate, an efficient implementation of good ol’ locate for fast finds
    • ntp, managing ntpstats (which aren’t enabled by default)
    • passwd, making backups of /etc/passwd and friends. (wtf?)
    • standard, the “standard daily maintenance script”. backs up /etc/passwd (again?) and looks for lost+found files. Dumb script, and redundant
    Weekly:
    Logrotate is configured to back up weekly and keep 4 weeks worth of logs and, surprisingly, not compress old logs. Apache overrides this to keep 52 weeks of logs and to compress them. I’m pleased to see the Ubuntu logrotate has a dateext and dateformat option to name logfiles like “access.log-20120108” rather than “access.log.3” and then renaming that to “access.log.4” the next week. That should be the default behavior but apparently that option isn’t available in older versions of logrotate.
  • Ubuntu and FQDNs

    Setting the fully qualified domain name on an Ubuntu host is way too hard. The Internet is full of confusing advice. Most people first seem to notice the FQDN isn’t right when setting up Apache, but it’s a broader problem than that. (You can fix Apache just by giving an explicit ServerName directive.) “hostname -f” should return the right thing.

    My solution:

    • /etc/hostname contains just the system name, “wi”
    • /etc/hosts contains an entry like this:
      208.110.64.210  wi.nelson.monkey.org wi
    • /etc/resolv.conf contains an entry like this:
      domain nelson.monkey.org

    And now “hostname” returns “wi”, “hostname -f” returns “wi.nelson.monkey.org”. Note I don’t have a reverse DNS entry for that IP address. (I do have forward DNS so wi.nelson.monkey.org resolves, but I don’t think it’s used.)

    Part of the confusion is a lot of people (and Ubuntu by default) want you to do something like

    127.0.1.1 wi.nelson.monkey.org wi

    This kind of works, despite the odd loopback address number. But it’s not really correct: your FQDN should resolve to a routeable IP address, not a loopback.

     

     

  • Ubuntu server notes from an old Debian user

    I’m setting up a new Linux server and using Ubuntu for it for the first time. I’ve been a Debian guy for a long time now (and before then, Red Hat), so Ubuntu is both new and old. I switched to it because it’s got a faster upgrade cycle than Debian, but I like all the extra tools they’ve added, too.

    Here’s some of the new things in Ubuntu I’ve learned about. The Ubuntu server docs are an excellent place to start.

    • upstart, a replacement for SysV init scripts that runs things in parallel. The configuration files live in /etc/init. There’s still old SysV stuff in /etc/rc2.d and /etc/init.d being run. The new service command is nice, ie “service –status-all”. There’s plans for upstart to replace cron and at, too. Docs: UbuntuBootupHowto, original rationaleupstart tutorial, and cookbook. (Ubuntu does not use systemd, another modern init replacement).
    • plymouth, the graphic splash boot screen. For some reason this is complicated and intertwined with upstart.
    • ureadahead (aka Über-readahead), a clever system to improve boot times by reading all necessary user files off of disk in one big transfer at the beginning of boot. The boot image is kept in /var/lib/ureadahead and there’s clever stuff to invalidate and re-write the files when new things are installed. It seems to save a few seconds on boot for me, not positive. More info here.
    • ext4fs is the default filesystem. It has various incremental improvements to ext3fs, many of them aimed at efficiently handling big files and disks.
    • AppArmor, a security mechanism. It lets the kernel restrict what access various programs have. For instance, ntpd is only allowed to write to certain log files so if there’s ever an exploit against ntpd it won’t go writing random files as root. It seems like a fairly limited security mechanism given how big the attack surface of Linux is, but also harmless and potentially useful.
    • ufw, Uncomplicated Firewall, is a friendlier interface for firewall rules than iptables.
    • Automatic software updates. The setup process is embarrassingly manual, but you can have Ubuntu install security updates (or all updates) automatically.
    • pam_motd, the absurd thing that makes /etc/motd show current system stats. It seems to update on login. If it ever breaks I will be very angry if I can’t log into my system.
    • JeOS is a stripped down Ubuntu server for running as a guest in a virtualization environment. I have no need for this now, but it’s a good idea.
    • Bacula, a fancy backup system. I will probably stick with rsnapshot and rsync.
    • Launchpad, a website where many Ubuntu projects are hosted. Bug tracking, etc.

    One thing they haven’t improved is libresolv; it still queries nameservices serially, one at a time. Worse, the first entry in /etc/resolv.conf for me named 192.168.1.100 as the nameserver, a bogus entry from the initial setup in a different DHCP environment. So every DNS query was waiting 5 seconds for that to time out. I fixed it by putting Google’s 8.8.8.8 as the first nameserver.

    My next project is to automate all my deployment of my own stuff to the server using Puppet or maybe Chef Solo. I’m trying to avoid the thing where my servers become an irreproducible rat’s nest of random crap I’ve collected over time. OTOH, I can tell already it’s going to be a lot of work for one little personal projects server.

     

  • METAR parsing libraries

    METAR, the weather observation format, is a great little text format

    KOAK 032353Z 00000KT 5SM BR SCT200 11/08 A3029 RMK AO2 SLP256 VIS SE 1 1/2

    It’s also not a very standardized format, making parsing a bit of an art. Here’s a few different METAR parsing libraries I’ve found:

    1. Tom Pollard’s python-metar, (sourceforge, github). This is what I used to prepare windhistory.com and it’s pretty good.
    2. Joe Yates’ metar-parser for Ruby. Has some test coverage. Also references other Ruby implementations.
    3. METAR decoding software from NWS. The closest thing to official code, written by Eric McCarthy, now on SourceForge. It’s 12,000+ lines of C code (!), albeit with a lot of comments, unit tests, etc. Has some interesting code for interpreting the meaning of METARs, like whether it codes for a tornado or the like.

    There’s lots of other parsers for various languages: PHP, .NET, etc.

     

  • Full-height HTML divs

    I keep forgetting the right way to make a DIV take up the full height of the web browser. For, say, a full screen map. Chrome and Safari default to filling the whole container but Firefox prefers to calculate a minimum height div. Long story short, here’s what to do:

    html, body, div#map { height: 100% }

    I’m not entirely sure this is correct but it seems to work and is in line with advice on this StackOverflow discussion. It’s not enough to set the div’s height, you have to set the containers too. There’s also something about using min-height instead but it seems unnecessary. What I don’t know is how to set the height of a div to be 100% minus 100px or the like; something that leaves room for a header or footer.

  • Sublime Text 2 and Python

    I’m working on replacing my existing IDE (WingIDE) with Sublime Text 2. I prefer Sublime as an editor, particularly since WingIDE runs under X11 on MacOS (yuck). Also I don’t use most IDE features of Wing; just a bit of smart completion, unit test integration, pdb integration, .. OK, several things.

    Anyway, here’s a list of Sublime Text modifications I’m finding useful for Python. I’ll update this post over time. (As an aside it’s hard to search for plugins for Python developers, because the plugins themselves are written in Python and so you get lots of false matches.)

    • sublimetext_python_checker: one of several linter plugins. I like this one because I can set it up to run PyFlakes in an unobtrusive way.
    • PdbSublimeTextSupport 0.2: a tiny shim to have Sublime show you what line of your source code it currently is in. I’ve never been good with PDB, I immediately miss all the graphical support of a full IDE like Wing. So this may not be enough.

    I’ve also set up my own build command to use to run unit tests. It’s not complete, I need to figure out Sublime’s support for showing where errors occurred. But it’s a start.

    
    {
     "cmd": ["python", "-m", "unittest", "discover", "-p", "*.py"]
    }
    
    

    Things that look promising that I haven’t tried:

  • Python linters

    I was playing around with Python lint tools. There’s a lot of terrific info in this StackOverflow question, so all I really have to offer is some very simple knee jerk opinions.

    • pyflakes is very brief, but did a good job finding actual errors
    • pep8 is a nice way to check specifically for code style issues and nothing else.
    • pychecker is useless, since it executes the code
    • pylint is amazingly detailed and anal. Also useless because it prints way way too much. I really don’t care that my test functions don’t have docstrings or that my comments are more than 80 characters. Maybe if I were working on more team projects I’d build up a set of “ignore this message” rules and turn it into a useful tool to me.
    Conclusion: pyflakes. If I cared more, I’d use pylint. If I were contributing code to libraries, pep8.
  • Trying out Riak on MacOS with Homebrew

    Homebrew install notes

    I decided to try out Riak on MacOS, using Homebrew to install. Here’s some rough notes.

    “brew install riak” worked. However, it didn’t leave you the Makefile you need to do “make devrel” in order to set up the 3 node test cluster. So I interrupted Homebrew during the install, copied the patched sources, did “make all rel devrel” in my own source tree, then copied the “dev/dev?” directories off. Still using the Homebrew binaries. Lot of work for 4 lousy config files :-P

    The Python client you want is riak-python from Basho. Unfortunately “pip install riak” doesn’t quite work, because Pip first tries to install protobuf and there’s a three year old bug open on Google’s protobuf client about how it doesn’t install right in Python. Some helpful soul forked Google’s library and his version can be installed witih pip install “git+http://github.com/rem/python-protobuf.git@v2.4.1#egg=protobuf”

    With that you have a working Riak and Python environment that you can use to work with the tutorials online.

    Insert notes

    Trying to insert my weather data: millions of rows of 4 tuples: (station name, timestamp, speed, direction). Testing with 3 Riak nodes running on my single fast iMac. This data is not a good match for document oriented stores: I already learned that Mongo was not good and expect Riak not to be, either. Still, it’s real data I have and I want to try it.

    Wow, is Riak slow if you just use it as a naive user. 3 node cluster running on a fast iMac, inserting 45,000 records one at a time with a store() between each took 120 seconds.

    Some write optimization.. there appears to be no batch insert in the Python library I’m using (although see this). Riak’s exposure of consistency via the W parameter is really neat, but what I need is to commit 10,000 records at once. I’ve tried optimizing my writes (ie: windData.store(w = 0, dw = 0, return_body = False)). Switching from HTTP to Protocol Buffers gives about a 4x increase. Still, no matter what I do, I can’t do better than 1900 records / second. Compare Mongo’s 30,000 records / second or Postgres’ 10,000 rows a second. Am I doing something terribly wrong?

    Sadly, I get a 10% speed increase (from 1900 to 2100 records / second) if I only have one node in the cluster. Not useful in the real world, but interesting.

    With just a single node running Riak seems to use about 1 kilobyte / document (650,000 rows, 720 megs on disk). Not good, since the docs themselves are like 20 bytes of data. Compare about 50 bytes / row for both Postgres and Mongo. With 3 nodes the combined disk usage is about the same size; I’m surprised, I thought the default config was three copies of all data.

    I sure miss the ability to drop a whole bucket. I don’t really have a way to reset my database while testing.

    Query Notes

    Once again I petulantly note that raw map/reduce is awfully low level if you just want a damn SQL function. Of course it gets wonderful when you want to query using 100 machines and a distributed data store.

    Trying to list 50,000 keys in a bucket (via /keys?keys=true) kills both Chrome and curl: they complain the HTTP header returned is too big. Also the docs warn you to never do this in production. In other words, queries over the whole dataset are difficult. I think you need to use links or some extra indexing system to do queries. Riak has a search capability, btw, but it looks more full text oriented than simple data iteration.

    I wrote some simple mapreduce code to do “select count(*) from bucket” and “select count(*) from bucket where parameter > 30”. It takes 16 seconds to run over 60,000 rows, that’s just unacceptable. Also the answer it gives is wrong; 43561 instead of 59072. Huh. While it’s cool I could run this query while nodes were dropping out, it’s uncool that the answer it gave varied as data was moved around.

    I don’t think I have a bug in my query code, but you never know… Here it is.

    import riak
    
    client = riak.RiakClient(port=8081, transport_class=riak.RiakPbcTransport)
    
    # select count(*) from KSFO
    query= client.add('KSFO')
    query.map("function(v) { return [1] }")
    query.reduce("Riak.reduceSum")
    for result in query.run(): print result
    
    query = client.add('KSFO')
    query.map('''function(v) {
        var data = JSON.parse(v.values[0].data);
        if (data[1] > 20) {
            return [1];
        } else {
            return [];
        }}''')
    query.reduce("Riak.reduceSum")
    for result in query.run(): print result
    

    Evaluation

    I knew this data was a poor match for Riak and now I’m doubly sure. The 1k / row is the real problem. I don’t think Riak is bad or anything, quite the opposite. Just curious how it feels to try it out on some data.

    I really admire how Riak is distributed to the core. Most datastores design for a single server, then later add replication and partitioning. It’s a total mess. Riak started distributed, and if they did it right it should be significantly more robust for real network usage. Also for parallel data analysis. I also think it’s freaking amazing how you can do “riak-admin join” and “riak-admin leave” and the data transparently repartitions while the cluster is still up and serving write requests.

    I wonder how robust Riak is in recovering if a node drops offline without a clean shutdown, then joins later with inconsistent data? The docs on CAP mention some voting and validation protocols, so they’ve thought about it. More a question of convenience than usability.

  • Postgres 9.0 to 9.1 upgrade with Mac / Homebrew

    I just had a terrible time trying to upgrade PostgreSQL 9.0 to 9.1. I finally gave up and am still running the old one.

    pg_upgrade is the recommended upgrade path. It’s pretty ugly, and among other things requires you have the old server binaries around. But it looks like it’d do the job. I had a few minor problems, like I ran “–check” and the check failed but left something around that made Postgres think a server was running. (Fortunately pg_ctl start just ignored that and started a server anyway, then pg_ctl stop fixed it.)

    Then I hit a bigger problem: PostGIS I’ve got that installed in my 9.0. I *think* it’s installed in 9.1, too, but I can’t be sure with Homebrew. In any event pg_upgrade doesn’t think it’s there so I’m SOL for now.

    An alternative is to do pg_dump on the old server, then pg_restore on the new. I suspect that’s what pg_upgrade does behind the scenes anyway. And I still have to resolve my PostGIS problem.