dormando

MySQL NUMA allocations under 2.6.32+

dormando — Thu, 30 Dec 2010 10:16:32 GMT

refering Jeremy Cole's post on swapstorming under NUMA hardware, I'll note something potentially new.

While I've seen this "brick wall swapstorming" a few times before and since the post, I just saw some new OS installs not do this by default, and using the numactl to change the defaults is actually harmful to system interactivity.

In the brick-wall cases, two NUMA zones of ~30G each, plus a mysqld (or memcached) running with 45G of ram, would equal 30G in memory, and 15G in swap. Ugly.

In this case, I'm getting a little bit in swap, but a relatively even note dist.

Here's a box with no numactl tuning:

N0        :      7068733 ( 26.97 GB)
N1        :      7120258 ( 27.16 GB)
active    :     13355529 ( 50.95 GB)
anon      :     14187441 ( 54.12 GB)
dirty     :     14185099 ( 54.11 GB)
mapmax    :          265 (  0.00 GB)
mapped    :         1580 (  0.01 GB)
swapcache :         2350 (  0.01 GB)

similar hardware, same OS/kernel running under numactl --interleave=all:

N0        :      6778742 ( 25.86 GB)
N1        :      6313382 ( 24.08 GB)
active    :     12395957 ( 47.29 GB)
anon      :     13090566 ( 49.94 GB)
dirty     :     13090566 ( 49.94 GB)
mapmax    :          255 (  0.00 GB)
mapped    :         1588 (  0.01 GB)

... just a touch in swap on the first guy. Though I'm going to wait a few days to declare victory or defeat, since I did see the first guy dump nearly a whole gig of swap once, but wasn't able to confirm if the swapped memory was mysql yet.

The side note here is that my numactl-modified node is exhibiting some extreme latency on interactivity. Appears to be related to anything that needs to fork having a half-second delay. MySQL seems to be running fine though.

I haven't investigated at all as to how numa distribution has changed in recent kernels (though I know it's been steadily improving over the years). Unfortunately every other box I've used which *has* the problem, runs on a redhat/centos5 kernel. Which is ancient to an extreme.

In this case it's debian squeeze with its default 2.6.32 kernel. Anyone try a recent ubuntu or redhat6 yet and see if the NUMA/swap issues are better on there?

Redis VS Memcached (slightly better bench)

dormando — Tue, 21 Sep 2010 22:13:27 GMT

Hello! First read this if you haven't yet.

I will now continue the back-and-forth obnoxiousness that benchmarking seems to be!

In my tests, I've taken the exact testing method antirez has used here, the same test software, the same versions of daemon software, and tweaked it a bit. Below are graphs of the results, and below that is the discussion of what I did.

Wow! That's pretty different from the first two benchmarks.

First, here's a tarball of the work I did. A small bash script, a small perl script to interpret the results (takes some hand fiddling to get it into gnuplot format), and the raw logs from my runs pre-rollup.

What I did

The "toilet" bench and antirez's benches both share a common issue; they're busy-looping a single client process against a single daemon server. The antirez benchmark is written much better than the original one; it tries to be asyncronous and is much more efficient.

However, it's still one client. memcached is multi-threaded and has a very, very high performance ceiling. redis is single-threaded and is very performant on its own.

There is a trivial patch I did to the benchmarks to make them just run the GET|SET tests. It is included in the tarball.

What I did was take the same tests, but I ran several of them in parallel. This required a slight change in pulling the performance figures and running the test. The tests were changed to run indefinitely, either doing sets, or sets then indefinite gets (I wanted to run some sets before the get tests so they weren't just hammering air).

The benchmarks were then fired off in parallel via the bash script, with the daemon fresh started before each run. After a rampup time (to allow the sets to happen, as well as let the daemons settle a bit), a script was used to talk to the daemons and sample the request rate. Since the benchmark is running several times in parallel, it's now most accurate to directly ask the daemon for how many requests it's doing. I did some quick hand verification and the sampling code lines up with the output of a non-parallel benchmark. So far so good.

I checked in with antirez to ensure I was running the tests correctly, and re-ran them as close to the original as I could get. Same number of clients *per daemon*, but there were 4 daemons in this case, so the actual number of clients is actually 4x what's listed in the graphs.

The tests ran on localhost using a dual cpu quadcore xeon machine, clocked at 2.27ghz (with turbooost enabled, I'm pretty sure). The OS is centos5 but with a custom 2.6.27 kernel. I verified the single-process benchmark results on my ubuntu laptop runnin 2.6.35 and a 2.50ghz core2duo and got similarish-but-slightly-lower numbers. I also tried the tests on several slightly differing machines after getting some odd initial results. Memcached was using the default of 4 threads. Performance might suffer in this particular test with more threads, as you'd land with more lock contention.

So these numbers look correct, for what I was trying to do here.

Nothing else was changed. I used the same tools.

Why I did it

Both tests are busy loops. All three of these benchmarks are wrong, but this can be slightly closer to reality. In most setups, you have many independent processes contacting your cache servers. In some cases, tens of thousands of apache/perl/ruby/python processes across hundreds of machines, all poking and prodding your cluster in parallel.

I don't have the room here to explain the difference between running two processes and one process against the same daemon - So I'll hand waive with "context switches n' stuff". There're plenty of good unix textbooks on this topic :)

So in this case, four very high speed benchmark programs soaked up CPU and hammered a single instance of redis and a single instance of memcached, which displays the strong point for the scalability of a single instance in each case.

Why the bench is still wrong

These are contrived benchmarks. They don't test memcached incr/decr or append/prepend (mc /does/ have a few more features than pure get/set).

Real world benchmarks will require a mix of sets, gets, incrs, decrs. Also, it requires testing each in isolation; some users might use their key/value store as a counter and hammer incr/decr hard. Others might hammer set hard, others might be near-purely gets.

All of these need to be tested. All features should be benchmarked and load tested in isolation, and also when mixed. All features need to be tested under abuse as well.

The test also doesn't try very hard to ensure the 'get' requests actually match anything. A better benchmark would preload some data across 100,000 keys and then randomly fetch them. I might try this next, but for the sake of argument I'm matching the same testing situation as the original blog post.

The interpretation for memcached

Memcached sticks to a constrained featureset and multithreads itself for a highly consistent rate of scale and performance. When pushed to the extreme, it needs to keep up. We also need to stay highly memory efficient. For a bulk of our users, the more keys they can stuff in, the more bang for the buck. Scalable performance is almost secondary to this. This is why we have features like -C, which disables the 8-byte CAS per object.

In a single-threaded benchmark against a multi-threaded memcached instance, memcached will lose out a bit due to the extra accounting overhead it must perform. However, when used in a realistic scale, it really shines.

There are some trivial ways we are able to greatly increase this ceiling. It's not hard to get memcached to run above 500,000 gets per second via some tweaks on some of its central locks. Sets have a lot of room for improvement due to this as well. We plan to accomplish this. Our timing has been bad for quite a while though :)

In almost all cases, the network hardware for a memcached server will give out before the daemon itself starts to limit your performance. This is a lot of why we haven't rushed to improve the lock scale.

Computers are absolutely trending toward more cores and not toward higher clocks. Threading is how we will scale single instances.

I really hate drawing conclusions from these sort of things. The entire point of this post is more or less me posturing about how shitty benchmarks tend to be. They are created in myopia and usually lauded with fear or pride.

You can't benchmark the fact that Redis has atomic list operations against memcached. They do different things and exist in different spaces, and the real differences are philosophical and perhaps barely technical. I'm merely illustrating the proper scalable performance of issuing bland SETs and GETs against a single instance of both pieces of software.

Understand what your app needs feature-wise, scale-wise, and performance-wise, then use the right tool for the damn job. Don't just read benchmarks and flip around ignorantly, please :)

Finally, here's one more graph... I noticed that redis seemed to do slightly better in the non-parallel benchmark, so I ran the numbers again with a single parallel benchmark in case anyone wants to look into it. Yes, the memcached numbers were lower for the single benchmark test, but I don't really care since it's higher when you actually give it multiple clients :)

Rolling back 432 million rows in MySQL 5.0

dormando — Thu, 14 Jan 2010 05:17:03 GMT

Believe it or not, I haven't put myself into this position before. The day before yesterday I started a LOAD DATA INFILE for 760 million rows. Didn't really think too hard about it, figured I'd let it run until it finished.

Unfortunately MySQL did that thing where the row insertion rate slowed to molasses over time, and the box was hosed. I killed the query. So it started rolling back the transaction. Which was 431 million rows in.

Using `SHOW ENGINE INNODB STATUS` you can look at the number of undo entries a transaction has left to go:

---TRANSACTION 0 1161525892, ACTIVE 18598 sec, process no 20139, OS thread id 1131772224
ROLLING BACK , undo log entries 431301691

Something like that. I left it to rollback overnight. The next day, it had only moved through a few million entries. It's going to take a week or more! I could just drop the table but it's locked from the transaction (Is this always true?)

So, I made sure replicaton was stopped, flushed logs, ensured nothing was talking to the DB.
Then, tried to shut down mysql. It hung, waiting for the transaction. Waited 15 minutes.
Then, I kill -9'ed mysqld.
Then, I waited a few hours for InnoDB crash recovery to run (large 512M redo logs + no fast recovery patch).
Then, I see this:

InnoDB: Apply batch completed
[etc]
InnoDB: Starting in background the rollback of uncommitted transactions
100113 20:53:31  InnoDB: Rolling back trx with id 0 1161525892, 431119521 rows to undo

Still rolling back, but now in the background!

Finally, I dropped the table housing the offending transaction.

Bam, now the undo rows are being chewed through at the rate of 1,000,000 every couple seconds. It'll be undone in a few minutes and I can go finish the maintenance work on the DB without having to reslave it. (It's a few TB in size, reslaving is a little painful).

Is there a better way to do this? I had no idea if this process would work or not, and there is the undesireable step of kill -9'ing mysqld.

Intel Nehalem CRC32 hardware instruction and HAProxy

dormando — Sat, 19 Dec 2009 04:54:29 GMT

Was staring at a wall earlier trying to think of things to optimize in an experimental HAProxy setup. I have the proxy configured to do very little processing (even using splice(2) when useful).

I'm asking HAProxy to do just a couple things in L7:
- Add an X-Forwarded-For header
- Hash the URI onto a list of backend servers
- Shuffle data between sockets

... pretty much nothing else. blind copy/forward otherwise.

But, hey, I have a box here with an Intel Nehalem based xeon core. Nehalem has a new hardware instruction for calculating a CRC32 hash. I wonder if it's any faster? How does the numeric distribution compare?

Well getting it working wasn't too hard at all. I'll post a patch once I clean it up.

I threw about 9,000 requests with different URI's at it to see how well it balanced

... a handful of requests were lost in the transfer, so these numbers are to be a little suspect. I'll try again to see if I can get a more accurate reading. You can already see that the built-in algorithm for haproxy is notably more even than the intel CRC32. *but*, the CRC32 isn't terribly far off. If you double or triple up your server list, it could end up being more even. I'll try this later.

Now, the speed test. I wrote a seperate bench tool that takes some input and runs the hash algorithms N times. I ran each test 3 times in a row per algorithm to ensure the times were close. The bench is for 10 million loops on the hash algo.

The results are "uri length: N", then the middle time, for each algorithm.

uri length: 6

haproxy:
0m7.459s
intel:
0m6.528s

uri length: 13

haproxy:
0m14.139s
intel:
0m6.138s

ur length: 39

haproxy:
0m40.366s
intel:
0m8.483s

... wow, quite a bit of a speedup! BUT, the hash algos do a little extra processing, looking for '?' or '/' characters to short circuit the hash if necessary. I took these out to compare them more on the raw hash speed. Also, technically the haproxy hash was counting '/' characters where the intel one wasn't.

uri length: 39
haproxy:
0m33.486s
intel:
0m7.459s

... still retains most of the speedup. Neat!

Unfortunately this speedup is likely marginal at best. In the final case with most of the tests in there, haproxy is running at 246,998 hashes per second. Intel is running at 1340662 per second. 246,998 per second already isn't slow. I'll have to run the algorithm in a full scale test to see if the difference is even measureable, or if too much CPU time is soaked up everywhere else in haproxy.

Also you can tell HAProxy to only consider the first N characters of a URI for hashing. Given the pretty linear decrease in speed given the length of the URI, capping the hash length at 10-15 gives you a speedup comparible to switching to the intel algorithm. If you want to hash arbitrary string lengths, the nehalem CRC32 algo is pretty darn impressive with a very minor increase in time under a wide increase in string length :)

So I'll clean up the patch, test it more at full scale, and post again later.

The "multiget hole" and how none of this is new

dormando — Tue, 27 Oct 2009 07:12:00 GMT

http://highscalability.com/blog/2009/10/26/facebooks-memcached-multiget-hole-more-machines-more-capacit.html

This has been making the rounds today... In my usual fashion I'm going to write an overly complicated post in response.

The basic claims of the memcached "multiget hole" are thus:

- If you are primarily using multigets to batch requests.
- Memcached is out of CPU.
- Adding a second memcached instance will split the batch request across both hosts.
- This will make things slower, since your multiget request gets split in half and hits both servers, instead of just one.

Lets break down this last claim into some detail, then discuss potential workarounds! Everybody put on your bath robe and thinking cap.

Claim: A multiget, split into two, will be slower than a single multiget against a single server.

What's really happening: A multiget, as referenced here, is when you combine a fetch for several keys into a single request. Lets say in this exercise you are trying to fetch keys 'foo1' through 'foo100', in one single request. The process for a typical memcached client and server instance is:

- Take the full list of keys requested.
- Hash each key individually against the list of memcached servers. If you have one server, they all go to the same place, if you have two, they are split.
- For each server that will get keys, issue a special multiget request against that server. For the ASCII protocol this looks like: `get foo1 foo2 foo3` to the first instance, and `get foo4 foo5 foo6` sent to the second instance. A single write, will get multiple responses back. This is faster than doing them one at a time, since you would be waiting for a response between each get. "get foo1" (wait for response) "get foo2" (wait for response) etc.
- Wait for each server to respond, collect keys, return to caller.

Lets break down the steps even more!

- For each multiget request issued, a *client* may either use a *blocking* or *non blocking* mode.
- In an optimized case, the client will issue a multiget against *both* servers *in parallel* and then call poll(2) (or similar) and wait for the responses.
- In a non-optimised case, the client will issue multigets to each server in turn and wait for the response. libmemcached did this until recently, so you might be surprised, if you look!

On the server end:

- Read all keys requested.
- For each key, hash the key and look it up against the internal hash table.
- Load any valid items for return and...
- ... write them to the socket.
The binary protocol more closely combines all these steps, but the idea is the same.

What the hell are you getting at?

Well, my point is slicing a multiget actually *shifts a tradeoff* as much as it becomes more or less efficient. There is a certain amount of overhead for a *server* to read from a client and respond, but there is also a particular amount of effort for that server to look up the key in its hash table and build a response. It is a fact that issuing a smaller multiget against a particular server will take *less* CPU time than a larger one. Adding servers does reduce CPU time on the server.

However, when the client has to issue separate writes to more servers, it is doing more complex work and will thus take longer and use more CPU time than if all of the requests were in a single write.

Hence, adding servers to a cluster *will* reduce the CPU usage on the cluster. The addition is non linear, but it will not make it *worse*. It could however negatively affect clients, and a bad client can be especially affected, if it has to wait for responses in serial.

Part the next: The subtle issue

Depending on how large your multigets are, it may take less time to split them an issue them against multiple servers. This is *entirely up to you* on how you want to handle, requires testing, and can be affected by kernel tunables.

If you are issuing a multiget with many keys, or with very large responses, you will be more likely to run up against the (I hope I'm quoting this right) TCP window scale. After so many bytes TCP needs to roundtrip an ACK packet to confirm that the remote end has received the preceeding data. This window will open at a certain size, and then expand or contract depending on how you're using the connection.

This is why some downloads or uploads will start out a little slow, then rapidly speed up. It's also why connections over a laggy or long link might not go above, say, 40k per second, but you can open multiple connections and run them all at 40k/sec to the same server (see also: download accelerators).

This last example should illustrate what I mean here. Stuffing too much down the pipe at once will cause more roundtrips to the remote server for the TCP acks. If you split a large list and run the data nonblocking, in parallel, to multiple servers, it might take less time to issue the request, but will use more CPU on the client.

Part the next to last: The workarounds

With the above in mind, the typical workaround has been in use for ages. A long time ago, in a galaxy far far away, brad fitz (or someone over there, I'm not sure who) realized that when fetching all cache keys for a livejournal profile is a trivial multiget of 10+ keys. It was also stupid to issue this (relatively small) multiget across all of the memcached hosts.

So he added a set and fetch by master key mode to the perl client (Cache::Memcached at the time). When you issue a set or a get request in "by key" mode, you give any given key a second key. That is, the key your client uses to hash your data out to the list of memcached servers, is different from the key you hand to memcached for storage. So:

- You assign a master key "dormando", to keys "dormando-birthday", "dormando-website", etc. This is bad key naming, but bear with me.
- Your client, instead of using "dormando-birthday" to decide where to store the keys, uses the key "dormando"
- Your cilent then sends *just* "dormando-birthday" and "dormando-website" to whatever server "dormando" hashed to.
- Memcached happily stores those keys, without any idea of what the master key was (you can't get it back).

Then you issue a multiget back, with the master key of "dormando". Both keys resolve to the *same* server, and the multiget hits a single host. With a single write, and ideally a single roundtrip.

If I had a metric pantload of keys to fetch and don't want to issue them all to the same server (noting the above subtle issue), I can semi-intelligently split the master keys into "dormando-chunk1", "dormando-chunk2" - depends on what your app can handle.

This is a simple and elegant way of avoiding having multigets spread thinly across your memcached cluster.

You could use UDP!

Yeah I guess you could. What about keys with larger responses? This does have a lot of the same issues, but in a different flavor. Could be faster or slower depending on what you're doing.

You could REPLICATE!

annnnnnnnnd do something really complicated where you have to store all of your keys in two places (halving the effective size of your cache!) and having your client randomly pick where each key goes to or comes from each time they're fetched? When issuing against a cluster of more than *two* machines, this isn't going to help nearly as much as cutting 50 separate fetches down into a single request, deterministically, by clustering the keys intelligently.

Note that replication adds a lot more failure scenarios. Network blips can lead to inconsistent cache data, among other things.

Sounds simpler to use a feature that already exists (look for "mget_by_key")?

But you could make either work.

Fortunately, there's also a really short answer to all of this.

Memcached 1.4.0

dormando — Fri, 10 Jul 2009 06:08:02 GMT

Everyone's favorite MySQL load relief system, memcached, has just hit the next major stable release: 1.4.0

This release sports a new binary protocol, major performance improvements, and many new statistics. Major kudos to the work of other people (Trond, Dustin, Toru) who put most of the effort into this new release.

Check out the release notes and give it a shot on your site. Please let us know if you've deployed it and any feedback you might have :)

Memcached 1.2.7 and 1.3.3

dormando — Sat, 04 Apr 2009 00:48:56 GMT

original post

... and this is my usual plea to those mysql/web/industrial folks to try out the latest code. Help us on our quest to scale the crap out of all of your stuff. :)

Find us on the mailing list, on #memcached on freenode, or on twitter as dormando, dlsspy, tmaesaka, and trondn. :) All others are fakers.

---

Two new memcached releases are available today.

Stable 1.2.7

The new stable release which is a maintenance release of the 1.2
series containing several bugfixes and a few features.

This version is recommended for any production memcached instances.

Release Notes:
http://code.google.com/p/memcached/wiki/ReleaseNotes127

Download:
http://memcached.googlecode.com/files/memcached-1.2.7.tar.gz

Beta 1.3.3

The new 1.3 beta brings lots of new features, performance, protocol
support and more to memcached.

Everyone is encouraged to get this into their labs and abuse it as
much as possible. This will be the stable tree. We've been testing
it quite thoroughly in the memcached community already and find it to
be quite stable, but we're always looking for more complaints.

Release notes:
http://code.google.com/p/memcached/wiki/ReleaseNotes133

Download:
http://memcached.googlecode.com/files/memcached-1.3.3.tar.gz

Memcached 1.3.2 beta

dormando — Fri, 20 Mar 2009 23:54:36 GMT

Yo,

I didn't write most of this code, but most of the new changes are pretty awesome go read about it, grab it, and try it.

We're very careful about getting as much testing as possible before declarnig a new release as stable. Please try it out in your development environments, beat up on it, maybe try it out in staging. Perhaps even be naughty and swap out one production machine with it some late night.

A lot of the changes in this release are good for those high end mysql/etc backed sites where you might be worried about (or are) hitting performance issues with memcached itself. Others like expanded statistics, optional memory optimisations, should be useful for most folks. Please check the release announcements for dustin's more thorough notes :)

A 1.2.7 stable release should follow on its heels shortly, but don't expect anything but bugfixes and a few minor feature enhancements over there.

Goodbye again, LJ

dormando — Wed, 19 Nov 2008 02:55:10 GMT

I said it once in 2002, and now again in 2008. See ya, LJ :)

Mucho thanks to dwell and tupshin for their mentions for the work burr86 and I have put in to help them take over LJ. Especially burr86 - I did some hard stuff but he did most of the work from 6A's side. Kudos to LJ's ops/eng teams for diving into one of the more complicated web architectures and actually getting it running.

We're there for ya (on whatever personal committment we're comfortable with :P), but LJ's in good hands. Enjoy.

I did some pretty cool hacks for MogileFS to help facilitate their move, which I should be posting to the mailing list as soon as I clean up the commits. What's good for us is good for everyone :) Please remember LJ's open source history as you march forward.

Should you cache?

dormando — Sun, 17 Aug 2008 09:21:42 GMT

Should you use memcached? Should you just shard mysql more?

Memcached's popularity is expanding its use into some odd places. It's becoming an authoritative datastore for some large sites, and almost more importantly it's sneaking into the lowly web startup. This is causing some discussion.

Most of whom seem to be missing the point. In this post I attempt to explain my point of view for how memcached should really influence your bouncing baby startups, and even give some pointers to the big guys who might have trouble seeing the forest through the trees.

Using memcached does not scale your website! Entertain me, I'm playing semantics here: This thing is not for scaleout. Mostly. What memcached really is, is a giant floating magnifying glass. It takes what you have already built and makes stretch ten times further. I insist on not confusing caching with scaleout as when your little stretch-armstrong of a website hits that tenfold limit, you're still screwed. There's no magic switch or configuration option in memcached that will save you from dealing with proper optimization and sharding.

You sure can get away with a hell of a lot though!

Keep it in the front of your mind; no it will not help you batch your writes, or make them smaller, or really help you deal with them in any useful way. If you want to write data you will need back later, you must shard. If it's data you don't care about, maybe write it to memcached and make a note of it in your business plan.

Also strongly keep in mind; memcached won't help your cache misses suck less. If you're writing awful data warehouse quality queries which you expect to run live on the site, go bust out the failboat and get-a-rowin'. You're screwed. As your dataset grows you will find new slices of hell in which your queries behave in all new ways. What once scanned "a few extra rows" now might hit tens of thousands. Cache misses will suck. You will have to deal with this. That's not something this solves.

Sometimes memcached does let you achieve the impossible, or scale the unlikely. Take slightly complex queries, or even template operations, which under the best of conditions might take 15-20 milliseconds each. An obnoxious join, a weird subquery, a tree walk, or fancy HTML templating. Being able to do this live could mean the difference betwen your website standing apart or having to settle with an awful workaround. In these cases, with a high enough hit rate, you can soak those cache misses and make the feature work.

My example isn't translating a 5 second query into 0.5ms with memcached, it's a 15-20ms query. If you had a dozen of these in a page load, a bad load might take an extra quarter second to render, but it wouldn't ruin the user experience. The issue memcached solves here is subtle. Tacking on 0.25 seconds per page render might not make the site completely unusable, but realize these queries are using solid resources on your expensive hardware for that extra quarter second. With a quadcore database, it's possible under the best conditions you would only be able to render 14-16 pages per second off of that machine. Throw in all the other things you have to do on a page load, writes, internal database whoosits and uneven CPU usage and you'd be lucky to get 5 pages per second.

In this case, it's still walking the line of scalability, but it turns something mildly impossible into something highly probable. On the cheap.

The cost equation

Now the most important factor here has reared its ugly head: Cost.

Cost. Ugly for startups. Ugly for established companies. Nightmares for venture capital. What is your cost? Why am I talking cash about companies who have millions of dollars in VC or sales? Just buy more servers! Whatever, right?

Well no. The largest cost is time. All others pale in comparison. The best physical goods investments your company can make are more related to your people than your hardware. Hardware has horrific depreciation. Most of the value is lost immediately, the rest over the first year of operation.

In comparison, buying your employees really fucking nice chairs, desks, and monitors in a swanky comfortable office are much more solid investments for your company. Aeron chairs have great resale value for that inevitable going-bumpkus dot bomb sale. Also anything you do to make your workers happier and more productive will pay out more than any hardware investment. Your product ships on time, you react to the market faster.

To sidestep into hardware a little... Always max out the RAM in your databases. Everyone should. I didn't realize people don't actually do this until I read some of these arguments against memcached. Whenever I add memcached to a website, RAM memcached gets is RAM that didn't fit into the databases, but easily fits into empty memory slots in webservers or cheaper hardware. A good solid database might cost $5,000, but a beefy memcached box will cost less than half that. Way less than that if you just add memory to existing hardware. So "adding that extra RAM to your databases" isn't a very fair apples-to-apples comparison unless you're already doing something wrong.

So it should be obvious just what the hell I'm getting at now, and what seems to be bothering everyone else about this whole stupid memcached fad.

You're all wasting your goddamn time! Yeesh!

How can a small site or startup benefit from memcached?

Simple: The idea.

Caching really wedges your whole RDBMS worldview. You don't just CRUD anymore. Your data is a process. A flow between points instead of just the store and display. At any time in this flow an idea may be injected. Maybe it's serializing a generated object and caching it, maybe it's utilizing gearman to shift off some asyncronous work. There is just more to it now.

But that's all messy complicated. What can you do? What should you do?

Design for having cache, design for change.
... but don't write all the code yet.
... but certainly design for change.

Think good object design. A "user" is a class. That user has base properties which you might find in the `user` table. A "user" object might have a profile, which is really another object with another class representing a `profile` table.

my $user; is an invaluable abstraction.

That user object must load and store data. When you build this at first it's all standard CRUD. Straight to a database.

Where would you think to add caching to this system? I hope I've made it too obvious.

At the query layer! Use a database abstraction class and have it memcache resultset objects and... No no no, that's a lie. I'm lying. Don't do that.

Do it inside that $user object. At the highest level possible. Take the whole object state and shovel it somewhere. That object is its own biggest authority. It knows when it's been updated, when it needs to load data, and when to write to the database. It might've had to read from several tables or load dependent objects based on what you ask it to do.

Instead of wrangling your best and brightest into figuring out a cache invalidation algorithm which might work "okay" against your schemas, do what's simple for the object. If adding caching to the $user object means the load() function tries memcached first, and all write operations hit memcached with a delete operation, so be it. You just added basic caching to one of the hottest objects in your website in, oh, half an hour. Maybe a few days if you're really scraping the bottom of the talent barrel.

Now we're back where we started. Reap the time benefits! Abstract your data access methods properly, plan for caching. Actually go write caching into a few objects. Maybe turn it off when you're done. You don't need it yet. Write your objects to talk directly to your database and save time.

Same idea for sharding. Either focus on that now, or realize you can take a $user object and extend its load() magic to find and write to users based on a sharding scheme. You probably don't have to rewrite all of the code to make this happen. Refactor to win.

So now you're ready. You're building your site fast and abstracting where you can. Brace for change. Be ready to shard, be ready to cache. React and change to what you push out which is actually popular, vs overplanning and wasting valuable time. Keeping it simple is gold here.

You're building something new and you're going to fail at it. Your design will be wrong, you will anticipate the wrong feature to be popular. Dealing with this quickly can set you apart. Being able to slap memcached into a bunch of objects in a few days (or even hours) can mean the difference between riding a load spike or riding the walrus.

Bullet points for fun! How can your small site benefit from memcached:

- Design for change! Holy crap I can't say this enough.
- Don't cache in ways that piss off your users.
- Not keeping it simple is fail.
- Cache and shard at the highest level possible relative to your data.
- Read High Performance MySQL 2nd ed. Memcached won't fix your lack of database knowledge.
- The same ideas which help you prepare for cache, helps you prepare for sharding.
- Don't waste all your time getting it right now. Get it close, get an idea, try it out, and prepare to be wrong.

Finally:

- Keep an open mind. Sites like grazr and fotolog do things differently. Doesn't mean they're right, doesn't mean they're wrong. Be inventive where it makes sense for your business.

There. Sorry this came out so long :)

Cache your sessions. Don't piss off your users

dormando — Sun, 10 Aug 2008 02:14:58 GMT

I hope you're all enjoying the 1.2.6 stable release of memcached. Don't want to hear no whining about it crashing!

One of the most common questions in memcached land is the ever obnoxious "how do I put my sessions in memcached?". The long standing answer is usually "you don't", or "carefully", but people often walk the dark path instead. Many libraries do this as well, although I've seen at least one which gets it.

This isn't as huge of a deal as people make it out to be. I've been asked about this over the mailing list, in IRC, in person, and even in job interviews. What people end up doing gives me the willies! Why! Why why why... Well, I know why.

So what is the deal with sessions? Why does everyone want to jettison them from mysql/postgres/disk/whatever? Well, a session is:

- Almost always larger than 250 bytes, and almost always smaller than 5 kilobytes.
- Read from datastore for every logged in (and often logged out) user for every dynamic page load.
- Written to the datastore for every dynamic page load.
- Eventually reaped from the database after N minutes of inactivity.

Ok well that sucks I guess. Every time a user loads a page we read a blob row from mysql, then write a blob row back. This is a lot slower than row without blobs. Alright, so I see it now. Memcached to the rescue!

Er, except maybe it's a little complicated to actually memcached these things, since we need a write for every read... Why not just use memcached for sessions!? It lines up perfectly! Check it out:

- Set a memcached expire time for the max inactivity for a session. Say 30 minutes...
- Read from memcached.
- Write to memcached.
- A miss from memcached means the user is logged out.

Voila! ZERO reads or writes to the database, fantastic! Fast. Except I really don't like the tradeoffs here. This is one example where I believe the experience of both your users and your operations team is cheapened. Users now get logged out when anything goes wrong with memcached! Operations has to dance on eggshells. Or needles. Painful.

- Evictions are serious business. Even if you disable them (-M), out of memory errors means no one can log into your site.
- Upgrading memcached, OS kernel, hardware, etc, now means kicking everyone off your site.
- Adding/removing memcached servers kicks people off your site. Even with consistent hashing, while the miss rate is low it's not going to be zero.

So now what? Well we have zero accesses on our database, so it's fast! But we can't ever touch memcached again in fear of ticking off users. Progress be damned! Before you all think I'm completely off my rocker, I will admit there are some legitimate reasons to do this. If the way your site works doesn't really impact users on loss of a session, or impacts few enough users, you can use this design pattern. How many people are actually affected if you get logged out of wikipedia.org? Well, the people writing revisions certainly mind, but the greater userbase is unaffected. They're a non profit, they understand the tradeoff, etc. So that's fine. It's not fine for a lot of the people I see suggesting it or doing it. As developers get more comfy with memcached the session issue will become more of an obvious bottleneck.

The memcached/mysql hybrid really isn't that bad at all. You can get rid of over 90% of the database reads, a lot of the writes, and leave your users logged in during rolling upgrades of memcached.

First, recap the components involved: The page session handler itself, and some batch job which reaps dead sessions. For small websites (like a vbulletin forum) these batch jobs are often run during page loads. For larger sites they will be crons and so forth. This batch job can also be used to save data about sessions for later analysis.

The pattern is simple. For reads fetch from memcached first, database second. For writes write to memcached, unless you haven't synced the session to the database in the last N seconds. So if a user is clicking around they will only write to the database once every 120 seconds, and write to memcached every time.

Now modify the batch job. Crawl all expired sessions, and check memcached for the latest data. If session is not really expired don't expire it then, if it is use the latest possible data from memcached. Write back to the database. Easy.

You take the tradeoff of sessions being mildly lossy for recent information, but you gain reliability back in your system. Reads against the database should be almost nonexistent, and write load should drop significantly, but not as much as reads.

So please, if you run some website I might eventually use, don't put memcached in a place where restarting individual servers might piss me off. Thanks :)

I'd like to also challenge maintainers of session libraries for all languages to turn this design pattern into tunable (note all the places where I wrote N) libraries folks can plug in and use.

The more standard this stuff is the more likely the next fancy startup is going to get it right. Reuse is a great thing. I can't say enough about how great efforts like krow's libmemcached go for standardizing how we use memcached, but it's also a great help to ship libraries for common design patterns.

Dormando's Proxy for MySQL R6

dormando — Sun, 11 May 2008 23:03:14 GMT

Previously.
As usual, hit up the homepage for the latest and greatest downloads. Or simply 'git pull' and use the tag release-6 if you're cool enough.

I'd like to use this post to explain in a more general fashion about what DPM is and why it's different from the rest of the proxies.

First, milestones since R5:
- BSD licensed. You are now free to roam about the cabin.
- Several C level bugs fixed.
- Many improvements to the lua library dpml.lua
- All of the demos were rewritten using dpml.lua, and are now far easier to use.

Now, what is DPM?

- It's a proxy for MySQL. It is event driven, embeds lua, and is written in C. It allows you to write plugins in lua, and easily extend the core with extra C.
- It supports a subset of the MySQL protocol's features, and offers a very flexible view into the protocol.
- Fast. The C part takes very little processing time. The more you do in lua the slower it will get. Identify the hot spots and rewrite them in C: The lua's there for a good reason.
- Supports MySQL's authentication for 4.1+ client/servers. Authenticate to it like a real database.
- Supports arbitrary timer objects. Kick off events with millisecond granularity.
- Very flexible. DPM's API intends to be a swiss army chainsaw for MySQL. You can do things in lua which will require extensive amounts of C in other proxies.

What DPM is not:

- A fork of MySQL Proxy. DPM was started during a short period of time when MySQL Proxy was closed source software. I started it from scratch, and have done most of the development myself. It has since taken a similar but different direction.
- Super easy to use. It is missing lots of documentation and is harder to use than MySQL Proxy. As of R6 this is starting to improve... The demos are much easier to work with.
- Controlled by MySQL AB (now Sun Microsystems). I'll leave it to the reader on whether this is good, or bad. DPM is a community project. Fork it with github and send me patches.
- Brand new. I've been hacking on it on and off since May of 2007. Most of the features that exist have been supported for over six months.
- Enterprise ready. Plop it in a non criticial place until we get more of the bugs out of it. Monitoring, benchmarking, etc are great starting places.

Okay, so it uses libevent, lua, written in C... Why is it different?

- Flexibility is the big deal. Being able to do as much evil as possible from within lua was the goal off the bat.
- Instead of writing library functions in C, the lua API is low level, and a lua library (dpml.lua) is distributed with it, containing handy functions and wrappers. Quick prototyping of library functions, and hot spots can be rewritten in C.
- Facilitates use of lua libraries with a flexible callback structure.
- Almost every high level concept is a lua object or command. Listeners, logical protocol packets, etc. This takes a little explanation.
- DPM is acts like mod_perl, while MySQL Proxy is more like mod_php. Once the startfile is loaded and running, DPM will not automatically reload anything.

Who are you, anyway?

- I'm Dormando! Well not really, but it's easier to remember than my real name.
- I work for Six Apart as a MySQL DBA.
- I have done systems, DBA, and scalability work for a bunch of huge sites (livejournal.com, gaiaonline.com, and a few others).
- I ride brad's coattails of success, and am presently the release maintainer for memcached. I also answer mail for, hack on, and help run instances of perlbal, mogilefs, etc. While I'm not a genius by any stretch, it means I have scalability and flexibility absolutely in mind while writing this tool, and plenty of experience to draw on.

Using DPM

This means DPM has a concept of a "startup" file:
./dpm --startfile yourproject.lua
... and this file initializes all of the functionality your running DPM instance will have. This is the only argument you actually need to start up. No specifying listening sockets, admin panels, ports, etc.

So, how does one specify what port and IP's to listen on? Lets look:

listen = dpm.listener("127.0.0.1", 5500)
listen:register(dpm.MYC_CONNECT, new_client

This creates, _at runtime_, a listening socket on localhost, port 5500. When a mysql client connects to it, the function 'new_client' will be called. At that point you can either pass the client through to a mysql server (see demo-direct.lua), or handle the client yourself and do connection pooling, load balancing, or implement your own service (see startup.lua).

Weird, right? Complicated too. You have to set up callbacks and wingwangs just to get clients authenticated, or passed through to a backend server.

This is where the DPML library comes in. If you check out the bottom of demo-direct.lua and startup.lua you'll see we pass the listener object into a convenience function.

dpml.passthrough_new_clients(listen, { new_client = new_client, 
                             new_command = new_command,
                             finished_command = finished_command,
                             closing = closing_client, },
                             { "127.0.0.1", 3306 })

This takes care of the nitty gritty and allows us to just focus on, say, a 'new_command' function. Now we're getting back to the usability of MySQL Proxy a little. Want to have your scripts automatically reload? Write a dpml library to dynamically load, or have a timed reload of, your new_command function from a file. Then everyone can do it easily, but serious users won't lose the speed from having to work around script caches.

Neat. Now, it should be obvious what other sorts of evil we can do here.

Since listeners are arbitrary, and you specify the exact callback function, you can easily have _one_ DPM instance run completely different code on different ports, IP's, etc. Make your plugins modular, and load a bunch of them at once for some fun.

How does DPML manage to do some of the things it does?

When you have a client or server connection object, you may register your own event callbacks on them. Watch for commands, parse resultsets, etc. Lets say you want parse results of queries in your own special way, but sometimes you just want to grab some easy results. So you have something like:

server:register(dpm.MYS_SENT_RSET = read_rset_packet)
[... etc]
dpml.execute_query_buffered(server, "SELECT 1 + 5", read_results)

... execute_query_buffered has to use its _own_ set of callbacks to make this work! What happened to yours?

They step aside for a minute, actually.

local br_callbacks = dpm.new_callback()
register_callbacks(br_callbacks, {
                   [dpm.MYS_SENT_RSET]      = br_read_rset,
                   [dpm.MYS_SENDING_FIELDS] = br_read_fields,
                   [dpm.MYS_SENT_FIELDS]    = br_read_endfields,
                   [dpm.MYS_SENDING_ROWS]   = br_read_rows,
                   [dpm.MYS_WAIT_CMD]       = br_read_finish,
                   [dpm.MYS_RECV_ERR]       = br_read_error,
                   })
[... etc]
server:package_register(br_callbacks)

Any library function, in any file, may temporarily override the callbacks on a client or server object with its own. package_register is a very fast operation, and you'd want to keep your callback object cached somewhere. Now you can have your own processing, but temporarily pass connections over to completely different callbacks. This is limited by a single depth; a library function cannot call another library function and expect to not have its package callbacks overridden.

This means you may safely use most of dpml's functionality from your own scripts to help make your life easier. It's also trivial for anyone to start their own library and contribute it back, or maintain on their own.

Timers are also way cool, but they were described in a previous post.

The polling multiplexer

With R6 all of the included demos were rewritten, and a new plugin has been started. Something more complete and useful: A query multiplexer.

I wrote this in bits over the last two days, and it's not complete, but it is functional, and very cool. The concept of the multiplexer is that DPM can send queries to many servers, asyncronously, in parallel, and aggregate the results for you. With only lua. If the code were made to be more verbose and less readable it could take up a lot less CPU time, but as-is it's not bad at all.

First, lets connect and say hello:

mysql> hello;
ERROR 666 (66666): Multiplexer does not understand.

Rude. When we talk to the multiplexer, we are never directly talking to a MySQL database. We're in the rudimentary command parser supplied with it. I also like making up error codes that are clearly wrong.

Lets do something more useful:

mysql> ADD DB host=127.0.0.1,user=happy,pass=wheefun;
Query OK, 1 row affected (0.00 sec)

mysql> SHOW DB STATUS;
+-----------+--------+-------+-------------+---------+---------+
| DSN       | active | queue | disconnects | retries | queries |
+-----------+--------+-------+-------------+---------+---------+
| 127.0.0.1 | 1      | 0     | 0           | 0       | 0       | 
+-----------+--------+-------+-------------+---------+---------+
1 row in set (0.00 sec)

Spiffy! We've added a single DB to the service, and it has successfully connected to said backend. We can also remove it again with "REMOVE DB 127.0.0.1" - note that DPM doesn't presently support asyncronous DNS lookups, so we have to use IP addresses for the time being.

Now we use a really awful syntax:

mysql> POLL ALL 5 SHOW DATABASES;       
Query OK, 1 row affected (0.00 sec)

"ALL" specifies a collection of servers, which is not currently supported. The '5' would be a timeout, which is also not presently supported. They are reserved for now.
... If we had a bunch of databases (and I have tested it with 10+), that would have sent the command "SHOW DATABASES" to _all_ of them. It will not wait for the response before returning though. While we can, we don't need to here. Why? Because we can do this:

mysql> POLL ALL 5 SHOW FULL PROCESSLIST;
Query OK, 1 row affected (0.00 sec)

... now we have two commands in flight. Well, they're probably finished by now, but you get the point. We can connect, fire off 5+ commands to run on all of our DB's, and go to sleep for a while.

Then:

mysql> FETCH ALL SHOW DATABASES;
+-----------+--------------------+
| dpm_dsn   | Database           |
+-----------+--------------------+
| 127.0.0.1 | information_schema | 
| 127.0.0.1 | docpart            | 
| 127.0.0.1 | mfs                | 
| 127.0.0.1 | mysql              | 
| 127.0.0.1 | sakila             | 
| 127.0.0.1 | test               | 
| 127.0.0.1 | world              | 
+-----------+--------------------+
7 rows in set (0.00 sec)

... we get our responses back. DPM has kindly rewritten the results so we can see which server sent which data. Now we can fetch all of the results from the last round, fire off new POLL queries, and go parse what we just got.

This is a big help for monitoring the databases at Six Apart. The script will get better over time, and will be perfect to plug into, say, Innotop, for its querying of groups of servers.

Lots of features/fixes possible, but as is it works fine. I've not made significant changes to the C or lua API (aside from bugfixes) in order to write this. I've used the same API as existed five+ months ago.

Download it, give it a shot, and let me know what you think!

If you think it'd be awesome to get to play with this in production, and use other fancy danga tools to help maintain one of the coolest web infrastructures, We're looking for Senior Sysadmins :) There are also a bunch of other jobs available if you want to take a look.

Silly update to Dormando's Proxy for MySQL

dormando — Sat, 19 Apr 2008 10:04:26 GMT

If you fetch the latest snapshot you'll find the code has been relicensed under BSD.

Huh. Wonder what you could do with that :)

Memcached 1.2.5

dormando — Tue, 04 Mar 2008 09:58:47 GMT

This is absolutely the most awesome release of memcached ever, and it has absolutely nothing to do with me being a patchmonkey for it.

Trees are stabilizing and people are getting to work. Fun and exciting things to come :)

Kudos to MySQL and Sun (like krow) for jumping in on the fun.

Next up: 1.3.0-rc. Time to start hammering down the binary protocol before it wanders off even more.

I'm up way too late :) Praise insomnia!

Dormando's [crappy] Operations Mantras

dormando — Mon, 04 Feb 2008 06:23:23 GMT

Ops Mantras (as made popular by Dormando).
I've been doing this shit for a while now. I'm presently acting as a MySQL DBA for SixApart, but these views are mine and not of my employer. This is an omega post of all of the generalized one off mantras I find valuable when approaching operations management. Even if these end up being idealistic, my humble view is to shoot for these and you'll be better off with what you end up with.

It's uh, long, sorry. This _was_ inspired by another post which I'll not be direct linking. Aside from the list-style, I've not stolen anything else. The Mantras are broken up into major sections:
- The Technical Element
- The Human Element
- The Practice

The Technical Element

Design for change

- The old google mantra is right. Design for change. Change is having to deploy new software, upgrade existing software, scaling, equipment breaking, and people shifting around.
- Everything in this mantra is about finding balance. You might think it's a good idea to tightly marry your system to a particular OS or Linux distro. It's just as bad of an idea to separately them entirely. Use layers and a _little_ indirection if you must.
- This does not mean complete and total platform agnostics. It's about making one system two, two systems twenty. Dealing if a sysadmin gets hit by a bus, if that dangling harddrive dies, if someone runs rm -rf /. It's for the incremental changes. Security updates, pushing new corporate content.

Use automatic, repeatable builds

- Don't build anything by hand. If you do, do it twice, and grab every single command the second time around.
- I cannot stress how important this is. It should take no more than 15 minutes from bare metal to production for new hardware. There does not need to be a human element to screw it up, or get punished when a server goes down and no one knows how to replace it.
- This is true for anything. There is _no_ such thing as a "one off" server build. If you've built it once, and it only needs to exist once, it will exist twice. The second time around is when it breaks, or if you need to do a major upgrade or consolidation two years down the road and have no frickin' clue how it was put together.
- Test, vet new builds. This should be easy because your builds are all automatic, correct?
- Scripted builds means that upgrade from Linux Distro Version 3 to Version 4 is absolutely clear cut. Install Version 4 and test the scripts. Read documentation and fix until it works again. This should be a week's worth of work at most, not a yearlong project. (to finish just in time for Version 5 to come out!).

Use redundancy

- Just beacuse something might be easy to rebuild, doesn't mean you can ignore redundancy. Jump boxes, mail servers, billing gateways, whatever. Wouldn't it be a hell of a lot easier if you could swap out one half of the equation without causing downtime for your customers?
- ... and along those lines, you get to "deal with it later!" when a box goes down at 3am, and the redundant machine kicks in.
- Even if it's not ideal, go for it anyway. Rsync'ing configs to a second box is a step above nothing. DRBD might not be perfect but it can provide an amazing service.

Use backups

- We shouldn't even joke about this. Use harddrives, burn the tapes. Compress them, move them, run them in parallel. Backup EVERYTHING!
- If your builds are automatic, the entire process can be backed up. If you're following along past this point a *real* Disaster Recovery plan might not seem so far fetched.

Keep monitoring specific

- Monitor every damn thing you can, but do it right. Don't get a thousand alerts if your NFS server craps its pants. Don't alert on timeouts if it doesn't make sense for your system. Test for success at the most specific level; sure the service might allow a new TCP connection, and it might even say hello, but does it remember how to do its job?
- If you have 500 webservers, you probably don't need to know immediately if one goes down. You _should_ know if the load balancer decided to not take it out of rotation and real human people are seeing its uglyriffic error messages.

Graph data, keep exact historical data

- Graphs are for visualizing trends. Historical data is for crunching numbers. Don't mix the two! It's too easy to get wrong numbers from eyeballing graphs. Many sites use rrd's or other aggregating data systems which will average and smooth out data over time to save on storage space. This means it's not only hard to read, it's wrong.
- Don't get trapped having to skim through hundreds of graphs just to pinpoint an issue. If you're trying to find outliers in the graphs, you can pull those out via scripts as well.
- If you must use graphs for troubleshooting, try to aggregate high level concepts into a single page, which link into drill-down pages from there. If you can see a spike in the database load, you'll know to click to the page overviewing the databases, then you'd see the one or two iffy machines in question. The idea is to narrow something down fast. Remove as much guesswork as possible.

Log useful information, use multiple streams of data

- Work on your own, and with development, to log as much useful information as you can. Doesn't matter if you live analyze it and store the data somewhere, or lump it into a database and run reports. Information is useful.
- Useful examples: Page rendering time (what page, what box, etc), user-facing errors, database and internal service errors, bandwidth usage, etc.
- Establish graphs, reports, and do historical comparisons from generalized data.
- Reports are really important. Get digested data week-to-week or day-to-day about changes in your infrastructure.

Understand your data storage, databases

- There's an entirely separate set of undrestanding about operating databases, but sometimes you can't leave all of this up to your DBA.
- Having multiple, redundant databases affords you many luxuries. Operations that were once many hours of downtime can be done "online" without shelling out for a huge Oracle instance. MySQL and replication is a fantastic thing.
- Work with the DBAs to get the best possible hardware for the database in question. RAID10, gobs of RAM, many fast spindles, and potentially RAM disks and SSD's. Ops has access to the vendors, DBA's can beat the pants off the hardware. Find out what works best and save tons of cash in the long run.
- Database configurations are changing. Software like HiveDB, MySQL Proxy, DPM exist now. We're absolutely doing partitioned data for huge datasets. We're also thinking outside of the box with software like starling and Gearman. Learn what these are, and understand that not everything will be in a database.
- Get a good grip on your filers! If the data's important, back it up! Snapshots on monolithic NFS servers are fantastic, wonderful, and NOT a backup!
- Consider alternatives. MogileFS gets better year after year. There're likely other projects for freely and cheaply maintaining massive stores of files. Similar systems were developed for youtube.com, archive.org, etc. We're finally free of expensive NFS filers being the standard!

Scale out a lot, up a little

- You've seen all of the papers. Scale out is really the way to go. Get commodity (read: available, affordable, standard, NOT super cheap) hardware and work with everyone to ensure all aspects possible can scale out.
- Scaling out starts at two, work from there. This also happens to encompass redundancy.
- Scale out as far as you can without being idiotic about it. The example of MySQL replication with single master, many slaves, is a fantastic example of one form of scale-out sucking. All slaves must do all writes, so as the number of writes scale up with the reads (if they do for your app, which I bet they certainly do), you get less capacity per slave you add.
- Keep alternatives in mind. User or range partitioning onto many databases, avoiding production slaves where possible, etc. Good ideas, many ways of implement.
- Everything can scale if you give it a chance! Routers, switches, load balancers, webservers, databases.
- Remember scale up? Big evil machines with many slow cores, lots of IO boards, and very expensive storage equipment? They're coming back. Well, the CPU part is.
- RAM is cheap.
- Combine the two, and you just may end up combining services again. A load balancer here, a webserver there... If an application can use many CPUs (apache) this is perfect. If it can't (memcached doesn't get much benefit from it, usually) you can end up wasting tons of available resources by segregating services too much.
- Job systems could potentially fill in gaps here. Where there're extra cores, slap up more workers.

Cache

- Caching is good. Developers, sysops, etc. Get on this! Yes, it's weird. It's different. Sometimes you may even need to, gasp, make a tradeoff for it. Effective use of caching can have as much as a ten times increase in overall system performance. That's a giant magnifying glass over the systems you have already and a fraction of the overall cost.
- Memcached. Service cache, denormalize DB structures (where it makes performance sense!), squid cache, or even make better usage of OS caches.
- Test it, toy with it, and break it. There will be new and different problems with caching. Be prepared for it.

Asyncronous jobs

- Starling, Gearman, The Schwartz, whatever. Job systems allow much more application flexibility. Workers can be spawned one-off, be persistent (load cached data, prepare data, etc), be on different hardware, different locations, and be syncronous or asyncronous.
- Maintaining these things is an ops issue. Using them is both a developer and an ops issue.
- User clicks "send all my friends an e-mail". Schedule a job, immediately say "okay done! Your friends will receive your spam shortly!" - let the job service multiplex and deal with the issue.
- Job systems are great places to bridge services. Blog post -> IM notification, billing cron -> billing services, authentication gateways, etc.
- Easy to scale. There will be choke points for where requests come in, and all the workers need to do is pull. This is in contrast with the largely push/pull state of HTTP.

Security and patrols

- Install security updates! Seriously! There's a whole crazy network of people who are dedicated to giving these to you in the shortest period of time possible. Don't let them sit for _years_ because you're afraid of change.
- Security is in layers. Accept what you can and cannot secure. Just because mysql has password access doesn't mean it gets to be directly accessable by the internet.
- Disable passwords over ssh. Use passphrase encrypted key auth. Remote users _cannot_ guess your private key. They _have_ to get it from you. Keep it safe, and there's no point in firewalling off your ssh port.
- Understand how the application works, exactly what it needs to do, and work that to your advantage. If the only part of your application which needs outbound internet access _at all_ are the billing pages and some twitter-posting service, those can easily become job workers. Put the job workers on specific boxes and allow those access to specific hosts. Keep the rest of your network in the dark.
- The above is especially important for php sites, but probably works great elsewhere. If someone breaks in, it's most likely going to be through your application. When someone gets in through the front gate, they'll need to haul in their toolbox to get into the safe. Don't let them pull in data and get what they need, or upload the contents of your database somewhere!
- These specific suggestions aside, read a lot. Use your best judgement, and test. If you have no understanding of how a security model works, that might not immediately make it worthless, but you certainly don't know where its limits are or even if it works.
- Secure based on testing, theory, attack trees, don't stab in the dark. I love it when people dream up obscure security models and ordinary folks like me can smush it to crumbles.
- Patrol what you can! Audit logins, logouts, commands used. All accesses to external facing services, including all arguments given in the request. Find outliers, outright ban input outside of the scope of your application, and do what you can actively and have the data to work retroactively.
- If you suspect something's been cracked, *take proper precaution* and understand a little computer forensics (or get a company that does). Respond by removing network access, checking the system through serial console or direct terminal, and avoiding using any service, config file, or data on the compromised machine. Too many people "clean up a trojan' and never understand how it got there, or if they've _actually cleaned it up_.
- If you do have a security team, forensics expert, or anyone else onhand, you must touch the machine as little as possible and isolate it. This means not rebooting it to "clear out some funky running processes". They need to be able to get at those. If you need to half ass it, go ahead, but remember to wipe the system completely clean, apply any security updates, and do your best to figure out if they've compromised any important data. Do what you can.
- Security is an incredible balancing act. If you do it wrong, developers, users, etc, will revolt and find ways around it. If they _can_ get around it, you're not doing your job right. If they _can't_ get around it, they might just give up and leave.
- Keep an iron grip on access control. This means ops must absolutely provide windows for what doors have been locked. Kicking development off of production entirely means they get to stab in the dark on fixing hard problems. Providing logging, debugging tools, etc, without allowing them to directly change the service, will be a win for all aspects of production.

The Human Element

Learn from many sources

- Fill up some RSS feeds, and read at least a few good articles per week. LWN, kerneltrap, undeadly.org, whatever's relevant, or even loosely related, to what you do.
- Read blogs from smart people. Sometimes they post interesting topics, and comment streams give us the unique ability to directly converse with the masters.
- Read a few blogs from not so smart people. Get a feel for what stumps them, or what they do that doesn't work so well.
- Get to know people who can kick your ass, at anything. Stay humble.
- Help find your own strengths by taking in from many sources, and gobbling up what envigorates you.
- Read up on success and failure stories from other companies. Ring up their CTO's and get them to divulge advise over free lunch.

Try many things

- You'll be amazed at what you can do if you keep trying. Never seen something before? Give it a shot.
- Try to not be a dangerous newbie. Play in the sandbox until you're comfortable enough to not burn down the house.

Understand redundancy

- Really understand how redundancy affects things. How it works, how it doesn't work.
- Break redundant systems in a test lab, sometimes in production. Learn what you can while you're in control. Unplug the power, yank cards out, kill processes, run the box out of memory, yank a harddrive, yank ethernet.
- Test replacing and upgrading systems in a redundant setup. Maybe you can toss in that brand new hal-o-tron 8000 without taking downtime.

Understand scalability

- There're tons of papers on making scalable systems. Even if you can't write one yourself, try to understand the theory.
- Learn with virtualization. Set up a few virtual machines and try tossing up applications to multiple machines. Run multiple instances locally on different ports.
- It's usually the job of operations to do proper capacity planning. You won't know what to do add unless you truely understand where resources should be added.

Become a troubleshooting superstar

- The moment something breaks the clock is ticking. You must be able to pull out your arsenal and use them effectively.
- Practice troubleshooting. Pick a perfectly good, working page, and try to track down how it works.
- strace, ltrace, lsof, logs.
- Understand that load != load. Look at all available information as to how a host is performing or behaving.
- Be very familiar with the tools for your IO system. Often "mysterious" performance problems happen beacuse your RAID or SAN setup isn't happy for some reason.
- Leave documentation. Checklists, troubleshooting tips, build tools.
- Build more tools. For yourself, for other people, or add features to existing ones.

Work with IT

- Believe it or not, there is overlap.
- Ops has to maintain high bandwidth network access for servers. IT has to do the same for people, and is often the bridge ops has *into* the datacenter. It may make sense to work together on this one.
- Draw the right line. IT should manage mail, but ops should manage development servers. Don't offload things you don't need to, and offer to do what you do best if necessary.
- Don't alienate people. Macs are popular, linux is (slowly) gaining share. Believe it or not, forcing everyone to use microsoft productivity software can bite you. There are plenty of alternatives, try one. Odds are more people in your company are familiar with google apps than they are with outlook.
- Don't make it more difficult than you have to do for people to run a unix system natively. Unless your backend is a windows shop, wouldn't you want people to have more familiarity with the OS they're supposed to support?

Work with developers

- You both work on the same product, for the same purpose. Try working together a little more.
- Having strategy meetings is not working together.
- Development understands the code resources the best, and operations understands the hardware and deployment the best. You can design something more efficient by taking all of this into mind.
- Cross training. Disseminating information can show how tools and designs on both sides can be improved to be more manageable and resilient.
- Be careful of being too demanding on either side. It's not an Us vs Them. Everyone's human. Everyone should be doing as much as they can for the company, not for themselves.
- It's more pleasent to handle crunch times and emergencies when everyone gets along.

Work with ops

- Ops folks have their specialties. Networking, databases, OS. Don't forget to talk to each other!
- Getting stuck in a rut is demotivating, boring, and a good way to lose people. Even if your systems ops guy has the ability to look over the shoulder of the network guy, they have the opportunity to learn.
- Always give people an opportunity to try, learn, and grow.
- Be careful of rewarding your best with too much work. If there're people who can pick up slack, you use them.
- Bad eggs. It happens. Be tough enough to deal with them. Most people can be turned around with a little help, but they need to be able to be independent.

The Practice

Fix it now, not later

- If a webserver goes offline, don't care about it. You have ten spare, right?
- Pick a day during the week to sweep up broken crap. Replace any broken hardware, ensure everything's 100% before swinging into the weekend.
- If small, annoying problems crop up, fix them permanently first thing in the morning. Logs fill up the disk twice last week? Come in fresh the next day, and fix it for good. These stack up, and suck.
- If you have automated builds, use this to your advantage to fix what you can right away, or in bulk.

Automate everything

- Humans can't screw up scripted tasks (as easily).
- Do it twice. Once by hand if you must, then roll up what you did into a script.
- Commented scripts make fantastic documentation. Instead of writing twenty pages detailing how to install something (which is up to interpretation of the reader!), write a script which explains what it does.
- Scripts can be rolled up into automated builds. The more often something is done, the closer it should get to becoming a zero time task.

Change what's necessary

- Make small, isolated changes.
- If you don't have to change it, leave it.
- This also means you must understand _when_ to change. Find what's necessary and upgrade it, switch it out, make it standard.

Design for change

- If you can't do it right immediately, get on the road to it being right.
- This means if you don't have time to do something right, get the basics going with a clear migration roadmap to the right thing. While your new mail system might not be the crazy cool redundant bounce-processing spam monster you dream of, installing postfix and setting up two hosts with a clean configuration gets you closer than you might think.
- This does have a tendency to leave unfinished projects everywhere, but you were going to do that anyway. :)

Practice updating content, fast

- It's usually the job of operations to push out code. Don't suck at it. Push in parallel, apply rolling restarts, be an efficient machine.
- This includes software updates, security patches, and configuration changes.
- Use puppet, cfengine, whatever you need to control the configuration. Keep it clean, simple, and easy.
- The fewer files one must change to make a necessary adjustment the better. If you're adding one line to 20 files just to push out a new database, you're doing it wrong. Build simple templates, build outward, and don't repeat data which needs to be edited by hand.

Standardize, stick to the standard

- Pick one or two standard OS's, httpd's, databases, package systems.
- Stick with them. Adjust and upgrade methods as it makes sense.
- Don't stick with that major version forever. Unless your product is going to be feature frozen forever, you'll need to keep the standard rolling forward, and everthing behind it.
- The _more_ is standard, the more places your tools will work. The more packages for other parts of the operation will "just work" everywhere else too.

Document well

- Document process
- Document product
- Categorize into shallow trees.
- Don't redundantly document. If a script has a long help, ask the reader to refer to that. The closer the documentation is to the program being discussed, the more likely it is to stay accurate.
- Marry documentation into code. perldoc, pydoc, etc.
- Out of date documentation is poisonous. Reserve time to keep things up to date. Sit down with new employees and update documentation as they run into problems.
- Use ticketing systems, with moderation. Documentation of history is important as well. Forcing people to create detailed process tickets for a DNS is just pissing in other people's cheerios.

Use source control

- Use git, or mercurial. Avoid SVN like the black plague.
- Put all of your configurations, scripts hacks, whatever, into source control.
- Keep checkouts everywhere...
- Keep strict, clean, master checkouts. No one should be able to push changes that aren't comitted, but it should also be easy to test changes (in a VM, directly on a single test machine) without having to wrestle with the source control.

Hire well

- Discern between stubborn and smart
- Don't avoid hiring senior. Some people really know their shit. Some _seem_ like they do. Others are "senior" in a particular area and will fall behind as technology changes. While you might want to avoid some, there are definitely rockstars out there.
- Don't avoid hiring junior. I know so many people who've started really junior (including myself! I still view myself as junior), who've shot up through the ranks and are now have firm established careers. I'd believe most of us have. Except there are ones who don't learn, don't have the motivation, or are in the wrong field.

Avoid vendor lock in, and keep a good relationship with the vendors you do use

- Buying propreitary hardward has the major downside of potentially locking you into always using it. It might be a particular SAN, NAS, special-case direct attached storage, backup systems, etc. Avoid getting sucked in. If you follow all of the above design advise, one should be able to build test environments on different platforms quickly. You're then able to keep on top of hardware evaluations and keep choices open.
- If everything's deep, dark, gnarled, undocumented, and directly dependent on your fancy proprietary load balancer, you'll never wriggle free of it.
- Be nice to the vendors you do end up using. If you "push them _hard_ on price!" for every single purchase, expect some shit hardware to show up.
- Datacenters these days have a lot of potentially useful resources. Try to throw some free remote hands service into your contract and abusing that to get harddrives replaced, vendor items shipped/RMA'ed, and some basic hardware installs. I've had entire racks of equipment delivered and installed with barely a visit from an employee... and damn, it's nice.

Give Open Source a serious try

- nginx, mongrel, lighttpd, apache, perlbal, mogilefs, memcached, squid, OpenBGPD, PF, IPTables, LVS, MySQL, Postgres, blah, blah, blah. Before you hop back on that trusty, reliable, expensive proprietary setup, give open source a shot. You might find yourself adding plugins, extensions, code fixes or contracting help to bring features you'd never be able to do otherwise. In my own experience OSS is just as reliable, often moreso, than big expensive hardware when put under significant load.
- The idea of "you get what you pay for" is a complete lie. If you can't make OSS work for you and need the hand holding, you _can_ still go with a vendor. If you have a smart, motivated team, who really want to learn and understand how their infrastructure runs, you just can't beat some hardy GPL'ed or BSD'ed systems.
- MySQL and Postgres are fine. Call them tradeoffs if you will; nothing's going to crawl out of your closet and night and eat your data. Sure, it does happen, but you're much more likely to be screwed over with monolithic oracle instances going offline (it happens!) than you are with a well tested and stable MySQL instance, in a redundant master<->master cluster pair.
- I'd say 'cite references' - but go look around. Check out any number of articles on the LAMP stack. Most major dot coms, ISP's, and even corporations now are adopting. Give it a shot. The worst you'll have is some lost time, and another product to scare your vendor into dropping price with.

Dormando's Proxy for MySQL Release 5

dormando — Tue, 15 Jan 2008 18:49:48 GMT

previously
Stroll on over to the DPM's minimalist homepage and grab the latest release tarball, export tarball, clone the git repo, or peruse gitweb.

While I did some porting work, this release has not been explicitly tested on all of the platforms yet. If there are bugs with a particular platform, please report.

This release fixes a lot of outstanding complaints I had with the power of the API, and many known obnoxious bugs and restrictions. Like the previous inability to listen on INADDR_ANY, or use unix domain sockets, etc. There are still a number of usability/troubleshooting gotchas when writing programs using DPM, but aside from the learning curve most of it should work now. There are no known crash bugs or memory leaks (aside from a "leak" in the dpml library under specific conditions).

Changes since last post:

- Backend authentication bug fixed.
- Fix for crashes if a listener socket object gets garbage collected (now gracefully closes the socket and cleans up after itself).
- All sockets being written to now get properly flushed. This _also_ means multiplexing commands between sockets now works. You can take a command from one client and send, in parallel, to multiple servers, then retrieve the resultsets in parallel. Works in either direction, like sending a single resultset back to multiple clients (for coalescing similar requests).
- Lua connections can actually show the IP address now via conn:socket_address().
- Unix domain socket support works for both clients and servers. Might be shakey, but tested okay.
- New dpm commands for retrieving time:
gettimeofday time time_hires

dpm.gettimeofday() returns two values: seconds and microseconds
dpm.time() returns seconds
dpm.time_hires() returns milliseconds
... would be fun to add some dpml functions to make time diffing easier!
- New callback timer API:
timer = dpm.new_timer()
timer:schedule(seconds, milliseconds, callback_function, argument)
Schedule a repeating timer to run the lua function 'callback_function'
For subsecond resolution set seconds to zero, and milliseconds to nonzero.

Callbacks look like:


-- 'self' is the 'timer' object.
function callback_function(self, arg)
    print("Hello! I have ran " .. arg["count"] .. " queries.")
end

Timers are cancelled by running timer:cancel()
... if you want callbacks to be ran exactly once, they must unschedule themselves via self:cancel()
- DPML function dpml.execute_query_buffered(server, query, callback)
Runs callback with a buffered resultset. Same table structure as accepted by dpml.send_resultset
- Silly barely-tested dpml.dump_table() utility to recusively pretty-print lua tables for debugging.

Fun, huh? :)

For R6 I'm still trying to switch the focus to writing more lua infrastructure than C, but I'd like to support a few more packet types first.
R6 should see the start of a test suite, which is a prerequisite to going back and cleaning up the codebase.

R7 will start the focus on more testing, and refactoring the code.
It's pretty ugly; I want to abstract out into more files, and work towards BSD licensing, support C plugins, etc. Function, variable, define naming all needs to be cleaned. Error handling for the lua libs needs to be ironed out... Lots to refactor, but that work is quick when you have a test library.

BSD licensing is important for me, and this project. It'll happen.

Long announcement, sorry. Wrote part of this on my way into work, and decided to release one feature early. Early, often.

Magic tricks

dormando — Sat, 12 Jan 2008 09:43:48 GMT

mysql> CREATE TABLE `blah2` ( `hello` int(11) default NULL );
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO blah2 VALUES (1);   
Query OK, 1 row affected (0.00 sec)

mysql> select * from blah2;
+-------+
| hello |
+-------+
|     1 | 
|     1 | 
+-------+
2 rows in set (0.00 sec)


mysql> show full processlist;
+--------+-------+-----------------+------+---------+------+----------------+-----------------------+
| Id     | User  | Host            | db   | Command | Time | State          | Info                  |
+--------+-------+-----------------+------+---------+------+----------------+-----------------------+
| 503235 | root  | localhost       | NULL | Sleep   |  742 |                | NULL                  | 
| 503238 | happy | localhost:36013 | test | Query   |    0 | NULL           | show full processlist | 
| 503239 | happy | localhost:36014 | test | Query   |    0 | Writing to net | show full processlist | 
+--------+-------+-----------------+------+---------+------+----------------+-----------------------+
3 rows in set (0.00 sec)

mysql> use sakila;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

mysql> show full processlist;
+--------+-------+-----------------+--------+---------+------+-------+-----------------------+
| Id     | User  | Host            | db     | Command | Time | State | Info                  |
+--------+-------+-----------------+--------+---------+------+-------+-----------------------+
| 503235 | root  | localhost       | NULL   | Sleep   |  801 |       | NULL                  | 
| 503238 | happy | localhost:36013 | sakila | Query   |    0 | NULL  | show full processlist | 
| 503239 | happy | localhost:36014 | sakila | Sleep   |   17 |       | NULL                  | 
+--------+-------+-----------------+--------+---------+------+-------+-----------------------+
3 rows in set (0.00 sec)

Easy.

No DPM -r5 tonight

dormando — Wed, 02 Jan 2008 10:50:25 GMT

Out of time, finally getting tired. Tomorrow, for sure :) One last bug to hunt down.

For now, HEAD's looking pretty good. grab the export tarball, check out the git repo, etc. If you're bored.

The most interesting addition is the CMake build and the fact that DPM's now portable to at least four operating systems. I've read Jan's recent posts on the subjecft of build systems, and I'll admit up front that I haven't touched DPM recently due to my reluctance to use autotools.

I've decided CMake isn't evil enough to warrant avoiding it. It's an extra dependency, so we'll see how it goes. There's even been recent discussion on using lua as CMake's build language :)

Also, I've finally done a few tests with mysql's sql-bench "run-all-tests" against DPM. So far, it's passed with no memory leaks or crashes! My one oustanding crash bug was a simple fix noted below.

Updates for R5 (slight recap from yesterday):

- Uses CMake as a build system. File tries really hard to satisfy dependency checks and to build it correctly.
- Ported to multiple operating systems (not win32, not really interested). OS X, OpenBSD, FreeBSD.
- Will now optionally build with -g, instead of always.
- With addition of CMake, 'make install' now exists and works. No more path hacking to get it loading the default libraries.
- Bugfixes: auth packets can now specify default databases, close open connections if the associated lua object is garbage collected.
- Specify under --verbose if closed connections are listeners
- Finally (argh!) able to set a listener to INADDR_ANY by specifying nil for an IP address: dpm.listener(nil, 3306) would emulate mysql's default network behavior.
- Some documentation updates.

Still not done for R5:

- Bug: Backend authentication does not always work, due to the random scramble generating invalid characters.
- Unix domain socket support (easy, but not five minutes of work).
- Allow lua connection objects to retrieve the connected IP address. (for authentication, etc).
- Potentially a fix for ensuring network flushes happen if packets have been written to the buffer, as presently it will only flush connections which directly receive packets, and proxied connections. Dumb.
- Potentially a few simple dpml additions to help ease new users in.

For R6:

TBD ;) Really needs a test suite.

Happy new year! DPM, memcached, etc

dormando — Tue, 01 Jan 2008 07:24:24 GMT

r5 of DPM will appear ... tomorrow! It's close now, but I'ma go party a bit. There's new code in HEAD if you're bored.

New stuff:

- Bugfixes (crash bugs, silly things)
- CMake build file.
- Code ported from Linux to OS X (leopard) PPC, OpenBSD 4.2, FreeBSD 6.2
- (not done yet) support INADDR_ANY, unix domain sockets, few more things.

Also, new releases of memcached coming up as soon as possible, along with nice clear useful documentation.

- 1.3.0, with binary protocol!
- 1.2.5, with portability fixes!
- Overview of all outstanding projects/ideas worth doing, now that memcached's development has been jumpstarted.

Then later;

- 1.3.1, probably with all of the binary protocol bugs fixed!

Happy new year!

DPM Release-4

dormando — Sat, 10 Nov 2007 07:16:01 GMT

DPM homepage

http://consoleninja.net/code/dpm/rel/dpm-r4.tar.gz - tarball of r4
git clone http://consoleninja.net/code/dpm/dpm.git - to get the latest code, always
http://consoleninja.net/code/dpm/dpm-export.tar.gz - a tarball of the latest code, for those unwilling to git it.

It's been a long time. I suck. One of these days I'll be productive enough to pop out releases.

Unfortunately I've been sitting on this one for a long time. There are significant API updates in this one, but I've cut it just short of being really good. Impatient :)

Notable API changes:

- All instances of 'myp' have been renamed to 'dpm' for coherency.
- New 'dpml' lua-based library. Most demos updated to use it. While the low level API offers a huge amount of flexibility, it's barely usable by a novice. The goal of 'dpml' is to provide pure lua wrappers around the low level API to emulate functions you might normally expect.
- Start of a programming manual in doc/API.txt
- 'server_status' flags are now accessable in lua, off of eof packets. This means you may detect whether a transaction is in use, if autocommit mode is enabled, if an index was missed, etc. All becoming easier to use with advances in the dpml library.
- New demo "demo-lib.lua" which shows off some of the dpml library functions.
- startup.lua, etc, updated for API changes.

- Callback system has been optimized. This is one case of a 5x+ speedup while keeping the API almost completely the same. The hacky proxy_until functions have been removed since it's always fast now.

- Package level callbacks have been added. All connections have callbacks wired directly into them, but there now exists floating callback structures. You may use this convention to temporarily take over a connection's callbacks from within a library. The dpml command 'connect_mysql_server' uses this, and demo-autoexplain.lua has a more clear example of its use.
I love the way this works :) It's now easy to build standard packages to build resultsets, handle resultsets, authentication, error conditions, etc.

- dpm.close() command for forcefully closing a connection. ;)

Other changes:

- Small number of bugfixes, mostly related to testing the API. It's now possible to actually set username/passwords when connecting to a server, to read/change server_status flags, etc.
- Fix from John Loehrer to allow buillding on OS X (intel, I think?)

Sweet :)

Now, release 5 will be a ways off. Will be pushing smaller changes into the head of git more often, but I will be dedicating more of my spare time over the next two weeks to other things. That doesn't mean work stops, so if you're going to use it and contribute back, still please do! :)

Proxy for MySQL quickie

dormando — Sun, 14 Oct 2007 00:47:23 GMT

Some small (but meaningful) updates in HEAD at http://consoleninja.net/code/dpm/dpm.git :

- You can now actually specify a username/password when connecting to a server. A few other similar spots were fixed.
- You can get the remote connection ID from a connection object.
- New myp.close() command for closing connections ;)

I've tested kill -9'ing clients, servers, in various states of transit with no obvious bugs. It's probably still unstable but these changes make a big difference.

I'm not an API designer by any stretch. I'm stuck wondering:

Should connobj:remote_id() (or just remote()) return the object id, or the full object?

On that same line, all callbacks have the affected connection ID as an argument. Should that just be a reference to the actual connection object, or continue to be the ID?

A huge annoyance with the API (and biggest performance issue) is the way the callback API is handled internally. This is changing significantly. I have the new approach but the API details elude me for the moment.

Dormando's Proxy for MySQL, release 3

dormando — Thu, 11 Oct 2007 09:34:34 GMT

previous post

http://consoleninja.net/code/dpm/rel/dpm-r3.tar.gz - tarball of r3
git clone http://consoleninja.net/code/dpm/dpm.git - to get the latest code, always
http://consoleninja.net/code/dpm/dpm-export.tar.gz - a tarball of the latest code, for those unwilling to git it.

It's been way, way too long. Next release will be within a week and a half, and will try to stick to the once-a-week schedule from then on. There've been over 210 commits to the repo at this point. The speed has been picking up.

Notable changes:

- Resultsets are fully parseable and writeable! This lacks a decent demo, but the new one makes use of it. -r4 will have many new demos. This was the big missing feature. It's certainly able to do a lot of work now.
- It's possible to inject and modify packets being streamed between servers. Inject rows, or modify fields _and_ rewrite rows on the way through :)
- Combine authentication support with resultset management, and you can connect to and query the proxy without a backend mysqld at all. (demo coming soon)
- Many many bugfixes and some code cleanup. Lots of bugfixes.
- New simplistic "auto explain" demo to show off non sucky resultset handling on the raw api.
- Better detection of closed sockets.
- Internal defines pushed up into lua, so no more external lua defines files.

Upcoming:

- Callback handling will be restructured. Should be much faster and more flexible. Worst case, faster :) The "proxy_until" hack will be going away.
- Better detection of connection death during internal proxying.
- Large packet detection, if not support.
- Start of basic "easy to use" lua libraries on top of the base API.
- Code cleanups. So many code cleanups.
- Test suite

Fun API tricks:
- [auto\reloading modules, along with static support
- using multiple module features against a single connection via routing
- different modules living on different listening sockets (easy)

IF someone wants to help out with autoconf/automake/make, or at least let me know how much magic sauce I should apply, I'd be pretty thrilled. I hate this crap and the project needs an installer sometime soon.

Sleep now. Taking a day off from the proxy then resuming.

Dormando's Proxy for MySQL, release 2

dormando — Sat, 25 Aug 2007 23:37:08 GMT

Previous post

http://consoleninja.net/code/dpm/rel/dpm-r2.tar.gz - tarball of r2
git clone http://consoleninja.net/code/dpm/dpm.git - to get the latest code, always
http://consoleninja.net/code/dpm/dpm-export.tar.gz - a tarball of the latest code, for those unwilling to git it.

I'll keep this short. Also, just noticed my tags aren't getting uploaded to the remote git. Sorry.

Bunch of commits worthy of a new release:
- new included pass-through demo.
- cleaned the included connection pooling demo a little.
- 'proxy_until' demo. Disables packet processing temporarily. Huge speedup if you're only doing partial inspection. On an 8.4 million row select (small rows, memory cached):
- no proxy: 5.80s avg
- default proxy: 19.4s avg
- with proxy_until: 6.2s avg
Nice?
- \s works with mysql CLI now.
- Packet sequencer is less sucky.
- Misc bug fixes.

Big goal for r3 is to have field and row packet parsing working. This will enable query handling from all within the proxy. Then things get a little more interesting!

Dormando's Proxy for MySQL, release 1

dormando — Tue, 21 Aug 2007 08:15:46 GMT

original post
second post

http://consoleninja.net/code/dpm/rel/dpm-r1.tar.gz - release 1 of DPM, tarballed sources. Also available via cloning git and tracking the 'release-1' tag

git clone http://consoleninja.net/code/dpm/dpm.git - to get the latest code, always
http://consoleninja.net/code/dpm/dpm-export.tar.gz - a tarball of the latest code, for those unwilling to git it.

I'd still call this a preview release, but it's too juicy to not release in some form. Early and often, right?

This is the first real release, as it actually does something. It crashes a lot less, has a few options, and is pretty hackable. Many many bugfixes since the last preview release. At this point I can do selects with 8.5 million rows of results through the proxy without much trouble, although it will still segfault if you try too hard.

The included demo lua file demonstrates very basic query rewriting. Send "HELLO;" in your mysql client and it should return the results of "SELECT 1 + 1" - you may edit this and have some fun.

Now that life is settling a bit I can get back into this. My goal is one release per week at least.

I've been slowed since I changed jobs (I now work at the parent company of this site as a DBA!), and Dell has delayed my new hacking machine into oblivion. I'm making do with suboptimal wired-to-the-wall hacking provisions for now.

So, submit patches, questions, requests, hate mail, etc! Or do nothing until I get five more releases out and it actually does something you might want.

Sorry for the delay! Happy hacking :)

Dormando's Proxy for MySQL, preview 2

dormando — Sat, 16 Jun 2007 09:16:16 GMT

Made some commits to my proxy for MySQL tonight. These are significant enough that they warrant another preview post.

original post

git clone http://consoleninja.net/code/dpm/dpm.git - to get the latest code, always
http://consoleninja.net/code/dpm/dpm-export.tar.gz - a tarball of the latest code, for those unwilling to git.

This feels more like a "real" preview now. Aside from one silly bug, it looks like I fixed most of the really bad stability issues. There're still a handful left, but I was pretty abusive just now and wasn't able to break it.

Grab it, build it, read the README.

The demo allows you to do some fun things:

- Connect to the proxy four times.
- Run 'show processlist' - wow, only shows one connection!
- 'use database' in one window, run 'show tables' in another. Weird :P
- Disconnect from the server, and the lua end (poorly) intercepts the COM_QUIT query and lets the client die on its own.

I got stuck for a day on one of these bugs I just fixed. Releases should be hard and fast for a while.