Comfortably Geek

Introducing DCM

2020-10-07T00:00:00+00:00

I’m happy to (finally) share our OSDI 2020 paper on Declarative Cluster Managers (DCM).

The premise for DCM is that writing modern cluster management code is notoriously hard, given that they routinely grapple with hard combinatorial optimization problems. Think of capabilities like policy-based load balancing, placement, scheduling, and configuration, which are features not only required in dedicated cluster management systems like Kubernetes, but also in enterprise-grade distributed systems like databases and storage platforms. Today, cluster manager developers implement such features by developing system-specific best-effort heuristics, which achieve scalability by significantly sacrificing the cluster manager’s decision quality, feature set, and extensibility over time. This is proving untenable, as solutions for cluster management problems are routinely developed from scratch in the industry to solve largely similar problems across different settings.

With DCM, we propose a radically different architecture where developers specify the cluster manager’s behavior declaratively, using SQL queries over cluster state stored in a relational database. From the SQL specification, the DCM compiler synthesizes a program that, at runtime, can be invoked to compute policy-compliant cluster management decisions given the latest cluster state. Under the covers, the generated program efficiently encodes the cluster state as an optimization problem and solves it using a constraint solver, freeing developers from having to design ad-hoc heuristics.

We show that DCM significantly lowers the barrier to building scalable and extensible cluster managers. We validate our claim by powering three systems with it: a Kubernetes scheduler, a virtual machine management solution, and a distributed transactional datastore.

If you’re interested in the details, check out the paper. If you’d like to try out DCM, have a look at our Github repository. We welcome all feedback, questions, and contributions!

Low-level advice for systems research

2020-09-27T00:00:00+00:00

There’s no shortage of “how to do research advice” on the Internet for graduate students. Such advice, while inspiring, is extremely hard to translate into daily or weekly productivity as a systems PhD student.

This is unfortunate, because I believe following good practices can offset a lot of the stress associated with systems research. Having discussed this topic often with students, I think it’s time to blog about it.

We’ll cover two broad topics: how to effectively prototype and how to run experiments systematically.

On the prototyping side, we’ll first cover the tracer bullet methodology to gather incremental feedback about an idea. We’ll then discuss the importance of testing your code.

Regarding experiments, I’ll first argue for preparing your experiment infrastructure as early as possible in a project. I’ll then share some tips on automating experiments, avoiding tunnel vision, experimentation-friendly prototyping, and end with some advice on understanding your results.

Disclaimer

As with any advice from people who landed research positions after their PhDs, remember that survivor bias is a thing. I can only say that following these guidelines work well for me, there is no guarantee that it will work for you.

Who is this article for?

This article might help you if you’re a systems PhD student, and one or more of the following bullets resonate with you:

You’re spending a lot of time building your system, and you’re unsure whether it will be worth it at all.
There are many components of your system left to implement, and you’re not sure what you should prioritize over others.
You find setting up, understanding, and debugging experiments overwhelming.

The premise

The premise for this article is the following comic I drew a while ago.

You have a finite amount of time, there’s a lot of systems work to do, there’s a submission deadline looming, and that worries you. As you get closer to the deadline, things will inevitably get chaotic. This article will hopefully help you effectively plan and alleviate some of that chaos.

Effective Prototyping

Your research prototype’s goal is to exercise a certain hypothesis. To validate that hypothesis, you will eventually subject your prototype to a set of experiments. More often than not, the experiments will compare your system against one or more baseline systems too.

Your goal through this process is to:

prototype in a way that you get continuous, incremental feedback about your idea
test for correctness during development time, so you don’t have to during experiments; debugging performance problems and system behavior over an experiment is hard enough
build the experiment setup and infrastructure as early as possible, and preferably even before you build the system (rather than building the system first, and then thinking about experiments)
Save time by automating the living crap out of your experiment workflow

The Tracer Bullet Methodology

I picked up this term from The Pragmatic Programmer book.

Here’s the problem. Let’s say you have multiple components to build before your system will work end-to-end. Each of these components might be complex by themselves, so building them one after the other in sequence and stitching them together might take a lot of time. That’s risky if you don’t yet know whether this system is worth building: you don’t want to spend a year building something, only to find out you’ve hit a dud.

Instead, focus on building a working end-to-end version early that might even only work for one example input, taking as many shortcuts as you need to get there (e.g., an interpreter that can only run the program “1 + 1” and nothing else). From there evolve the system to work with increasingly complex inputs, filling in gaps in your skeleton code and eating away at hardcoded assumptions and shortcuts as you go along. The benefit of this approach is that you always see the system work end-to-end, for more and more examples, and you receive continuous, incremental feedback on a variety of factors throughout the process. This strategy applies at any granularity of your code: the entire system itself, specific modules within them, or even specific functions.

For example, in the DCM project, we observed that developers expend a lot of effort to hand-craft ad-hoc heuristics for cluster management problems. We hypothesized that a compiler could instead synthesize the required implementation from an SQL specification. Doing so would make it easier to build cluster managers and schedulers that perform well, are flexible, and compute high-quality decisions.

When we started working on DCM, we weren’t even sure if this idea was practical, and whether SQL was an expressive enough language for the task-at-hand. Building a fully functional compiler that generated fast-enough code would have taken a while, and we wanted to validate the core hypothesis early. So to get started, we took a cluster management system, expressed some of its logic in SQL, and had the compiler emit code that we’d mostly written by hand. We tried this for several cluster management policies and received steady feedback about our design’s feasibility along the way. Importantly, even before we had filled in the compiler’s gaps, we had a working end-to-end demo running within a real system, which gave us the required confidence to double down on the idea.

Testing

Test your code thoroughly before you run experiments. The additional time you invest in writing tests and running them on every build will be more than made up by not having to debug your system for correctness when you run experiments. Debugging performance issues and system behavior during an experiment is daunting enough as it is: don’t compound this by debugging correctness issues at the same time.

When working on a complex system (especially with multiple authors), tests are essential to slowing down the rate at which bugs and regressions creep in over time. When you find a new bug in your system, add a new test case to reproduce the bug.

Only use a given commit/build of your system for an experiment if it has passed all tests.

Test every build and commit using a continuous integration (CI) infrastructure. A CI system monitors a repository for new commits and runs a series of prescribed test workflows. There are free solutions like Travis you can easily use. You can also self-host your CI infrastructure using Gitlab CI.

A common pushback I’ve heard about rigorously testing research prototypes is “Hey! We’re trying to write a paper, not production code!”. That statement, however, is a non sequitur. A research prototype with tests is not a production system! I’m not suggesting that you add metrics, log retention, compliance checks, rolling upgrades, backwards compatibility, and umpteen other things that production systems need (unless they’re relevant to your research, obviously).

Use any automated tooling available to harden your code. For example, I configure all my Java projects to use Checkstyle to enforce a coding style, and both SpotBugs and Google ErrorProne to run a suite of static analysis passes on the code. I use Jacoco and tools like CodeCov for tracking code coverage of my tests. Every build invokes these tools and fails the build if there are problems. The CI infrastructure also runs these checks whenever I push commits to a git repository.

Experiments

A common trap that students fall into is to not think about experiments until their system is “ready”. This leads to the same problem we just discussed about not getting incremental feedback.

Set up the experiment infrastructure early

Most systems papers often have one or more baselines to compare against. If so, set up the experiment infrastructure to measure the baselines as early as possible, even before you start formulating a hypothesis about something you’d like to improve. Use the data collected from studying the baselines to formulate your hypothesis.

From my own experience, you’re likely to find problems others haven’t noticed (prior art can be surprisingly flimsy sometimes). You’ll also learn whether the problem you’re studying is even real at all. Either way, you’ll gain valuable ammunition towards formulating a good hypothesis.

Automate your experiments like a maniac

The main goal for your experiment setup should be to completely automate the entire workflow. And I mean the entire workflow. For example, have a single script that given a commit ID from your system’s git repository, checks out that commit ID, builds the artifacts, sets up or refreshes the infrastructure, runs experiments with all the parameter combinations for the system and the baselines, cleans up between runs, downloads all the relevant logs into an archive with the necessary metadata, processes the logs, generates a report with the plots, and sends that report to a Slack channel.

When running experiments, make sure to not only collect log data and traces but also the experiment’s metadata. Common bits of metadata that I tend to record include a timestamp for when I triggered the experiment, the timestamps for the individual experiment runs and repetitions, the relevant git commit IDs, metadata about the environment (e.g., VM sizes, cluster information), and all the parameter combinations that were run. Save all the metadata to a file in a machine-friendly format (like CSV or JSON). Avoid encoding experiment metadata into file or folder names (“experiment_1_param1_param2_param3”). This makes it hard to introduce changes over time (e.g. adding a new parameter will likely break your workflow). Propagate the experiment metadata all the way to your graphs, to be sure you know what commit ID and parameter combinations produced a given set of results. Aggressively pepper the workflow with sanity checks (e.g, a graph should never present data obtained from two different commits of your system).

Never, ever, configure your infrastructure manually (e.g., bringing up VMs on AWS using the EC2 GUI, followed by logging into the VMs and installing software libraries and configuring them). Instead, always script these workflows up (I like using Ansible for such tasks). The reason being, that accumulating ad-hoc tweaks and commands with side effects (like changing the OS configuration) impairs reproducibility. Instead, be disciplined about only introducing changes to the infrastructure via a set of well-maintained scripts. They come in handy especially when unexpected failures happen, and you need to migrate to a new infrastructure (literally every project I’ve worked on had to go through this!).

The sooner you start with the above workflows, the better. Once it’s up, just like your tests, you’ll be able to continuously run experiments as part of your development process, and make sure your system is on track as far as your evaluation goals go. Every commit gets backed by a report full of data about how that change affected your system’s metrics of concern. The experiment workflow then becomes a part of your testing and regression suites.

The tracer bullet methodology applies to your experiment setup just as much as it does to the system you’re building. You’ll find your experiment setup evolving in tandem with your system: you’ll add new logging and tracing points, you’ll add new parameter combinations you want to test, and new baselines.

I have a fairly standard workflow I use for every project. It starts with using Ansible to set up the infrastructure (e.g., bring up and configure VMs on EC2), deploy the artifacts, run the experiments, and collect the logs. I use Python to parse the raw logs and produce an SQLite database with the necessary traces and experiment metadata. I then use R to analyze the traces, ggplot for plotting, and RMarkdown to produce a report that I can then send to a Slack channel. A single top-level bash script takes a commit ID and chains all the previous steps together.

Avoiding tunnel-vision

A common risk with system building is to fall into the experiment tunnel-vision trap. For example, frantically trying to improve performance for a benchmark or metric that may not matter, while ignoring those that are important to assess your system.

There are two things you can do to avoid this trap.

First, try to sketch out the first few paragraphs of your paper’s evaluation section as early as possible. This makes you think carefully about the main theses you’d like your evaluation to support. I find this simple trick helps me stay focused when planning experiments (and often made me realize my priorities were wrong!). You’ll find yourself refining both the text for the evaluation section and the experiment workflow over time.

The second is to use what I recommended in the previous section: wait to see a report with data about all your experiments before iterating on a change to your system. Otherwise, you’ll be stuck in experiment whack-a-mole limbo! Make sure your report contains the same graphs and data that you plan to add to the evaluation section.

Experimentation-friendly prototyping

Build experimentation-friendly prototypes. Always use feature flags and configuration parameters for your system to toggle different settings (including log levels). If you find yourself modifying and recompiling your code only to enable/disable a certain feature, you’re doing it completely wrong.

Here’s one scenario where feature flags come in handy. Often, a paper proposes a collection of techniques that when combined, improve over the state-of-the-art (e.g., five new optimization passes in a compiler). If your paper fits that description, always make sure to have the quintessential “look how X changes as we turn on these features, one after the other” experiment. If you only test the combination of these features and not their individual contributions, you (and the readers) won’t know if some of your proposed features negatively impact your system!

And no, for the majority of systems projects, a few additional if statements are hardly going to affect your performance (actual pushback I’ve heard!). If production-ready JVMs can ship with hundreds of such flags, your research prototype can too.

Measure, measure, measure

Systems are unfortunately way too complex. If you want to understand what happened over an experiment, there is no substitute for measuring aggressively. Log any data you think will help you understand what’s going on, even if it is data that you won’t present in a paper. If in doubt, always over-measure rather than under-measure.

To borrow a quote from John Ousterhout: “Use your intuition to ask questions, not answer them”. Often, when faced with a performance problem or a bug, it’s tempting to assume you know why (your intuition) and introduce changes to fix that problem. Don’t! Treat your intuition as a hypothesis, and dig in further to confirm exactly why that problem occurs and how it manifests. For example, expand your workflow with more logs or set up additional experiments to control for the suspected factors.

A particularly dangerous example I’ve seen is to declare victory the moment one sees their system beat the baseline in end-to-end metrics. For example, “our nifty algorithm improves transaction throughput by 800x over baseline Y” (ratios that are quite fashionable these days!). Again, don’t assume it was because of your algorithm, but follow up with the required lower-level analysis to confirm you can thoroughly explain why the performance disparity exists. Some basic questions you can ask your data, based on unfortunate examples I’ve encountered in the wild:

are you sure your database is not faster on reads because all reads return null?
does your “Big Data” working set fit in an L3 cache?
is your system dropping requests whereas the baseline isn’t?
are you measuring latency differently for the baseline vs your system (round-trip vs one-way)?
are you introducing competition side effects [1] (section 1.5.3)?
are the baselines faster than your system, but you’ve driven them to congestion collapse?

There are several aspects of measurement methodology that I think every systems student should know. For example, the relationship between latency and throughput, open vs closed loop workload generation, how to understand bottlenecks, how to summarize performance data, the use of confidence intervals, and much more. I highly recommend these books and articles to learn more:

[1] “Performance Evaluation of Computer and Communication Systems”, Jean-Yves Le Boudec.
[2] “The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling”, Raj Jain.
[3] “Mathematical foundations of computer networking”, Srinivasan Keshav.
[4] “Systems Benchmarking Crimes”, Gernot Heiser

State Of Valhalla

2020-03-25T00:00:00+00:00

I highly recommend the latest “State of Valhalla” document by Brian Goetz if you’re interested in the Java language (or just language runtimes in general). It gives an accessible overview of this huge shakeup of the JVM, made possible by the breakthrough observation that inline classes and object references can simply re-use the JVM’s “L-carrier type”.

Who Do I Sue?

2018-10-03T00:00:00+00:00

“Thus I will claim that the future of technology will be less determined by what technology can do, than social, legal and other restraints on what we can do. Thus, if you stop to think about computer-controlled highway traffic – it sounds good to you but ask yourself: who do I sue in an accident?”

– Richard Hamming, two decades before self-driving cars were a thing.

You can see the lecture here.

Rapid: Stable and Consistent Membership at Scale

2018-09-13T00:00:00+00:00

This post gives an overview of our recent work on the Rapid system, presented at USENIX ATC 2018 (you can find the paper, slides and presentation audio here ).

Rapid is a scalable, distributed membership service – it allows processes to form clusters and receive notifications when the membership changes.

Why design another membership service?

Rapid is motivated by two key challenges.

First, we observe that datacenter failure scenarios are not always crash failures, but commonly involve misconfigured firewalls, one-way connectivity loss, flip-flops in reachability, and some-but-not-all packets being dropped. However, existing membership solutions struggle with these common failure scenarios, despite being able to cleanly detect crash faults. In particular, existing tools take long to, or never converge to, a stable state where the faulty processes are removed. This is problematic in any distributed system: membership changes often trigger failure recovery workflows, and repeatedly triggering these workflows can not only degrade system performance, but also cause widespread outages.

Second, we note that inconsistent membership views in a system pose a challenging programming abstraction for developers, especially for building critical features in a distributed system like failure recovery. Without strong consistency semantics, developers need to write code where a process can make no assumptions about the world view of the rest of the system.

Enter Rapid

Rapid is a scalable, distributed membership system that is stable in the face of a diverse range of failure scenarios, and provides participating processes a strongly consistent view of the system’s membership. In particular, Rapid guarantees that all processes see the same sequence of membership changes to the system.

Rapid drives membership changes through three steps: maintaining a monitoring overlay, identifying a membership change proposal from monitoring alerts, and arriving at agreement among processes on a proposal. In each of these steps, we make design decisions that contribute to our goals of stability and consistency.

Expander-based monitoring edge overlay. Rapid organizes a set of processes (a configuration) into a stable failure detection topology comprising observers that monitor and disseminate reports about their communication edges to their subjects. The monitoring relationships between processes forms a directed expander graph with strong connectivity properties, which ensures with a high probability that healthy processes detect failures. We interpret multiple reports about a subject’s edges as a high-fidelity signal that the subject is faulty. The monitoring edges represent edge failure detectors that are pluggable.

Multi-process cut detection. For stability, processes in Rapid (i) suspect a faulty process p only upon receiving alerts from multiple observers of p, and (ii) delay acting on alerts about different processes until the churn stabilizes, thereby converging to detect a global, possibly multi-node cut of processes to add or remove from the membership. This filter is remarkably simple to implement, yet it suffices by itself to achieve almost-everywhere agreement – unanimity among a large fraction of processes about the detected cut.

Practical consensus. For consistency, we show that converting almost-everywhere agreement into full agreement is practical even in large-scale settings. Rapid’s consensus protocol drives configuration changes by a low-overhead, leaderless protocol in the common case: every process simply validates consensus by counting the number of identical cut detections. If there is a quorum containing three-quarters of the membership set with the same cut, then without a leader or further communication, this is a safe consensus decision.

How well does it work?

We experimented with Rapid in moderately scalable settings comprising 1000-2000 nodes.

The paper has detailed head-to-head comparisons against widely used membership solutions, like Akka Cluster, ZooKeeper and Memberlist. The higher-order bit is that these solutions deal well with crash failures and network parititions, but face stability issues under complex network failure scenarios like asymmetric reachability issues and high-packet loss. Rapid is stable under these circumstances because of its expander-based monitoring overlay that can better localize faults and its approach of removing entire cuts of faulty processes.

At the same time, Rapid is also fast. It can bootstrap a 2000 node cluster 2-5.8x faster than the alternatives we compared against, despite the fact that Rapid offers much stronger guarantees around stability and consistency.

Lastly, Rapid is comparable in cost to Memberlist (a gossip-based protocol) in terms of network bandwidth and memory utilization.

We also found Rapid easy to integrate in two existing applications: a distributed transaction data-store, and a service discovery use case.

We believe the insights we identify in Rapid are easy to apply to existing systems, and are happy to work with you on your use case. Feel free to reach out to me if you’d like to chat!

References

USENIX ATC 2018 paper, slides and presentation
Code on Github
VMware Research Group

People In Order

2018-08-05T00:00:00+00:00

Here’s a fascinating piece I found on aeon.co:

What can 73 homes arranged by household income say about their residents?

Invictus

2017-10-01T00:00:00+00:00

Ever so rarely, I find a poem that gives me goosebumps every time I read it. One such poem is Invictus, by William Ernest Henley.

The poem is perhaps best known for having been a favourite of Nelson Mandela, who used to recite it to his fellow inmates.

This tribute by the one and only Zen Pencils is therefore something I keep coming back to.

The New Colossus

2017-01-03T00:00:00+00:00

Given the news lately, The New Colossus by Emma Lazarus seems very relevant.

Not like the brazen giant of Greek fame,
With conquering limbs astride from land to land;
Here at our sea-washed, sunset gates shall stand
A mighty woman with a torch, whose flame
Is the imprisoned lightning, and her name
MOTHER OF EXILES. From her beacon-hand
Glows world-wide welcome; her mild eyes command
The air-bridged harbor that twin cities frame.

"Keep, ancient lands, your storied pomp!" cries she
With silent lips. "Give me your tired, your poor,
Your huddled masses yearning to breathe free,
The wretched refuse of your teeming shore.
Send these, the homeless, tempest-tost to me,
I lift my lamp beside the golden door!

Is -0 a number?

2016-11-12T00:00:00+00:00

Reproducing an old answer of mine from Quora.

Is -0 a number? Yes it is, and it is equal to 0.

Interestingly though, signed zeros are a necessary representation in computing because of rounding off errors and limitations with floating point precision. In certain classes of computations, the sign of a number before it was rounded off to zero is of practical importance.

Here’s an example from the Wikipedia article on signed zeros:

Informally, one may use the notation “−0” for a negative value that was rounded to zero. This notation may be useful when a negative sign is significant; for example, when tabulating Celsiu temperatures, where a negative sign means below freezing.

The Garden of Earthly Delights

2016-09-30T00:00:00+00:00

Today, my colleague Udi introduced me to Hieronymus Bosch’s “The Garden of Earthly Delights”. I’m far from being an art connoisseur, but I’m speechless.

More than an hour in, I still haven’t made it past the central panel.

Defended

2016-06-27T00:00:00+00:00

And I finally defended my PhD thesis.

C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection

2015-06-07T00:00:00+00:00

After a long hiatus of technical posts, I’m finally getting around to blogging about my PhD research. Today, I’ll give a brief overview of some of my recent work on the C3 system that was published at NSDI 2015.

My research has focused on techniques to reduce latency in the context of large-scale distributed storage systems. A common pattern in the way people architect scalable web-services today is to have large request fanouts, where even a single end-user request can trigger tens to thousands of data accesses to the storage tier. In the presence of such access patterns, the tail latency of your storage servers becomes very important since they begin to dominate the overall query time.

At the same time, storage servers are typically chaotic. Skewed demands across storage servers, queueing delays across various layers of the stack, background activities such as garbage collection and SSTable compaction, as well as resource contention with co-located workloads are some of the many factors that lead to performance fluctuations across storage servers. These sources of performance fluctuations can quickly inflate the tail-latency of your storage system, and degrade the performance of application services that depend on the storage tier.

In light of this issue, we investigate how replica selection, wherein a database client can select one out of multiple replicas to service a read request, can be used to cope with server-side performance fluctuations at the storage layer. That is, can clients carefully select replicas for serving reads with the objective of improving their response times?

This is challenging for several reasons. First of all, clients need a way to reliably measure and adapt to performance fluctuations across storage servers. Secondly, a fleet of clients needs to ensure that they do not enter herd behaviours or load oscillations because all of them are trying to improve their response times by going after faster servers. As it turns out, many popular systems either do a poor job of replica selection because they are agnostic to performance heterogeneity across storage servers, or are prone to herd behaviours because they get performance-aware replica selection wrong.

C3 addresses these problems through a careful combination of two mechanisms. First, clients in a C3 system, with some help from the servers, carefully rank replicas in order to balance request queues across servers in proportion to their performance differences. We refer to this as replica ranking. Second, C3 clients use a congestion-control-esque approach to distributed rate control, where clients adjust and throttle their sending rates to individual servers in a fully decentralized fashion. This ensures that C3 clients do not collectively send more requests per second to a server than it can actually process.

The combination of these two mechanisms gives C3 some impressive performance improvements over Cassandra’s Dynamic Snitching, which we used as a baseline. In experiments conducted on Amazon EC2, we found C3 to improve the 99.9th percentile latency by factors of 3x, while improving read throughput by up to 50%. See the paper for details regarding the various experiments we ran as well as the settings considered.

While the system evaluation in the paper was conducted using the Yahoo Cloud Serving Benchmark (YCSB), I’m currently investigating how C3 performs under production settings through some companies who’ve agreed to give it a test run. So far, the tests have been rather positive and we’ve been learning a lot more about C3 and the problem of replica selection in general. Stay tuned for more results!

Interactive Classroom Hack

2015-01-23T00:00:00+00:00

I gave a lecture yesterday as part of a lab course I was TA-ing. The assignment for this week had to do with understanding how different TCP variants perform in a wireless setting.

To prepare students for the assignment, my lecture was designed to be a refresher on TCP’s basics.

My plan was to discuss what TCP sets out to accomplish, some of the early design problems associated with it, and how each subsequent improvement of TCP solved a problem that the previous one didn’t (or introduced). This could have been a very one-sided lecture, with me parroting all of the above. But the best way to keep a classroom interactive is to deliver a lecture packed with questions, and have the students come up with the answers.

This meant that I began the lecture by asking students what TCP tries to accomplish. The students threw all kinds of answers at me, and we discussed each of them one after the other. We talked about what reliability means, how reliable TCP’s guarantee of reliability actually is, and from a performance standpoint, what TCP tries to accomplish. Note, at this point I’m still on slide number 1 with only “What are TCP’s objectives?” on it. Next, we went into the law of conservation of packets, and I asked them why that matters. After that round of discussions were complete, we started with TCP Tahoe. I posed each problem that TCP Tahoe tries to fix, the problems it doesn’t fix, and also asked them what the ramification is/would be of a certain design decision of Tahoe. This went on for a while, with the students getting more and more worked up about the topic, until we finally covered all the TCP variants I had planned on teaching. By this point, the students themselves had discussed, debated and attempted to solve each of the many issues associated with making TCP perform well.

Next, we moved on to the problems associated with TCP over wireless, and I asked them to suggest avenues for constructing a solution. The discussion that followed was pretty exciting, and at some point they even began correcting and arguing with each other. Little did they know, that this one line problem statement I offered them took several PhD theses to even construct partially working solutions.

I’ve tried different variations of this strategy in the past, and after all these years I’ve concluded this: Leaving students with questions during a lecture puts them in the shoes of those before them who tried to find the answers. Leaving students with the answers makes them mere consumers of knowledge.

When we tell students about a solution alongside the problem itself, we’ve already put horse blinders on their chain of thought. We’re directing their thoughts through a linear chain. Leaving them with the questions long enough enough makes them think more, and in my opinion, works very well in making a classroom interactive.

Academic Abandonware

2014-03-24T00:00:00+00:00

I recently stumbled upon this.

The gist of the discussion is that a good deal of CS research published at reputable venues is notoriously difficult or even impossible to replicate. Hats off to the team from Arizona for helping to bring this to the limelight. It’s something we as a community ought to be really concerned about.

Among the most common reasons seem to be:

None of the authors can be contacted for any help relating to the paper.
Single points of failure: the only author capable of reproducing the work has graduated.
The objective of publishing a paper being accomplished, the software went unmaintained and requires divine intervention to even build/setup, let alone use.
The software used or built in the paper cannot be publicly released. This is either due to licensing reasons, the first two points, or plain refusal by the authors.
Critical details that are required to re-implement the work are omitted from the paper.

One of the criticisms I have with the study is that their methodology involved marking a piece of code as “cannot build” if 30 minutes of programmer time was insufficient to build the tool. I doubt many of my own sincere attempts to make code publicly available would pass this test. Odin comes to my mind here, which is a pain to setup despite the fact that others have and do successfully use it for their research.

So what can we do to minimise academic abandonware? Packaging your entire software environment into VMs and releasing them via a project website sounds to me like an idea worth pursuing. It avoids the problem of having to find, compile and link combinations of ancient libraries. True, it doesn’t help if one requires special hardware resources or a testbed in order to run the system, but it’s a start nevertheless. Investing time and research into building thoroughly validated simulators and emulators may also aid in this direction.

I’ll end this post with a comic I once drew.

Mental Detox

2014-02-18T00:00:00+00:00

I’ve had a couple of paper deadlines in the last few months, all of which were not-so-conveniently placed a couple of hours apart from each other. While the month leading up to it was insanely stressful, I managed to push out most of what I had in the pipeline and don’t have any more paper deadlines to worry about for a few months.

I’m now doing the usual post-submission-mental-detox to clear up my head, where I’ve been taking it easy at work and catching up on life in general. I’ve been completing some pending reviews, preparing an undergraduate course for the upcoming semester, and rabidly catching up on lost gaming time. I’m also going on holiday to Argentina in a week, an opportunity to completely disconnect from work all together.

This freedom to manage my time the way that suits me best is what I enjoy the most about doing a PhD. I can be working insanely hard in the weeks leading up to a deadline to push out a paper, and then slow down for a while to to clear up again.

Now back to exploring dungeons in Skyrim.

Dream Setup

2014-01-22T00:00:00+00:00

The Setup Interviews is an interesting website which features interviews with professionals from different fields about the hardware/software they use on a daily basis. The interviews conclude with the interviewees being asked what their dream setup is. While most people tend to answer this question with some set of gizmos they’d like to own, I feel Matt Might got it right in his answer:

When I was young, I dreamed about building a “nerd cave” full of fast hardware, big monitors, sleek software and cool gadgets. I see now that technology can only nip at the margins of happiness, creativity and productivity relative to the effect of having sharp colleagues, good friends and close family nearby. I have many sharp colleagues that double as good friends. And, there’s an outside chance that in the next two or three years both of my brothers and all three of my sisters-in-law (each of whom is like an actual sister to me) will have joined me and my wife in Utah. I hope it happens. That’s my dream setup.

Moving to Github

2014-01-02T00:00:00+00:00

It’s the New Year and that means it’s time for change. I’ve finally moved my blog off wordpress.com and onto Github + Jekyll.

Jekyll has been a pleasure to deal with so far. The import from wordpress was mostly trivial but with some rough edges.

First, export your wordpress.com blog using the admin console (you should get an xml dump of your site) and then run:

# Assumes the XML dump is named wordpressdotcom.xml,
$: jekyll import wordpressdotcom

I only exported my posts because I wanted to setup pages myself. The above command should populate Jekyll’s _posts folder with your blog’s posts as html files.

I found the generated html to be rather mangled; there were no paragraph separations and blockquotes looked ugly. This required some monkey patching with sed and awk to fix. There are still some more loose edges left which I’ll get around to later. I’ve setup Disqus for comments, and I still need to import all the comments from the Wordpress site.

I’m currently on the Hyde theme which I modified a bit to suit my liking.

All in all, it’s been a breeze to deploy over Github and I’m quite happy to have a lot more control on how my site looks like.

Procrastination post: Erdos Number

2013-12-15T00:00:00+00:00

In waiting for my combinatorial explosion of a factorial design experiment to complete, I decided to find out what my Erdos Number looks like, which is basically defined as the collaborative distance between yourself and the famous mathematician, Paul Erdos. Turns out, I have an Erdos number of 4 via the following chain of co-authorship.

/me -- Anja Feldmann -- Edward G. Coffman, Jr -- Joel H. Spencer -- Paul Erdős.

You can calculate your Erdos number here.

Comic: Academic Future Work Slides

2013-11-30T00:00:00+00:00

I've been going through a lot of academic talks lately and almost every slide on future work feels like this to me.

Optimize your life for learning

2013-11-24T00:00:00+00:00

Just stumbled upon a gem of an email [1] from Professor Alexander Coward at Berkeley, explaining why he isn't going to cancel a class despite a strike.

The last couple of paragraphs ought to resonate well with anyone who's obsessed with learning. To quote the email,

In order for you to navigate the increasing complexity of the 21st century you need a world-class education, and thankfully you have an opportunity to get one. I don’t just mean the education you get in class, but I mean the education you get in everything you do, every book you read, every conversation you have, every thought you think.

You need to optimize your life for learning.

You need to live and breath your education.

You need to be *obsessed* with your education.

Do not fall into the trap of thinking that because you are surrounded by so many dazzlingly smart fellow students that means you’re no good. Nothing could be further from the truth.

And do not fall into the trap of thinking that you focusing on your education is a selfish thing. It’s not a selfish thing. It’s the most noble thing you could do.

Society is investing in you so that you can help solve the many challenges we are going to face in the coming decades, from profound technological challenges to helping people with the age old search for human happiness and meaning.

That is why I am not canceling class tomorrow. Your education is really really important, not just to you, but in a far broader and wider reaching way than I think any of you have yet to fully appreciate.

[1] : http://alumni.berkeley.edu/california-magazine/just-in/2013-11-21/cal-lecturers-email-students-goes-viral-why-i-am-not