<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <generator uri="http://jekyllrb.com" version="3.10.0">Jekyll</generator>
  
  
  <link href="https://lalith.in/feed.xml" rel="self" type="application/atom+xml" />
  <link href="https://lalith.in/" rel="alternate" type="text/html" />
  <updated>2026-03-16T12:25:36+00:00</updated>
  <id>https://lalith.in//</id>

  
    <title type="html">Comfortably Geek</title>
  

  

  

  
  
    <entry>
      
      <title type="html">Introducing DCM</title>
      
      
      <link href="https://lalith.in/2020/10/07/introducing-dcm/" rel="alternate" type="text/html" title="Introducing DCM" />
      
      <published>2020-10-07T00:00:00+00:00</published>
      <updated>2020-10-07T00:00:00+00:00</updated>
      <id>https://lalith.in/2020/10/07/introducing-dcm</id>
      <content type="html" xml:base="https://lalith.in/2020/10/07/introducing-dcm/">&lt;p&gt;I’m happy to (finally) share our &lt;a href=&quot;/papers/dcm-osdi2020.pdf&quot;&gt;OSDI 2020 paper&lt;/a&gt; on &lt;em&gt;Declarative Cluster
Managers&lt;/em&gt; (DCM).&lt;/p&gt;

&lt;p&gt;The premise for DCM is that writing modern cluster management code is notoriously hard, given that  they routinely
grapple with hard combinatorial optimization problems. Think of capabilities like policy-based load balancing,
placement, scheduling, and configuration, which are features not only required in dedicated cluster management systems
like Kubernetes, but also in  enterprise-grade distributed systems like databases and storage platforms. Today, cluster
manager developers implement such features by developing system-specific best-effort heuristics, which achieve
scalability by significantly sacrificing the cluster manager’s decision quality, feature set, and extensibility over
time. This is proving untenable, as solutions for cluster management problems are routinely developed from scratch in
the industry to solve largely similar problems across different settings.&lt;/p&gt;

&lt;p&gt;With DCM, we propose a radically different architecture where developers specify the cluster manager’s behavior
&lt;em&gt;declaratively&lt;/em&gt;, using SQL queries over cluster state stored in a relational database. From the SQL specification, the
DCM compiler synthesizes a program that, at runtime, can be invoked to compute policy-compliant cluster management
decisions given the latest cluster state. Under the covers, the generated program efficiently encodes the cluster state
as an optimization problem and solves it using a constraint solver, freeing developers from having to design ad-hoc
heuristics.&lt;/p&gt;

&lt;p&gt;We show that DCM significantly lowers the barrier to building scalable and extensible cluster managers. We validate our
claim by powering three systems with it: a Kubernetes scheduler, a virtual machine management solution, and a
distributed transactional datastore.&lt;/p&gt;

&lt;p&gt;If you’re interested in the details, check out the &lt;a href=&quot;/papers/dcm-osdi2020.pdf&quot;&gt;paper&lt;/a&gt;. If you’d like
to try out DCM, have a look at our &lt;a href=&quot;https://github.com/vmware/declarative-cluster-management/&quot;&gt;Github repository&lt;/a&gt;. We
welcome all feedback, questions, and contributions!&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">I’m happy to (finally) share our OSDI 2020 paper on Declarative Cluster Managers (DCM). The premise for DCM is that writing modern cluster management code is notoriously hard, given that they routinely grapple with hard combinatorial optimization problems. Think of capabilities like policy-based load balancing, placement, scheduling, and configuration, which are features not only required in dedicated cluster management systems like Kubernetes, but also in enterprise-grade distributed systems like databases and storage platforms. Today, cluster manager developers implement such features by developing system-specific best-effort heuristics, which achieve scalability by significantly sacrificing the cluster manager’s decision quality, feature set, and extensibility over time. This is proving untenable, as solutions for cluster management problems are routinely developed from scratch in the industry to solve largely similar problems across different settings. With DCM, we propose a radically different architecture where developers specify the cluster manager’s behavior declaratively, using SQL queries over cluster state stored in a relational database. From the SQL specification, the DCM compiler synthesizes a program that, at runtime, can be invoked to compute policy-compliant cluster management decisions given the latest cluster state. Under the covers, the generated program efficiently encodes the cluster state as an optimization problem and solves it using a constraint solver, freeing developers from having to design ad-hoc heuristics. We show that DCM significantly lowers the barrier to building scalable and extensible cluster managers. We validate our claim by powering three systems with it: a Kubernetes scheduler, a virtual machine management solution, and a distributed transactional datastore. If you’re interested in the details, check out the paper. If you’d like to try out DCM, have a look at our Github repository. We welcome all feedback, questions, and contributions!</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Low-level advice for systems research</title>
      
      
      <link href="https://lalith.in/2020/09/27/Low-Level-Advice-For-Systems-Research/" rel="alternate" type="text/html" title="Low-level advice for systems research" />
      
      <published>2020-09-27T00:00:00+00:00</published>
      <updated>2020-09-27T00:00:00+00:00</updated>
      <id>https://lalith.in/2020/09/27/Low-Level-Advice-For-Systems-Research</id>
      <content type="html" xml:base="https://lalith.in/2020/09/27/Low-Level-Advice-For-Systems-Research/">&lt;p&gt;There’s no shortage of “how to do research advice” on the Internet for graduate students. Such advice, while inspiring,
is extremely hard to translate into daily or weekly productivity as a systems PhD student.&lt;/p&gt;

&lt;p&gt;This is unfortunate, because I believe following good practices can offset a lot of the stress associated with systems
research. Having discussed this topic often with students, I think it’s time to blog about it.&lt;/p&gt;

&lt;p&gt;We’ll cover two broad topics: how to effectively prototype and how to run experiments systematically.&lt;/p&gt;

&lt;p&gt;On the prototyping side, we’ll first cover the tracer bullet methodology to gather incremental feedback about an idea.
We’ll then discuss the importance of testing your code.&lt;/p&gt;

&lt;p&gt;Regarding experiments, I’ll first argue for preparing your experiment infrastructure as early as possible in a
project. I’ll then share some tips on automating experiments, avoiding tunnel vision, experimentation-friendly
prototyping, and end with some advice on understanding your results.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;disclaimer&quot;&gt;Disclaimer&lt;/h3&gt;

&lt;p&gt;As with any advice from people who landed research positions after
their PhDs, remember that survivor bias is a thing. I can only say that
following these guidelines work well for me, there is no guarantee that it
will work for you.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;who-is-this-article-for&quot;&gt;Who is this article for?&lt;/h3&gt;

&lt;p&gt;This article might help you if you’re a systems PhD student, and one or more of the following bullets 
resonate with you:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;You’re spending a lot of time building your system, and you’re unsure whether
it will be worth it at all.&lt;/li&gt;
  &lt;li&gt;There are many components of your system left to implement, and you’re not sure what
you should prioritize over others.&lt;/li&gt;
  &lt;li&gt;You find setting up, understanding, and debugging experiments overwhelming.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;the-premise&quot;&gt;The premise&lt;/h3&gt;

&lt;p&gt;The premise for this article is the following comic I drew a while ago.&lt;/p&gt;

&lt;p align=&quot;center&quot;&gt;
&lt;img src=&quot;https://lalith.in/img/codequality.png&quot; alt=&quot;drawing&quot; width=&quot;500&quot; /&gt;
&lt;/p&gt;

&lt;p&gt;You have a finite amount of time, there’s a lot of systems work to do, there’s a submission deadline looming, 
and that worries you. As you get closer to the deadline, things will inevitably get chaotic.
This article will hopefully help you effectively plan and
alleviate some of that chaos.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;effective-prototyping&quot;&gt;Effective Prototyping&lt;/h3&gt;

&lt;p&gt;Your research prototype’s goal is to exercise a certain hypothesis. 
To validate
that hypothesis, you will eventually subject your prototype to a set of experiments.
More often than not, the experiments will compare your system against one or more baseline
systems too.&lt;/p&gt;

&lt;p&gt;Your goal through this process is to:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;prototype in a way that you get continuous, incremental feedback about your idea&lt;/li&gt;
  &lt;li&gt;test for correctness during development time, so you don’t have to during experiments;
debugging performance problems and system behavior over an experiment is hard enough&lt;/li&gt;
  &lt;li&gt;build the experiment setup and infrastructure as early as possible, and preferably
even before you build the system (rather than building the system first, and 
&lt;em&gt;then&lt;/em&gt; thinking about experiments)&lt;/li&gt;
  &lt;li&gt;Save time by automating the living crap out of your experiment workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;the-tracer-bullet-methodology&quot;&gt;The Tracer Bullet Methodology&lt;/h4&gt;

&lt;p&gt;I picked up this term from &lt;a href=&quot;https://www.amazon.com/Pragmatic-Programmer-Journeyman-Master/dp/020161622X&quot;&gt;The Pragmatic Programmer&lt;/a&gt;
 book.&lt;/p&gt;

&lt;p&gt;Here’s the problem. Let’s say you have multiple components to build before your system will work end-to-end. 
Each of these components might be complex by themselves, so building them one after the other in sequence and stitching 
them together might take a lot of time. That’s risky if you don’t yet know whether this system is worth building:
you don’t want to spend a year building something, only to find out you’ve hit a dud.&lt;/p&gt;

&lt;p&gt;Instead, focus on building a working end-to-end version early that might even only work for one example input, taking as
many shortcuts as you need to get there (e.g., an interpreter that can only run the program “1 + 1” and nothing else).
From there evolve the system to work with increasingly complex inputs, filling in gaps in your skeleton code and eating
away at hardcoded assumptions and shortcuts as you go along. The benefit of this approach is that you always see the
system work end-to-end, for more and more examples, and you receive continuous, incremental feedback on a variety of
factors throughout the process. This strategy applies at any granularity of your code: the entire system itself,
specific modules within them, or even specific functions.&lt;/p&gt;

&lt;p&gt;For example, in the &lt;a href=&quot;/papers/dcm-osdi2020.pdf&quot;&gt;DCM&lt;/a&gt; project, 
we observed that developers expend a lot of effort to hand-craft ad-hoc heuristics for cluster management
problems. We hypothesized that a compiler could instead synthesize the required implementation from an SQL 
specification. Doing so would make it easier to build cluster managers and schedulers that perform well, are flexible, 
and compute high-quality decisions.&lt;/p&gt;

&lt;p&gt;When we started working on DCM, we weren’t even sure if this idea was practical, and whether SQL was an expressive
enough language for the task-at-hand. Building a fully functional compiler that generated fast-enough code would have
taken a while, and we wanted to validate the core hypothesis early. So to get started, we took a cluster management
system, expressed some of its logic in SQL, and had the compiler emit code that we’d mostly written by hand. We tried
this for several cluster management policies and received steady feedback about our design’s feasibility along the way.
Importantly, even before we had filled in the compiler’s gaps, we had a working end-to-end demo running within a real
system, which gave us the required confidence to double down on the idea.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;testing&quot;&gt;Testing&lt;/h4&gt;

&lt;p&gt;Test your code thoroughly before you run experiments. The additional time you invest
in writing tests and running them on every build will be more than made up by &lt;em&gt;not&lt;/em&gt; having to debug your system for
correctness when you run experiments. Debugging performance issues and system behavior during
an experiment is daunting enough as it is: don’t compound this by debugging correctness issues
at the same time.&lt;/p&gt;

&lt;p&gt;When working on a complex system (especially with multiple authors), tests are essential to slowing down the rate at 
which bugs and regressions creep in over time. When you find a new bug in your system, add a new test case to reproduce
the bug.&lt;/p&gt;

&lt;p&gt;Only use a given commit/build of your system for an experiment if it has passed all tests.&lt;/p&gt;

&lt;p&gt;Test every build and commit using a continuous integration (CI) infrastructure.
A CI system monitors a repository for new commits and runs a series of prescribed test workflows. 
There are free solutions like &lt;a href=&quot;https://travis-ci.org/&quot;&gt;Travis&lt;/a&gt; you can easily use. You can also self-host your CI
infrastructure using &lt;a href=&quot;https://docs.gitlab.com/ee/ci/&quot;&gt;Gitlab CI&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A common pushback I’ve heard about rigorously testing research prototypes is “Hey! We’re trying to write a paper, not
production code!”. That statement, however, is a non sequitur. A research prototype with tests is not a production
system! I’m not suggesting that you add metrics, log retention, compliance checks, rolling upgrades, backwards
compatibility, and umpteen other things that production systems need (unless they’re relevant to your research,
obviously).&lt;/p&gt;

&lt;p&gt;Use any automated tooling available to harden your code. For example, I configure all my Java projects to use
 &lt;a href=&quot;https://github.com/checkstyle/checkstyle&quot;&gt;Checkstyle&lt;/a&gt; to enforce a coding style, and both 
 &lt;a href=&quot;https://github.com/spotbugs/spotbugs&quot;&gt;SpotBugs&lt;/a&gt; and &lt;a href=&quot;https://errorprone.info/&quot;&gt;Google ErrorProne&lt;/a&gt;
 to run a suite of static analysis passes on the code. I use &lt;a href=&quot;https://github.com/jacoco/jacoco&quot;&gt;Jacoco&lt;/a&gt; and
  tools like &lt;a href=&quot;https://codecov.io&quot;&gt;CodeCov&lt;/a&gt; for tracking code coverage of my tests. 
  Every build invokes these tools and fails the build if there are problems. The 
 CI infrastructure also runs these checks whenever I push commits to a git repository.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h3 id=&quot;experiments&quot;&gt;Experiments&lt;/h3&gt;

&lt;p&gt;A common trap that students fall into is to not think about experiments until their system is “ready”. 
This leads to the same problem we just discussed about not getting incremental feedback.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;set-up-the-experiment-infrastructure-early&quot;&gt;Set up the experiment infrastructure early&lt;/h4&gt;

&lt;p&gt;Most systems papers often have one or more baselines to compare against. If so, set up the experiment infrastructure
to measure the baselines as early as possible, even &lt;em&gt;before&lt;/em&gt; you start formulating a hypothesis
about something you’d like to improve. Use the data collected from studying the baselines to formulate
your hypothesis.&lt;/p&gt;

&lt;p&gt;From my own experience, you’re likely to find problems others haven’t noticed 
(prior art can be surprisingly flimsy sometimes). You’ll also learn whether the problem you’re studying
 is even real at all. Either way, you’ll gain valuable ammunition towards formulating 
a good hypothesis.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;automate-your-experiments-like-a-maniac&quot;&gt;Automate your experiments like a maniac&lt;/h4&gt;

&lt;p&gt;The main goal for your experiment setup should be to completely automate the entire workflow. And I mean
&lt;strong&gt;the entire workflow&lt;/strong&gt;. For example, have a single script that given a commit ID from your system’s git repository,
 checks out that commit ID, builds the artifacts, sets up or refreshes the infrastructure, 
 runs experiments with all the parameter combinations for the system and the baselines, cleans up between runs, 
 downloads all the relevant logs into an archive with the necessary metadata, processes the logs, 
 generates a report with the plots, and sends that report to a Slack channel.&lt;/p&gt;

&lt;p&gt;When running experiments, make sure to not only collect log data and traces but also the experiment’s metadata. 
Common bits of metadata that I tend to record include a timestamp for when I triggered the experiment, 
the timestamps for the individual experiment runs and repetitions, the relevant git commit IDs, metadata
about the environment (e.g., VM sizes, cluster information), and all the parameter combinations that were run. 
Save all the metadata to a file in a machine-friendly format (like CSV or JSON). 
Avoid encoding experiment metadata into file or folder names (“experiment_1_param1_param2_param3”). This makes it hard
to introduce changes over time (e.g. adding a new parameter will likely break your workflow).
Propagate the experiment metadata all the way to your graphs, to be sure you know what commit ID and parameter
 combinations produced a given set of results. Aggressively pepper the workflow with sanity checks (e.g,
 a graph should never present data obtained from two different commits of your system).&lt;/p&gt;

&lt;p&gt;Never, ever, configure your infrastructure manually (e.g., bringing up VMs on AWS using the EC2 GUI, followed by logging
 into the VMs and
installing software libraries and configuring them). Instead, always script these workflows up 
(I like using &lt;a href=&quot;https://www.ansible.com/&quot;&gt;Ansible&lt;/a&gt; for such tasks).
The reason being, that accumulating ad-hoc tweaks and commands with side effects (like changing the OS configuration) 
impairs reproducibility. Instead, be disciplined about only introducing changes to the infrastructure via a set of 
well-maintained scripts. They come in handy especially when unexpected failures happen, and you need to migrate to a new
infrastructure (literally every project I’ve worked on had to go through this!).&lt;/p&gt;

&lt;p&gt;The sooner you start with the above workflows, the better. Once it’s up, just like your tests, 
you’ll be able to continuously run experiments as part of your development process, and 
make sure your system is on track as far as your evaluation goals go. Every commit gets backed by a report
full of data about how that change affected your system’s metrics of concern. The experiment workflow then becomes 
a part of your testing and regression suites.&lt;/p&gt;

&lt;p&gt;The tracer bullet methodology applies to your experiment setup just as much as it does to the system you’re
building. You’ll find your experiment setup evolving in tandem with your system: you’ll add new logging and 
tracing points, you’ll add new parameter combinations you want to test, and new baselines.&lt;/p&gt;

&lt;p&gt;I have a fairly standard workflow I use for every project. It starts with using 
&lt;a href=&quot;https://www.ansible.com/&quot;&gt;Ansible&lt;/a&gt; to set up the infrastructure (e.g., bring up and configure VMs on EC2), 
deploy the artifacts, run the experiments, and collect the logs. I use Python to parse the raw logs and produce 
an &lt;a href=&quot;https://sqlite.org/index.html&quot;&gt;SQLite&lt;/a&gt; database with the necessary traces and experiment metadata. 
I then use &lt;a href=&quot;https://www.r-project.org/&quot;&gt;R&lt;/a&gt; to analyze the 
traces, &lt;a href=&quot;https://ggplot2.tidyverse.org/&quot;&gt;ggplot&lt;/a&gt; for plotting, and &lt;a href=&quot;https://rmarkdown.rstudio.com/&quot;&gt;RMarkdown&lt;/a&gt; to 
produce a report that I can then send to a Slack channel. A single 
top-level bash script takes a commit ID and chains all the previous steps together.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;avoiding-tunnel-vision&quot;&gt;Avoiding tunnel-vision&lt;/h4&gt;

&lt;p&gt;A common risk with system building is to fall into the experiment tunnel-vision trap. For example, frantically trying to
improve performance for a benchmark or metric that may not matter, while ignoring those that are important to assess
your system.&lt;/p&gt;

&lt;p&gt;There are two things you can do to avoid this trap.&lt;/p&gt;

&lt;p&gt;First, try to sketch out the first few paragraphs of your paper’s evaluation section as early as possible.
This makes you think carefully about the main theses you’d like your evaluation to support.
I find this simple trick helps me stay focused when planning experiments (and often made me realize my priorities were
 wrong!). You’ll find yourself refining both the text for the evaluation section and the experiment workflow over time.&lt;/p&gt;

&lt;p&gt;The second  is to use what I recommended in the previous section: wait to see a report with data about all your
experiments before iterating on a change to your system. Otherwise, you’ll be stuck in experiment whack-a-mole limbo!
Make sure your report contains the same graphs and data that you plan to add to the evaluation section.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;experimentation-friendly-prototyping&quot;&gt;Experimentation-friendly prototyping&lt;/h4&gt;

&lt;p&gt;Build experimentation-friendly prototypes. Always use feature flags and configuration parameters for your system to
toggle different settings (including log levels). If you find yourself modifying and recompiling your code only to
enable/disable a certain feature, you’re doing it &lt;strong&gt;completely wrong&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here’s one scenario where feature flags come in handy. Often, a paper proposes a collection of techniques that when
combined, improve over the state-of-the-art (e.g., five new optimization passes in a compiler). If your paper fits that
description, always make sure to have the quintessential “look how X changes as we turn on these features, one after the
other” experiment. If you only test the combination of these features and not their individual contributions, you (and
the readers) won’t know if some of your proposed features &lt;em&gt;negatively&lt;/em&gt; impact your system!&lt;/p&gt;

&lt;p&gt;And no, for the majority of systems projects, a few additional &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if&lt;/code&gt; statements are hardly going to affect your
performance (actual pushback I’ve heard!). If production-ready JVMs can ship with hundreds of such flags, your research
prototype can too.&lt;/p&gt;

&lt;p&gt;&lt;br /&gt;&lt;/p&gt;

&lt;h4 id=&quot;measure-measure-measure&quot;&gt;Measure, measure, measure&lt;/h4&gt;

&lt;p&gt;Systems are unfortunately way too complex. If you want to understand what happened over an experiment, there is no
substitute for measuring aggressively. Log any data you think will help you understand what’s going on, even if it is
data that you won’t present in a paper. If in doubt, always over-measure rather than under-measure.&lt;/p&gt;

&lt;p&gt;To borrow a quote from &lt;a href=&quot;https://web.stanford.edu/~ouster/cgi-bin/sayings.php&quot;&gt;John Ousterhout&lt;/a&gt;: “Use your intuition to
ask questions, not answer them”. Often, when faced with a performance problem or a bug, it’s tempting to assume you know why
(your intuition) and introduce changes to fix that problem. Don’t! Treat your intuition as a hypothesis, and dig in
further to confirm exactly why that problem occurs and how it manifests. For example, expand your workflow with more logs or
set up additional experiments to control for the suspected factors.&lt;/p&gt;

&lt;p&gt;A particularly dangerous example I’ve seen is to declare victory the moment one sees their system beat the baseline 
in end-to-end metrics. For example, “our nifty algorithm improves transaction throughput by 800x over baseline Y” 
(ratios that are quite fashionable these days!).
Again, don’t assume it was because of your algorithm, but follow up with the required lower-level analysis to confirm you can
 thoroughly explain &lt;em&gt;why&lt;/em&gt; the performance disparity exists. 
Some basic questions you can ask your data, based on unfortunate examples I’ve encountered in the wild:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;are you sure your database is not faster on reads because all reads return &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;null&lt;/code&gt;?&lt;/li&gt;
  &lt;li&gt;does your “Big Data” working set fit in an L3 cache?&lt;/li&gt;
  &lt;li&gt;is your system dropping requests whereas the baseline isn’t?&lt;/li&gt;
  &lt;li&gt;are you measuring latency differently for the baseline vs your system (round-trip vs one-way)?&lt;/li&gt;
  &lt;li&gt;are you introducing competition side effects [1] (section 1.5.3)?&lt;/li&gt;
  &lt;li&gt;are the baselines faster than your system, but you’ve driven them to congestion collapse?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are several aspects of measurement methodology that I think every systems student should know. For example,
the relationship between latency and throughput, 
open vs closed loop workload generation, how to understand bottlenecks, 
how to summarize performance data, the use of confidence intervals, and much more. 
I highly recommend these books and articles to learn more:&lt;/p&gt;

&lt;p&gt;[1] &lt;a href=&quot;https://perfeval.epfl.ch/&quot;&gt;“Performance Evaluation of Computer and Communication Systems”&lt;/a&gt;, Jean-Yves Le Boudec.&lt;br /&gt;
[2] &lt;a href=&quot;https://www.cse.wustl.edu/~jain/books/perfbook.htm&quot;&gt;“The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling”&lt;/a&gt;, Raj Jain.&lt;br /&gt;
[3] &lt;a href=&quot;https://www.informit.com/store/mathematical-foundations-of-computer-networking-9780321792105&quot;&gt;“Mathematical foundations of computer networking”&lt;/a&gt;, Srinivasan Keshav.&lt;br /&gt;
[4] &lt;a href=&quot;http://gernot-heiser.org/benchmarking-crimes.html&quot;&gt;“Systems Benchmarking Crimes”&lt;/a&gt;, Gernot Heiser&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">There’s no shortage of “how to do research advice” on the Internet for graduate students. Such advice, while inspiring, is extremely hard to translate into daily or weekly productivity as a systems PhD student. This is unfortunate, because I believe following good practices can offset a lot of the stress associated with systems research. Having discussed this topic often with students, I think it’s time to blog about it. We’ll cover two broad topics: how to effectively prototype and how to run experiments systematically. On the prototyping side, we’ll first cover the tracer bullet methodology to gather incremental feedback about an idea. We’ll then discuss the importance of testing your code. Regarding experiments, I’ll first argue for preparing your experiment infrastructure as early as possible in a project. I’ll then share some tips on automating experiments, avoiding tunnel vision, experimentation-friendly prototyping, and end with some advice on understanding your results. Disclaimer As with any advice from people who landed research positions after their PhDs, remember that survivor bias is a thing. I can only say that following these guidelines work well for me, there is no guarantee that it will work for you. Who is this article for? This article might help you if you’re a systems PhD student, and one or more of the following bullets resonate with you: You’re spending a lot of time building your system, and you’re unsure whether it will be worth it at all. There are many components of your system left to implement, and you’re not sure what you should prioritize over others. You find setting up, understanding, and debugging experiments overwhelming. The premise The premise for this article is the following comic I drew a while ago. You have a finite amount of time, there’s a lot of systems work to do, there’s a submission deadline looming, and that worries you. As you get closer to the deadline, things will inevitably get chaotic. This article will hopefully help you effectively plan and alleviate some of that chaos. Effective Prototyping Your research prototype’s goal is to exercise a certain hypothesis. To validate that hypothesis, you will eventually subject your prototype to a set of experiments. More often than not, the experiments will compare your system against one or more baseline systems too. Your goal through this process is to: prototype in a way that you get continuous, incremental feedback about your idea test for correctness during development time, so you don’t have to during experiments; debugging performance problems and system behavior over an experiment is hard enough build the experiment setup and infrastructure as early as possible, and preferably even before you build the system (rather than building the system first, and then thinking about experiments) Save time by automating the living crap out of your experiment workflow The Tracer Bullet Methodology I picked up this term from The Pragmatic Programmer book. Here’s the problem. Let’s say you have multiple components to build before your system will work end-to-end. Each of these components might be complex by themselves, so building them one after the other in sequence and stitching them together might take a lot of time. That’s risky if you don’t yet know whether this system is worth building: you don’t want to spend a year building something, only to find out you’ve hit a dud. Instead, focus on building a working end-to-end version early that might even only work for one example input, taking as many shortcuts as you need to get there (e.g., an interpreter that can only run the program “1 + 1” and nothing else). From there evolve the system to work with increasingly complex inputs, filling in gaps in your skeleton code and eating away at hardcoded assumptions and shortcuts as you go along. The benefit of this approach is that you always see the system work end-to-end, for more and more examples, and you receive continuous, incremental feedback on a variety of factors throughout the process. This strategy applies at any granularity of your code: the entire system itself, specific modules within them, or even specific functions. For example, in the DCM project, we observed that developers expend a lot of effort to hand-craft ad-hoc heuristics for cluster management problems. We hypothesized that a compiler could instead synthesize the required implementation from an SQL specification. Doing so would make it easier to build cluster managers and schedulers that perform well, are flexible, and compute high-quality decisions. When we started working on DCM, we weren’t even sure if this idea was practical, and whether SQL was an expressive enough language for the task-at-hand. Building a fully functional compiler that generated fast-enough code would have taken a while, and we wanted to validate the core hypothesis early. So to get started, we took a cluster management system, expressed some of its logic in SQL, and had the compiler emit code that we’d mostly written by hand. We tried this for several cluster management policies and received steady feedback about our design’s feasibility along the way. Importantly, even before we had filled in the compiler’s gaps, we had a working end-to-end demo running within a real system, which gave us the required confidence to double down on the idea. Testing Test your code thoroughly before you run experiments. The additional time you invest in writing tests and running them on every build will be more than made up by not having to debug your system for correctness when you run experiments. Debugging performance issues and system behavior during an experiment is daunting enough as it is: don’t compound this by debugging correctness issues at the same time. When working on a complex system (especially with multiple authors), tests are essential to slowing down the rate at which bugs and regressions creep in over time. When you find a new bug in your system, add a new test case to reproduce the bug. Only use a given commit/build of your system for an experiment if it has passed all tests. Test every build and commit using a continuous integration (CI) infrastructure. A CI system monitors a repository for new commits and runs a series of prescribed test workflows. There are free solutions like Travis you can easily use. You can also self-host your CI infrastructure using Gitlab CI. A common pushback I’ve heard about rigorously testing research prototypes is “Hey! We’re trying to write a paper, not production code!”. That statement, however, is a non sequitur. A research prototype with tests is not a production system! I’m not suggesting that you add metrics, log retention, compliance checks, rolling upgrades, backwards compatibility, and umpteen other things that production systems need (unless they’re relevant to your research, obviously). Use any automated tooling available to harden your code. For example, I configure all my Java projects to use Checkstyle to enforce a coding style, and both SpotBugs and Google ErrorProne to run a suite of static analysis passes on the code. I use Jacoco and tools like CodeCov for tracking code coverage of my tests. Every build invokes these tools and fails the build if there are problems. The CI infrastructure also runs these checks whenever I push commits to a git repository. Experiments A common trap that students fall into is to not think about experiments until their system is “ready”. This leads to the same problem we just discussed about not getting incremental feedback. Set up the experiment infrastructure early Most systems papers often have one or more baselines to compare against. If so, set up the experiment infrastructure to measure the baselines as early as possible, even before you start formulating a hypothesis about something you’d like to improve. Use the data collected from studying the baselines to formulate your hypothesis. From my own experience, you’re likely to find problems others haven’t noticed (prior art can be surprisingly flimsy sometimes). You’ll also learn whether the problem you’re studying is even real at all. Either way, you’ll gain valuable ammunition towards formulating a good hypothesis. Automate your experiments like a maniac The main goal for your experiment setup should be to completely automate the entire workflow. And I mean the entire workflow. For example, have a single script that given a commit ID from your system’s git repository, checks out that commit ID, builds the artifacts, sets up or refreshes the infrastructure, runs experiments with all the parameter combinations for the system and the baselines, cleans up between runs, downloads all the relevant logs into an archive with the necessary metadata, processes the logs, generates a report with the plots, and sends that report to a Slack channel. When running experiments, make sure to not only collect log data and traces but also the experiment’s metadata. Common bits of metadata that I tend to record include a timestamp for when I triggered the experiment, the timestamps for the individual experiment runs and repetitions, the relevant git commit IDs, metadata about the environment (e.g., VM sizes, cluster information), and all the parameter combinations that were run. Save all the metadata to a file in a machine-friendly format (like CSV or JSON). Avoid encoding experiment metadata into file or folder names (“experiment_1_param1_param2_param3”). This makes it hard to introduce changes over time (e.g. adding a new parameter will likely break your workflow). Propagate the experiment metadata all the way to your graphs, to be sure you know what commit ID and parameter combinations produced a given set of results. Aggressively pepper the workflow with sanity checks (e.g, a graph should never present data obtained from two different commits of your system). Never, ever, configure your infrastructure manually (e.g., bringing up VMs on AWS using the EC2 GUI, followed by logging into the VMs and installing software libraries and configuring them). Instead, always script these workflows up (I like using Ansible for such tasks). The reason being, that accumulating ad-hoc tweaks and commands with side effects (like changing the OS configuration) impairs reproducibility. Instead, be disciplined about only introducing changes to the infrastructure via a set of well-maintained scripts. They come in handy especially when unexpected failures happen, and you need to migrate to a new infrastructure (literally every project I’ve worked on had to go through this!). The sooner you start with the above workflows, the better. Once it’s up, just like your tests, you’ll be able to continuously run experiments as part of your development process, and make sure your system is on track as far as your evaluation goals go. Every commit gets backed by a report full of data about how that change affected your system’s metrics of concern. The experiment workflow then becomes a part of your testing and regression suites. The tracer bullet methodology applies to your experiment setup just as much as it does to the system you’re building. You’ll find your experiment setup evolving in tandem with your system: you’ll add new logging and tracing points, you’ll add new parameter combinations you want to test, and new baselines. I have a fairly standard workflow I use for every project. It starts with using Ansible to set up the infrastructure (e.g., bring up and configure VMs on EC2), deploy the artifacts, run the experiments, and collect the logs. I use Python to parse the raw logs and produce an SQLite database with the necessary traces and experiment metadata. I then use R to analyze the traces, ggplot for plotting, and RMarkdown to produce a report that I can then send to a Slack channel. A single top-level bash script takes a commit ID and chains all the previous steps together. Avoiding tunnel-vision A common risk with system building is to fall into the experiment tunnel-vision trap. For example, frantically trying to improve performance for a benchmark or metric that may not matter, while ignoring those that are important to assess your system. There are two things you can do to avoid this trap. First, try to sketch out the first few paragraphs of your paper’s evaluation section as early as possible. This makes you think carefully about the main theses you’d like your evaluation to support. I find this simple trick helps me stay focused when planning experiments (and often made me realize my priorities were wrong!). You’ll find yourself refining both the text for the evaluation section and the experiment workflow over time. The second is to use what I recommended in the previous section: wait to see a report with data about all your experiments before iterating on a change to your system. Otherwise, you’ll be stuck in experiment whack-a-mole limbo! Make sure your report contains the same graphs and data that you plan to add to the evaluation section. Experimentation-friendly prototyping Build experimentation-friendly prototypes. Always use feature flags and configuration parameters for your system to toggle different settings (including log levels). If you find yourself modifying and recompiling your code only to enable/disable a certain feature, you’re doing it completely wrong. Here’s one scenario where feature flags come in handy. Often, a paper proposes a collection of techniques that when combined, improve over the state-of-the-art (e.g., five new optimization passes in a compiler). If your paper fits that description, always make sure to have the quintessential “look how X changes as we turn on these features, one after the other” experiment. If you only test the combination of these features and not their individual contributions, you (and the readers) won’t know if some of your proposed features negatively impact your system! And no, for the majority of systems projects, a few additional if statements are hardly going to affect your performance (actual pushback I’ve heard!). If production-ready JVMs can ship with hundreds of such flags, your research prototype can too. Measure, measure, measure Systems are unfortunately way too complex. If you want to understand what happened over an experiment, there is no substitute for measuring aggressively. Log any data you think will help you understand what’s going on, even if it is data that you won’t present in a paper. If in doubt, always over-measure rather than under-measure. To borrow a quote from John Ousterhout: “Use your intuition to ask questions, not answer them”. Often, when faced with a performance problem or a bug, it’s tempting to assume you know why (your intuition) and introduce changes to fix that problem. Don’t! Treat your intuition as a hypothesis, and dig in further to confirm exactly why that problem occurs and how it manifests. For example, expand your workflow with more logs or set up additional experiments to control for the suspected factors. A particularly dangerous example I’ve seen is to declare victory the moment one sees their system beat the baseline in end-to-end metrics. For example, “our nifty algorithm improves transaction throughput by 800x over baseline Y” (ratios that are quite fashionable these days!). Again, don’t assume it was because of your algorithm, but follow up with the required lower-level analysis to confirm you can thoroughly explain why the performance disparity exists. Some basic questions you can ask your data, based on unfortunate examples I’ve encountered in the wild: are you sure your database is not faster on reads because all reads return null? does your “Big Data” working set fit in an L3 cache? is your system dropping requests whereas the baseline isn’t? are you measuring latency differently for the baseline vs your system (round-trip vs one-way)? are you introducing competition side effects [1] (section 1.5.3)? are the baselines faster than your system, but you’ve driven them to congestion collapse? There are several aspects of measurement methodology that I think every systems student should know. For example, the relationship between latency and throughput, open vs closed loop workload generation, how to understand bottlenecks, how to summarize performance data, the use of confidence intervals, and much more. I highly recommend these books and articles to learn more: [1] “Performance Evaluation of Computer and Communication Systems”, Jean-Yves Le Boudec. [2] “The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling”, Raj Jain. [3] “Mathematical foundations of computer networking”, Srinivasan Keshav. [4] “Systems Benchmarking Crimes”, Gernot Heiser</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">State Of Valhalla</title>
      
      
      <link href="https://lalith.in/2020/03/25/State-of-Valhalla/" rel="alternate" type="text/html" title="State Of Valhalla" />
      
      <published>2020-03-25T00:00:00+00:00</published>
      <updated>2020-03-25T00:00:00+00:00</updated>
      <id>https://lalith.in/2020/03/25/State-of-Valhalla</id>
      <content type="html" xml:base="https://lalith.in/2020/03/25/State-of-Valhalla/">&lt;p&gt;I highly recommend the latest &lt;a href=&quot;http://cr.openjdk.java.net/~briangoetz/valhalla/sov/01-background.html&quot;&gt;“State of
Valhalla”&lt;/a&gt;
document by Brian Goetz if you’re interested in the Java language (or just
language runtimes in general). It gives an accessible overview of
this huge shakeup of the JVM, made possible by the breakthrough observation
that inline classes and object references can simply re-use the JVM’s
“L-carrier type”.&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">I highly recommend the latest “State of Valhalla” document by Brian Goetz if you’re interested in the Java language (or just language runtimes in general). It gives an accessible overview of this huge shakeup of the JVM, made possible by the breakthrough observation that inline classes and object references can simply re-use the JVM’s “L-carrier type”.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Who Do I Sue?</title>
      
      
      <link href="https://lalith.in/2018/10/03/Who-Do-I-Sue/" rel="alternate" type="text/html" title="Who Do I Sue?" />
      
      <published>2018-10-03T00:00:00+00:00</published>
      <updated>2018-10-03T00:00:00+00:00</updated>
      <id>https://lalith.in/2018/10/03/Who-Do-I-Sue</id>
      <content type="html" xml:base="https://lalith.in/2018/10/03/Who-Do-I-Sue/">&lt;p&gt;&lt;em&gt;“Thus I will claim that the future of technology will be less determined by
what technology can do, than social, legal and other restraints on what we can
do.  Thus, if you stop to think about computer-controlled highway traffic – it
sounds good to you but ask yourself: who do I sue in an accident?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;– Richard Hamming, two decades before self-driving cars were a thing.&lt;/p&gt;

&lt;p&gt;You can see the lecture &lt;a href=&quot;https://youtu.be/AD4b-52jtos?t=1807&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">“Thus I will claim that the future of technology will be less determined by what technology can do, than social, legal and other restraints on what we can do. Thus, if you stop to think about computer-controlled highway traffic – it sounds good to you but ask yourself: who do I sue in an accident?” – Richard Hamming, two decades before self-driving cars were a thing. You can see the lecture here.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Rapid: Stable and Consistent Membership at Scale</title>
      
      
      <link href="https://lalith.in/2018/09/13/Rapid/" rel="alternate" type="text/html" title="Rapid: Stable and Consistent Membership at Scale" />
      
      <published>2018-09-13T00:00:00+00:00</published>
      <updated>2018-09-13T00:00:00+00:00</updated>
      <id>https://lalith.in/2018/09/13/Rapid</id>
      <content type="html" xml:base="https://lalith.in/2018/09/13/Rapid/">&lt;p&gt;This post gives an overview of our recent work on the Rapid system, presented
at USENIX ATC 2018 (you can find the paper, slides and presentation audio &lt;a href=&quot;https://www.usenix.org/conference/atc18/presentation/suresh&quot;&gt;here
&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Rapid is a scalable, distributed membership service
– it allows processes to form clusters and receive notifications when the
   membership changes.&lt;/p&gt;

&lt;h4 id=&quot;why-design-another-membership-service&quot;&gt;&lt;em&gt;Why design another membership service?&lt;/em&gt;&lt;/h4&gt;

&lt;p&gt;Rapid is motivated by two key challenges.&lt;/p&gt;

&lt;p&gt;First, we observe that datacenter failure scenarios are not always crash
failures, but commonly involve misconfigured firewalls, one-way connectivity
loss, flip-flops in reachability, and some-but-not-all packets being dropped.
However, existing membership solutions struggle with these common failure
scenarios, despite being able to cleanly detect crash faults. In particular,
existing tools take long to, or never converge to, a stable state where the
faulty processes are removed. This is problematic in any distributed system:
membership changes often trigger failure recovery workflows, and repeatedly
triggering these workflows can not only degrade system performance, but also
cause widespread outages.&lt;/p&gt;

&lt;p&gt;Second, we note that inconsistent membership views in a system pose a
challenging programming abstraction for developers, especially for building
critical features in a distributed system like failure recovery. Without
strong consistency semantics, developers need to write code where a process
can make no assumptions about the world view of the rest of the system.&lt;/p&gt;

&lt;h4 id=&quot;enter-rapid&quot;&gt;&lt;em&gt;Enter Rapid&lt;/em&gt;&lt;/h4&gt;

&lt;p&gt;Rapid is a scalable, distributed membership system that is &lt;em&gt;stable&lt;/em&gt; in the
face of a diverse range of failure scenarios, and provides participating
processes a &lt;em&gt;strongly consistent view&lt;/em&gt; of the system’s membership. In
particular, Rapid guarantees that all processes see the same sequence
of membership changes to the system.&lt;/p&gt;

&lt;p&gt;Rapid drives membership changes through three steps: maintaining a monitoring
overlay, identifying a membership change proposal from monitoring alerts, and
arriving at agreement among processes on a proposal. In each of these steps,
we make design decisions that contribute to our goals of stability and
consistency.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/rapid-flow.jpg&quot; alt=&quot;Configuration changes in Rapid&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expander-based monitoring edge overlay&lt;/strong&gt;. Rapid organizes a set of processes
    (a configuration) into a stable failure detection topology comprising
    &lt;em&gt;observers&lt;/em&gt; that monitor and disseminate reports about their communication
    edges to their &lt;em&gt;subjects&lt;/em&gt;. The monitoring relationships between processes
    forms a directed expander graph with strong connectivity properties, which
    ensures with a high probability that healthy processes detect failures. We
    interpret &lt;em&gt;multiple&lt;/em&gt; reports about a subject’s edges as a high-fidelity
    signal that the subject is faulty. The monitoring edges represent &lt;em&gt;edge
    failure detectors&lt;/em&gt; that are pluggable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-process cut detection&lt;/strong&gt;. For stability, processes in Rapid (i) suspect
    a faulty process &lt;em&gt;p&lt;/em&gt; only upon receiving alerts from multiple observers of
    &lt;em&gt;p&lt;/em&gt;, and (ii) delay acting on alerts about different processes until the
    churn stabilizes, thereby converging to detect a global, possibly
    multi-node &lt;em&gt;cut&lt;/em&gt; of processes to add or remove from the membership. This
    filter is remarkably simple to implement, yet it suffices by itself to
    achieve &lt;em&gt;almost-everywhere agreement&lt;/em&gt; – unanimity among a large fraction of
    processes about the detected cut.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical consensus&lt;/strong&gt;. For consistency, we show that converting
    almost-everywhere agreement into full agreement is practical even in
    large-scale settings. Rapid’s consensus protocol drives configuration
    changes by a low-overhead, leaderless protocol in the common case: every
    process simply validates consensus by counting the number of identical cut
    detections. If there is a quorum containing three-quarters of the
    membership set with the same cut, then without a leader or further
    communication, this is a safe consensus decision.&lt;/p&gt;

&lt;h4 id=&quot;how-well-does-it-work&quot;&gt;&lt;em&gt;How well does it work?&lt;/em&gt;&lt;/h4&gt;

&lt;p&gt;We experimented with Rapid in moderately scalable settings comprising
1000-2000 nodes.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.usenix.org/conference/atc18/presentation/suresh&quot;&gt;paper&lt;/a&gt; has
detailed head-to-head comparisons against widely used membership solutions,
like Akka Cluster, ZooKeeper and Memberlist. The higher-order bit is that
these solutions deal well with crash failures and network parititions, but
face stability issues under complex network failure scenarios like asymmetric
reachability issues and high-packet loss. Rapid is stable under these
circumstances because of its expander-based monitoring overlay that can better
localize faults and its approach of removing entire cuts of faulty processes.&lt;/p&gt;

&lt;p&gt;At the same time, Rapid is also &lt;em&gt;fast&lt;/em&gt;. It can bootstrap a 2000 node cluster
2-5.8x faster than the alternatives we compared against, despite the fact that
Rapid offers much stronger guarantees around stability and consistency.&lt;/p&gt;

&lt;p&gt;Lastly, Rapid is comparable in cost to Memberlist (a gossip-based protocol) in
terms of network bandwidth and memory utilization.&lt;/p&gt;

&lt;p&gt;We also found Rapid easy to integrate in two existing applications:
a distributed transaction data-store, and a service discovery use case.&lt;/p&gt;

&lt;p&gt;We believe the insights we identify in Rapid are &lt;em&gt;easy&lt;/em&gt; to apply to existing
systems, and are happy to work with you on your use case. Feel free to reach
out to me if you’d like to chat!&lt;/p&gt;

&lt;h4 id=&quot;references&quot;&gt;References&lt;/h4&gt;

&lt;p&gt;&lt;a href=&quot;https://www.usenix.org/conference/atc18/presentation/suresh&quot;&gt;USENIX ATC 2018 paper, slides and
presentation&lt;/a&gt;
&lt;br /&gt;
&lt;a href=&quot;https://github.com/lalithsuresh/rapid/&quot;&gt;Code on Github&lt;/a&gt; 
&lt;br /&gt;
&lt;a href=&quot;https://research.vmware.com/&quot;&gt;VMware Research
Group&lt;/a&gt;&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">This post gives an overview of our recent work on the Rapid system, presented at USENIX ATC 2018 (you can find the paper, slides and presentation audio here ). Rapid is a scalable, distributed membership service – it allows processes to form clusters and receive notifications when the membership changes. Why design another membership service? Rapid is motivated by two key challenges. First, we observe that datacenter failure scenarios are not always crash failures, but commonly involve misconfigured firewalls, one-way connectivity loss, flip-flops in reachability, and some-but-not-all packets being dropped. However, existing membership solutions struggle with these common failure scenarios, despite being able to cleanly detect crash faults. In particular, existing tools take long to, or never converge to, a stable state where the faulty processes are removed. This is problematic in any distributed system: membership changes often trigger failure recovery workflows, and repeatedly triggering these workflows can not only degrade system performance, but also cause widespread outages. Second, we note that inconsistent membership views in a system pose a challenging programming abstraction for developers, especially for building critical features in a distributed system like failure recovery. Without strong consistency semantics, developers need to write code where a process can make no assumptions about the world view of the rest of the system. Enter Rapid Rapid is a scalable, distributed membership system that is stable in the face of a diverse range of failure scenarios, and provides participating processes a strongly consistent view of the system’s membership. In particular, Rapid guarantees that all processes see the same sequence of membership changes to the system. Rapid drives membership changes through three steps: maintaining a monitoring overlay, identifying a membership change proposal from monitoring alerts, and arriving at agreement among processes on a proposal. In each of these steps, we make design decisions that contribute to our goals of stability and consistency. Expander-based monitoring edge overlay. Rapid organizes a set of processes (a configuration) into a stable failure detection topology comprising observers that monitor and disseminate reports about their communication edges to their subjects. The monitoring relationships between processes forms a directed expander graph with strong connectivity properties, which ensures with a high probability that healthy processes detect failures. We interpret multiple reports about a subject’s edges as a high-fidelity signal that the subject is faulty. The monitoring edges represent edge failure detectors that are pluggable. Multi-process cut detection. For stability, processes in Rapid (i) suspect a faulty process p only upon receiving alerts from multiple observers of p, and (ii) delay acting on alerts about different processes until the churn stabilizes, thereby converging to detect a global, possibly multi-node cut of processes to add or remove from the membership. This filter is remarkably simple to implement, yet it suffices by itself to achieve almost-everywhere agreement – unanimity among a large fraction of processes about the detected cut. Practical consensus. For consistency, we show that converting almost-everywhere agreement into full agreement is practical even in large-scale settings. Rapid’s consensus protocol drives configuration changes by a low-overhead, leaderless protocol in the common case: every process simply validates consensus by counting the number of identical cut detections. If there is a quorum containing three-quarters of the membership set with the same cut, then without a leader or further communication, this is a safe consensus decision. How well does it work? We experimented with Rapid in moderately scalable settings comprising 1000-2000 nodes. The paper has detailed head-to-head comparisons against widely used membership solutions, like Akka Cluster, ZooKeeper and Memberlist. The higher-order bit is that these solutions deal well with crash failures and network parititions, but face stability issues under complex network failure scenarios like asymmetric reachability issues and high-packet loss. Rapid is stable under these circumstances because of its expander-based monitoring overlay that can better localize faults and its approach of removing entire cuts of faulty processes. At the same time, Rapid is also fast. It can bootstrap a 2000 node cluster 2-5.8x faster than the alternatives we compared against, despite the fact that Rapid offers much stronger guarantees around stability and consistency. Lastly, Rapid is comparable in cost to Memberlist (a gossip-based protocol) in terms of network bandwidth and memory utilization. We also found Rapid easy to integrate in two existing applications: a distributed transaction data-store, and a service discovery use case. We believe the insights we identify in Rapid are easy to apply to existing systems, and are happy to work with you on your use case. Feel free to reach out to me if you’d like to chat! References USENIX ATC 2018 paper, slides and presentation Code on Github VMware Research Group</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">People In Order</title>
      
      
      <link href="https://lalith.in/2018/08/05/People-In-Order/" rel="alternate" type="text/html" title="People In Order" />
      
      <published>2018-08-05T00:00:00+00:00</published>
      <updated>2018-08-05T00:00:00+00:00</updated>
      <id>https://lalith.in/2018/08/05/People-In-Order</id>
      <content type="html" xml:base="https://lalith.in/2018/08/05/People-In-Order/">&lt;p&gt;Here’s a fascinating piece I found on &lt;a href=&quot;https://aeon.co&quot;&gt;aeon.co&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://aeon.co/videos/what-can-73-homes-arranged-by-household-income-say-about-their-residents&quot;&gt;What can 73 homes arranged by household income say about their residents?&lt;/a&gt;&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">Here’s a fascinating piece I found on aeon.co: What can 73 homes arranged by household income say about their residents?</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Invictus</title>
      
      
      <link href="https://lalith.in/2017/10/01/Invictus/" rel="alternate" type="text/html" title="Invictus" />
      
      <published>2017-10-01T00:00:00+00:00</published>
      <updated>2017-10-01T00:00:00+00:00</updated>
      <id>https://lalith.in/2017/10/01/Invictus</id>
      <content type="html" xml:base="https://lalith.in/2017/10/01/Invictus/">&lt;p&gt;Ever so rarely, I find a poem that gives me goosebumps every time I read it. One such poem is Invictus, by William Ernest Henley.&lt;/p&gt;

&lt;p&gt;The poem is perhaps best known for having been a favourite of Nelson Mandela, who used to recite it to his fellow inmates.&lt;/p&gt;

&lt;p&gt;This tribute by the one and only &lt;a href=&quot;http://zenpencils.com/comic/140-invictus-a-comic-tribute-to-nelson-mandela/&quot;&gt;Zen Pencils&lt;/a&gt; is therefore
something I keep coming back to.&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">Ever so rarely, I find a poem that gives me goosebumps every time I read it. One such poem is Invictus, by William Ernest Henley. The poem is perhaps best known for having been a favourite of Nelson Mandela, who used to recite it to his fellow inmates. This tribute by the one and only Zen Pencils is therefore something I keep coming back to.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">The New Colossus</title>
      
      
      <link href="https://lalith.in/2017/01/03/The-New-Colossus/" rel="alternate" type="text/html" title="The New Colossus" />
      
      <published>2017-01-03T00:00:00+00:00</published>
      <updated>2017-01-03T00:00:00+00:00</updated>
      <id>https://lalith.in/2017/01/03/The-New-Colossus</id>
      <content type="html" xml:base="https://lalith.in/2017/01/03/The-New-Colossus/">&lt;p&gt;Given the news lately, &lt;a href=&quot;https://en.wikipedia.org/wiki/The_New_Colossus&quot;&gt;The New Colossus&lt;/a&gt; by Emma Lazarus seems very relevant.&lt;/p&gt;

&lt;center&gt;
&lt;em&gt;
Not like the brazen giant of Greek fame,  &lt;br /&gt;
With conquering limbs astride from land to land;  &lt;br /&gt;
Here at our sea-washed, sunset gates shall stand  &lt;br /&gt;
A mighty woman with a torch, whose flame  &lt;br /&gt;
Is the imprisoned lightning, and her name  &lt;br /&gt;
MOTHER OF EXILES. From her beacon-hand  &lt;br /&gt;
Glows world-wide welcome; her mild eyes command  &lt;br /&gt;
The air-bridged harbor that twin cities frame.  &lt;br /&gt;
&lt;br /&gt;
&quot;Keep, ancient lands, your storied pomp!&quot; cries she  &lt;br /&gt;
With silent lips. &quot;Give me your tired, your poor,  &lt;br /&gt;
Your huddled masses yearning to breathe free,  &lt;br /&gt;
The wretched refuse of your teeming shore.  &lt;br /&gt;
Send these, the homeless, tempest-tost to me,  &lt;br /&gt;
I lift my lamp beside the golden door!  &lt;br /&gt;
&lt;/em&gt;
&lt;/center&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">Given the news lately, The New Colossus by Emma Lazarus seems very relevant. Not like the brazen giant of Greek fame, With conquering limbs astride from land to land; Here at our sea-washed, sunset gates shall stand A mighty woman with a torch, whose flame Is the imprisoned lightning, and her name MOTHER OF EXILES. From her beacon-hand Glows world-wide welcome; her mild eyes command The air-bridged harbor that twin cities frame. &quot;Keep, ancient lands, your storied pomp!&quot; cries she With silent lips. &quot;Give me your tired, your poor, Your huddled masses yearning to breathe free, The wretched refuse of your teeming shore. Send these, the homeless, tempest-tost to me, I lift my lamp beside the golden door!</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Is -0 a number?</title>
      
      
      <link href="https://lalith.in/2016/11/12/Signed-Zeroes/" rel="alternate" type="text/html" title="Is -0 a number?" />
      
      <published>2016-11-12T00:00:00+00:00</published>
      <updated>2016-11-12T00:00:00+00:00</updated>
      <id>https://lalith.in/2016/11/12/Signed-Zeroes</id>
      <content type="html" xml:base="https://lalith.in/2016/11/12/Signed-Zeroes/">&lt;p&gt;Reproducing an old answer of mine from Quora.&lt;/p&gt;

&lt;p&gt;Is -0 a number? Yes it is, and it is equal to 0.&lt;/p&gt;

&lt;p&gt;Interestingly though, signed zeros are a necessary representation in computing because of rounding off errors and limitations with floating point precision. In certain classes of computations, the sign of a number before it was rounded off to zero is of practical importance.&lt;/p&gt;

&lt;p&gt;Here’s an example from the Wikipedia article on &lt;a href=&quot;https://en.wikipedia.org/wiki/Signed_zero&quot;&gt;signed zeros&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Informally, one may use the notation “−0” for a negative value that was rounded to zero. This notation may be useful when a negative sign is significant; for example, when tabulating Celsiu temperatures, where a negative sign means below freezing.&lt;/p&gt;
&lt;/blockquote&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">Reproducing an old answer of mine from Quora. Is -0 a number? Yes it is, and it is equal to 0. Interestingly though, signed zeros are a necessary representation in computing because of rounding off errors and limitations with floating point precision. In certain classes of computations, the sign of a number before it was rounded off to zero is of practical importance. Here’s an example from the Wikipedia article on signed zeros: Informally, one may use the notation “−0” for a negative value that was rounded to zero. This notation may be useful when a negative sign is significant; for example, when tabulating Celsiu temperatures, where a negative sign means below freezing.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">The Garden of Earthly Delights</title>
      
      
      <link href="https://lalith.in/2016/09/30/Hieronymus-Bosch/" rel="alternate" type="text/html" title="The Garden of Earthly Delights" />
      
      <published>2016-09-30T00:00:00+00:00</published>
      <updated>2016-09-30T00:00:00+00:00</updated>
      <id>https://lalith.in/2016/09/30/Hieronymus-Bosch</id>
      <content type="html" xml:base="https://lalith.in/2016/09/30/Hieronymus-Bosch/">&lt;p&gt;Today, my colleague &lt;a href=&quot;https://udiwieder.wordpress.com&quot;&gt;Udi&lt;/a&gt; introduced me to Hieronymus Bosch’s “&lt;a href=&quot;http://www.esotericbosch.com/Garden.htm&quot;&gt;The Garden of Earthly Delights&lt;/a&gt;”. I’m far from being an art connoisseur, but I’m speechless.&lt;/p&gt;

&lt;p&gt;More than an hour in, I still haven’t made it past the central panel.&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">Today, my colleague Udi introduced me to Hieronymus Bosch’s “The Garden of Earthly Delights”. I’m far from being an art connoisseur, but I’m speechless. More than an hour in, I still haven’t made it past the central panel.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Defended</title>
      
      
      <link href="https://lalith.in/2016/06/27/defended/" rel="alternate" type="text/html" title="Defended" />
      
      <published>2016-06-27T00:00:00+00:00</published>
      <updated>2016-06-27T00:00:00+00:00</updated>
      <id>https://lalith.in/2016/06/27/defended</id>
      <content type="html" xml:base="https://lalith.in/2016/06/27/defended/">&lt;p&gt;And I finally defended my PhD thesis.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/defended.jpg&quot; alt=&quot;Drawing&quot; style=&quot;width: 700px&quot; /&gt;&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">And I finally defended my PhD thesis.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection</title>
      
      
      <link href="https://lalith.in/2015/06/07/c3/" rel="alternate" type="text/html" title="C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection" />
      
      <published>2015-06-07T00:00:00+00:00</published>
      <updated>2015-06-07T00:00:00+00:00</updated>
      <id>https://lalith.in/2015/06/07/c3</id>
      <content type="html" xml:base="https://lalith.in/2015/06/07/c3/">&lt;p&gt;After a long hiatus of technical posts, I’m finally getting around to blogging
about my PhD research. Today, I’ll give a brief overview of some of my recent work on
the C3 system that was published at &lt;a href=&quot;https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/suresh&quot;&gt;NSDI 2015&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;My research has focused on techniques to reduce latency in the context of
large-scale distributed storage systems. A common pattern in the way people
architect scalable web-services today is to have large request fanouts, where even a
single end-user request can trigger tens to thousands of data accesses to the
storage tier. In the presence of such access patterns, the &lt;em&gt;tail latency&lt;/em&gt; of
your storage servers becomes very important since they begin to dominate the
overall query time.&lt;/p&gt;

&lt;p&gt;At the same time, storage servers are typically chaotic. Skewed demands across
storage servers, queueing delays across various layers of the stack,
background activities such as garbage collection and SSTable compaction, as
well as resource contention with co-located workloads are some of the many
factors that lead to &lt;em&gt;performance fluctuations&lt;/em&gt; across storage servers. These
sources of performance fluctuations can quickly inflate the tail-latency of
your storage system, and degrade the performance of application services that
depend on the storage tier.&lt;/p&gt;

&lt;p&gt;In light of this issue, we investigate how &lt;em&gt;replica selection&lt;/em&gt;, wherein a
database client can select one out of multiple replicas to service a read
request, can be used to cope with server-side performance fluctuations at the
storage layer. That is, can clients carefully select replicas for serving
reads with the objective of improving their response times?&lt;/p&gt;

&lt;p&gt;This is challenging for several reasons. First of all, clients need a way to
reliably measure and adapt to performance fluctuations across storage servers.
Secondly, a fleet of clients needs to ensure that they do not enter herd
behaviours or load oscillations because all of them are trying to improve
their response times by going after faster servers. As it turns out,
many popular systems either do a poor job of replica selection because
they are agnostic to performance heterogeneity across storage servers,
or are prone to herd behaviours because they get performance-aware
replica selection wrong.&lt;/p&gt;

&lt;p&gt;C3 addresses these problems through a careful combination of two mechanisms.
First, clients in a C3 system, with some help from the servers, carefully
rank replicas in order to balance request queues across servers in proportion
to their performance differences. We refer to this as replica ranking. Second,
C3 clients use a congestion-control-esque approach to distributed rate
control, where clients adjust and throttle their sending rates to individual
servers in a fully decentralized fashion. This ensures that C3 clients
do not collectively send more requests per second to a server than it
can actually process.&lt;/p&gt;

&lt;p&gt;The combination of these two mechanisms gives C3 some impressive performance
improvements over Cassandra’s Dynamic Snitching, which we used as a baseline.
In experiments conducted on Amazon EC2, we found C3 to improve the 99.9th
percentile latency by factors of 3x, while improving read throughput by up to
50%. See the &lt;a href=&quot;https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/suresh&quot;&gt;paper&lt;/a&gt;
for details regarding the various experiments we ran as well as the settings considered.&lt;/p&gt;

&lt;p&gt;While the system evaluation in the paper was conducted using the Yahoo Cloud
Serving Benchmark (YCSB), I’m currently investigating how C3 performs under
production settings through some companies who’ve agreed to give it a test
run. So far, the tests have been rather positive and we’ve been learning
a lot more about C3 and the problem of replica selection in general. Stay tuned
for more results!&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">After a long hiatus of technical posts, I’m finally getting around to blogging about my PhD research. Today, I’ll give a brief overview of some of my recent work on the C3 system that was published at NSDI 2015. My research has focused on techniques to reduce latency in the context of large-scale distributed storage systems. A common pattern in the way people architect scalable web-services today is to have large request fanouts, where even a single end-user request can trigger tens to thousands of data accesses to the storage tier. In the presence of such access patterns, the tail latency of your storage servers becomes very important since they begin to dominate the overall query time. At the same time, storage servers are typically chaotic. Skewed demands across storage servers, queueing delays across various layers of the stack, background activities such as garbage collection and SSTable compaction, as well as resource contention with co-located workloads are some of the many factors that lead to performance fluctuations across storage servers. These sources of performance fluctuations can quickly inflate the tail-latency of your storage system, and degrade the performance of application services that depend on the storage tier. In light of this issue, we investigate how replica selection, wherein a database client can select one out of multiple replicas to service a read request, can be used to cope with server-side performance fluctuations at the storage layer. That is, can clients carefully select replicas for serving reads with the objective of improving their response times? This is challenging for several reasons. First of all, clients need a way to reliably measure and adapt to performance fluctuations across storage servers. Secondly, a fleet of clients needs to ensure that they do not enter herd behaviours or load oscillations because all of them are trying to improve their response times by going after faster servers. As it turns out, many popular systems either do a poor job of replica selection because they are agnostic to performance heterogeneity across storage servers, or are prone to herd behaviours because they get performance-aware replica selection wrong. C3 addresses these problems through a careful combination of two mechanisms. First, clients in a C3 system, with some help from the servers, carefully rank replicas in order to balance request queues across servers in proportion to their performance differences. We refer to this as replica ranking. Second, C3 clients use a congestion-control-esque approach to distributed rate control, where clients adjust and throttle their sending rates to individual servers in a fully decentralized fashion. This ensures that C3 clients do not collectively send more requests per second to a server than it can actually process. The combination of these two mechanisms gives C3 some impressive performance improvements over Cassandra’s Dynamic Snitching, which we used as a baseline. In experiments conducted on Amazon EC2, we found C3 to improve the 99.9th percentile latency by factors of 3x, while improving read throughput by up to 50%. See the paper for details regarding the various experiments we ran as well as the settings considered. While the system evaluation in the paper was conducted using the Yahoo Cloud Serving Benchmark (YCSB), I’m currently investigating how C3 performs under production settings through some companies who’ve agreed to give it a test run. So far, the tests have been rather positive and we’ve been learning a lot more about C3 and the problem of replica selection in general. Stay tuned for more results!</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Interactive Classroom Hack</title>
      
      
      <link href="https://lalith.in/2015/01/23/interactive-classroom-hack/" rel="alternate" type="text/html" title="Interactive Classroom Hack" />
      
      <published>2015-01-23T00:00:00+00:00</published>
      <updated>2015-01-23T00:00:00+00:00</updated>
      <id>https://lalith.in/2015/01/23/interactive-classroom-hack</id>
      <content type="html" xml:base="https://lalith.in/2015/01/23/interactive-classroom-hack/">&lt;p&gt;I gave a lecture yesterday as part of a lab course I was TA-ing. The
assignment for this week had to do with understanding how different TCP
variants perform in a wireless setting.&lt;/p&gt;

&lt;p&gt;To prepare students for the assignment, my lecture was designed to be a
refresher on TCP’s basics.&lt;/p&gt;

&lt;p&gt;My plan was to discuss what TCP sets out to accomplish, some of the
early design problems associated with it, and how each subsequent improvement
of TCP solved a problem that the previous one didn’t (or introduced). This
could have been a very one-sided lecture, with me parroting all of the above.
But the best way to keep a classroom interactive is to deliver a lecture
packed with &lt;em&gt;questions&lt;/em&gt;, and have the students come up with the answers.&lt;/p&gt;

&lt;p&gt;This meant that I began the lecture by asking students what TCP tries to
accomplish. The students threw all kinds of answers at me, and we discussed
each of them one after the other. We talked  about what reliability means, how
reliable TCP’s guarantee of reliability actually is, and from a performance
standpoint, what TCP tries to accomplish. Note, at this point I’m still on
slide number 1 with only “&lt;em&gt;What are TCP’s objectives?&lt;/em&gt;” on it.  Next, we went
into the law of conservation of packets, and I asked them why that matters.
After that round of discussions were complete, we started with TCP Tahoe. I
posed each problem that TCP Tahoe tries to fix,  the problems it doesn’t fix,
and also asked them what the ramification is/would be of a certain design
decision of Tahoe. This went on for a while, with the students getting more
and more worked up about the topic, until we finally covered all the TCP
variants I had planned on teaching. By this point, the students themselves had
discussed, debated and attempted to solve each of the many issues associated
with making TCP perform well.&lt;/p&gt;

&lt;p&gt;Next, we moved on to the problems associated with TCP over wireless, and I asked
them to suggest avenues for constructing a solution. The discussion that followed
was pretty exciting, and at some point they even began correcting and arguing with
each other. Little did they know, that this one line problem statement I offered them
took several PhD theses to even construct partially working solutions.&lt;/p&gt;

&lt;p&gt;I’ve tried different variations of this strategy in the past, and after
all these years I’ve concluded this: Leaving students with questions during a lecture puts them in the shoes of
those before them who tried to find the answers. Leaving students with the answers
makes them mere consumers of knowledge.&lt;/p&gt;

&lt;p&gt;When we tell students about a solution alongside the problem itself, we’ve
already put horse blinders on their chain of thought. We’re directing their
thoughts through a linear chain. Leaving them with the questions long enough
enough makes them think more, and in my opinion, works very well in
making a classroom interactive.&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">I gave a lecture yesterday as part of a lab course I was TA-ing. The assignment for this week had to do with understanding how different TCP variants perform in a wireless setting. To prepare students for the assignment, my lecture was designed to be a refresher on TCP’s basics. My plan was to discuss what TCP sets out to accomplish, some of the early design problems associated with it, and how each subsequent improvement of TCP solved a problem that the previous one didn’t (or introduced). This could have been a very one-sided lecture, with me parroting all of the above. But the best way to keep a classroom interactive is to deliver a lecture packed with questions, and have the students come up with the answers. This meant that I began the lecture by asking students what TCP tries to accomplish. The students threw all kinds of answers at me, and we discussed each of them one after the other. We talked about what reliability means, how reliable TCP’s guarantee of reliability actually is, and from a performance standpoint, what TCP tries to accomplish. Note, at this point I’m still on slide number 1 with only “What are TCP’s objectives?” on it. Next, we went into the law of conservation of packets, and I asked them why that matters. After that round of discussions were complete, we started with TCP Tahoe. I posed each problem that TCP Tahoe tries to fix, the problems it doesn’t fix, and also asked them what the ramification is/would be of a certain design decision of Tahoe. This went on for a while, with the students getting more and more worked up about the topic, until we finally covered all the TCP variants I had planned on teaching. By this point, the students themselves had discussed, debated and attempted to solve each of the many issues associated with making TCP perform well. Next, we moved on to the problems associated with TCP over wireless, and I asked them to suggest avenues for constructing a solution. The discussion that followed was pretty exciting, and at some point they even began correcting and arguing with each other. Little did they know, that this one line problem statement I offered them took several PhD theses to even construct partially working solutions. I’ve tried different variations of this strategy in the past, and after all these years I’ve concluded this: Leaving students with questions during a lecture puts them in the shoes of those before them who tried to find the answers. Leaving students with the answers makes them mere consumers of knowledge. When we tell students about a solution alongside the problem itself, we’ve already put horse blinders on their chain of thought. We’re directing their thoughts through a linear chain. Leaving them with the questions long enough enough makes them think more, and in my opinion, works very well in making a classroom interactive.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Academic Abandonware</title>
      
      
      <link href="https://lalith.in/2014/03/24/academic-abandonware/" rel="alternate" type="text/html" title="Academic Abandonware" />
      
      <published>2014-03-24T00:00:00+00:00</published>
      <updated>2014-03-24T00:00:00+00:00</updated>
      <id>https://lalith.in/2014/03/24/academic-abandonware</id>
      <content type="html" xml:base="https://lalith.in/2014/03/24/academic-abandonware/">&lt;p&gt;I recently stumbled upon &lt;a href=&quot;http://reproducibility.cs.arizona.edu/tr.pdf&quot;&gt;this&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The gist of the discussion is that a good deal of CS research published at
reputable venues is notoriously difficult or even impossible to replicate.
Hats off to the team from Arizona for helping to bring this to the limelight.
It’s something we as a community ought to be really concerned about.&lt;/p&gt;

&lt;p&gt;Among the most common reasons seem to be:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;None of the authors can be contacted for any help relating to the paper.&lt;/li&gt;
  &lt;li&gt;Single points of failure: the only author capable of reproducing the work
has graduated.&lt;/li&gt;
  &lt;li&gt;The objective of publishing a paper being accomplished,
the software went unmaintained and requires divine intervention to even
build/setup, let alone use.&lt;/li&gt;
  &lt;li&gt;The software used or built in the paper cannot be publicly released. This is either due to licensing reasons,
the first two points, or plain refusal by the authors.&lt;/li&gt;
  &lt;li&gt;Critical details that are required to re-implement the work are omitted from the paper.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the criticisms I have with
the study is that their methodology involved marking a piece of code as
“cannot build” if 30 minutes of programmer time was insufficient to build the
tool. I doubt many of my own sincere attempts to make code publicly available
would pass this test. &lt;a href=&quot;http://github.com/lalithsuresh/odin&quot;&gt;Odin&lt;/a&gt; comes to my mind
here, which is a pain to setup despite the fact that others have and do
successfully use it for their research.&lt;/p&gt;

&lt;p&gt;So what can we do to minimise academic abandonware? Packaging your entire
software environment into VMs and releasing them via a project website sounds
to me like an idea worth pursuing. It avoids the problem of having to find,
compile and link combinations of ancient libraries. True, it doesn’t help if
one requires special hardware resources or a testbed in order to run the
system, but it’s a start nevertheless. Investing time and research into
building thoroughly validated simulators and emulators may also aid in this
direction.&lt;/p&gt;

&lt;p&gt;I’ll end this post with a comic I once drew.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/img/codequality.png&quot; alt=&quot;My helpful screenshot&quot; /&gt;&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">I recently stumbled upon this. The gist of the discussion is that a good deal of CS research published at reputable venues is notoriously difficult or even impossible to replicate. Hats off to the team from Arizona for helping to bring this to the limelight. It’s something we as a community ought to be really concerned about. Among the most common reasons seem to be: None of the authors can be contacted for any help relating to the paper. Single points of failure: the only author capable of reproducing the work has graduated. The objective of publishing a paper being accomplished, the software went unmaintained and requires divine intervention to even build/setup, let alone use. The software used or built in the paper cannot be publicly released. This is either due to licensing reasons, the first two points, or plain refusal by the authors. Critical details that are required to re-implement the work are omitted from the paper. One of the criticisms I have with the study is that their methodology involved marking a piece of code as “cannot build” if 30 minutes of programmer time was insufficient to build the tool. I doubt many of my own sincere attempts to make code publicly available would pass this test. Odin comes to my mind here, which is a pain to setup despite the fact that others have and do successfully use it for their research. So what can we do to minimise academic abandonware? Packaging your entire software environment into VMs and releasing them via a project website sounds to me like an idea worth pursuing. It avoids the problem of having to find, compile and link combinations of ancient libraries. True, it doesn’t help if one requires special hardware resources or a testbed in order to run the system, but it’s a start nevertheless. Investing time and research into building thoroughly validated simulators and emulators may also aid in this direction. I’ll end this post with a comic I once drew.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Mental Detox</title>
      
      
      <link href="https://lalith.in/2014/02/18/mental-detox/" rel="alternate" type="text/html" title="Mental Detox" />
      
      <published>2014-02-18T00:00:00+00:00</published>
      <updated>2014-02-18T00:00:00+00:00</updated>
      <id>https://lalith.in/2014/02/18/mental-detox</id>
      <content type="html" xml:base="https://lalith.in/2014/02/18/mental-detox/">&lt;p&gt;I’ve had a couple of paper deadlines in the last few months, all of which were
not-so-conveniently placed a couple of hours apart from each other. While the
month leading up to it was insanely stressful, I managed to push out most of
what I had in the pipeline and don’t have any more paper deadlines to worry
about for a few months.&lt;/p&gt;

&lt;p&gt;I’m now doing the usual post-submission-mental-detox to clear up my head,
where I’ve been taking it easy at work and catching up on life in general.
I’ve been completing some pending reviews, preparing an undergraduate course
for the upcoming semester, and rabidly catching up on lost gaming time. I’m
also going on holiday to Argentina in a week, an opportunity to completely disconnect
from work all together.&lt;/p&gt;

&lt;p&gt;This freedom to manage my time the way that suits me best is what I enjoy the
most about doing a PhD.  I can be working insanely hard in the weeks leading
up to a deadline to push out a paper, and then slow down for a while to
to clear up again.&lt;/p&gt;

&lt;p&gt;Now back to exploring dungeons in Skyrim.&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">I’ve had a couple of paper deadlines in the last few months, all of which were not-so-conveniently placed a couple of hours apart from each other. While the month leading up to it was insanely stressful, I managed to push out most of what I had in the pipeline and don’t have any more paper deadlines to worry about for a few months. I’m now doing the usual post-submission-mental-detox to clear up my head, where I’ve been taking it easy at work and catching up on life in general. I’ve been completing some pending reviews, preparing an undergraduate course for the upcoming semester, and rabidly catching up on lost gaming time. I’m also going on holiday to Argentina in a week, an opportunity to completely disconnect from work all together. This freedom to manage my time the way that suits me best is what I enjoy the most about doing a PhD. I can be working insanely hard in the weeks leading up to a deadline to push out a paper, and then slow down for a while to to clear up again. Now back to exploring dungeons in Skyrim.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Dream Setup</title>
      
      
      <link href="https://lalith.in/2014/01/22/dream-setup/" rel="alternate" type="text/html" title="Dream Setup" />
      
      <published>2014-01-22T00:00:00+00:00</published>
      <updated>2014-01-22T00:00:00+00:00</updated>
      <id>https://lalith.in/2014/01/22/dream-setup</id>
      <content type="html" xml:base="https://lalith.in/2014/01/22/dream-setup/">&lt;p&gt;&lt;a href=&quot;http://usesthis.com/&quot;&gt;The Setup Interviews&lt;/a&gt; is an interesting website which features interviews with professionals
from different fields about the hardware/software they use on a daily basis. The interviews conclude with the interviewees
being asked what their dream setup is. While most people tend to answer this question with some set of gizmos they’d like to own,
I feel &lt;a href=&quot;http://usesthis.com/&quot;&gt;Matt Might&lt;/a&gt; got it right in his answer:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;When I was young, I dreamed about building a “nerd cave” full of fast hardware, big monitors, sleek software and cool gadgets.
I see now that technology can only nip at the margins of happiness, creativity and productivity relative to the effect of having sharp colleagues, good friends and close family nearby.
I have many sharp colleagues that double as good friends.
And, there’s an outside chance that in the next two or three years both of my brothers and all three of my sisters-in-law (each of whom is like an actual sister to me) will have joined me and my wife in Utah.
I hope it happens.
That’s my dream setup.&lt;/p&gt;
&lt;/blockquote&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">The Setup Interviews is an interesting website which features interviews with professionals from different fields about the hardware/software they use on a daily basis. The interviews conclude with the interviewees being asked what their dream setup is. While most people tend to answer this question with some set of gizmos they’d like to own, I feel Matt Might got it right in his answer: When I was young, I dreamed about building a “nerd cave” full of fast hardware, big monitors, sleek software and cool gadgets. I see now that technology can only nip at the margins of happiness, creativity and productivity relative to the effect of having sharp colleagues, good friends and close family nearby. I have many sharp colleagues that double as good friends. And, there’s an outside chance that in the next two or three years both of my brothers and all three of my sisters-in-law (each of whom is like an actual sister to me) will have joined me and my wife in Utah. I hope it happens. That’s my dream setup.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Moving to Github</title>
      
      
      <link href="https://lalith.in/2014/01/02/moving-to-github/" rel="alternate" type="text/html" title="Moving to Github" />
      
      <published>2014-01-02T00:00:00+00:00</published>
      <updated>2014-01-02T00:00:00+00:00</updated>
      <id>https://lalith.in/2014/01/02/moving-to-github</id>
      <content type="html" xml:base="https://lalith.in/2014/01/02/moving-to-github/">&lt;p&gt;It’s the New Year and that means it’s time for change. I’ve finally moved my blog off wordpress.com and onto Github + Jekyll.&lt;/p&gt;

&lt;p&gt;Jekyll has been a pleasure to deal with so far. The import from wordpress was mostly trivial but with some rough edges.&lt;/p&gt;

&lt;p&gt;First, export your wordpress.com blog using the admin console (you should get an xml dump of your site) and then run:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;# Assumes the XML dump is named wordpressdotcom.xml,&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$:&lt;/span&gt; jekyll import wordpressdotcom&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I only exported my posts because I wanted to setup pages myself. The above command should populate Jekyll’s
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_posts&lt;/code&gt; folder with your blog’s posts as html files.&lt;/p&gt;

&lt;p&gt;I found the generated html to be rather mangled; there were no paragraph separations and blockquotes looked ugly. This required
some monkey patching with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sed&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;awk&lt;/code&gt; to fix. There are still some more loose edges left which I’ll get around to later.
I’ve setup &lt;a href=&quot;http://disqus.com&quot;&gt;Disqus&lt;/a&gt; for comments, and I still need to import all the comments from the Wordpress site.&lt;/p&gt;

&lt;p&gt;I’m currently on the &lt;a href=&quot;http://andhyde.com&quot;&gt;Hyde theme&lt;/a&gt; which I modified a bit to suit my liking.&lt;/p&gt;

&lt;p&gt;All in all, it’s been a breeze to deploy over Github and I’m quite happy to have a lot more control on how my site looks like.&lt;/p&gt;</content>

      
      
      
      
      

      

      

      

      
        <summary type="html">It’s the New Year and that means it’s time for change. I’ve finally moved my blog off wordpress.com and onto Github + Jekyll. Jekyll has been a pleasure to deal with so far. The import from wordpress was mostly trivial but with some rough edges. First, export your wordpress.com blog using the admin console (you should get an xml dump of your site) and then run: # Assumes the XML dump is named wordpressdotcom.xml, $: jekyll import wordpressdotcom I only exported my posts because I wanted to setup pages myself. The above command should populate Jekyll’s _posts folder with your blog’s posts as html files. I found the generated html to be rather mangled; there were no paragraph separations and blockquotes looked ugly. This required some monkey patching with sed and awk to fix. There are still some more loose edges left which I’ll get around to later. I’ve setup Disqus for comments, and I still need to import all the comments from the Wordpress site. I’m currently on the Hyde theme which I modified a bit to suit my liking. All in all, it’s been a breeze to deploy over Github and I’m quite happy to have a lot more control on how my site looks like.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Procrastination post: Erdos Number</title>
      
      
      <link href="https://lalith.in/2013/12/15/procrastination-post-erdos-number/" rel="alternate" type="text/html" title="Procrastination post: Erdos Number" />
      
      <published>2013-12-15T00:00:00+00:00</published>
      <updated>2013-12-15T00:00:00+00:00</updated>
      <id>https://lalith.in/2013/12/15/procrastination-post-erdos-number</id>
      <content type="html" xml:base="https://lalith.in/2013/12/15/procrastination-post-erdos-number/">&lt;p&gt;In waiting for my combinatorial explosion of a factorial design experiment to complete, I decided to find out what my &lt;a href=&quot;http://en.wikipedia.org/wiki/Erd%C5%91s_number&quot;&gt;Erdos Number&lt;/a&gt; looks like, which is basically defined as  the collaborative distance between yourself and the famous mathematician, Paul Erdos. Turns out, I have an Erdos number of 4 via the following chain of co-authorship.&lt;/p&gt;

&lt;p&gt;/me -- Anja Feldmann -- Edward G. Coffman, Jr -- Joel H. Spencer -- Paul Erdős.&lt;/p&gt;

&lt;p&gt;You can calculate your Erdos number &lt;a href=&quot;http://www.ams.org/mathscinet/collaborationDistance.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;</content>

      
      
      
      
      

      

      
        <category term="Research" />
      

      

      
        <summary type="html">In waiting for my combinatorial explosion of a factorial design experiment to complete, I decided to find out what my Erdos Number looks like, which is basically defined as the collaborative distance between yourself and the famous mathematician, Paul Erdos. Turns out, I have an Erdos number of 4 via the following chain of co-authorship. /me -- Anja Feldmann -- Edward G. Coffman, Jr -- Joel H. Spencer -- Paul Erdős. You can calculate your Erdos number here.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Comic: Academic Future Work Slides</title>
      
      
      <link href="https://lalith.in/2013/11/30/comic-academic-future-work-slides/" rel="alternate" type="text/html" title="Comic: Academic Future Work Slides" />
      
      <published>2013-11-30T00:00:00+00:00</published>
      <updated>2013-11-30T00:00:00+00:00</updated>
      <id>https://lalith.in/2013/11/30/comic-academic-future-work-slides</id>
      <content type="html" xml:base="https://lalith.in/2013/11/30/comic-academic-future-work-slides/">I&apos;ve been going through a lot of academic talks lately and almost every slide on future work feels like this to me.&lt;p&gt;
&lt;p&gt;
&lt;p&gt;
&lt;a href=&quot;http://lalithsuresh.files.wordpress.com/2013/11/academic_future_work2.png&quot;&gt;&lt;img class=&quot;aligncenter size-full wp-image-1837&quot; alt=&quot;academic_future_work&quot; src=&quot;http://lalithsuresh.files.wordpress.com/2013/11/academic_future_work2.png&quot; width=&quot;580&quot; height=&quot;604&quot; /&gt;&lt;/a&gt;</content>

      
      
      
      
      

      

      
        <category term="Comic" />
      

      

      
        <summary type="html">I&apos;ve been going through a lot of academic talks lately and almost every slide on future work feels like this to me.</summary>
      

      
      
    </entry>
  
  
  
    <entry>
      
      <title type="html">Optimize your life for learning</title>
      
      
      <link href="https://lalith.in/2013/11/24/optimize-your-life-for-learning/" rel="alternate" type="text/html" title="Optimize your life for learning" />
      
      <published>2013-11-24T00:00:00+00:00</published>
      <updated>2013-11-24T00:00:00+00:00</updated>
      <id>https://lalith.in/2013/11/24/optimize-your-life-for-learning</id>
      <content type="html" xml:base="https://lalith.in/2013/11/24/optimize-your-life-for-learning/">Just stumbled upon a gem of an email [1] from Professor Alexander Coward at Berkeley, explaining why he isn&apos;t going to cancel a class despite a strike.&lt;p&gt;
&lt;p&gt;
The last couple of paragraphs ought to resonate well with anyone who&apos;s obsessed with learning. To quote the email,
&lt;blockquote&gt; 
&lt;p&gt;
&lt;br&gt;In order for you to navigate the increasing complexity of the 21st century you need a world-class education, and thankfully you have an opportunity to get one. I don’t just mean the education you get in class, but I mean the education you get in everything you do, every book you read, every conversation you have, every thought you think.&lt;br&gt;

&lt;br&gt;You need to optimize your life for learning.&lt;br&gt;

&lt;br&gt;You need to live and breath your education.&lt;br&gt;

&lt;br&gt;You need to be *obsessed* with your education.&lt;br&gt;

&lt;br&gt;Do not fall into the trap of thinking that because you are surrounded by so many dazzlingly smart fellow students that means you’re no good. Nothing could be further from the truth.&lt;br&gt;

&lt;br&gt;And do not fall into the trap of thinking that you focusing on your education is a selfish thing. It’s not a selfish thing. It’s the most noble thing you could do.&lt;br&gt;

&lt;br&gt;Society is investing in you so that you can help solve the many challenges we are going to face in the coming decades, from profound technological challenges to helping people with the age old search for human happiness and meaning.&lt;br&gt;

&lt;br&gt;That is why I am not canceling class tomorrow. Your education is really really important, not just to you, but in a far broader and wider reaching way than I think any of you have yet to fully appreciate.&lt;br&gt;

&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;
[1] : &lt;a href=&quot;http://alumni.berkeley.edu/california-magazine/just-in/2013-11-21/cal-lecturers-email-students-goes-viral-why-i-am-not&quot;&gt;http://alumni.berkeley.edu/california-magazine/just-in/2013-11-21/cal-lecturers-email-students-goes-viral-why-i-am-not&lt;/a&gt;&lt;p&gt;</content>

      
      
      
      
      

      

      
        <category term="Education" />
      

      

      
        <summary type="html">Just stumbled upon a gem of an email [1] from Professor Alexander Coward at Berkeley, explaining why he isn&apos;t going to cancel a class despite a strike. The last couple of paragraphs ought to resonate well with anyone who&apos;s obsessed with learning. To quote the email,   In order for you to navigate the increasing complexity of the 21st century you need a world-class education, and thankfully you have an opportunity to get one. I don’t just mean the education you get in class, but I mean the education you get in everything you do, every book you read, every conversation you have, every thought you think. You need to optimize your life for learning. You need to live and breath your education. You need to be *obsessed* with your education. Do not fall into the trap of thinking that because you are surrounded by so many dazzlingly smart fellow students that means you’re no good. Nothing could be further from the truth. And do not fall into the trap of thinking that you focusing on your education is a selfish thing. It’s not a selfish thing. It’s the most noble thing you could do. Society is investing in you so that you can help solve the many challenges we are going to face in the coming decades, from profound technological challenges to helping people with the age old search for human happiness and meaning. That is why I am not canceling class tomorrow. Your education is really really important, not just to you, but in a far broader and wider reaching way than I think any of you have yet to fully appreciate. [1] : http://alumni.berkeley.edu/california-magazine/just-in/2013-11-21/cal-lecturers-email-students-goes-viral-why-i-am-not</summary>
      

      
      
    </entry>
  
  
</feed>
