Oh The Huge Manatee!

Software Estimation Deniers: the Flat-Earthers of Project Management

2022-12-20T09:01:00+01:00

Another day, another Hacker News post complaining about estimation in software build projects. These articles and comment sections should come with a trigger warning for armchair speculation. Project management and time estimation are fields of serious research. Drawing conclusions from your personal experience, with no idea what the research says, is as silly as writing your experience that the world is obviously flat.

Estimation is actually well understood in academic project management. There is (Nobel Prize winning!) research about what, actually, are the problems inherent to estimation, and how to produce specific and accurate estimates despite them. This academic field is almost 50 years old, and no one who complains about estimation in blog posts or comments is aware of it.

Stop navel gazing and actually go READ about the subject. I know it’s hard for us to take in anything longer than a StackExchange post, but please try, BEFORE you write about your shitty experience with estimates and generalize to the entire problem space.

Here are some things that the research has said for decades:

Humans are all bad at time estimation. Even the ones who consider themselves good at it, estimating tasks with which they are very familiar, “only” underestimate by 30% at best.
Humans are pretty good at estimating non-time attributes of work, even those with a direct correlation to time. Like effort, complexity, or “cups of coffee.”
if you estimate something with a time corellation (e.g. complexity) in a consistent way and measure the average throughput over time, you can very precisely and accurately estimate time to completion. This is the Law of Large Numbers, which is how casinos can accurately predict profits when dealing with much more randomness than exists in software projects.
Note that long term estimates produced with the Law of Large Numbers include unexpected complexity, personal issues, illness, windows updates, etc. It’s a statistical law.
the accuracy of average time estimates is proportional to the time left on the project. It runs opposite to the uncertainty of distant features. I.e. this method does not predict how much you can build in a week; you’re better off with your relatively intimate knowledge of the feature at that point in time and a gut check. Rather it predicts how much you will build over 12 weeks, with extraordinary accuracy.
estimates are better understood when presented with a confidence interval, e.g. “The work as we understand it today will take 8 weeks, with 95% certainty.”
Estimates are damaging to teams when they are treated as proscriptive. They are constructive when used as predictions. i.e. rather than “you must get this work done in 8 weeks,” it’s “we understand this as 8 weeks of work.” Proscriptive estimates compound the problems with human estimation (see first point).

What I HAVEN’T seen in the research, but which is undoubtedly true, is that most teams violate these fundamentals. A certain percentage of those team members then complain on the Internet that estimates are useless.

Asking your team to estimate in time units IS useless. Summing those time estimates to create a long term plan is doubly useless. Cracking the whip on your team when that estimate proves incorrect is triply useless. And complaining about it on the Internet because you’ve never read any of the grown up work on the subject… well that’s hacker culture.

Whatsapp Is Too Expensive for Me

2022-09-19T19:30:16+02:00

Plenty of my friends and colleagues use WhatsApp and enjoy it. But I think few people are really considering the information pricetag they’re paying for a chat client. It comes up frequently enough that I thought I’d catalog the information you give Meta/Facebook/Zuckerberg by installing WhatsApp.

I imagine collecting metadata over a period of a few years to glean these insights. This is all extracted from “metadata”, by the way. You don’t need access to the contents of someone’s messages to know all about them.

Contacts
- everyone you have ever known, when you last contacted them and when you last updated their info.
- what high school, universities you attended
- what you studied
- where you have worked and when
- who are your family (contacts named “mom” or with same last name)
- who are your friends
- approximately where you live
- your hobbies
- your political leanings and affiliations
- what doctors you have visited regularly enough to store
- what medical conditions you’ve probably had
Storage
- every place you’ve taken a picture
- who you were with
- what you were doing
- what you wear (brands etc)
- your interests/hobbies
- when you travel, and where to
- do you have kids, how many and how old
Location
- home address
- work address (and therefore likely job)
- friends and where they live
- who you’re sleeping with
- your path to work, what transit you use, where you change stations etc
- social events
- hobbies
- doctors / medical history
- how often you use the toilet and for how long
- kids’ schools and paths there
- where you shop and what brands
Generally from having an app
- your sleep/wake schedule
- how often you use your phone
- your phone make/model/year
- any phone peripherals you own (eg earbuds)
- what other social applications you share from

Of course I haven’t included any data that’s publicly available to purchase and correlate with you to build a more complete profile, which Meta certainly does. Credit card information, brands you frequent, your subscriptions, which video services you use (netflix, hulu, etc), your SSN, income bracket, age, sex, and more. Nor have I included information we get by combining metadata, like psychological profile, alcohol/drug habits, risk of medical conditions like heart attack, etc.

Even talking about that knowledge in the abstract like this, the significance doesn’t really land. Another way to look at it is the specific knowledge about your life that they gather from this information. Here are some examples from the EFF:

They know you rang a phone sex line at 2:24 am and spoke for 18 minutes. But they don’t know what you talked about.

They know you called the suicide prevention hotline from the Golden Gate Bridge. But the topic of the call remains a secret.

They know you got an email from an HIV testing service, then called your doctor, then visited an HIV support group website in the same hour. But they don’t know what was in the email or what you talked about on the phone.

They know you received an email from a digital rights activist group with the subject line “Let’s Tell Congress: Stop SESTA/FOSTA” and then called your elected representative immediately after. But the content of those communications remains safe from government intrusion.

They know you called a gynecologist, spoke for a half hour, and then called the local abortion clinic’s number later that day.

This is not quite the same as knowing when you were masturbating and to what, that you’re suicidally depressed, have HIV, that you’re a digital rights activist, and are having an abortion… but it’s just as bad.

When an app asks me for permissions, I try to mentally translate it into a request for specific information like this. So when I go to install WhatsApp, it says “in order to use Whatsapp, you have to tell me everyone you have ever known, when you last talked to them, where you studied, your doctors’ names, your medical history, family and friends name, where you live and work, your route between them, when you’re sleeping vs awake, your hobbies, etc…”

There are plenty of apps where that tradeoff is worth it for me. But not a chat app. They’re a dime a dozen! Especially not when there are alternatives like Signal available which ask for no information and offer the same features.

I think most people don’t consider app installs this way. That’s not an accident, it’s a feature of how the system is built. People who are tricked - practically everyone - are not fools. They are victims. So when I see that you have WhatsApp installed, I don’t judge you. I judge Meta.

Four Drupalists

2022-09-08T21:07:31+02:00

Recently some of my work on prenotes came up, and reminded me that I’ve wanted to post favorite moments here for some time. For those who don’t know, a prenote is a session before the keynote of a conference or event. During that session, notables from across the community put on a humorous show of parody lyrics karaoke, terribly scripted dialogue, and general silliness. It’s a great way to kick off an event, and really sets the tone if you’re into flat-hierarchy and openness for newbies.

Without further ado, here is the first of my favorite prenote moments, performed for Drupalcon Amsterdam in 2019.

The four Drupalists

(the four of us pass a joint around)

Rob: Ahh.. Very passable, this, very passable.

Cam: Nothing like a good hit of Lemon Kush Blue Drop Haze, ay Adam?

Adam: Dude…

Jam: Who’d a thought ten years ago we’d all be sittin’ here smoking a bong on the prenote stage?

Rob: Yep. In those days, we’d a’ been glad to even give a session.

Cam: A session OFF the mainstage.

Jam: Without mics or speaker notes.

Adam: OR a session.

Rob: In a filthy broom closet.

Jam: I never used to have a broom closet. I used to have to present in the alley behind the convention center.

Cam: The best WE could manage was talking to strangers at a bus stop.

Adam: But you know, we were happy in those days, though we weren’t Drupal Famous.

Rob: Aye. BECAUSE we weren’t Drupal Famous. Dries used to say to me: Rob, Twitter followers don’t buy you happiness.'

Jam: He was right. I was happier then and I had NO followers. We used to have this tiiiny old PHPBB bulletin board, with enormous fatal errors in the forums.

Cam: Bulletin board! You were lucky to have a bulletin board! We used to share one ICQ account, all twenty-six of us, no wifi. Half the router was dead; we were all plugged into the other half for fear of PACKET LOSS!

Adam: You were lucky to have a ROUTER! We used to have to use a token ring network!

Rob: Ohhh we used to DREAM of a token ring network! Woulda’ been luxury to us. We used to share a dial-up connection on a copper phone line. We got woken up every morning by the sound of SCREEEEE-WOOOEOOOOOEOOO-BIBOMMMM-BIBOMM-BIBOMM. Internet connection! Hmpfh.

Jam: Well when I say “Internet connection”, it was just a 14.4 baud modem connected to a tin can and some string, but it was an Internet connection to US.

Cam: Our tin can and string were cut; we had to send packets by smoke signal!

Adam: You were lucky to have smoke signals! There were a hundred and fifty of us sending our TCP packets on USB sticks by carrier pigeon.

Rob: 1 Gigabyte sticks?

Adam: Yep.

Rob: You were lucky. For us, once a week old Geppeto would come to town pushing a hand cart with UDP packets. SYN and ACK, hmpf! We used to join stand-up at 6 in the morning, clean the backlog, drink a cup of cold coffee, go down to code in the basement for fourteen hours a day week-in-week-out. When we finished a project, the PM would schedule a 2 week retrospective.

Cam: Luxury. We used to join standup at 3 o’clock in the morning, rewrite the whole ticketing system from scratch, eat a handful of coffee grinds, write our modules without an IDE for 2 cents a month, clock out, and the PM would beat us over the head with a laptop, if we were LUCKY!

Adam: Well we had it tough. We used to have to wake up the carrier pigeons at twelve o’clock at night, and format the USB sticks by hand. We drank a cup of sulfuric acid, worked twenty-four hours a day writing modules on a punch card machine for for four cents every six years, and when we got home, our PM would strangle us with a mouse cable.

Jam: Right. I had to join morning standup at ten o’clock at night, half an hour before I went to bed, drink a cup of molten lava, work twenty-nine hours a day making punch cards with my teeth, paying the CEO for permission to come to work, and when we got home, our PM would kill us and dance about on our graves singing the Drupal song.

Rob: But you try and tell the young Drupalists today that… and they won’t believe ya'.

ALL: Nope, nope…

The purpose of Music Notation and Theory

2022-04-25T11:00:15+02:00

I talk a fair bit about the connections between music and engineering, and a discussion on Hacker News came up about one of my favorite points. I’ve adapted my comments here.

Classical musicians are overwhelmingly taught that their art form is about reproducing the notes, dynamics, and composer’s intention as accurately as possible. This is mistaking the finger for the moon. Music, of any genre, is never about reproducing the notes and dynamics accurately. MIDI does that, and the day we can create historically informed MIDI we will have it perfected.

Music is the spontaneous communication and individuality that occurs while reproducing those notes and dynamics and intents. Genres differ in which dimensions of freedom the musician can use, but not in that fundamental objective. In classical music it’s what makes it worthwhile to hear different interpreters of the same piece. It’s why people join the 10 year waiting list to see Wagner in Bayreuth, when arguably the greatest rendition of The Ring in 100 years (Solti, Vienna, fight me) is available relatively cheaply everywhere.

If that’s the case…

Encourage your classical friends to break out of the mindset of “cooking the recipe the master chef made exactly.” Audiences complain that classical music is sterile and doesn’t speak to them, precisely because this mindset is so common among all but the most elite performers. Also, notice that NO elite performers approach their musicianship that way. As Ansel Adams said, “craft facility liberates expression.” The point of learning all this technique and theory is to give you the tools to express yourself in music. Mozart’s notes are only the vehicle.
Satisfying the written note, the theory, and the historical practice, is not making music. They’re simply there as tools of communication between musicians, to describe recurring patterns we hear. Otherwise we’d be forever saying things like “Beethoven 3, it’s the one that starts with that thing where it sounds like it’s going to end but really doesn’t.” Calling it a “deceptive cadence” or even a “surprise chromatic sixth” is just way easier and lets us operate on a higher level of abstraction. It’s the same way “dependency injector class” tells you about a given chunk of code, and lets you reason about its place in the larger structure. Further, a great engineer isn’t defined by their ability to reproduce a textbook dependency injection class, but rather by their ability to adapt the concept of dependency injection in the right places and times.

I think it’s a common mistake, especially in classical music, to think about musical traditions as rooted in clear rules and numerical relationships. It’s analogous to thinking that language is rooted in clear rules and structures.

The truth is just the reverse, the rules, numbers, and structures are rooted in music. They exist to describe an organic, emergent cultural mechanism that is continuously changing. Ask yourself: which came first, music or theory? Language or grammar? There are plenty of musical styles with no formalized structure or numeric relationships, just as there are plenty of languages with no formalized structure or even spelling. Grammar and music theory/notation are fingers pointing at the moon, and we are looking at the finger.

When we try to engineer systems to help people point at the moon, we should be focused on the finger. Musical structure and numeric relationships are the way people communicate about music, and the tools should speak their language. Unfortunately they then have to grapple with the painful inconsistencies in that language and those structures. Computer music comprehension has problems analogous to computer language comprehension - emergent complexity so high it took neural nets to finally achieve real utility.

PS - it’s true that Western music has, near the bottom, some relationships that were mathematically derived by none other than Pythagoras, among others. Relationships of fourths, fifths, octaves, and equal temperment all had numeric justification… But that justification was still only there to describe the practice which had already become common, an explanation of “why this sounds harmonious” as well as a proscription for “how to sound harmonious”. That some rules are internalized by some generations and broken by others illustrates exactly the problem with the system.

Time Management Advice for Effective Leaders

2022-03-17T16:26:28+01:00

I coach a number of leaders of various stripes around tech, and one issue that comes up for everyone is time management. I think of it more as “obligation control.”

As any kind of leader, you are the the first resource people tend to call, not just for obstacles, but for FYIs and general information. Your job may involve knowing what’s happening across a broad set of projects. Just think of the math: if you have a small team of 5 people to lead, you are managing 200 hours of work per week. Everyone on your team has things where they want - or should want - a bit of your brain space. Add to that the number of communicaton channels (25), timezones, and schedules, and you are already overloaded.

So how do effective leaders manage their time? How do you control the virtually unlimited number of obligations that seek your attention? I had a chat with a colleague today on the subject and thought I’d share.

Personally, I’m very strict about my schedule and up front about that with everyone. I have very hard start and stop times every day, and I try to be dogmatic about my “no meeting day” (more on that in another post). This has two important effects:

It forces me to “switch off” my work brain at a certain time every day, driving a baseline level of work/life balance.
It imposes scarcity on my schedule, forcing me to be very picky about which meetings I attend. The basic skill you have to nurture is to say “no” more than you think you can.

It’s hard. Some people find it helps to have a scapegoat… so think in advance of a couple of good excuses. “I need to circle back with my team about that,” and “My child has an appointment at that time,” are favorites from colleagues I’ve mentored. Personally I just say “I’m not available at that time.” No one questions it.

It’s important to follow the “no” with a request proceed without you (if that’s possible), or with a suggested future time. Very very very few things we deal with have life or death consequences within a 1 week time-frame. As an aside: if it was possible to proceed without you, that should trigger you to reconsider if you really need to be at that meeting cadence at all. A good leader tries to grow their people so that they are not personally needed for anything.

Personally, I’m very strict about my schedule and try to communicate it clearly with everyone. I have very hard start and stop times every day, and I try to be dogmatic about my “no meeting day” (more on that in another post). This has two important effects:

It forces me to “switch off” my work brain at a certain time every day, driving a baseline level of work/life balance.
It imposes scarcity on my schedule, forcing me to be very picky about which meetings I attend.

It sounds mean. It even feels a little mean. But the habit is essential: with every meeting, first consider if your attendance is really critical, or if you can get the information another way. Here are some alternative ways to consider:

Meeting recordings
Written meeting notes
Have a representative attend and report back to you
Simply trust that the team will continue operating as they have in the past.

You can make use of all of these tactics at different times. A word of caution, however: you will miss some of the context, particularly the unspoken conditions on the team, this way. It’s important that you book regular sessions with the people actually doing the work to compensate for this loss. If you build healthy relationships with the people where “the rubber meets the road”, they will tell you when there is unspoken disfunction or other trouble on the team.

One final trick: the schedule I tell people is a little shorter than my ACTUAL availability. I say I’m only available till 7.30pm local time, but the truth is it’s fine if I work till 8. Most nights that extra half hour is for closing out the day’s business, but I can make someone feel special and extra-important by “staying on late” for them.

You might note this all leaves you with very little agency to impact your own schedule. It’s quite reactive! I can offer you two tricks for proactively deciding where to impact and for making efficiency gains:

I keep a RACI chart of all the different areas of work, and I only allow myself ONE item in the Responsible column. Everything else I monitor from the outside, often asynchronously. This chart has no validity for the rest of my organization; I’m ultimately responsible fr all of the threads. It still helps as a mental tool though, to make me consciously decide where I will lean in to have impact… and to limit those places. Without it, it is too easy for me to see potential impact everywhere and dilute my time too much.
I do everything I can to make sure the work is organized and information is structured so it’s easy to context switch. That’s the best source for time savings I have. I use a project management tool, so I can view the information from different perspectives:

I’m always taking notes in a relevant ticket.
I can “zoom out” to overview all of the workstreams
I can track progress across all the workstreams
I can quickly report on status of any given workstream and issue with a high degree of granularity.
If I’m feeling flashy I can even make dashboards about it

A good leader makes conscious decisions about where to invest their time, builds relationships and collects information from up and down their whole organization, leverages their whole team and every available option to maximize their reach, and exercises conscious restraint to avoid over-committing. Hopefully these tips help you get a few steps closer.

Starting a New Role: Global Partner Strategy ISV CTO for Red Hat

2022-02-16T21:17:35+01:00

This month I start a new challenge at Microsoft, as Global Partner Strategy ISV CTO for Red Hat. Translated: I am taking on technical leadership of the Microsoft/Red Hat partnership.

I had a wonderful and fruitful 4 years as a technical and team lead in the Commercial Software Engineering department, continuing my project-based work with some of Microsoft’s biggest commercial customers. I got to build interesting projects with brilliant engineers, working in close collaboration with Volkswagen, Daimler, E.ON, the United Nations, and others. I got to learn what it’s like working in an enormous company like Microsoft, and how global enterprises engage with each other.

Working with Red Hat is an exciting next step for me. Red Hat 5.2 was my first exposure to Linux back in 1998, and though I never did get it working successfully, I tried again a couple of years later and built my first homemade router and file server on 6.1. It was my go-to distro for years: when I ran a research lab in university, the systems and servers were all Red Hat 8. If someone had told me back then, that 20 years later this would be my job, I wouldn’t have believed it.

I feel like my life has completed one of those little circles, and I am over the moon about where I landed.

Microsoft and Red Hat have a lot in common: they are both enterprise focused, with strong technical fundamentals in infrastructure and cloud native in particular. Of all the hyperscalers Azure is the one that treats hybrid cloud as a first class citizen, and not just an onramp to public cloud. Red Hat is also hybrid focused, with a suite of powerful products including Openshift, Ansible, and others. Though Microsoft has made great leaps in open source, we have a lot to learn by working closely with a company for whom open source principles are truly fundamental. And enterprise-focused though RedHat is, no one meets enterprise where they live as well as Microsoft.

With all this in common, and so much to learn from each other, the partnership is already a warm and welcoming place to land. In my first weeks here it is clear that this is very fertile ground. Look for exciting things to come!

Thank you to all my colleagues from Microsoft Commercial Software Engineering. I had a lot of fun with you, and learned a lot. On to the next adventure!

Building blocks for autonomous driving simulation environments

2021-11-12T19:15:33+01:00

My team’s work with Volkswagen on an autonomous driving simulation environments just turned into a new Azure Well-Architected page, which makes this a good time to add some behind-the-scenes commentary from real world experience. The architecture recommended there came from a lot of discovery and experimentation work, and suits quite a broad case that I think is not well represented (yet) in cloud computing products.

The challenge

Testing and CI-style validation in autonomous driving development are a tricky challenge. See, an autonomous driving simulation environment is actually composed of several different components, all working together. There is always a system for time synchronization, typically coupled with some kind of data bus that all the other components can plug into. This is where the challenges start:

Each component is a separate executable with separate requirements. In our case we had both Windows and Linux components, several of which needed exclusive GPU access.
Exactly which components you need for a test is totally dependent on what part of the autonomous driving system you’re testing. For example, a lidar sensor dev team needs very high fidelity models of objects it might be detecting. Reflective and absorbtive properties and shape require a lot of detail. So do weather conditions as far as visual occlusion and light scattering effects go. They don’t need much of a road surface simulation, or pedestrians, or other drivers. The dev team working on the highway control component however, needs a totally different set of components, practically mutually exclusive.
Even within one team the components may vary, as lots of this stuff is non-deterministic. It sometimes helps to validate against more than one simulation of the same environment.
Components have hard and complex interdependencies, despite separate execution environments. For example, Linux and Windows based components may work closely together to simulate other cars and their drivers’ behaviors. This impacts startup order, where some tools can’t even start initializing until others - or until certain sets of others - are reporting ready. The whole simulation of course, can’t start until everything is ready. These tools are designed to start and operate in a highly stateful manner.
Because we’re dealing with human lives and liability here, each component needs to be independently validated and verifiable, starting from a known good state. Five years later, we need to be able to reproduce as closely as possible the starting point of the simulation. “Formal Reproducibility” would be amazing, but depends on component development too much to be within our control.
Finally, some components are designed for human UI interaction. For example the test runner, which has its own system of debug breakpoints and step debugging that can’t be replicated in code (“break when the car is < 2m from the obstacle, or after 68000 virtual milliseconds”). Developers rely on this kind of debugging to build.

Anyone with DevOps engineering experience probably looks at that laundry list and thinks “may as well wish for the f@#$king moon while you’re at it.” I urge a bit more empathy. Remember, these developers are working with non-deterministic code to handle non-deterministic environments. This is not trivial stuff.

A platform for automated driving simulation needs to address those problems. Put another way, the system needs total composability and the ability to model a directed acyclic graph for startup and execution of the simulation; the components need flexibility in OS and interactivity.

My team came up with this (admittedly complex) diagram to describe the problems in abstract.

Go ahead and zoom in for a better view, it’s a lot to take in. It’s easiest to look at the big segments (“layers” in the parlance of the image): User input, Orchestration, Building Block Factory, Simulation Infrastructure, Storage. Most of these are self explanatory. The building block factory is whatever system creates and validates components, the “building blocks” of any simulation run.

The workflow in the above system is straightforward to describe:

The developer submits a human-readable document to the UI, which validates and passes it to the orchestration layer. The orchestration layer.
The orchestration layer interprets the document and provisions the necessary resources - whatever they may be - in an appropriate networking, monitoring, and storage context. The resources themselves come from the Building Block Factory, so they’re all validated and conform to a documented API (the “building block contract”). The provisioned, started components in their network etc frame form the Simulation Infrastructure Layer.
The simulation runs autonomously, as far as possible. At the end of the simulation, outputs and logs are in the Storage Layer, and the Simulation Infrastructure Layer is deprovisioned.

Just three steps. But the number of boxes and lines in that diagram belie just how complicated it is, on the inside.

The key to the whole system is clearly in the Orchestration Layer. There are great options for the Building Block Factory in most any CI/CD and versioning toolkit. The infrastructure layer is provided by the cloud provider. But this bit of interpreting the user input into a complex graph of components, starting them, monitoring for readiness, then starting execution… that’s hard to find in a single tool.

Some parts of this problem space are well handled by a variety of tools. Provisioning and tearing down cloud resources, for example, could be done through any number of tools like Terraform, or even straight up ARM (Azure Resource Manager) templates. But then what monitors component startup state, managing that directed acyclic graph through to readiness? We evaluated several options for core technologies around which we could architect, including Puppet/Chef, Ansible, Terraform, and a collection of Azure services held together by Azure Functions and duct tape. The clear winner by a substantial margin was Kubernetes.

This was a suspicous conclusion for us, a team of infra specialists. We always have to be careful about our own technical biases, especially towards the new-and-cool tools in our kit (Had Nomad existed at the time it probably would have been a good contender, and made us feel a bit better about having at least two alternatives). But however you cut the problem, Kubernetes out of the box solved almost all of the problems in the Orchestrator Layer space, and provided a great starting point for the remaining boxes.

The boxes that Kubernetes does not cover OOTB, are the flexibility to run VMs or containers interchangeably (with an abstraction layer to ensure identical APIs), and the monitoring of buildup and simulation start/end.

There are a number of projects to provide the former, including integrations into Azure, and our choice, kube-virt. Importantly for us at the time: kube-virt was an earlier stage CNCI project and therefore still rough around the edges, but it was already a supported part of Azure RedHat OpenShift, and the supported way to run Azure IoT Edge on Kubernetes.

The latter problem is the core business logic, so it is the Right Focus for Custom Code. But even there Kubernetes gives us a good start, as it is already an event-driven architecture with deep connections to container lifecycle and readiness/liveness probes. What’s more, the model of Kubernetes objects and controllers already implements the concept of a larger object abstracting a group of lower-level ones, in Deployments.

In fact, we found that our custom controller would have to implement a CRD quite similar to a Kubernetes Deployment. We could even extend the Deployment object to do it. The user ultimately could provide a simple YAML like this:

name: my simulation run
spec:
    start:
        component: test-runner
        command: C:\start.exe
    components:
        - name: sync-server
          image: sim/sync-server:5.4-dev
        - name: test-runner
          image: sim/test-runner:3.2.2
          interactive: true
        - name: environment-sim
          image: sim/vtd:2.1
          requires:
            gpu: 1
          depends:
            - test-runner
            - sync-server
    results:
       storageClass: AzureBlob

Given some metadata on each component, this is probably sufficient information for a controller to create the required deployments and run the simulation.

test-runner is a Windows VM image, as indicated by the component’s own metadata. The interactive flag indicates that the controller should return a proxy to the RDP port on the VM.
Other components are both containers.
One container shows a resource request for a gpu core, which is handled by assigning the container to the right node pool.
That container also declares its’ dependencies. Probes are defined in the component metadata; the kubernetes controller watches for both dependencies to pass their Readiness probes, and then starts environment-sim.
Once all three components pass their Readiness probes, spec.start.command is run inside spec.start.component. When the command terminates, the simulation is considered complete.
The spec.results key helps the Controller create a PVC for results. The actual PVC is later accessible through a well-known naming scheme, such as [datetime]-[name].
Component definitions include default ports to expose, but by default nothing is exposed to the outside world.

Of course the actual Deployments and/or Replicasets involved need a lot more information than this. But that information is all consistent enough between runs that it can be templated with default values. The structure of this CRD is similar enough to other Kubernetes objects, that we can allow the user freedom to add other keys from the container spec, for example, to override those defaults. The 99% case however, would profit from opinionated defaults.

Implications

If you’ve followed along this far, you’ve noticed that this domain is not unique to automotive simulation. Lots of computational problems require a directed acyclic graph for building up interdependent components, with a defined execution and teardown afterwards. But execution automation tools like batch runners rarely offer the kind of composability that has become the norm in service-oriented infrastructure. The key insight that this approach offers, is that an event-driven architecture for managing infrastructure, like Kubernetes, has a lot to offer in discrete computational tasks as well. This is particularly the case in very domain-knowledge heavy areas like simulation, but probably also in data science, IoT, biochemistry, and others.

This diagram, architecture, and first steps with a commercial customer were only the beginning. I’m no longer on that project, but I look forward to seeing how it develops.

The Cluster in My Closet - Advice for running kubernetes at home

2021-05-15T15:00:05+01:00

What do most people do with their old computers? I’ve never been good at getting rid of mine. They were all repurposed into servers, running whatever key services for my household I could think of at the time. This year I decided to move out of the “old sysadmin” patterns of my roots, and try running my homelab in a more modern way. That’s right: I set up a kubernetes cluster.

All my old devices now go to the great cluster in the sky, and live on serving us files and media, blocking ads, and running small automations. They’re happy, now.

They may be happy, but I am not. I thought I was comfortable with Kubernetes, but boy was I wrong. Turns out, I was comfortable with kubernetes… in a stable, homoegeneous, cloud-based environment. Turns out, those cloud vendors really do cover a lot of complexity for you: using Azure Kubernetes Service to manage enterprise and microservices apps is way easier than running your own cluster for the kind of apps you use in the home.

My home cluster consists of four Raspberry pi 4s, two old laptops, one mini desktop, and a pine64 laptop. First pain point: that’s a mix of CPU architectures and capabilities. I’ve got it all covered: x64, arm7, and arm64… and many container images are only built for a subset of those. At least the laptops are x64, but they’re different makes and models, with different quirks. One of them has an unreliable USB port, which wreaks intermittent havoc with its USB ethernet adapter. The other has a dead internal IDE chip, and has to boot off of an external drive duct-taped to the back of the screen. And of course the whole thing lives in my laundry room, where temperatures and vibration from the floor will vary with the drying cycle, and where dust is all-pervasive.

I’m happy I’ve stuck with it - this has made me really learn the ins and outs of detailed troubleshooting in kubernetes. I’ve a much better understanding now of why different abstractions may exist, in a very practical way. Here are some of the fun features and workarounds I have in place. Each of these could be their own post, someday:

Lots of home-oriented services run with SQLite as a backing database, which is a problem because sqlite’s normal ‘wal mode’ of accessing database files has a hard dependency on block-level file access. That means network filesystems, and all k8s’ built in abstractons - are out.
- I ran with iSCSI devices for awhile, but the unreliability of home networking hardware was sufficient to produce data corruption every few months.
- I wrote my own sidecar service to periodically freeze a shared PV and sync all the files to NFS. But if the freeze comes at a bad time for sqlite, that will also cause data corruption.
- Longhorn doesn’t have images for armv7 yet.
- Finally I settled on using the local provisioner for volumes, and using litestream to back the DBs up to NFS. This seems to be working well… even though the Pi SD cards are a slow place to write even interim data. Not to mention, when you use the local provisioner, your pod is always scheduled back to the same node. That breaks the flexibility which is half of the value of the system in the first place!
Very few container images offer all three of my CPU architectures, so every one of my manifests needs to use NodeSelector.
I’m running a single master k3s node on one of the Raspberry Pi 4s. That’s plenty of capacity for a normal load, but when you add monitoring through Netdata and prometheus, log monitoring through Loki and Grafana, and a handful of multi-master applications, the load from cluster DNS can cause problems. I ended up implementing Nodelocal DNS cache to lighten the load.
The tiny fans that come on most Raspberry Pi cases are not reliable in the difficult environment of my laundry room. I’ve had to replace all the case fans several times and eventually bought a rack for the Pis.
One of my most critical services is Plex, which does much better with hardware decoding capabilities. I’m using node-feature-discovery to get node capabilities into labels, and intel-gpu-plugin to make the hardware available in the pod. Oh, but node feature discovery doesn’t have a build for armv7, so that only works on some nodes.
I use images from linuxserver.io, which are great… but all based on s6_overlay. This means they’re not compatible with common privilege restriction approaches on kubernetes like security contexts. Fortunately(?) they support including arbitrary scripts on container startup, so I can hack my way around most problems.
Internal services often prefer to use HTTPS. Fair enough, but they are internal services, so letsencrypt can’t validate/issue certificates for them. That means either an internal CA, or app configurations that allow self signed certs.
I mentioned I have one node with an unreliable ethernet connection. I tuned kubernetes’ heartbeat and status check timings to minimize downtime when that node disappears, detecting it early and redistributing its’ pods. But it’s a delicate balance: it’s easy to get the math wrong, or just be overzealous, and your nodes start popping into NotReady state with no discernable reason. Of course that disrupts intra-service communication, which can cause cascading failures.
One of my cats knocked a couple of nodes down while I was away on vacation without physical access to the machines. I ended up building a private network connection to Azure and adding support for scaling with VMs there, to keep services running.

It goes on and on. All of these are problems that you basically never encounter when running on a cloud provider. Your money really does go to a valuable purpose.

On the other hand, my graveyard of machines has taught me more about internals of enterprise orchestration than I ever would have learned by implementing in real-world enterprise contexts. So it’s been worth it for me professionally, at least.

What’s the point?

The moral of the story is: running a home kubernetes cluster is a lot harder than you think; and if you have half a brain you already think it’s pretty hard. It definitely does have benefits (beyond learning) though! My home services are actually self-healing, easy to diagnose, and very resiliant to failure. Capacity planning is a non-issue. I have a repo of human-readable files which describe my application environments in their entirety. The trade off has been worth it for me, but only when I look through a pretty broad lens. I could have spent less time on this had I just stuck with docker-compose, for example.

If you’re considering using kubernetes for your home lab, here’s my hard-won advice for you:

Use uniform nodes. Identical CPUs and capabilities make your life so much better.
Use a lightweight kubernetes distro, like k3s. It’s all API compatible anyway and the community will be full of users in similar situations to yours.
Don’t worry about always doing things the Kubernetes Way. For many home applications, StatefulSets really do make for a more reliable result.
Set up centralized logging first. The Loki stack helm chart is relatively turnkey.
Then set up centralized monitoring. Netdata beats the hell out of manually configuring everything in grafana.
Keep your manifests in a repo, for the love of all that is good in the world.
If any of your services are mission critical for your family members (Nexcloud and Plex in my house), leave them out of the cluster for as long as possible. Architect those applications as High-Availability. In my case, that means a multi-master mariadb cluster, multiple web heads, and layered failover… with all the health checks I can think of.
Automate node setup with rancher, ansible, or similar. It’s hard to make home hardware act like “cattle instead of pets”; use every tool at your disposal.
Keep a timestamped log of every change you make that isn’t in a YAML file, and every problem you encounter. Often that’s the fastest route to find your foot guns.
Have fun! Remember you’re (hopefully) not doing this to be pragmatic. Don’t listen to the haters, as long as you’re getting what you want out of the experience.

5 Project Manager Comments That Out You as an Amateur

2021-03-22T10:13:04+01:00

A lot of the project managers you’ll meet in technical consulting are new. That’s OK! The number of developers - and correspondingly, their project managers - is growing exponentially from year to year. Even if we could assign everyone a mentor with at least a year of experience, there wouldn’t be enough mentors to go around. A lot of people are learning as they go, and land in a role because someone thinks they have an aptitude.

That said, the downside to this explosive growth is that there aren’t always strong norms that show newcomers how highly successful professional technical project management actually looks, in practice. People end up substituting what makes them feel good, or what suits their personality, for what’s actually successful in an evidence-based way.

So here’s a bit of insight in list-icle format for you: 10 things that learner PMs say that make them look like amateurs.

5) “Oh wow, look at your calendar. I can’t work that way, planning every block of time.”

Effective and precise management of a project extends necessarily into precise control of time. A good project manager has a lot of competing demands on their time. “Ad hoc” planning ultimately translates into “whatever fire is most visible for me at the moment.” This is not the same as the most important activities. These people will always find themselves with fires to put out, however, which can make them feel like real heroes!

4) “Whatever the customer says…”

It often feels like the customer should always be right. It is their property you’re building, after all! But a consultant’s job is (usually) not just blindly implementing. The real value is in the consultant’s unique insight: if the customer knew how best to solve the problem, they would have done it themselves! This means that good consulting necessarily involves a certain amount of conflict, or “push back” if you prefer. Blindly doing whatever the customer thinks is not the mark of a professional.

3) “We have to get this done on time, I don’t care what it takes”

No, we don’t. Development is a creative activity; tasks are very hard to estimate accurately. No matter what the customer’s deadline is, even if you set the deadline yourselves, it does not have priority over the reality of how long it takes to build something well. A good PM identifies early when reality will diverge from estimated timelines, and manages the situation by adjusting project scope, team size, or deadlines.

2) “I’ve been working 60 hour weeks…”

Especially in North America, it’s easy for overwork to feel like a virtue. Leaving the cultural aspect aside, the project manager’s job is to control scarce work resources over time. If they can’t manage their own scarce work resources over time (only 40 hours each week!), they certainly can’t manage for a team of other people!

1) “I don’t bother with estimates…” / “How many hours will this take?”

There is an entire (Nobel Prize winning) field of research on best approaches to estimation. To no one’s surprise, estimation in time is one of the least accurate options. So inaccurate, that it appears in some situations to be worse than no estimates at all. Both of these comments indicate a PM who has never run estimates using any of the more effective approaches suggested by research, such as three-part estimation, third party estimation, or abstracted unit estimation. Any of those produce usefully accurate results. My own team’s timelines land within 5% of reality over a 4 month project, and we can predict updates based on changes in scope or architecture weeks or months in advance. We’re not doing anything magical or revolutionary, just some basic math with abstracted estimation units.

I hope this helps some new PMs out there. TL;DR:

Get control of your schedule. Use some kind of a system to ensure you’re spending your scarce time on the most important tasks, not the most urgent.
Don’t be afraid to push back on stupid customer requests! The customer is there for your advice. If something won’t work, or is counter-productive, or endangers the timeline, tell them about it. Document that you had the conversation. If the customer still decides to proceed, you can proceed with a clear conscience and likely a better idea of why.
No one wins from rushed, poor quality development at the end of a project. When it starts to look like you’ll have a timeline crunch, talk to your customer. Help them decide how to adjust the scope, team size, or deadline so the crunch disappears.
Limit your work time to “just” 40 hours per week. Do not take your laptop home with you, do not open your email after hours. This forces you to be effective with the above.
Learn about different estimation systems. Go read some articles about how different Agile frameworks do it, and consider the core principles that they all share. Start estimating in some way that’s more accurate than asking people for hours, and track the variance from reality over time. Estimate + variance = accurate, actionable estimates.

The core job of a project manager is to talk with your customer when it looks like you need to adjust scope, deadline, or team size… and to do that as early as possible. If you make some effort towards accurate estimated timelines, constantly compare your timeline to what’s happening in reality, and help your customer adjust scope/team/deadline when the numbers seem off… Congratulations, you are a good project manager.

How to Format Video for Fast Playback on the Web

2020-11-11T20:47:39+02:00

It’s a pain in the ass to get your video optimized for web. Not only do you have to work out codecs and their support across browsers, but even within codecs there are tricks to help it stream more easily. Here’s the short version.

In terms of format, there are some really great options if you only care about compatibility with the most popular browsers (ahem - chrome). webm/VP8 is the way to go here, or webm/VP9 if you really only care about chrome. You’ll get a very small filesize and high quality, with hardware playback on the latest devices. But if you want to include Safari and others, or if you care about non-flagsip devices or ones more than a couple of years old, it has to be MP4/h.264 . Your filesize won’t be as small, but it will play with hardware acceleration on any device since about 2011.

One common opimization is to use @media queries to dfault to webm/VP9 and fallback to other formats based on the browser capabilities. Do that if you want, but for my use cases I prefer simplicity over bandwidth savings.

I use ffmpeg, because it does everything except my laundry (and I’m pretty sure that’s because I don’t know th right flags for “spin cycle”). To convert one video into a format that is universally compatible:

ffmpeg  -i input.mp4 \
  -c:v libx264 # h264 encoder \
  -profile:v main # h264 options to make available. for >5 year old phones, use "baseline". \
  -preset veryslow # slowest option, best compression \
  -s hd720 # rarely need higher resolution than this \
  -b:v 1.5M # sets the video bitrate \
  -an # no audio track; this is for a background video \
  -movflags +faststart # allows playing while the file is downloading \
  output.mp4

The non-obvious options:

-profile:v main sets which features of h264 to use. The standard evolved over time, and depending on just how new your playback devices are, there improvements available. Most of the time main is the right choice, but if you want to target older devices use baseline.
-b:v 1.5M sets a 1.5 Megabit bitrate. You should experiment with this number to find the right tradeoff between quality and filesize. You can find out the bitrate of your original source with ffinfo, or with the output of this ffmpeg command.
-an puts no audio in the finished video. I worked this out while making a background video for a site, so ths was appropriate. Side benefit that it simplifies this blog post. :) If you want audio, you could replace this with -codec:a aac is a very compatible option.
movflags +faststart is critical if you want the video to play before being fully downloaded. Video files contain multiple streams of data (at least one stream of video and one of audio). Usually the metadata about each stream is at the end of the filee, meaning that a player has to load the whole file before it knows how to play the data. Faststart breaks the file up into chunks and puts the metadata in with each chunk, so your player can start playing the video as soon as the first bit is loaded.

That’s all you really need to make good looking video files, optimized for web, which play before they’re fully downloaded, and work on every device. Have fun!

Chinese censorship, values decisions, and free software

2019-10-08T19:10:15+00:00

Chinese censors are in the news this weekend: Blizzard banned a grandmaster Hearthstone player for supporting Hong Kong in an interview, South Park was added to the “banned” list for their critique of Chinese censorship in Hollywood, and an NBA franchise owner’s job was on the line for a pro-HK tweet. Increasingly, Western companies are finding themselves up against the wall to prioritize western liberal values against access to the enormous Chinese market. We can imagine the difficult, high-pressure decisions for executives in this situation.

As a consumer it feels like we don’t face that kind of pressure, or that kind of decision. Those of us whose choices do not impact thousands of employees’ livelihoods, and millions of consumers’ information environments, seem to have little leverage. If TikTok quietly hides any videos of Hong Kong unrest, or Delta lists Taiwan as a part of China, or Marriott includes cities in Tibet, Taiwan, Hong Kong, and Macau as inside China… we can’t tell the difference. The whole point of censorship is that you remain blissfully unaware of what you’re missing.

When we as consumers think about China’s censorship power in our lives the important question is: How can we tell when it’s happening? It’s not unthinkable for Chrome to invisibly hide certain content, or Facebook, or your iPhone. In fact most of the information services you use likely do this already, in the name of curating content you will “like.” This isn’t necessarily a problem in and of itself; it only gets problematic when the filters are invisible.

This is a part of the point of Free and Open Source software. Filters can’t be applied in secret, and by definition you have the right to fork software to do things your way if you like. It’s your right to have your device behave the way you want it to. And black boxes should inspire some suspicion that they may be doing things you don’t like.

At this point it would be easy to descend into the typical open source rant: if only Facebook open sourced its filters! If only Instagram were open! I’ll leave that for others. I want to point out something a little subtler:

The executive’s values decision about western liberalism vs a larger addressable market, is remarkably close to the developer’s values decision about a license that respects user control vs the easier monetization and control of a black box. That decision in turn is related to the consumer’s values decision, about software that respects your rights vs the convenience of a popular black box.

It’s easy to criticize Blizzard’s decision to bow to Chinese sensibilities. But how do you decide, on your own much smaller scale? Do you prioritize values over convenience? Or do you accept the censored experience of a black box as a user? Do you enjoy the control of that black box as a developer?

Hearthstone players are by definition running a black box OS. Users of iOS, Windows, and Android demonstrate comfort with invisible censors in other parts of their devices. Is their presence in video streams and gameplay all that different? Do they have any right to complain? Having given away control to black boxes, perhaps complaining is the way they can impact the way their device runs.

An Open Letter to my MEPs about Article 17 (formerly article 13)

2019-03-21T23:44:43+00:00

The proposal for a directive on copyright in the digital single market is disastrous for the EU economy, culture, and democracy in the digital world. It is particularly bad for my country of Germany, as a leading light in Europe in all three areas. I am writing all of my MEPs listed in support of this impossibly bad proposal.

The German and European economies would be terribly damaged by this article, which effectively rules out small and medium sized competition in favor of the largest incumbents. I work for Microsoft on precisely the kind of machine-understanding tasks involved in the copyright filter requirement. I can tell you with authority: it is an impossible task which only the deepest pockets can approach. Article 17 makes Germany and Europe into hostile venues for Internet startups. The next generation of Youtubes, Soundclouds, and Netflixes are not possible under this Article - unless of course, it’s an existing mega-corporation who decides to start it. This is a tremendous handicap in the fastest growing sector of the global economy.

The disaster for culture exemplifies the impossibility of such a filter. My “side business” is an entertainment company, making opera music accessible for tens of thousands of Europeans every year. The entire classical music industry opposes digital filters because we all know the consequence: a machine or untrained human can’t tell the difference between the 300 different versions of Bach’s Wohl Temperierte Klavier. The piece is identified as probably copyrighted because copyrighted versions exist, and taken down as a precaution. Famously even the European anthem, An die Freude, suffers automated takedowns because there are copyrighted versions of the piece. Asking platforms to bulk-police content means that the more influential a piece of music is for our culture - that is to say, the more versions of it exist - the more prone it is to spurious blocking on copyright grounds.

As much as the technical infeasibility and halting the spread of the most important pieces of our European culture bother me,** the effect on democracy in the digital age is the worst part of this Article. User-generated content is foundational to online discussion**. It is precisely this content which enriches online debate and engagement, which reaches younger generations and pulls them into a very participatory democratic environment. You’ve no doubt heard that memes are culture, but they are also the medium of exchange in the biggest democratic commons humankind has ever created. Legislation which shuts down this medium of exchange, or which forces the commons into channels controlled by the largest (foreign) economic actors in history, is bad for the EU.

Perhaps the world needs a digital copyright equivalent of Brexit, to scare everyone else away from the copyright lobby. Perhaps we all need a material example to see just how poorly the copyright lobby’s 1960’s-era ideology fits the 21st century economy.

But I would prefer it not be my country, my continent that makes an example of itself.

Please heed the warnings from internet experts, the UN Special Rapporteur on Freedom of Expression, NGOs, programmers, and academics. I urge you to reconsider your position on this digital Brexit.

Sincerely,

Campbell Vertesi

Berlin, Germany

BTRFS and free space - emergency response

2019-02-11T13:58:39+00:00

I run BTRFS on my root filesystem (on Linux), mostly for the quick snapshot and restore functionality. Yesterday I ran into a common problem: my drive was suddenly full. I went from 4GB of free space on my system drive to 0 in an instant, causing all sorts of chaos on my system.

This problem happens to lots of people because BTRFS doesn’t have a linear relationship to “free space available”. There are a few concepts that get in the way:

Compression: BTRFS supports compressing data as it writes. This obviously changes the amount of data that can be stored. - 50MB of text may take only 5MB “room” on the drive.
Metadata: BTRFS stores your data separately from metadata. Both data and metadata occupy “space”.
Chunk allocation: BTRFS allocates space for your data in chunks.
Multiple devices: BTRFS supports multiple devices working together, RAID-style. That means there’s extra information to store for every file. For example, RAID-1 stores two copies of every file, so a 50MB file takes 100MB of space.
Snapshots: BTRFS can store snapshots of your device, which really store more like a diff from the current state. How much data is in the diff depends on your current state… so the snapshot itself doesn’t have a consistent size.
Nested volumes: BTRFS lets you divide the filesystem into “subvolumes” - each of which can (someday) have its own RAID configuration.

It’s easy to look at the drive and tell how many MiB of space has not been used yet. But it’s very hard to accurately say how much of your data you can write in that space. For this reason the amount of “free space” reported on BRFS volumes by system utilities like df can jump a lot - like my disappearing 4GiB. Worse, the free space reported by general tools is misleading. BTRFS can run out of space while df still thinks you have lots available.

Let’s walk through how BTRFS stores data, to understand the problem a bit better. Then we can solve it with some of BTRFS’ own tools.

How much free space do I have?

Rather than using general tools like df to answer this question, it’s better to get more detail using the btrfs CLI tool.

BTRFS starts out with a big pool of raw storage, and allocates as it goes. You can get a listing of all the devices in a block device like this:

$ sudo btrfs fi show
Label: 'OS'  uuid: c0d21ade-5570-41a3-b0cf-a5ce219e7a8e
  Total devices 1 FS bytes used 31.74GiB
  devid    1 size 48.83GiB used 47.80GiB path /dev/nvme0n1p2

In this case, I only have one physical device involved. You can see that it gives me a total number of bytes allocated, compared to the total size. In another filesystem this might be the number reported to df. Not so with BTRFS! Let’s dig deeper.

$ btrfs fi df /

Data, single: total=45.75GiB, used=30.56GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=2.02GiB, used=1.17GiB
GlobalReserve, single: total=89.31MiB, used=0.00B

The “total” values here are the breakdown of what the first command counts as “used”. btrfs fi df shows us of the allocated space, how much is actually storing data, and how much is just empty allocation. In this case: on my 48GiB device, 47GiB is allocated. Of the allocation, 31GiB is actually storing data. Side note: if you’re in a multi-drive situation this command will take into account RAID metadata.

Here’s an easier view:

$ sudo btrfs fi usage /

Overall:
    Device size:      48.83GiB
    Device allocated:     47.80GiB
    Device unallocated:      1.03GiB
    Device missing:        0.00B
    Used:       31.74GiB
    Free (estimated):     16.22GiB  (min: 16.22GiB)
    Data ratio:           1.00
    Metadata ratio:         1.00
    Global reserve:     89.31MiB  (used: 0.00B)

Data,single: Size:45.75GiB, Used:30.56GiB
   /dev/nvme0n1p2   45.75GiB

Metadata,single: Size:2.02GiB, Used:1.18GiB
   /dev/nvme0n1p2    2.02GiB

System,single: Size:32.00MiB, Used:16.00KiB
   /dev/nvme0n1p2   32.00MiB

Unallocated:
   /dev/nvme0n1p2    1.03GiB

This shows the breakdown of space allocated and used across all the devices in this block device. “Overall” is for the whole block device, and that “Free (estimated)” number is what gets reported to df.

This is a problem: most of my normal tools tell me I have 15GB free space. But if I write 1GiB more data, BTRFS will run out of space anyways. This issue is a pain in the ass and hard to diagnose. It’s even harder to fix, since most of the solutions require having some extra space on the device.

Converting unused allocation to free space

So, why does BTRFS allocate so much space to store such a small amount of data? Here I am storing 31GiB of data in 47GiB of allocation, the used/total ratio is 0.66! This is very inefficient. It’s an unfortunate consequence of being a copy-on-write filesystem - BTRFS starts every write in a freshly allocated chunk. But the chunksize is static, and files come in all sizes. So lots of the time, a chunk is incompletely filled. That’s the “allocated but not used” space we’re complaining about.

Fortunately there’s a way to address this problem: BTRFS has a tool to “rebalance” your filesystem. It was originally designed for balancing the data stored across multiple drives (hence the name). It is also useful in single drive configurations though, to rebalance how data is stored within the allocation.

By default, balance will rewrite all the data on the disk. This is probably unnecessary. Chunks will be unevenly filled, but we saw above that the average should be about 66% used. So we’ll filter based on data (-d) usage, and only rebalance chunks that are less than 66% used. That will leave any partially filled chunks which are more-filled than average.

# Run it in the background, cause it takes a long time.
$ sudo btrfs balance start -dusage=66 / &
# Check status
$ sudo btrfs balance status -v /       
Balance on '/' is running
1 out of about 27 chunks balanced (5 considered),  96% left
Dumping filters: flags 0x1, state 0x1, 
# Or be lazy, and have bash report status every 60 seconds.
$ while :; do sudo btrfs balance status -v / ; sleep 60; done
Balance on '/' is running
3 out of about 27 chunks balanced (12 considered),  89% left
Dumping filters: flags 0x1, state 0x1, force is off
  DATA (flags 0x2): balancing, usage=66
Balance on '/' is running
4 out of about 27 chunks balanced (13 considered),  85% left
Dumping filters: flags 0x1, state 0x1, force is off
  DATA (flags 0x2): balancing, usage=66
...
# When the balance operation finishes:
Done, had to relocate 19 out of 59 chunks

There’s a nice big differnce once it’s finished:

$ btrfs filesystem df /
Data, single: total=32.53GiB, used=30.83GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=2.02GiB, used=1.17GiB
GlobalReserve, single: total=84.67MiB, used=0.00B

That’s 15GiB of space allocated for other use. My usage ratio is now 0.94. Huzzah! In some rare cases you may need to do this on the Metadata allocation (use -musage instead of -dusage above).

If you’ve already run out of space

If you have already run out of space, you can’t run a balance! In that caseyou have to get sneaky. Here are your options:

1) Free up space

This is harder than it sounds. If you just delete data, it will probably leave those chunks partially filled and therefore allocated. What you really need is unallocated space. The easiest place to get this is by deleting snapshots. Start from the oldest one, since it will be the biggest.

Once you have a little bit of wiggle room, rebalance a small segment, like Metadata. Then proceed with rebalancing data as described above.

2) Add some space

Don’t forget, a BTRFS volume can span multiple devices! I had to exercise this option recently. Add a device - a flash drive will do, but choose the fastest thing you can - and add it to the BTRFS volume.

# Add your extra drive (/dev/sda).
$ sudo btrfs device add -f /dev/sda / 
# Now run the smallest balance operation you can.
$ sudo btrfs balance start -dusage=1 /
Done, had to relocate 1 out of 59 chunks
# Remove the device, and run a proper balance.
$ sudo btrfs device remove /dev/sda /
$ sudo btrfs balance start -dusage=66 /
Done, had to relocate 18 out of 59 chunks

Balance operations usually take a long time - more than an hour is not unusual. It will take even longer with slow flash media involved. For that reason, I use a very low balance filter (-dusage=) in this example. We only need to free up a teensy bit of space to run balance again without the flash disk in the mix.

And this last option is how I saved my computer last night. I hope this helps someone out of a similar predicament someday.

Update to the update: Do not do this! A friendly commentor from the BTRFS community let me know that this is actually a really bad idea, since anything that interrupts your RAM will wreck your filesystem irreparably. Stick with the USB drive solution, above. Thank you @Zygo for the correction, and sorry for anyone who suffered for my learning.

UPDATE: Now that I’ve had to do this a few times, it’s way better to rebalance a full filesystem by adding a ramdisk to it. Not only is it faster than a flash device, it’s also more reliable in most cases… and certainly for my kind of use case (a developer laptop) the important preconditions apply: lots of RAM, reliable power source. Here’s the recipe:

# Create a ramdisk. Make sure /dev/ram0 isn't in use already before doing this!
$ sudo mknod -m 660 /dev/ram0 b 1 0 
$ sudo chown root:disk /dev/ram0
# Mount the ramdisk with a concrete size. Otherwise it grows to whatever is needed.
$ sudo mkdir /mnt/ramdisk
$ sudo mount -t ramfs -o size=4G,maxsize=4G /dev/ram0 /mnt/ramdisk
# Create a file on the ramdisk to use as a loopback device.
$ sudo dd if=/dev/zero of /mnt/ramdisk/extend.img bs=4M count=1000
$ sudo losetup -fP /mnt/ramdisk/extend.img
# figure out which loopback device ID is yours
$ sudo losetup -a |grep extend.img
/dev/loop10: [5243078]:8563965 (/mnt/ramdisk/extend.img)
# Add the loopback device to the btrfs filesystem
$ sudo btrfs device add /dev/loop10 /
# Decide on your balance ratio and balance as usual.
$ sudo btrfs fi usage / |head -n 6
Overall:
    Device size:		 400.91GiB
    Device allocated:		 396.36GiB
    Device unallocated:		   4.55GiB
    Device missing:		     0.00B
    Used:			 348.91GiB
$ echo 'scale=2;348/396' |bc
.87

$ sudo btrfs balance start -dusage=87 /
Done, had to relocate 46 out of 400 chunks
# Remove the device and destroy it.
$ sudo btrfs device delete /dev/loop0 /
$ sudo losetup -d /dev/loop10
$ sudo umount /mnt/ramdisk
$ sudo rm -rf /dev/ram0

Serverless is the MS Access of the Future

2019-01-24T15:12:30+02:00

Controversial opinion time: the usefulness of what we presently call “serverless” will always be limited to simple use cases. It is a great choice for glue code or simple projects, but it will never be the best choice for even medium complexity development problems. Containers and similar technologies will eat its lunch the same way RDBMSes ate MSAccess'.

The benefits of serverless are real. Don’t worry about infrastructure, don’t worry about scaling or availability. Just paste your code here and we’ll take care of the rest! Minimal running costs, infinite scalability, simpler units of code to maintain!

These benefits are only possible because the cloud provider made a lot of decisions for you, in a way that is sensible for a majority use case. These are all decisions that you could hypothetically configure for yourself, given the time to do so. And it’s exactly here that the core value of serverless is found. If your use case happens to fit within the frame set out by your serverless provider, there’s real value on the table in terms of setup and maintenance time/cost (and SLA).

These decisions come with limitations, like most technical choices. What languages and versions can you use? What modules, what dependency management systems? What external binaries are available? What memory or disk is available, with what kind of I/O throughput? What’s the local development environment, and how well does it replicate the live one? What are the scaling characteristics? And so on.

In a simple use case, most of these probably don’t matter, and the ones that do matter are often exposed by the cloud provider. You can choose your VM size, for example, and preemptive scaling rules, and attach disks and external services, and supply secondary binaries yourself, and…

Very quickly you’ve taken on a similar complexity, setup, and maintenance cost to what you were trying to avoid in the first place. This might be OK if it were a net zero transaction, but it’s not. You’ve taken on those costs, in exchange for… the rest of the limitations and a platform over which you have no control.

This feels a lot like so many MSAccess applications I worked with in the early 2000’s. When the application was simple, it was great to have a visual data engine. But with a larger data model, the UI increasingly became an obstacle. You would increasingly use text to express your queries, duplicate data in more convenient tables to avoid tricky joins, and … In the end, the workarounds piled up. Managing a complex application on MSAccess is just as hard as managing it on any other RDBMS, but without the flexibility and power, and with more kludgy workarounds. Access is a great product for simple or straightforward use cases. But the moment your application grows too big, you start paying a heavy tax.

It’s not controversial to suggest that the core value of Serverless is providing OOTB great hosting for simpler, highly encapsulated code, even code segments. The controversial part is the future prediction, where the metaphor really kicks in.

What happened to MSAccess? Other RDBMS’ got much better, and friendlier. From easy-to-use ORMs for easy-to-use programming languages, to GUIs like MySQLAdmin, to cloud-based application builders backed by RDBMSes, the full-powered RDBMS ecosystem gradually took over the use case for MSAccess. The ease-of-use benefit which was so core to the Access value proposition gradually disappeared, and users ended up with the choice between a fully-flexible power system and a limited one, both relatively easy to use. Finally Access 2010 became a GUI on top of SQL, integrated with Sharepoint.

Serverless is headed in a similar direction. Other tools, largely from the container ecosystem, are already nibbling at their lunch. If you’re at the point where you need to configure VM sizes and scaling rules on your Serverless provider, you’re probably considering jumping to a managed Kubernetes provider instead. If you don’t need to configure that stuff, you’re looking at pure-container cloud solutions like Azure Container Instances. Same flexibility and cost structure, but with run-anywhere compatibility and a development environment which matches the CI testbed and prod. The only uncontested ground left is applications that are too small to bother containerizing.

Meanwhile the container ecosystem is taking off at rocket speed, making that “flexibility tradeoff” worse and worsse for serverless. Where is the multi-cloud ecosystem for serverless? Where is the network security modeling market? Compare the difficulty of local dev and hosted CI environments for serverless code, to the out of the box auto-detected container builds available in every major CI platform. I’s clear: serverless isn’t actually all that much easier anymore. And it’s only going one way. Soon enough, Serverless will become a nice frontend for a container runtime.

Kubernetes for stateful applications: Scaling macroservices

2019-01-07T11:10:21+00:00

I recently got to proctor an Openhack event on modern containerization. It ended up an excuse to dig deep on one of the corner cases that we all encounter, but no one likes to talk about.

Kubernetes is one of the greatest orchestration and scaling tools ever built, designed for modern decoupled, stateless architectures. Kubernetes tutorials abound to show you these strong use cases. But in the real world where you don’t get to build “green field” every time, there are a lot of applications that don’t fit that model.

Lots of people out there are still writing tightly-coupled monoliths, in many cases for good reason. In some use cases microservices style scalability isn’t even useful - you actually prefer stateful applications with tight coupling. For example a game server, where you don’t want to scale player capacity per-game, you want to add more games (server instances).

So today I’m writing about stateful, non-scalable applications in kubernetes.

There are a few different approaches to coupling appliciation components:

Multi-container pods

Level 0 is to simply specify multiple components (containers) in your deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: php
        image: php:fpm
        ports:
        - containerPort: 9000
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

This specifies 3 copies of the same application, with the same two containers in each replica. This is a coupled application, but it’s still stateless. Let’s add a volume - that’s where we get into trouble.

The problem: If you add a Volume the normal way (persistentVolumeClaim), each of your replicas will try and connect to the same volume. It’ll act like a network shared drive. Maybe that’s OK for your application, but not if it’s our super-stateful example! And depending on your volume class, the volume may reject multiple connections like (Azure Disk does, for example).

So how do we get around this limitation? I want a separate volume for each instance of the application.

Kubernetes supports a different object type for this use case, called a StatefulSet. This is exactly what it sounds like: a set of objects that define a stateful application. It’s a template for creating multiple copies of all resources defined therein.

A statefulset will create replicas similar to a deployment, but it will set up separate Volumes and VolumeClaims for each one. The replicas will be identical except for an index number at the end of the labels. The first one might be called nginx-deployment-0, the second: nginx-deployment-1, and so on. The result is a set of tightly coupled components, which can be individually addressed, and scaled using normal Kubernetes tools.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  serviceName: "nginx"
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
      - name: php
        image: php:fpm
        volumeMounts:
        - mountPath: "/var/www/html"
          name: data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        storageClassName: default
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
    - port: 80
      name: http
  clusterIP: None
  selector:
    app: nginx

There are a few details to notice here.

Yes, we’ve replaced Deployment with StatefulSet. You get a shiny gold star if you noticed that one.

The interesting part is the VolumeClaimTemplates section, below the containers definition. This keyword only exists inside a StatefulSet, and it’s just what it sounds like: a template for creating Persistent Volume Claims.

If you apply this config, you’ll see three PVs created, with three PVCs, attached to three Pods. You can apply HPA rules to scale these up and down just like you would with deployments.

There’s also that weird Service at the bottom. A naked service with no clusterIP? What’s the point? The point is as a helper for Kubernetes’ internal DNS. All of those nice StatefulSet pods will come under a neat subdomain, eg nginx-0.nginx, nginx-1.nginx, etc. Additionally you can connect to active members of the StatefulSet by using that nginx domain component. A dns lookup on it will show all the IPs of the active members in the CNAME record.

“But what about external access?” I hear you cry. Yes, we’ve built a great stateful application that can scale instances, but it’s only internally addressable! Good luck hosting those games…

External access and metacontroller

Normally you would put a LoadBalancer service in front of your application. But a Kubernetes load balancer will grab all of these StatefulSet members - so you can’t address them externally one-by-one. What you really want to do, is create an external IP address for each statefulset member.

One solution is to use a reverse proxy like nginx or HAProxy, configured to differentiate based on hostnames. But this is a blog post about Kubernetes, so we’re going to do this the Kubernetes way!

Kubernetes is very extensible. If Pods, Services, etc don’t make sense for your application or domain, you can define custom object types and behaviors, through custom resources and controllers. That’s pretty edge case, but as we’ve seen, some kubernetes edge cases are mainstream cases in the real world.

In our super-stateful application, we don’t need a custom resource type. But we do want to attach custom behaviors to our StatefulSet: every time we start up a pod we should create a LoadBalancer for it. We should be nice and tear them down when the pods are scaled down, of course.

We’ll use the Metacontroller add-on to make our lives easier. Metacontroller makes it “easy” to add custom behaviors. Just write a short script, stick it into a ConfigMap or FaaS, and let Metacontroller work its magic!

Metacontroller project comes with several well documented examples, including one that’s very close to our requirement: service-per-pod.

Step 1 is to install Metacontroller, of course:

# Create 'metacontroller' namespace, service account, and role/binding.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/metacontroller/master/manifests/metacontroller-rbac.yaml
# Create CRDs for Metacontroller APIs, and the Metacontroller StatefulSet.
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/metacontroller/master/manifests/metacontroller.yaml

Then we’ll add some new metadata to our existing StatefulSet. The metacontroller script will use these values to configure the load balancers.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    service-per-pod-label: "pod-name"
    service-per-pod-ports: "80:80"
...

We also need to tell Kubernetes to decorate each StatefulSet with a pod-name label. We do this in the StatefulSet’s pod template.

...
spec:
  template:
    metadata:
      annotations:
        pod-name-label: "pod-name"
...

Note: this only works in k8s 1.9+ - if you’re stuck with a lower version, you can script this action with Metacontroller, too. :).

Now you’re going to need two hooks. Put them in a directory together so they’re easy to apply at once. These ones are written in jsonnet, but you could write this in whatever language you like.

The first hook actually creates the LoadBalancer for each Pod.

function(request) {
  local statefulset = request.object,
  local labelKey = statefulset.metadata.annotations["service-per-pod-label"],
  local ports = statefulset.metadata.annotations["service-per-pod-ports"],

  // Create a service for each Pod, with a selector on the given label key.
  attachments: [
    {
      apiVersion: "v1",
      kind: "Service",
      metadata: {
        name: statefulset.metadata.name + "-" + index,
        labels: {app: "service-per-pod"}
      },
      spec: {
        type: "LoadBalancer",
        selector: {
          [labelKey]: statefulset.metadata.name + "-" + index
        },
        ports: [
          {
            local parts = std.split(portnums, ":"),
            port: std.parseInt(parts[0]),
            targetPort: std.parseInt(parts[1]),
          }
          for portnums in std.split(ports, ",")
        ]
      }
    }
    for index in std.range(0, statefulset.spec.replicas - 1)
  ]
}

The other hook is the “finalizer” - it responds to changes or deletions in pods by tearing down the corresponding LoadBalancers.

function(request) {
  // If the StatefulSet is updated to no longer match our decorator selector,
  // or if the StatefulSet is deleted, clean up any attachments we made.
  attachments: [],
  // Mark as finalized once we observe all Services are gone.
  finalized: std.length(request.attachments['Service.v1']) == 0
}

Add those into a subdirectory, and put them into a configmap together. Metacontroller will run them from there.

kubectl create configmap service-per-pod-hooks -n metacontroller --from-file=hooks

Now apply the actual decorator controller which will run those functions. Note that you have to identify your hook jsonnet files by (file) name! Get the name wrong, and the finalizer will hang forever, preventing you from deleting your statefulset. In my case, the files were called create-lb-per-pod.jsonnet and finalizer.json.

apiVersion: metacontroller.k8s.io/v1alpha1
kind: DecoratorController
metadata:
  name: service-per-pod
spec:
  resources:
  - apiVersion: apps/v1beta1
    resource: statefulsets
    annotationSelector:
      matchExpressions:
      - {key: service-per-pod-label, operator: Exists}
      - {key: service-per-pod-ports, operator: Exists}
  attachments:
  - apiVersion: v1
    resource: services
  hooks:
    sync:
      webhook:
        url: http://service-per-pod.metacontroller/create-lb-per-pod
    finalize:
      webhook:
        url: http://service-per-pod.metacontroller/finalizer
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: service-per-pod
  namespace: metacontroller
spec:
  replicas: 1
  selector:
    matchLabels:
      app: service-per-pod
  template:
    metadata:
      labels:
        app: service-per-pod
    spec:
      containers:
      - name: hooks
        image: metacontroller/jsonnetd:0.1
        imagePullPolicy: Always
        workingDir: /hooks
        volumeMounts:
        - name: hooks
          mountPath: /hooks
      volumes:
      - name: hooks
        configMap:
          name: service-per-pod-hooks
---
apiVersion: v1
kind: Service
metadata:
  name: service-per-pod
  namespace: metacontroller
spec:
  selector:
    app: service-per-pod
  ports:
  - port: 80
    targetPort: 8080

That’s it! Now you can scale complete replicas of a very-stateful application with a simple kubectl scale sts nginx --replicas=900.

Enjoy bragging to your friends about your “macroservices architecture”, pushing the limits of Kubernetes to run and replicate a stateful monolith!

Everyone hates writing YAML. Check out the sample code for this post on Github

Optimizing data transfer speeds

2018-12-27T10:30:23+00:00

One of my holiday projects was to set up my home “data warehouse.” Ever since Dropbox killed modern Linux filesystem support I’ve been using (and loving) Nextcloud from my home. It backs up to an encrypted Duplicati store on Azure blob store, so that’s offsite backups taken care of. But it was time to knit all my various drives together into a single RAID data warehouse. The only problem: how to transfer my 2 terabytes (rounded to make the math in the post easier) of data, without nasty downtime during the holidays?

A local network transfer is the fastest, with the least downtime. I have a switched gigabit network in my house, and all my servers are hard wired. That’s about 125 megabytes per second; a theoretical 5 hours to transfer everything. Not bad! Start up an rsync and I’m all done! So I kicked it off and went to bed:

$ ssh nextcloud.vert
$ rsync -axz /media/usbdrive/ warehouse:/mnt/storage/ --log-file=transfer-to-warehouse.log &

I woke up in the morning with the excitement of a kid on Christmas. Everything should be done, right?

$ ssh warehouse df -h |grep md0
/dev/md0        2.7T  501G  2.1T  20% /mnt/storage
$

Wait, what? How had it only transferred 500 gigabytes overnight? Including time for Doctor Who and breakfast, that was only 1 Megabit per second! I knew it was time to play everyone’s favorite game: “where’s the bottleneck?

I guess it could be rsync scanning all those small files. If that’s the case, we’ll see high CPU usage, and even higher load numbers (as processes are I/O blocked):

$ ssh nextcloud
$ top

top - 08:22:27 up 10:26,  1 user,  load average: 1.20, 1.34, 1.33
Tasks: 170 total,   2 running, 106 sleeping,   0 stopped,   0 zombie
%Cpu(s): 28.0 us,  2.1 sy,  0.0 ni, 69.5 id,  0.1 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem : 16330372 total,   180568 free,   657104 used, 15492700 buff/cache
KiB Swap:  4194300 total,  4162556 free,    31744 used. 15300068 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                  
 8755 ohthehu+  20   0  130572  58456   2672 R  99.0  0.4 513:14.75 rsync                                                                                    
 8756 ohthehu+  20   0   49596   6648   5152 S  16.9  0.0  92:12.29 ssh 
...

OK, let’s kill the transfer and start again using a single large, piped tarball. No more small file scans!

$ ssh nextcloud
$ cd /media/bigdrive && tar cf - . | ssh warehouse "cd /mnt/storage && tar xpvf -"

That helps, but we’re still compressing lots of data unnecessarily (most of my data is already compressed), and encrypting it, too. We can improve it with a lightweight ssh cipher and disabled compression:

$ ssh nextcloud
$ cd /media/bigdrive && tar cf - . | ssh -o Compression=no -c chacha20-poly1305@openssh.com warehouse "cd /mnt/storage && tar xpf -"

That chacha20-poly1305 is a very fast cipher indeed - faster than the old arcfour cipher we used to use in this case. But SSH still puts extra work on the CPU. So let’s remove it completely from the equation and just use netcat.

$ ssh nextcloud cd /media/bigdrive && tar cf - . | pv | nc -l -q 5 -p 9999 
# in a separate terminal
$ ssh warehouse cd /mnt/storage && nc nextcloud 9999 | pv | tar -xf -

Transfer speeds now average about 61 megabytes per second. That’s fast enough to kick in the law of diminishing returns on my optimization effort: this will take about 8 hours to transfer if I keep it running. I had to pause work for an hour; now if I spend another hour on this, it has to shave more than 25% off my transfer time to finish any earlier tonight. I’m not confident I can beat those numbers.

Still - What happened to my 125 theoretical megabytes per second? Here are the culprits I suspect - and can’t really do anything about:

Slow disk: We are writing to a software RAID5 array of old drives. In my head I was using the channel width of SATA-II for my calculations. In reality, and especially on spinning metal, write speeds are much slower. I looked up my component disks on userbenchmark.com, and the slowest member has an average sustained sequential write speed of 69 MB/s. This is very likely my first bottleneck. At most I can only use half of my available bandwidth.
TCP: After replacing all my drives with SSDs, TCP is the next culprit I would go after. The protocol technically only has about 6% of overhead, but it also dynamically seeks the maximum send rate through TCP Congestion Control. It keeps trying to send “just a little faster”, until the number of unacknowledged packets exceeds a threshold. Then it backs off to 50%, and goes back to “just a little faster” mode. This loop means your practical speed with a TCP stream is about 75% of the pipe’s theoretical maximum. Think of it like John Cleese offering just one more “wafer thin” packet. I considered using UDP to avoid this, but I actually want the error-checking in TCP. Probably the best solution is something esoteric like UDR.
Slow CPU: This is the last bottleneck here. Warehouse is an old Intel Core2 Duo I had lying around the house. Untar and netcat aren’t exactly CPU hungry beasts, but at some point there IS a maximum. If you believe the FreeNAS folks, a fileserver needs an i5 and 8 gigs of RAM for basic functionality. I haven’t found that to be the case, but then I’m not using RAID-Z, either.

I’m happy with the outcome here. I have another drive to copy later, with another terabyte. I’m considering removing that slowest drive from my RAID array, since the next-slowest one is almost 50% faster. Then I can copy to the array while it’s in degraded mode, and re-add the slowpoke afterwards. We’ll see.

Happy holidays!

Appendix: easy performance testing

If you’re working on a similar problem for yourself, you might find these performance testing commands helpful. The idea is to tease apart each component of the transfer. There are better, more detailed, dedicated tools for each of these, but in a game of “find the bottleneck” you really only need quick and dirty validation. Fun fact: the command dd is actually short for Down and Dirty. Well it should be, at any rate.

Read speed (on the source is easy: hand an arbitrary large file to dd, and write down the numbers it gives.

$ dd if=large-file.tar.bz2 of=/dev/null bs=1M
1021317200 bytes (1 GB) copied, 3.9888 s, 256 MB/s

Network speed can be tested by netcatting a gigabyte of zeros from one machine to the other.

# On the receiving machine, open a port to /dev/null
$ nc -vvlnp 12345 >/dev/null
# On the sending machine, send a gig of zeroes to that port
$ dd if=/dev/zero bs=1M count=1K | nc -vvn 192.168.1.50 12345
Connection to 192.168.1.50 12345 port [tcp/*] succeeded!
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.7811 s, 91.1 MB/s
# Remember, 8 bits to a byte!
$ echo "$(bc -l <<< 91*8) Megabits"
728 Megabits

Write speed on the destination can be tested with dd, too:

$ dd if=/dev/zero bs=1M count=1024 of=/mnt/storage/test.img
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 55.3836 s, 19.4 MB/s

(note: these tests were run while the copy was happening on warehouse. Your numbers should be better than this!)

Drupal Does Face Recognition: Introducing Image Auto Tag module

2018-04-19T18:06:51+00:00

Last week I wrote a Drupal module that uses face recognition to automatically tag images with the people in them. You can find it on Github, of course. With this module, you can add an image to a node, and automatically populate an entity_reference field with the names of the people in the image. This isn’t such a big deal for individual nodes of course; it’s really interesting for bulk use cases, like Digital Asset Management systems.

I had a great time at Drupalcon Nashville, reconnecting with friends, mentors, and colleagues as always. But this time I had some fresh perspective. After 3 months working with Microsoft’s (badass) CSE unit - building cutting edge proofs-of-concept for some of their biggest customers - the contrast was powerful. The Drupal core development team are famously obsessive about code quality and about optimizing the experience for developers and users. The velocity in the platform is truly amazing. But we’re missing out on a lot of the recent stuff that large organizations are building in their more custom applications. You may have noticed the same: all the cool kids are posting about Machine Learning, sentiment analysis, and computer vision. We don’t see any of that at Drupalcon.

There’s no reason to miss out on this stuff, though. Services like Azure are making it extremely easy to do all of these things, layering simple HTTP-based APIs on top of the complexity. As far as I can tell, the biggest obstacle is that there aren’t well defined standards for how to interact with these kinds of services, so it’s hard to make a generic module for them. This isn’t like the Lucene/Solr/ElasticSearch world, where one set of syntax - indeed, one model of how to think of content and communicate with a search-specialized service - has come to dominate. Great modules like search_api depend on these conceptual similarities between backends, and they just don’t exist yet for cognitive services.

So I set out to try and explore those problems in a Drupal module.

Image Auto Tag is my first experiment. It works, and I encourage you to play around with it, but please don’t even think of using it in production yet. It’s a starting point for how we might build an analog to the great search_api framework, for cognitive services rather than search.

I built it on Azure’s Cognitive Services Face API to start. Since the service is free for up to 5000 requests per month, this seemed like a place that most Drupalists would feel comfortable playing. Next up I’ll abstract the Azure portion of it into a plugin system, and try to define a common interface that makes sense whether it’s referring to Azure cognitive services, or a self-hosted, open source system like OpenFace. That’s the actual “hard work”.

In the meantime, I’ll continue to make this more robust with more tests, an easier UI, asynchronous operations, and so on. At a minimum it’ll become a solid “Azure Face Detection” module for Drupal, but I would love to make it more generally useful than that.

Comments, Issues, and helpful PRs are welcome.

The #1 Question I Get Asked Working at MS: Why Do You Run Linux?

2018-02-07T16:42:50+00:00

My War on Systemd-resolved

2018-01-25T11:05:31+00:00

I'm joining Microsoft, because they're doing Open Source Right

2018-01-10T20:32:50+00:00

I’m excited to announce that I’ve signed with Microsoft as a Principal Software Engineering Manager. I’m joining Microsoft because they are doing enterprise Open Source the Right Way, and I want to be a part of it. This is a sentence that I never believed I would write or say, so I want to explain.

First I have to acknowledge the history. I co-founded my first tech company just as the Halloween documents were leaked. That’s where the world learned that Microsoft considered Open Source (and Linux in particular) a threat, and was intentionally spreading FUD as a strategic counter. It was also the origin of their famous Embrace, Extend, and Extinguish strategy. The Microsoft approach to Open Source only got more aggressive from there, funneling money to SCO’s lawsuits against Linux and its users, calling OSS licensing a “cancer”, and accusing Linux of violating MS intellectual property.

I don’t need to get exhaustive about this to make my point: for the first decade of my career (or more), Microsoft was rightly perceived as a villain in the OSS world. They did real damage and disservice to the open source movement, and ultimately to their own customers. Five years ago I wouldn’t have even entertained the thought of working for “the evil empire.”

Yes, Microsoft has made nice movements towards open source since the new CEO (Satya Nadella) took over in 2014. They open sourced .NET and Visual Studio, they released Typescript, they joined the Linux Foundation and went platinum with the Open Source Initiative, but come on. I’m an open source warrior, an evangelist, and developer. I could see through the bullshit. Even when Microsoft announced the Linux subsystem on Windows, I was certain that this was just another round of Embrace, Extend, Extinguish.

Then I met Josh Holmes at the Dutch PHP Conference.

First of all, I was shocked to meet a Microsoft representative at an open source conference. He didn’t even have bodyguards. I remember my first question for him was “What are you doing here?”.

Josh told me a story about visiting startup conferences in Silicon Valley on behalf of Microsoft in 2007, and reporting back to Ballmer’s office:

“The good news is, no one is making jokes about Microsoft anymore. The bad news is, they aren’t even making jokes about Microsoft anymore.”

For Josh, this was a big “aha” moment. The booming tech startup space was focused on Open Source, so if Microsoft wanted to survive there, they had to come to the table.

That revelation led to the creation of the Microsoft Partner Catalyst Team. Here’s Josh’s explanation of the team and its job, from an interview at the time I met him:

“We work with a lot of startups, at the very top edge of the enterprise mix. We look at their toughest problems, and we go solve those problems with open source. We’ve got 70 engineers and architects, and we go work with the startups hand in hand. We’ll sit down for a little pair programming with them, sometimes it will be a large enough problem that will take it off on our own and we’ll work on it for a while, and we’ll come back and give them the code. Everything that we do ends up in Github under typically an MIT or Apache license if it’s original work that we’re doing on our own, or a lot of times we’re actually working within other open source projects.”

Meeting with Josh was a turning point for my understanding of Microsoft. This wasn’t just something that I could begrudgingly call “OK for open source”. This wasn’t just lip service. This was a whole department of people that were doing exactly what I believe in. Not only did I like the sound of this; I found that I actually wanted to work with this group.

Still, when I considered interviewing with Microsoft, I knew that my first question had to be about “Embrace, Extend, and Extinguish”. Josh is a nice guy, and very smart, but I wasn’t going to let the wool be pulled over my eyes.

Over the next months, I would speak with five different people doing exactly this kind of work at Microsoft. I I did my research, I plumbed all my back-channel resources for dirt. And everything I came back with said I was wrong.

Microsoft really is undergoing a fundamental shift towards Open Source.

CEO Sadya Nadella is frank that closed-source licensing as a profit model is a dead-end. Since 2014, Microsoft has been transitioning their core business from licensed software to platform services. After all, why sell a license once, when you can rent it out monthly? So they move all the licensed products they can online, and rent, instead of selling them. Then they rent out the infrastructure itself, too - hence Azure. Suddenly flexibility is at a premium. As one CTO put it, for Azure to be Windows-only would be a liability.

This shift is old news for most of the world. As much as the Hacker News crowd still bitches about it as FUD, this strategic direction has been in and out of the financial pages for years now. Microsoft has pivoted to platform services. Look at their profits by product over the last 8 years:

The trend is obvious: server and platform services are the place to invest. Office only remains at the top of the heap because it transitioned to SaaS. Even Windows license profits are declining. This means focusing on interoperability. Make sure everything can run on your platform, because anything else is to handicap the source of your biggest short- and medium-term profit. In fact, remaining adversarial to Open Source would kill the golden goose. Microsoft has to change its values in order to make this shift.

So much for financial and strategic direction; but this is a hundred-thousand-person company. That ship doesn’t turn on a dime, no matter what the press releases tell you. So my second interview question became “How is the transition going?” This sort of question makes people uncomfortable: the answer is either transparently unrealistic, or critical of your environment and colleagues. Over and over again, I heard the right answer: It’s freakin’ hard.

MS has more than 40 years of proprietary development experience and institutional momentum. All of their culture and systems - from hiring, to code reviews, to legal authorizations - have been organized around that model. That’s very hard to change! I heard horror stories about the beginning of the transition, having to pass every line of contribution past the Legal department. I heard about managers feeling lost, or losing a sense of authority over their own team. I heard about development teams struggling to understand that their place in an OSS project was on par with some Rando Calrissian contributor from Kansas. And I heard about how the company was helping people with the transition, changing systems and structures to make this cultural shift happen.

The stories I heard were important evidence, which contradicted the old narrative I had in my head. Embrace, extend, extinguish does not involve leadership challenges, or breaking down of hierarchies. It does not involve personal struggle and departmental reorganization. The stories I heard evidenced an organization trying a real paradigm shift, for tens of thousands of people around the world. It is not perfect, and it is not finished, but I believe that the transition is real.

When you accept that Microsoft is trying to reorient its own culture to Open Source, suddenly all those “transparent” PR moves you dismissed get re-framed. They are accomplishments. It’s incredibly difficult to change the culture of one of the biggest companies in the world… but today, almost half of Azure users run Linux. Microsoft’s virtualization work made them the fifth largest contributor to the 3.x Linux kernel. Microsoft maintains the biggest project on Github (by contributor count). They maintain a BSD distribution and a Linux distribution. And a huge part of LXD (the container-based virtualization system for Linux) comes from Microsoft’s work with Canonical.

That’s impressive for any company. But Microsoft? It boggles the mind. This level of contribution is not lip-service. You don’t maintain a 15 thousand person community just for PR. Microsoft is contributing as much or more to open source than many other major players, who have had this in their culture from the start (Google, Facebook, Twitter, LinkedIn…). It’s an accomplishment, and it’s impressive!

In the group I’m entering, a strong commitment to Open Source is built into the project structure, the team responsibilities, and the budgeting practice. Every project has time specifically set aside for contribution; developers’ connections to their communities are respected and encouraged. After a decade of working with companies who try to engage with open source responsibly, I can say that this is the strongest institutional commitment to “giving back” that I have ever seen. It’s a stronger support for contribution than I’ve ever been able to offer in any of my roles, from sole proprietor to CTO.

This does mean a lot more work outside of the Drupal world, though. I will still attend Drupalcons. I will still give technical talks, participate, and help make great open source communities for Drupal and other OSS projects. If anything, I will do those things more. And I will do them wearing a Microsoft shirt.

Microsoft is making a genuine, and enormous, push to being open source community members and leaders. From everything I’ve seen, they are doing it extremely well. From the outside at least, this is what it looks like to do enterprise Open Source The Right Way.