Adaptive Patchwork

Agents of productivity and chaos

2026-05-24T00:00:00-07:00

It’s been over a year since I published Creative Flow vs. Critical Review, as much of my writing on AI has been internal. I’m hoping to publish a bit more here and wanted to start by sharing a little bit of my agent setup (as of May 2026). A version of this was originally published on an internal discussion board titled “Agents of productivity and chaos - multi-repo, multi-agent learnings and workflows”. For context, I’m mostly working in the GitHub Copilot App and the GitHub Copilot CLI, across a few dozen repos. The thing I wanted to share is that I’ve got a couple of skills that have been doing a lot of heavy lifting lately: one for planning multi-agent projects and another for delegating work — they show up a few times below, if you don’t read any further check them out!

Agents multiply effort, but if you’re going in the wrong direction they don’t help at all, they make it worse! Nobody wants to go faster in the wrong direction. Agents are also prone to making messes. They tend towards more code and more chaos. Entropy and all that. If you want to code with agents, one of your primary tasks is to counter both those behaviors. You need to:

Understand what you want to build (direction) and why and how it intersects with reality. Understand this deeply.
Leverage the tools for what they excel in, don’t fight the quirks of their particular brands of “intelligence”, know when to lean in and when to push back.
Go back to your computer science and engineering fundamentals and then focus on thinking at a systems level. You’re not a worker on the factory floor of Toyota, you’re designing the Lean manufacturing process itself.

The key is setting direction, steering, validating, and iterating on your processes. You’ve got to a) set them up for success b) validate the outputs and c) feed back what works and doesn’t. I’m NOT perfect at this (see failure modes section below), but I have learned a lot in the last few months. Here’s how I’m thinking about all this right now:

Personalize — Start with your own agent harness setup. Dial in an AGENTS.md and/or copilot-instructions.md with your taste, values, preferences. Next, curate a small set of user-level skills and wire up a few critical MCPs. Be careful to keep things concise and focused: don’t spend your entire context window budget here. Personalization is only the first layer of context design. This is also a good place to put things that are unique to how you like to work. For example, I have this fun little signature appended to all comments that Copilot leaves on my behalf:

_{Generated via Copilot (Claude Opus 4.7) on behalf of @tclem}

It’s enforced by a rule in my global Copilot instructions, the model name swaps in automatically, and the linked attribution post explains what this is.
Provide great context — Next you want to make sure each project you work on also has agent building blocks: project-specific skills, design documentation, ADRs. You want this to be indexed by your search engine of choice, and you want to ensure that the agents reach for the right skills and tools without getting overwhelmed and polluting their context. This is not a totally solved problem, but tool search and meta skills (e.g. a choosing-skill skill that helps the agent pick from the project’s skill library) can help. Before you start any work, make sure you’ve either directly supplied high quality, relevant context or have a system in place for agents to find that context themselves.
Plan — The tools encourage this step (see plan mode) but I take it even further. When you plan, don’t blindly accept the plan, ask for the opinion of a different model, read the plan, ask questions, try to poke holes in it. Ask the agent to debate with another agent. Eventually, you’ll start to collapse up on a plan that’s actually reasonable. This is your chance to steer with fundamental principles. Use your design and engineering expertise. Think like an architect. This is where you’re picking an initial direction. Don’t run fast until you feel relatively confident it’s the right direction. I’ve got a fun skill for planning multi-agent projects.
Implement — This part’s fun. As part of planning you can ask the agent to evaluate which parts of the work can happen in parallel and which parts need to be sequential. Regardless, the agents are pretty good here now — check out this delegating work skill for ways to make them even better. I always advise letting another agent review the code of the primary agent, fix all CI issues, address all code review comments, etc. Only at this point is it worth getting a human involved for code review. I’m still manually reviewing a lot of code written by my agents. Look for patterns, this is how you’re going to tune your skills and your processes. Push back on design decisions. Open the code in your editor. Read the diff. Force the agent to solve the root cause and make fixes that will last and scale. Don’t let it get away with band-aids or workarounds. Just this week, I caught the agents happily disabling parts of CI to make the builds pass instead of fixing the actual failures — teaching our project’s merge skill to refuse those workarounds was a small but important fix.
Verify — You do still need to verify the work for most things. Reality is unforgiving. Until you run the code, you don’t know if it actually works — as Knuth put it, “beware of bugs in the above code; I have only proved it correct, not tried it.” This is another area that needs more investment from the industry at large: how do you scale verification? how do you take the tedium out of manually testing? How do you know that the agent has done what you asked it to do? And how do you know that what you asked it to do actually solves whatever root problem or job-to-be-done you had in mind? Be careful of pushing the burden of this to your end users — we’re all going to get really sick of being beta testers for vibe coded apps.
Feed back — Chances are, all sorts of things are going to go wrong in this process. Observe, take notes, let themes emerge, don’t be afraid of doing manual stuff until you actually understand the patterns and can articulate a good abstraction. Then: feed that back. Update and tune skills, add new ones, delete broken ones. Make some things deterministic; scripts, programs, etc. (see Examples below for a couple of real instances of this). Because the models themselves are also improving, you want to have a process that flexes and absorbs those changes instead of fighting them. One trick is to interrogate the models themselves. Ask, “why did/didn’t you use X skill?”, “what’s in the context that made you decide Y?”.

Failure modes

Not everything is golden in the world of agentic coding. Just a few of the things I’m struggling with right now for discussion:

Focus — With agents, it is so easy to spin sessions that you end up with fractured attention. Too many things going on. Lots of context switching. It is easy to make mistakes, mis-read or forget something, etc. If you work with a team of people who are also each running their own agents, this problem multiplies.
Code review — Good code review is already hard. Now, how do you do good code review for hundreds of thousands of lines of code? What if you have 5 teammates who are also producing a huge volume of agentic code?
Verification — Manual validation is slow and tedious and automated validation is a hard problem, but not doing this just multiplies work: fixes on fixes on fixes.
Good foundations — It’s hard to get good foundations in place because everyone is moving fast, but for some reason we keep skipping known computer science fundamentals and then we pay the price later in quality issues, regressions, and tech debt. There has to be a way to move fast on solid fundamentals.

In some ways, our code and coding process have always had these problems. GitHub itself was coded by thousands of engineers over almost 2 decades: it is the very definition of legacy code. Agents mean you get legacy code like that in a few weeks, days, hours (?). There’s no silver bullet, but I expect there are better abstractions yet to be discovered.

Examples

A few real examples of this process in action…

Decomposing a 1,800-line function

A vibe-coded Rust service I work on had a match in handle_client_message that had grown to ~1,800 lines and 228 message-type arms — the next likely candidate in a string of tokio worker stack-overflow incidents we’d been dealing with. I turned on clippy::large_futures at 16 KiB during diagnosis; the function blew past it immediately, which gave me a concrete lint to anchor on.

The fix was relatively mechanical, but this code base moves forward so quickly that it was important to sequence out a small series of patches.

Introduce the futures size lint as a hard backstop so that we get CI failures, not application crashes.
Land some repo-level skill changes so that everyone else’s agents know how to avoid the bad pattern going forward.
Refactor the large match in a series of 7 sequential PRs designed to minimize merge conflicts, immediately reduce the stack sizes in this code path, and be authored largely in parallel by agents.

handle_client_message went from 16,296 bytes to 664 bytes, and the chain fed a pile of lessons back into the planning-multi-agent-projects and delegating-plan-work skills (the same skills that helped plan and execute the work to begin with).

Running Copilot CLI inside GitHub Actions

The GitHub Copilot App vendors the Rust copilot-sdk and the vendored copy needs to track upstream multiple times a day. We do this with a mix of deterministic and agentic Actions workflows.

A standard Actions workflow just syncs any upstream changes into our vendored copy (obeying some rules about additional code we have that we haven’t upstreamed yet).
That job pushes and opens a PR, assigns some humans as reviewers, and CCR starts reviewing as well.
If everything is green and there are no review comments: we’re done.
If anything fails, another action running the Copilot CLI kicks off and does the work to fix any breaking changes. It addresses CI failures, review comments, etc, and pushes to the same PR when it’s done.
Everything still requires a human review and approval before merging.

A complexity-first rewrite

The vibe-coded diff viewing feature in the GitHub Copilot App had cost functions that grew with the size of the underlying diffs. To fix this, we didn’t need a faster diff algorithm or more caching (there were already too many layers of unnecessary caching from over eager agents trying to make things better) — it was some computer science fundamentals around complexity and use of appropriate data structures.

We’re still untangling this, but the process is: a multi-agent plan was written from the output of /research as a series of markdown documents that laid out a phased approach for refactoring the full stack feature. We wanted the complexity contract to be O(V) where V is the size of the viewport (the visible diff), not the size of the entire patch. In addition, we developed a skill that would hold agents to that contract and would pair with some deterministic tests to verify behavior. The skill and associated design documentation now live in the repo as markdown files for humans and agents to reference. Scrolling large diffs is much improved, but we’re still working through untangling some of the misguided caching and unnecessary code complexity, moving carefully to not break end users.

_{This post was written with the help of AI (Claude Opus 4.7). The vast majority of the text was hand written, but I had an agent copy edit, fill in links, and resolve placeholders (e.g. TODO: grab this stat from datadog). See my ai attribution page for more.}

Thoughts on LLMs, Part 2

2025-10-08T00:00:00-07:00

This was originally part of an internal discussions post, surfacing on my blog because it is interesting to see how far we’ve come since October 2025 and how much my thinking has changed. For context, Claude Opus 4.5 was released November 24, 2025. Read part 1.

One hypothesis of the LLM true believers is that mastery of language is the foundation on which all other intelligent capabilities can be built. As LLMs scale, as more and more data and compute is poured into foundational models, this mastery of human language results in a system that can be an expert in any domain. Language means being able to write code; it means being able to “think out loud” and “reason”; it means being able to predict what humans say and what they want you to say in any given situation: in a bar exam or as a therapist or as a Disney character; it means being able to provide iterative outputs based on feedback; it even means being able to take real world actions: run this program, book this flight.

Honestly, it is amazing that we’ve been able to scale up modeling language like this! And the flexibility and general applicability of LLMs is interesting and surprising.

However, I’m not convinced that language is the essence of intelligence: I think the arrow points the other way. I don’t believe that the mastery of predicting language actually equates to understanding, or learning, or a falsifiable modeling of reality. This doesn’t mean that LLMs aren’t useful, and yes there are a number of other ML technologies and techniques that DO model or attempt to model reality, but it does provide clarity into where to use LLMs (and where not to) and what to expect out of them.

The trap I see individuals and companies falling into is this: they truly believe that the foundational LLMs model the real world (if not today: soon). This isn’t true at all. They are trained on language (and sure, pictures and videos too) that humans have written about the real world, but they are stuck in a simulation playing a one-way game of telephone. They can’t learn, they can’t test hypotheses, and therefore they can’t build knowledge - they can only simulate how humans talk about knowledge. This means that they are fundamentally untrustworthy for certain applications. They might be excellent retrieval systems but they are doomed for any generative endeavor that must meet the harsh reality of how the world actually works.

This is actually a very helpful delineation!

When we think about writing code, there is a broad spectrum of software and jobs-to-be-done by software out there in the world. Let’s classify broadly as Type 1 and Type 2 software.

Type 1

Prototyping? Building yet another CRUD web app? Doing a task with code that many before you have done before? LLMs are going to be amazing. They don’t understand, but they don’t need to and there’s not a lot of “reality” to crash into. The quality of the output will increase based on how often there are examples of solving this particular sort of thing in the training set. And honestly, for many of these applications: low quality and one-off use is fine. For many people, this will be the difference between having a software app or not. This in itself changes the world dramatically. I expect a collapse of 90% of software into a few languages (Python, JavaScript) and there will be SO much of this type of software that it’ll overwhelm what we have now but it’ll be like cheap shit from Amazon off the internet: inexpensive, useful, disposable, of varying quality, and challenging to manage because it’ll clog up everything else. We’ll need waste management solutions. It’ll super-charge spam and abuse and bots and bad-actors on the internet. It’ll also let a billion people “code” apps that are unable to do so today.

Type 2

Software Engineering. Systems of consequence. Internet platform providers. Financial systems. Novel ideas. Health care. Governmental data. Fundamental building blocks. Open source stalwarts. Novel algorithms and data structures and architectures. Performance, correctness. Anywhere that software’s job is to model reality. No matter how good they get, LLMs are going to be frustrating to use for type 2 software. They will require extra effort from human experts to check their work, they will require complex online and offline eval setups to validate their performance, they will do a bad job even with lots of iterations. I predict the juice won’t be worth the squeeze. Completions, Q&A, etc. will still provide some value in type 2 development, but agentic coding by LLMs will be a dead end here. (NOTE: other ML techniques may be developed and used successfully here, but LLMs writing this sort of code will be hard and frustrating).

⚠️ some people will be confused about what type of software they are writing and use LLMs poorly and there will be direct fallout and consequences.

Common to all software

I think there will be a variety of tasks, common to all software development where LLMs will provide value and productivity benefits. Hopefully we’ll get some assistance with the toil of repetitive tasks (LLMs are good at following patterns without necessarily understanding). I expect LLMs to continue to augment and work with traditional information retrieval systems to provide fast access to common/general programming knowledge. Hopefully we can elevate/expose the patterns that produce high quality, performant, and maintainable code. This will be true for completion products and “ask/search” workflows.

_{This post was hand written, but in porting it to my public blog, I did use a model (Claude Opus 4.7) to copy edit; the changes were minor spelling and grammar fixes. See my AI attribution page for more.}

Thoughts on LLMs

2025-10-06T00:00:00-07:00

This was originally an internal discussions post, surfacing on my blog because it is interesting to see how far we’ve come since October 2025 and how much my thinking has changed. For context, Claude Opus 4.5 was released November 24, 2025. Part 1 of 2.

Last week, I had an insight about LLMs and where they are a good tool for the job and where they are not. I think this captures some of the frustration I feel in using Copilot and GitHub’s various LLM-driven features. As an engineer, my experience so far is this: there are a limited number of places where the LLM is genuinely helpful, but for the vast majority of my interactions: LLMs live somewhere on the scale of counterproductive and frustrating to downright harmful.

Some places they are useful:

Editing (especially prose) for spelling and grammar.
Asking questions and rubber ducking.
A shortcut to searching on the internet, but not a replacement for reading the manual.
Exposure to knowledge you didn’t know existed.
Code completions and NES (maybe 20% of the time, depends on the language and job to be done).
Summarization. Often useful, not always, tends to be too verbose and to miss subtle important details/context.
Prototyping (success is language specific, domain specific)
Translation (both between human languages and programming languages, though be careful with the latter)

Some places I find them incredibly frustrating and useless:

Code review of any great depth
Writing (new) code of any consequence
Editing existing code, especially in larger projects
PR summarization (verbose, not helpful, distracting)
Alert and notification fatigue (LLMs don’t get tired and can produce a lot more text a lot faster than any team of humans)
Abuse of human attention (LLMs are essentially designed to maximize the attention we give them)
Verbosity (too many words, too many iterations required, human languages aren’t precise or specific so verbosity reigns)
Human distraction (feeling the need to respond to a hallucination or getting sucked in)
Subtle errors that require exorbitant effort to identify and fix later
Loss of learning and loss of expertise
Loss of mentorship opportunities (e.g. instead of pairing with a colleague, you do the work with an LLM)
Loss of (human) mental modeling and context building that leads to later poor decision making or deskilling.
Porting code to another language. I guess I would say: use with care. This is something LLMs could be good at, but you have to pick your battles.

My insight is that this all comes down to the fact that we are using language models and as-such they are incredibly good at modeling/predicting human language (including code languages), but they have no actual representation of the real world or how the real world works. This is very important to understand. An LLM “knows” (to a surprising degree) about how humans talk about how reality works: but has zero ability to predict, reason, or learn about reality itself. I think this might be a fundamental limitation. Language does not equal intelligence, the arrow goes the other way (I know that people might fervently disagree with this).

Armed with this insight, you can make better decisions about how and when to use LLMs and what we might expect out of their future capabilities. Oh and I should be careful to say: we don’t want to conflate LLMs with AI or ML. The broader set of machine learning technologies and techniques (outside of large language models) may hold the keys for more sophisticated code generation and software engineering, I’m speaking here about the LLMs that form the foundation of products like Copilot in 2025.

So the heuristic then becomes simple. Ask yourself: What am I trying to accomplish right now and can that be done well (or accelerated) by a sophisticated model of how humans use language? Remember, the LLM doesn’t understand your problem in any way, but it can simulate how humans generally talk about many, many things. Here are a few examples, I may add more as they come up:

Should I use an LLM to review my code? Maybe. With care. They are very useful for reviewing spelling and grammar and sentence structure (especially in docs and comments), but don’t forget that they don’t understand what you’re trying to do and they don’t understand e.g. how the rust compiler works. They are OK at surface level review of code written in programming languages in their training sets, but they struggle with anything nuanced or detailed because they don’t understand what your program is trying to do and why or how or what the best way to model that particular problem in code might be. Today, you’re much better off grabbing another human.

Should I respond to each and every CCR comment? No. Definitely not. Ignore and dismiss them, do not waste your words unless it is more important for you to be giving the CCR team feedback than it is for you to accomplish your main day job.

Should I use an LLM agent to write this code? No. Only if the code is throwaway: prototypes, demos or only if you’re committed to iterating and owning the output. Today I find that it takes 2x to 3x the time and effort to write good code with an agent for a small real-world task vs just doing it yourself. If you do use an agent, your name should be on the blame for all future bugs, security vulnerabilities, and availability issues related to that code. Do not submit code that you do not fully understand and have signed off on.

Should I use Copilot completions and/or NES in my editor? Yes, but toggle it on and off liberally and don’t accept suggestions blindly. It’s a good skill to be able to work with and without the assistance of LLM completions and you should learn to discern rapidly whether a suggestion is valid/invalid. I recommend limiting multi-line completions unless you’re going slowly enough to read and digest everything that’s being written. Be aware of the tendency of LLMs to be verbose and to generate slop (hallucinations). Again, remember that this code is YOUR responsibility.

Should I rely on this LLM summary? Maybe. Depends on the context. Company all-hands? Probably fine. Technical requirements document? Maybe use it to get oriented, but if you need to load that context up in your mental model the summary is unlikely to be sufficient. Use summaries as a shortcut for knowing whether or not you need to dive in and invest or move on.

Should I use an LLM to port this library to Rust? Probably not, but depends on the source and destination language and the task. LLMs are heavily trained on Python and JavaScript and translation is actually an ideal task for a language model, but I would only recommend using for a port if you don’t care too much about the clients of the library or the quality of the code. I would say this is maybe one quality step down from outsourcing that work to a team in another part of the world that has a similar amount of limited context and I expect that LLMs will eventually be much better at this.

Continued in Part 2: Models of reality vs. models of what people say.

Creative Flow vs. Critical Review

2025-04-18T00:00:00-07:00

Something I struggle with when writing code in the current state of AI is the quick oscillation between creative flow and critical review that many AI tools force upon their humans. Here’s what happens…

You’re building something, and it’s going well, and you find yourself in that magical zone of creative flow. Ideas wash freely through your mind and into your keyboard-clacking fingers. There’s great clarity and an intimate sense of forward motion and progress.

Sometimes, AI plays along like a good improv comedy partner that says “Yes! And…“. But mostly, my AI tries to complete my sentences in ways that require very close scrutiny. I think: Is that right? Does that struct actually define that method? I’d like the first part of that, but the second half is nonsense, and so on… The larger the edits, the harder it gets. A single-line completion can be reviewed relatively quickly; multiple lines take more thought. Edits in many files or an agent that’s been working for many minutes generating all sorts of code? Good luck.

Now I’ve lost my flow state and I’m in reviewer mode. In reviewer mode, I’m a critic. I’m thinking about potential bugs, security vulnerabilities, resiliency of the system, design and maintainability, and what the user experience will be when things go wrong. In critical reviewer mode, I care deeply about precision and correctness and about saving myself work down the road. Unfortunately, this mental state is hard to square with continuing to tap into your creativity. Overcoming the initial inertia of getting started creatively is hard, so once you do that, you want to keep moving—it’s better to plow through the typos to get your ideas out and then circle back later to revise.

The problem is, it is very hard (for me) to oscillate between these two states of mind. Maybe this is something the next generation of engineers will be much better at, and it’s a skill to be practiced? For now, it’s like having a comedy partner that says “Yes! and…“, and sometimes hilarity ensues, but then, frequently, you find yourself in jail at the end of the night. So you can’t always say “yes, and…“ back: you have to correct, reject, edit, scrutinize. It’s much more like reviewing code from a junior developer or a new teammate. Until they ramp up, you know they don’t have enough experience, skill, or context to write high-quality code, so your goal as a reviewer is to a) level them up and b) make sure the changes that land actually accomplish the goals and are as high quality as possible. Even if the AI-written code was incredibly high quality, you still need to map that back into your brain and sync your own mental context—and reading code is much harder than writing it.

In a more traditional work style, this moment of critical review still happens. It’s just that there’s a big gap in time between getting the ideas out and bringing in the experts to give feedback and help refine. And the different roles are often assigned to different people.

I think “vibe coding” might be just embracing that improv “yes, and…“ with the long-term consequences be damned! And maybe that does make for a fun toy application or an amazing natural language-to-working-app demo. The problem is that subtle errors and misunderstandings add up very fast in engineering systems. We know that catching these early makes them much easier, cheaper, and faster to fix. Is it possible to vibe code with an AI on real, revenue-generating, production systems and not end up buried under a mountain of tech debt and existential business risk? Maybe the future of coding is more like software archaeology, where we’re all sifting through the midden trying to make sense of it all.

Secondarily, any effort I put towards reviewing the AI is essentially lost in the void. The AI system isn’t getting better when I correct it. And worse, my new colleague, who would absorb and learn, isn’t even in the conversation (because I’m talking to my AI). So I’ve both wasted my breath on the AI and missed an opportunity to level up a coworker (not to mention the benefits of sharing knowledge and context and the old-fashioned fulfillment of human interaction).

I’m not sure what to make of all this yet. I’ve been mulling over a few recommendations and ideas though:

AI systems should narrow the scope of their changes and engage only when the probability of acceptance (by me) is above some threshold (90%+). As they improve, AI systems can take on more, but until then, they need to reduce verbosity and noise. I’d like a dial that I can control that’s basically: don’t suggest a change unless the AI’s confidence is above some threshold.
AI systems need much deeper context and an internal loop to validate output before presenting it to me. Don’t suggest something with a typo (Copilot does this all the time) or something that won’t compile or something with fundamental performance problems. I understand that some of the reasoning and thinking modes of the latest models can do this, but the interaction lag is way too slow.
Humans should pair together and then add an AI. Senior developers should be especially sensitive to this because I observe that more experienced and senior engineers gain significantly more productivity by using AI compared to junior or early-career developers. The senior engineer can rein in the tool and has the skills to review deeply. The junior engineer is just happy that the model spit out a bunch of code that does a thing and then gets lost down wrong turns and convoluted solutions with code they do not understand.
There has to be a way to personalize and/or further train the AI systems that I use so that my corrections aren’t going into the void. I don’t know if this is fine-tuning my own models, custom model memory systems, intricate personal system prompts, or what.
Generative coding AI tools need to provide proper attribution for code provenance (preferably at the span level). It’s important to know who or what wrote a section of code (co-authorship is fine). This is important for normal git blame development/debugging but also for future training of models or fine-tuning on your private code.
Everybody needs to level up in how they read and review code. Learn to read code and read it in a way where you actually understand what’s going to happen when that code runs. It’s also important to have some insight into whether that code is going to stand the test of time. Ask yourself: What will it be like to read, debug, and fix this code 1 year from now?

NOTE: I had GitHub Copilot lightly edit the final draft of this post for spelling, punctuation, and grammar. It did a pretty good job making concise and minor edits as asked.

UTF-8 Conversions with BitRank

2023-07-10T00:00:00-07:00

One problem I’ve run into a number of times over the years in working on parsers and information retrieval systems is the need to identify positions in a document by multiple units. These positions are used, for example, to identify the location of a cursor, to highlight a matching term, to denote an identifier for lookup, or to do syntax highlighting. On a backend system, it is desirable to store only the most compact and efficient representation (e.g. UTF-8 and UTF-8 byte offsets), but frontends will often deal with characters (code points) and/or graphemes, and sometimes even represent strings with different encodings. This means there has to be a way to convert between UTF-8 byte offsets and code points, or in some cases between UTF-8 code units (bytes) and UTF-16 code units (words).

In blackbird, the code search engine I’ve been working on for the past few years, we store all documents as UTF-8 and represent the positions and spans of text in those documents as byte offsets into UTF-8 encoded string. This is motivated by two things:

UTF-8 is a variable length encoding, making it space efficient, especially for code which tends to primarily use ASCII characters.
Using byte offsets means slicing out or otherwise finding and manipulating text in a document is O(1): it’s just an index lookup.

However, once search results appear in the front end, we need to be able to identify positions by character offsets (not bytes) and the more human friendly line/column coordinates. This is what you expect out of a text editor or viewing a blob on github.com. Additionally, our front end is written in JavaScript, which uses UTF-16 to represent strings so there is an encoding mismatch between the backend and frontend—rendering UTF-8 byte offsets not-so-helpful. (Even the LSP specification used to require and still defaults to identifying spans with UTF-16 code units.)

The question is: how do you efficiently convert between these different unit systems?

Representing strings in UTF-8 and UTF-16

Here’s an example string so that we can visualize the differences between the two encodings and the units we want to convert between:

h✅😆x

Below is how this is represented in memory in UTF-8. The ✅ emoji takes 3 bytes and 😆 takes 4 bytes to represent. NOTE: In UTF-8, each unicode code point is represented using one to four one-byte (8-bit) code units.

   h |             ✅ |                  😆  |    x |
0x68 | 0xe2 0x9c 0x85 | 0xf0 0x9f 0x98 0x86 | 0x78 | <- hex representation (utf-8)
   0 |    1    2    3 |    4    5    6    7 |    8 | <- byte index
   0 |   _1_  _2_   3 |   _4_  _5_  _6_   7 |    8 | <- code unit offset (utf-8, so same as byte index)
   0 |              1 |                   2 |    3 | <- code point offset

And here’s what it looks like in UTF-16. The ✅ emoji takes 2 bytes (one UTF-16 code unit) and 😆 takes 4 bytes (two UTF-16 code units) to represent. NOTE: In UTF-16, each unicode code point is represented with one or two 16-bit code units.

     h |     ✅ |            😆  |      x |
0x0068 | 0x2705 | 0xd83d 0xde06 | 0x0078 | <- hex representation (utf-16)
   0 1 |    2 3 |    4 5    6 7 |    8 9 | <- byte index
     0 |      1 |     _2_     3 |      4 | <- code unit offset (utf-16 uses words)
     0 |      1 |             2 |      3 | <- code point offset

It’s clear now that locating, for example the "x", by character offset (character meaning a unicode code point) or in a UTF-16 encoded version of the string, is going to require a linear scan of the content due to the variable encoding and the fact that converting between these units is content dependent. I’m sure you can think of a number of naive approaches to enable this conversion: lookup tables, re-encoding and measuring string lengths, adhoc linear scanning when you need to, etc; but there’s a clever solution that by colleague Alexander Neubeck has come up with using BitRank.

BitRank

BitRank allows efficient RANK operations on bit vectors. Bit vectors are a well known succinct data structures that provide lossless compression close to the information-theoretic lower bound, while still allowing efficient query operations. We use them in a number of places in blackbird.

For a bit vector, we define the operation rank(i) as the number of set bits (1s) in the range [0, i) (i.e. an exclusive rank). There’s some additional good background on rank (and select) in Rank-Select optimizations in BitMagic Library by Anatoliy Kuznetsov, but our implementation and use case is quite different.

Converting between UTF-8 byte offsets and code points

What we’re going to do is scan the content once to build up a couple of BitRank data structures. These will then let us efficiently convert 1) UTF-8 code units (byte offsets) to code points and 2) UTF-8 code units (byte offsets) to UTF-16 code units (word offsets).

[h, ✅,        😆,x]
[0,1,2,3,4,5,6,7,8] - byte idx
[1,0,0,1,0,0,0,1,1] - bit vector for utf8 -> code points
[1,0,0,1,1,0,0,1,1] - bit vector for utf8 code units -> utf16 code units

[  h, ✅,      😆,  x] - what this string would look like in utf-16
[0,1,2,3,4,5,6,7,8,9] - byte idx

NOTE: These data structures are going to be ~1/8 the size of the content (one bit per byte of content) plus maybe 10-20% overhead, depending on the bitrank implementation.

Now, we can compute the rank of these bit vectors and use that to convert between byte offsets and UTF-8/16 code units as needed. For example, given byte index: 8, what is the unicode code point offset? We do this by finding the rank of the element at index 8 in the bit vector. That rank is our code point offset.

[h,   ✅,      😆,x]
[0,1,2,3,4,5,6,7,8] - byte idx
                 ^
                 | byte idx: 8

[1,0,0,1,0,0,0,1,1] - bit vector for utf8 -> code points
[1,1,1,2,2,2,2,3,4] - rank
[0,1,1,1,2,2,2,2,3] - rank (exclusive)
                 ^
                 | idx: 8, rank: 3, code point offset is 3

[h,   ✅,      😆,x]
[0,1,2,3,4,5,6,7,8] - byte idx
[0,    1,      2,3] - code point offset
                 ^
                 | sure enough, our math checks out
                 | for byte idx: 8, the code point offset is 3

OK, now let’s convert UTF-8 code units to UTF-16 code units at the same original offset. Again, we find the rank of the element at index 8, but this time looking in the UTF-8 to UTF-16 bit vector.

[  h, ✅,      😆,  x] - utf16 (takes one more byte to represent)
[h,   ✅,      😆,x] - utf8
[0,1,2,3,4,5,6,7,8] - byte idx
[1,0,0,1,1,0,0,1,1] - bit vector for utf8 code units -> utf16 code units
[1,1,1,2,3,3,3,4,5] - rank
[0,1,1,1,2,3,3,3,4] - rank (exclusive)
                 ^
                 | idx: 8, rank: 4, utf16 code unit offset is 4

[  h, ✅,      😆,  x] - utf16
[0,1,2,3,4,5,6,7,8,9] - byte idx
[  0,  1,  2,  3,  4] - utf16 code unit offset
                   ^
                   | check out math again:
                   | for byte idx: 8 in the utf8 repr,
                   | the utf16 code unit offset is 4

You can also easily go the other way (from character offset to byte position). Given the code point offset 3 (the "x"): what is the UTF-8 byte offset in the original UTF-8 content?

[h,   ✅,      😆,x] - utf8
[0,    1,      2,3] - code point offset
                 ^
                 | what's the byte offset of the "x" at line: 0, col: 3?

Instead of rank, now we perform a select operation. Go find the first element that has rank 3: it’s index is the byte offset we’re looking for.

[1,0,0,1,0,0,0,1,1] - bit vector for utf8 -> code points
[1,1,1,2,2,2,2,3,4] - rank
[0,1,1,1,2,2,2,2,3] - rank (exclusive)
                 ^
                 | rank: 3 is at idx: 8

[h,   ✅,      😆,x] - utf8
[0,1,2,3,4,5,6,7,8] - byte idx
                 ^
                 | byte offset: 8

Building the bit vectors

Building the bit vectors is straightforward. We iterate through the bytes of the content, jumping forward by the byte_len of each UTF-8 character. One of the cool things about UTF-8 is that it’s designed so that you can quickly determine the length of the character just by reading the first byte (first nibble really).

/// Returns the number of bytes this utf8 char occupies given the first byte of the utf8 encoding.
/// Returns 0 if the byte is not a valid first byte of a utf8 char.
fn utf8_width(c: u8) -> usize {
    // Every nibble represents the utf8 length given the first 4 bits of a utf8 encoded byte.
    const UTF8_WIDTH: u64 = 0x4322_0000_1111_1111;
    ((UTF8_WIDTH >> ((c >> 4) * 4)) & 0xf) as usize
}

For the UTF-8 to code point conversion, when you get to a character boundary, you toggle that bit in the bit vector. For the UTF-8 to UTF-16 code unit conversion, you also toggle a bit on each character boundary, but then there’s one extra step. In the cases where you have a character that takes 4 bytes to represent in UTF-8, it is required to toggle an additional bit (because they will require 2 UTF-16 code units to represent). You can set the extra bit before the boundary bit that’s always set.

Back to work!

And that’s about it! A clever data structure with excellent space and time complexity, happily working away behind the scenes to keep code search delightfully fast.

NOTE: Since I wrote this post, we’ve extracted and published string-offsets, an open source Rust library that can be found in our rust-gems project along with a number of other algorithms and data structures you might find interesting!

Special thanks to Alexander Neubeck for introducing me to the subject, writing our internal BitRank implementation, and helping me improve the utf8/16 conversion code. Also thanks to Rick Winfrey for reading and giving feedback on my early drafts of this post.

Particle Photon Hack

2020-10-13T00:00:00-07:00

I’ve been slowly working on some basic home automation, tackling minor home projects that involve manual and repetitive steps. One of my core values is that any automation I add to the house must still have a low-fidelity fallback. Obviously operating a light requires power, but you should still be able to turn your lights on/off if the home network is down and in general there’s only a small class of things where automating genuinely saves you time.

I had recently switched my front lights and a couple of key indoor lights over to Lutron Caseta wireless switches, which are awesome btw, and I’m running Home Assistant on a Raspberry Pi that also operates as the family network print server. For this project I wanted to get a different set of lights on the same automation schedule (e.g. turn on at sunset), but these were some inexpensive LED landscape lights that are controlled by a 12VAC transformer/power supply with a rotary timer (the kind where you put the pins in for the on/off times). The timer is a pain to manage because it drifts, doesn’t track the changing sunlight with the seasons, and gets off when the power goes out. It’s also the only way to control these lights (there’s no direct on/off switch).

What I came up with is pretty neat. I had a Particle Photon lying around along with a relay shield that I just wired into the circuit to control the landscape lights. These Particle chips are pretty awesome: essentially a microcontroller but connected to Particle’s cloud so you can centrally manage and program (plus they sort out the WiFi setup for you which is often a bit of a pain on microcontrollers).

Details and setup

I took the transformer and put one of the relays in the circuit so that it can be controlled by the Photon and then took the pins out of the rotary dial and just left it in the on position.

Then I wrote a bit of code for the Photon to turn on/off the relay and talk to the Home Assistant over MQTT. The later is what took the most setup.

Steps:

Install MQTT integration in Home Assistant

Install mosquitto on the pi:

 pi:~ $ sudo apt-get install -y mosquitto mosquitto-clients

 # here's how you'd test subscribing/publishing
 pi:~ $ mosquitto_sub  -v -t home/topic_name
 pi:~ $ mosquitto_pub -t home/topic/light -m 'turn on'

Program the Photon to be a MQTT discoverable device. This is the interesting part so go check out that code. You’ll need to include the MQTT library in your particle project to get things to compile.

Add this to configuration.yml in Home Assistant:

 mqtt:
   discovery: true
   discovery_prefix: homeassistant

Now, in Home Assistant the front lights just show up like another light entity and can be toggled on/off or included in automations. I have a single automation that turns all the outdoor lights on about 10 minutes after sunset.

Here’s the final hardware setup mounted with software up and running on the Photon.

(What am I going to do with all those extra relays?!)

And here’s what it looks like from Home Assistant (the Walkway light is the Photon-enabled front lights).

A few extra notes:

The Photons, like many IoT devices and microcontrollers operate on 2.4GHz, but modern WiFi routers often want to “upgrade” them to 5GHz. I had to set my router to force the Photon’s MAC address to only use the 2.4GHz radio because otherwise the device would intermittently lose connectivity.
I’ve hard coded the IP Address of my Home Assistant server and also configured the router to give that device a static IP.

Shelter In Place Vignettes

2020-07-01T00:00:00-07:00

A tour of strings

2020-06-28T00:00:00-07:00

Much has been written about encodings—and specifically unicode—by programmers and for programmers, but it’s easy to get lost in the weeds and easier still to feel enlightened until the next text encoding bug bites. Recently, I’ve been exploring how text and strings are represented in different programming languages and I think the various language specific implementations are actually a critical component of why encodings are (still) so hard for programmers. There’s the standard adage: all programmers should know about encodings which is true and sure, some programmers still just don’t know (I certainly didn’t learn any of that in my formal education), but at a deeper level text encodings are hard because: a) trying to represent the world’s writing is an ambitious task, b) different programming tasks favor differing representations, and c) different programming languages do subtly different things in how they represent and handle text.

So with that in mind, I present to you a brief tour of 6 different languages: JavaScript, Ruby, Go, Rust, Swift, and Haskell—and how each implements and represents text.

Why encode things?

First, it’s essential to understand that in order for written human languages to be used, stored, and shared on computers we have to have a system by which we can:

Take text and turn it into numbers. (e.g., hi -> [104, 105].)
Take numbers and turn them back into text. (e.g., [104, 105] -> hi.)

The former is called encoding, the later decoding, and critically we all have to agree on the big table that translates character-like-things to numbers and back again. The relevant big table that we all agree on (for the purposes of this article) is Unicode, but know that there are others[1]. All this is important because computers only know about numbers: everything you interact with on a computer has to eventually be represented as a number. Pixels? Yep. Sound? Yep. PDF files? Yep. Text? Yep, that’s what this is all about. Even the number 1 in text is represented as another number (the unicode code point U+0031).

What I’m not going to talk about but find interesting

I’m not going to talk about detecting encodings from arbitrary text or binary arrays of bytes. I’m also not going to talk about fonts and the rendering or display of characters and graphemes. Those are both topics for another post.

Things we expect from our strings

One more aside. Programmers have a bunch of different things they would like to do with strings. And because of this, we have some very high expectations for what our strings can do:

Iterate, enumerate and index into them
Find the length of
Search and use regexes
Find & replace
Create, combine, and build new strings
Sort (trickier than you might think!)
Reverse (also tricky)
Manipulate by intercalating, interspersing, or transposing.
Change case (up, down, title, camel, pascal, snake, etc)
Change encodings
Detect encodings (very difficult!)
Strip, trim
Break up in various ways

My explorations aren’t super rigorous and I mostly explore basic representation, enumeration, and indexing, but even with that surface scratching the breadth of different decisions made is fascinating.

JavaScript

JavaScript, language darling of the web, uses UTF-16 to represent strings despite the fact that the entire modern web uses UTF-8 to represent text. This has some important consequences and surprises.

Quick: How long is this string? hi 👋🏻?

> "hi 👋🏻".length
7

Huh? What we really should have asked next is: in what units?

JavaScript strings are represented in UTF-16 and this representation leaks through into things like the length and how you might go about indexing strings. I find the API is much less obvious than many of the other languages.

We can get at unicode code points using the for of String iterator or using Array.from.

> let arr = []; for (const c of "hi 👋🏻") { arr.push(c) }; arr
["h", "i", " ", "👋", "🏻"]
> [..."hi 👋🏻"  ].length
5
> [..."hi 👋🏻"]
["h", "i", " ", "👋", "🏻"]
> [..."hi 👋🏻"]  .reverse()
[ '🏻', '👋', ' ', 'i', 'h' ]

Be careful if you index as if the string was an array:

const str = "hi 👋🏻"; let arr = []; for (let i = 0; i < str.length; i++) { arr.push(str[i]) }; arr
(7) ["h", "i", " ", "�", "�", "�", "�"]

And good luck if you want to deal with UTF-8 instead (See TextDecoder and TextEncoder).

Ruby

Next there’s Ruby, a delightful and quirky old friend of a language. I don’t want to spend too much time on historical language versions, but I will briefly note that until Ruby 1.9 (and even then it was kinda broken), Ruby basically considered strings just arrays of bytes with some extra helper methods. Eventually explicit encoding support was added and has evolved over time and as of writing, in Ruby 2.7, it’s relatively straightforward to deal with and understand text. Ruby has fully embraced UTF-8 as the default encoding for strings, but it’s easy to convert to/from other encodings if that’s your thing

>> RUBY_VERSION
=> "2.7.1"

# Let's get the length again
> "hi 👋🏻".size
=> 5

# We can iterate over code points
>> "hi 👋🏻".codepoints.map{ |b| "0x%x" % b }
=> ["0x68", "0x69", "0x20", "0x1f44b", "0x1f3fb"]

# Or bytes
>> "hi 👋🏻".bytes.map{ |b| "0x%x" % b }
=> ["0x68", "0x69", "0x20", "0xf0", "0x9f", "0x91", "0x8b", "0xf0", "0x9f", "0x8f", "0xbb"]

# Or change to UTF-16 and do the same
# Notice how it takes a few more bytes
>> "hi 👋🏻".encode("UTF-16").bytes.map{ |b| "0x%x" % b }
=> ["0xfe", "0xff", "0x0", "0x68", "0x0", "0x69", "0x0", "0x20", "0xd8", "0x3d", "0xdc", "0x4b", "0xd8", "0x3c", "0xdf", "0xfb"]

# You can lookup elements like this
>> "hi 👋🏻"['👋🏻']
=> "👋🏻"

# Skipping the skin tone modifier is fine
>> "hi 👋🏻"['👋']
=> "👋"

# But reversing it doesn't quite work
>> "hi 👋🏻".reverse
=> "🏻👋 ih"

# You can look at characters
>> "hi 👋🏻".chars
=> ["h", "i", " ", "👋", "🏻"]

# Or even grapheme cluster
>> "hi 👋🏻".each_grapheme_cluster.to_a
=> ["h", "i", " ", "👋🏻"]
>> "hi 👋🏻".each_grapheme_cluster.reverse_each.to_a
=> ["👋🏻", " ", "i", "h"]

# It's interesting to see how something like the family emoji in constructed
>> "👨‍👩‍👧‍👧".chars
=> ["👨", "‍", "👩", "‍", "👧", "‍", "👧"]
>> "👨‍👩‍👧‍👧".reverse
=> "👧‍👧‍👩‍👨"
>> "👨‍👩‍👧‍👧".each_grapheme_cluster.to_a
=> ["👨‍👩‍👧‍👧"]

Haskell

Haskell has a number of data types that can be used to represent text: String, Text, and ByteString (the later two which also have lazy and strict implementations, but I’ll spare you the details of that). Let’s start with String. You can use the OverloadedStrings language pragma, but I’m going to be explicit about the types for clarity (note: λ is just my prompt in ghci). String is literally a list of Char (it’s a type synonym: type String = [Char]) where Char is an enum of all the unicode code points.

-- The wave with a skin tone is made up of two unicode code points
λ length ("hi 👋🏻" :: String)
5

-- ghci just bails on printing these
λ reverse ("hi 👋🏻" :: String)
"\127995\128075 ih"

-- index into the string
λ ("hi 👋🏻" :: String) !! 0
'h'
λ ("hi 👋🏻" :: String) !! 3
'\128075'
λ ("hi 👋🏻" :: String) !! 4
'\127995'

-- Deconstruct the list
λ let _:_:_:x:_ = ("hi 👋🏻" :: String)
λ x
'\128075'

-- Check out the code points
λ import Numeric
λ foldr (\x acc -> "U+" <> showHex (fromEnum x) "" : acc ) [] ("hi 👋🏻" :: String)
["U+68","U+69","U+20","U+1f44b","U+1f3fb"]

Notice that the underlying representation is unicode code points, not UTF-16 or UTF-8 like we’ve seen in other languages so far.

The trouble with String is that the implementation is very inefficient. You’re better off with Text (though String is in the base library and used all over the place for things like error). There is a similar API, and text operates mostly like a list, but you can’t use list deconstruction in quite the same way.

λ import qualified Data.Text as T
λ T.length ("hi 👋🏻" :: T.Text)
5
λ T.index ("hi 👋🏻" :: T.Text) 3
'\128075'
λ T.index ("hi 👋🏻" :: T.Text) 4
'\127995'
λ T.reverse ("hi 👋🏻" :: T.Text)
"\127995\128075 ih

-- This doesn't work
λ let _:x:_ = ("hi 👋🏻" :: T.Text)
:8:14-30: error:
    • Couldn't match expected type ‘[a]’ with actual type ‘T.Text’
    • In the expression: ("hi 👋🏻" :: T.Text)
      In a pattern binding: _ : x : _ = ("hi 👋🏻" :: T.Text)
    • Relevant bindings include x :: a (bound at :8:7)

λ T.foldr (\x acc -> "U+" <> showHex (fromEnum x) "" : acc ) [] ("hi 👋🏻" :: T.Text)
["U+68","U+69","U+20","U+1f44b","U+1f3fb"]

ByteString should really just be called Bytes and is useful when dealing with true binary data (it can be even more efficient than Text for that reason) and it’s how we can represent UTF-8 or UTF-16. Text lets you encode and decode to different representations. Notice that the types produced are all ByteString.

λ import Data.Text.Encoding
λ :t encodeUtf8
encodeUtf8 :: T.Text -> Data.ByteString.Internal.ByteString

-- encode to utf8
λ encodeUtf8 ("hi 👋🏻" :: T.Text)
"hi \240\159\145\139\240\159\143\187"

-- or, a bit easier to read bytes in hex
λ B.foldr (\x acc -> "0x" <> showHex x "" : acc) [] $ encodeUtf8 ("hi 👋🏻" :: T.Text)
["0x68","0x69","0x20","0xf0","0x9f","0x91","0x8b","0xf0","0x9f","0x8f","0xbb"]

-- encode to utf16
λ B.foldr (\x acc -> "0x" <> showHex x "" : acc) [] $ encodeUtf16BE ("hi 👋🏻" :: T.Text)
["0x0","0x68","0x0","0x69","0x0","0x20","0xd8","0x3d","0xdc","0x4b","0xd8","0x3c","0xdf","0xfb"]

-- ghci won't let me insert 👨‍👩‍👧‍👧 into a string literal. You can manually spell out the unicode, but if you paste in or insert, the repl just takes the first code point and drop everything else:
λ "👨" :: T.Text
"\128104"

Notice that you have to pick little or big endian for UTF-16 using either encodeUtf16BE or encodeUtf16LE.

Swift

In Swift, we have another interesting point in the design landscape: a String (declared as a struct) is a collection of extended grapheme clusters with views into the various ways we might want to access the data. This actually seems quite sane to me, though I understand there are still some arguments about specific grapheme clusters.

import Cocoa

var str = "hi 👋🏻"

print(str)
print("grapheme count: \(str.count)")
print(Array(str))
print(str.reversed().map{ c in c })

print("unicode code points (scalars) count: \(str.unicodeScalars.count)")
print(str.unicodeScalars.map { i in String(format:"U+%x", i.value) })

print("utf8 count: \(str.utf8.count)")
print(Array(str.utf8).map(toHex))

print("utf16 count: \(str.utf16.count)")
print(Array(str.utf16).map(toHex))

func toHex(_ v: CVarArg) -> String {
    return String(format:"0x%x", v)
}

output

hi 👋🏻
grapheme count: 4
["h", "i", " ", "👋🏻"]
["👋🏻", " ", "i", "h"]
unicode code points (scalars) count: 5
["U+68", "U+69", "U+20", "U+1f44b", "U+1f3fb"]
utf8 count: 11
["0x68", "0x69", "0x20", "0xf0", "0x9f", "0x91", "0x8b", "0xf0", "0x9f", "0x8f", "0xbb"]
utf16 count: 7
["0x68", "0x69", "0x20", "0xd83d", "0xdc4b", "0xd83c", "0xdffb"]

This is fascinating because it’s our first language to have graphemes be the primary unit (you can get to them from Ruby, but the standard methods use code points).

Go

Go uses UTF-8 to represent text in the standard “strings” package. You’re able to both get at the underlying UTF-8 bytes as well as iterate over the unicode code points. There’s a utf16 package if needed. However, you’re on your own if you want to reverse a string (make sure you iterate properly, hint: don’t use len()).

package main

import "fmt"

func main() {
	str := "hi 👋🏻"
	fmt.Println(str)
	fmt.Printf("bytes: %v\n", []byte(str))
	fmt.Printf("length %v\n", len(str))

	fmt.Println()
	fmt.Println("iterate forward")
	for i, c := range str {
		fmt.Printf("%v: %v\n", i, c)
	}

	fmt.Println()
	fmt.Println("reverse")
	for i := 0; i < len(str); i++ {
		fmt.Printf("%v: %v\n", i, str[i])
	}
}

output

hi 👋🏻
bytes: [104 105 32 240 159 145 139 240 159 143 187]
length 11

iterate forward
104
105
32
128075
127995

reverse
104
105
32
240
159
145
139
240
159
143
187

Rust

Rust’s std::string::String uses a UTF-8 representation and forces that strings only contain valid UTF-8 (you can use OsString if you must). Also, indexing into a String is not allowed: doing "hello"[0] is a compile time error.

// ❯ rustc --version
// rustc 1.40.0 (73528e339 2019-12-16)

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let str = "hi 👋🏻";

    // println!("{}", str[0]); Indexing a str is not allowed, compile time error

    println!("{}", str);
    println!("UTF-8 len: {}", str.len());
    println!("UTF-8 bytes: {:?}", str.bytes().map({|b| return format!("0x{:x}", b)}).collect::>());
    println!("code points len: {:?}", str.chars().count());
    println!("code points: {:?}", str.chars().collect::>());

    // have to pull in a library to iterate by grapheme clusters.
    let graphemes = UnicodeSegmentation::graphemes(str, true).collect::>();
    println!("graphemes len: {:?}", graphemes.len());
    println!("graphemes: {:?}", graphemes);

    // operations like reverse can be done on any of the specific representations
    println!("graphemes reversed: {:?}", graphemes.into_iter().rev().collect::>());
}

hi 👋🏻
UTF-8 len: 11
UTF-8 bytes: ["0x68", "0x69", "0x20", "0xf0", "0x9f", "0x91", "0x8b", "0xf0", "0x9f", "0x8f", "0xbb"]
code points len: 5
code points: ['h', 'i', ' ', '👋', '🏻']
graphemes len: 4
graphemes: ["h", "i", " ", "👋🏻"]
graphemes reversed: ["👋🏻", " ", "i", "h"]

Conclusions

I’m not really sure what to conclude here other than to confirm that encodings are hard and particularly difficult due to how many programing languages uses subtly different (and often leaky) abstractions. On the other hand, I find this kind of delightful and interesting. What’s your favorite way to work with text? What are some of the other interesting data points in the programming language design space for representing and manipulating strings?

[1] Popular encodings include: ASCII, ISO 8859, Windows-1252, Mac OS Roman, Shift JIS, and many, many others.

On programming languages

2020-05-06T00:00:00-07:00

In the programming language design space there are so many interesting facets of language design and subsequent arguments about the various merits thereof that it can often bring about a paralysis of choice when deciding what to use or what to standardize on. Or, instead of paralysis, you’ll get entrenchment and an inability to form consensus: people love their language of choice and will argue its benefits to the death (or at least their exit). But what to do if you’ve got an application to write? A problem to solve? Another job to be done? What language should you use?

There are so many aspect to a programming language, that what you’re selecting is much more than just the syntax and language features. What is the community like? How long has the language been around? How painful are the changes between major versions? What’s the documentation like? What’s included in the standard library? What about the extended library and the greater ecosystem? Package management? What is the tooling like to do development? How do you model things in the language? What’s required to package and deploy/run? What kind of runtime behavior do you need and where is the application going to run? What other systems do you need to interact with and what languages are they written in? What is the core domain of your problem space? And, perhaps controversially, what does it feel like to write code in that language?

In short, I think that choice of programming language depends on team size and problem space—and then a dose of feel. If you’re an individual or a small team (and expect to stay small) and want to go far and fast pick a more sophisticated, niche language. If you’re working in a larger organization, standardize on a well known top-10 language widely used in your problem domain.

Powerful languages, often with steep learning curves, can be a secret weapon for individuals and small teams to move quickly with high quality and great flexibility. Once learned, and if well-matched to the domain, they are multipliers of effort. A common feature of these languages is that they allow programming language driven development in some form via meta programming (macros, domain specific languages, etc). This allows the host language to fit your domain like a glove and acts like a lever to make the most of your limited people power. Small teams can take on much larger competitors and move quickly to find product/market fit and innovate into new territory. See, for example, PG’s writing on LISP, Scheme, OCAML, Racket, or the Haskell communities. The critical insight here is that your constraint is not having a lot of humans working on the problem so: flexibility > standardization, advanced features > learning curve, agility > correctness.

Beyond being hard to learn, the downside is that these languages are usually niche enough that the pool of engineering talent with sufficient expertise is relatively limited and once you start creating languages within languages you have to train people to learn the base language, the domain, and then the domain specific language—which is no small task. Since skills in the derived language are much less transferable, it also means that unless you’re one of the engineers working on the meta level, it can be a dead end career-wise to know some random in-house DSL (I had a job once where said secondary language was affectionally named BobScript). Another downside is that you’ll often not be able to rely as much on the larger community so you’ll need to be competent and comfortable with quickly writing integrations into modern operational and business systems (e.g., you need to do billing but Stripe doesn’t have an out of the box Haskell client library). This usually means some combination of basic systems, net/http, security, and algorithmic programing is required. Additionally, editor tooling here can be spartan and basic. Developers tend to have high fidelity mental models of the inner workings of the language that they lean on instead of tooling support like autocomplete and a visual debugger.

On the other hand, if you have a large number of engineers, it’s important to select a more run-of-the-mill language because your constraints are finding and effectively training more humans to work on your software project and this is where standardization and ubiquity matter greatly. The humans are going to come and go and it’s OK if it takes verbose code and tedious refactorings over the years because language popularity and ease of training matters more than language power. One nice advantage here is tooling. Just based on sheer numbers of people working in the ecosystem and attention to the beginner onramp, a top 10 language will have great libraries and sophisticated tooling for editing, debugging, profiling, etc. It’s also true that these languages tend to slowly slurp up some of the best ideas from programming language research and more advanced languages. Within that top 10 or 20 list, does it matter what you pick? Beyond being aware of your problem domain, probably not as much as you think. Doing some sort of scientific computing or data science work? Python or R are going to be good bets. Android app? Kotlin. iOS? Swift. Embedded device? C. Systems Programming? C/C++, Java, Go. Gaming? C++. Building websites? You’ll definitely write some JavaScript/TypeScript, but after that the choice of a backend language isn’t all that important: Ruby, Java, Python, Go, C#, even more JavaScript/TypeScript are all going to be sufficient for your problem space. The larger the team, the more features like static typing and strict linting/formatting standardization are going to be important.

Beyond language, what matters more is the other patterns you standardize on as there will be entire teams concerned about the meta health of all the code in your organization. Does it build consistently? Are security patches applied quickly and ubiquitously? Do you track open source license compliance? Is PII handled appropriately? Are you moving from one time series stats collection system to another? There’s always some project going on where you’re moving all the code from platform A to platform B (e.g., moving source control providers, cloud providers, runtime platforms, etc) and you’ll need teams and tools that can execute effectively across all projects. There are also issues of ownership and who to page when something goes wrong, catalogs of services and their SLOs and status pages. Having a uniform code base—one or a limited number of languages, identical formatting, single package system, standard scripts for setup/deploy, familiar interfaces, same RPC system—is a huge strategic advantage for that large organization. Small teams have most all of these same needs, but not the surface area and human level communication/synchronization problems.

So. Just you and a buddy or two? Reach for advanced languages that make you happy. Invest the time to learn them in depth and enjoy your freedom and agility. Building a large engineering team or working in a large organization? Standardization and ubiquity are your friends—stick with the usual suspects, mind your domain, focus on a single way of doing things but watch out because this is very much like the innovators dilemma: some LISP wielding high school kid from Hack Club is going out-innovate an entire subtree of your org chart one day.

Thailand

2019-07-02T00:00:00-07:00

Bangkok

Bangkok’s heat presses down from the humid night sky, browning out sharp thinking, slowing the pace of life, and bringing a sheen to exposed skin. The heat radiates out of the pavement and the street food and the kindness of the people as we find ourselves street side with steaming bowls of Kuai Tiao (spicy beef noodle soup) and the greatest of pleasures—good friends far from home.

Vast and sprawling, we experience only a sliver of a sliver of this great Southeast Asian city—sampling tourist attractions, everyday life, and food. So much good food! Traveling often makes you aware of how life revolves around getting from one meal to the next and Thailand is the perfect place to just embrace this notion to it’s logical extent—bridging hotel breakfast buffets and lunch with walking fruit bought from street side vendors and served in a bag with a wooden stick to stab the pieces: pineapple, watermelon, papaya, guava, rose apples, dragon fruit, mangosteens, lychees, longans, rambutans, multiple varieties of mangos and bananas, and many others that we don’t recognize. The fruit is always ripe and sweet and a delicious complement to the sticky heat. Ellery has a real thing for mangos so we buy and experiment with a number of varieties from the dripping sweet yellow standards to the crisp and tart green mangos. Kate, always the adventurous eater, forges out into new territory — one day a green, sour (pickled?), thumb-sized fruit a bit like a large caper berry that was served with a small bag of sugar and ground chilly pepper for dipping.

And then it’s time for lunch! Basil chicken with fried eggs on rice or Pad Thai followed by afternoon snacks: larb flavored pretzel sticks, banana crunchies, dried mangos, beans. Before moving on to dinner: Tom Kha Gai (coconut soup), fried crabs, or a full international cuisine: Sukiyaki, Korean fried chicken. And then later, or anytime really, it’s coconut ice cream (sticky rice, coconut ice cream of various flavors, peanuts, condensed milk) mango sticky rice (sticky rice covered with a full fresh cut mango and topped with condensed milk and crispy rice), or fresh fruit smoothies.

As tourists, we wander the halls and grounds of the Grand Palace, gazing up at the emerald Buddha, navigating the mobs of Chinese tourists, and hire a boat to explore the klongs—canals that run like streets through the city. As purveyors of everyday life, we ride the sky train and experience IconoSiam (tagline: The icon of eternal prosperity of course), a massive mall accessible by a ferry, where you can buy a Maserati and then stroll through artificial canal-side street markets complete with vendors hawking their wares—though at mall-level prices, but with AC! We observe the monks out collecting alms (Louisa has a small conversation with one on the BTS), walk the streets of K, A, and K’s neighborhood, explore the non-tourist markets, eat street side and in small cafes around the city, warm houses of friends of friends, and make a strong contention for the “swim every day” award (though I must note that the bobbing part of the daily dive makes me exhausted just watching).

And then we take the night train from Bangkok to Chiang Mai, getting surprisingly good sleep bookended by gazing out the windows—the last light of the day giving glimpses into the shanty towns on the outskirts of Bangkok and the first light of the morning reflecting green jungle and mountains as the train pulls into the station.

Chiang Mai

Chiang Mai brings new sounds and sensations. The bright and insistent gong from the Wat right next to our house wakes me at 4:30am (but only the first night) and is but one of the novel street noises in the old city as we fall asleep. Dishes clink with soft conversation in Thai from one side, cicadas belt out their dissonant calls on the other, every now and then a motor scooter purrs by—much quieter than the Hell’s Angles that ride by our place in San Francisco.

At Wat Phra Sing, monks sit like statues and in an adjacent building some of them actually have turned to bronze—still unmoving in their meditation. In the main temple, the Buddha’s eyes draw us in and turn our attention inward, if for only a brief moment.

At the public park just inside the old town wall, the community is out in force—streams of people lap around the perimeter path, stopping at workout stations every 10 feet. A pick up futsal game with car tires balanced vertically as goals is in play, and a Sepak Takraw match (like volleyball but with your feet) is just starting up. There’s AcroYoga on the grass, a rumba-like dance/exercise class going on, a playground full of kids, and scores of fish roaming the pond for picnickers with extra to share.

We eat dinner on the street, Louisa survives on fried eggs and fruit, storing up whenever there is a breakfast buffet with french toast. Ellery tries just about everything once and then sticks to the meat. For dessert, Kate buys roasted bananas on a stick from a little, hunched-over lady with a charcoal grill diligently turning her skewers.

In the mountains outside the city we meet people from the Ethnic hill tribes and spend a day with a small family of elephants—feeding and bathing and caring for them. The encounter is up close and hands on—dexterous trunks grabbing for small bananas out of your hands, huge elephant bodies pushing against yours, soft tongues, wide flat molars, bristly hair, flat tails like bottle brushes on a cord, tips of trunks like alien creatures with two holes and a thumb like grasp, surprising grace and awareness, those obviously intelligent eyes, and a joyful playfulness. One of the babies, naughty boy, is always running away—picture a 1.5 ton juvenile elephant loose on your property! The largest female mama elephant is 13 (out of 24) months pregnant and you can see and feel her baby stretching her skin taught. We feed them pretty much constantly, pausing only for our own lunch and a lively mud bath—the 4 of us waist deep in mud with 5 elephants rolling blissfully around in the muck before we all head to the river to wash off.

Krabi

And then we head south to find ourselves in another world, accessible only by boat just outside of Krabi, where the sky is big again and the Andaman Sea shoots out limestone crags that are startling and picturesque—pale yellow/white rock with green life in every crevice, dramatically set off against the aquamarine.

The sensations here are again new. The water is so warm that we debate if the air is actually colder or if it’s just the evaporative effect of your wet skin. Ellery dives right in with confidence, ducking under waves and independently catching a few body surfs. Louisa gets a full spectrum of ocean experiences—initially shy to swim, complete delight as she overcomes her fear and body surfs for the first time, and then terror as she and I both get stung by a jellyfish (not dangerous but certainly an interesting sensation: lingering sharp, needling pain). Back to where we started, I guess. Ellery gets stung later too (much screaming), but everyone seems to recover all right (vinegar takes the sting out).

We snorkel for the first time with the girls. Ellery takes to it immediately, head glued underwater following the fish around, grin spilling out from under her mask. Louisa is not so sure at first (too much equipment, the mask is uncomfortable), but then makes a friend with one of the Thai boat assistants who shows her around underwater while she wears her swimming goggles and holds her breath (I saw nemo dad!, sigh). Kate finds a school of squid that she follows around and I’m intrigued by these long needle-like fish (2-4’ in length, incredibly pointy) that make Kate nervous—she can only think: barracuda!

Back at the resort, life teams as well. My favorite regular sighting: a rather large water monitor (in the same family as the Komodo dragon, but only 4-5’ long, though big ones can be 50kg) that the girls first see at the top of the water slide (again, much screaming), but then we catch sight of a number of them around the property just going about their business. We walk the monkey trail (to the beach with the jellyfish) and promptly show up for happy hour everyday (though we’re ashamed to admit to Alli that we missed it the first day). Throughout all this, the limestone cliffs loom over us—a dramatic half circle trapping us between them and the water in what has to be the most beautifully exotic place I’ve ever stayed. It’s monsoon season, though we manage to dodge the rain, which means the resort is quiet. The undulating dock is a source of constant entertainment as guests come and go—rolling and bouncing like a fairground ride. We don’t see anyone fall in but they must loose people and luggage into the water every now and then! Our departure at 6am on a private long tail boat at sunrise is stunning with the sea and cliffs glowing from first light as we motor away to the final leg of our trip.

Taipei

We spend our last 48 hours in Kate’s birth city: Taipei, our lodging the historic Grand Hotel where a one-year-old Kate took her first steps on the landing of the iconic red carpeted staircase in the lobby. The hotel is superlative. Our room has sweeping views of downtown Taipei and the halls and common spaces are covered with orchids, artwork, dragons, ornate carvings, and architectural details.

Out and about in the city, we gaze at the jade cabbage and other treasures in the National Palace Museum, hunt down brown sugar bubble tea, ride the MRT, and experience monsoon season first hand by soaking our travel clothes right before catching 16 hours worth of flights home. But the highlight by far is an evening with San Francisco friends who are home for the summer in Taiwan. Dinner is a feast of some of the best food of our trip: fried rice and noodles, clams in a ginger garlic soup, roasted snails, squid, fried oysters and shrimp, asparagus with crabs, sea cucumber, and half a goose. Ellery and Louisa both venture out into new culinary territory, egged on by their peers. Kate and I drink it all in as the last few hours of our trip slip by. The kids pair off with school friends and explore the night market, playing carnival games and eating shaved ice (frozen condensed milk shaved thinly with fresh fruit). It’s still hot, but I don’t notice anymore—maybe just adjusting to the warmth of the culture and the incandescence of good friends.

Photos by Tim Clem, Kate Clem, Anita Roth, and Chiangmai Elephant Legend staff