Giant Robots Smashing Into Other Giant Robots

Simple, affordable unsupervised agentic coding from my phone with Claude Code in Github Actions

2026-05-04T00:00:00+00:00

If you’re following the AI hype train, you’ll have heard serious software people talk about building software (without needing to code) while they take their kids to the park. This is obviously appealing, but it’s often discussed in ways that imply serious barriers to entry: the interviewee from Anthropic mentions that they get to use all the tokens they want for free, the AI-coding consultancy says that if you’re not spending $1,000 per day per developer you’re doing it wrong, discussions of a new agentic coding paradigm refer to people who switch between multiple top-end Claude Code Max plans.

Although these conversations are frequently intertwined, building software away from a desk doesn’t have to come with a 4-digit daily bill. I don’t have that sort of cash, but I’ve still been able to get in on the fun.

The $20-$30/month workflow

Here’s a workflow that I’ve been using for unsupervised AI coding on a side project which will never justify serious focus or budget:

I type one or two lines into a new issue on the project Github repo (usually from my phone)
I add a label, needs-elaboration to the issue
Github starts up an actions workflow that invokes Claude Code in the cloud to review the pre-existing code and write a plan for how to turn my few sentences into a reality
I review the plan and if I’m happy I add a ready-for-dev label
Github starts Claude Code to implement the issue and raise a pull request
I click on preview links that Vercel (this is a NextJS app) adds to the PR and inspect the results
If I’m happy I merge the PR and Vercel deploys to production
If I’m not happy I add a comment and an agent-review label and GOTO 5

The total cost is $20 for my Claude Code Pro plan and $10 for a Github Pro Plan (educators and open source contributors can often get this for free). There are usage limits to both plans¹ but I mostly run out of ideas before I run out of limits. I’m not focused on making an agent work 24/7 while I’m doing other things; my only concern is that there is something new ready for me each time I have spare attention for this project.

In case it wasn’t clear, I’m not writing or even reading any of the code as part of this loop². That would take focus that I can’t dedicate on a per-feature basis to this project.

Minimal time and money for decent results

This project wasn’t always an experiment in agentic coding. It was originally an exploration of JS web development with NextJS that I pursued during Friday investment time over a seven month period a few years back. This allows me to compare before and after, with striking results.

In the past month, with 10 minutes here and there on my phone, I’ve far exceeded what I did in the original seven months of Fridays:

features that have never existed in any previous equivalent software (from myself and others in this niche hobby space)
huge improvements to UI including working mobile portrait mode and desktop modes
500+ tests and 77% coverage versus 0 tests and 0% coverage
optimisations that allowed me to downgrade my Vercel account to a free plan
GDPR-compliant analytics and error tracking
extending my integration with Google auth from test-mode (allow-listed users only) to production (anyone can sign up)

Not bad for effort spent while out for a walk or on the London underground³.

How it works: Github workflows for cloud-hosted agentic coding

When I say “Github starts Claude Code in the cloud” it may not be clear exactly what I mean. Here’s an example of a workflow file from my project’s .github/workflows folder:

name: Claude Issue Triage

on:
  issues:
    types: [labeled]

jobs:
  claude-triage:
    if: github.event.label.name == 'needs-elaboration'
    ...
    steps:
      - name: Checkout repository
      ...
      - name: Run Claude Code
        uses: anthropics/claude-code-action@v1.0.70
        with:
          claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
          prompt: |
            GitHub issue #${{ github.event.issue.number }} ("${{ github.event.issue.title }}")
            requires your analysis...
        claude_args: "--max-turns 20 --dangerously-skip-permissions"
        show_full_output: "true"

Think of it like running a CI workflow in Github actions, except that instead of running against a branch or pull request, this workflow runs when an issue is labelled needs-elaboration. First it checks out the code, then it runs Anthropic’s Claude Code action. The action passes a fixed prompt to Claude Code plus metadata from the Github workflow (like the contents of the labelled issue).

It doesn’t have to be Claude Code, I could go with Opencode or the Copilot agent instead.

Notes for people unfamiliar with unsupervised agentic coding

If AI-coding terms like gastown, taskmaster, ralph loops, or minions mean nothing to you, these notes about what becomes different when we stop intervening in agentic code may be helpful.

First of all, the --dangerously-skip-permissions option on Claude Code may look scary, but it’s standard practice for folks who want to run Claude Code without any human intervention. I am less worried about doing this in a random github actions runner in the cloud than I am running it on my local machine. My laptop remains safe for my day job.

In unsupervised agentic coding, there’s a transition from “human in the loop” (approval of individual code changes that the agent makes) to “human on the loop” where the agent’s output over time is assessed and its standard instructions are improved. I’m not telling Claude which code to modify or how. Instead, it decides and I have to live with the results until I can give it what is needed to solve a problem in a better way the next time. This is a lot like being a manager of human coders (at least one who does not want to be a bottleneck to team output).

Some of the code in this project is really bad, for example a 1500-line JSX file. I only noticed this because of how Claude was struggling to implement changes to this code. Ideally I would have noticed this earlier, but the path to recovery is fairly straightforward. I’ll tell Claude to identify pieces of the component that can be split into separate components and files and have it create new github issues that the agents themselves can plan and implement. This is just how I would work as a developer myself (my threshold for breaking the code down would be somewhat earlier than 1500 lines though).

I wouldn’t trade the progress I’ve made in this month for perfect code. There’s a lightbulb moment for every developer when they work on software that they themselves want to use: suddenly the choice to pursue higher quality code has to fight for development priority with wanting to use a new shiny feature. The delicate balance is in knowing that with too much tech debt (e.g. 1500 line components), the shiny features will take longer and longer to build until eventually “improve quality then build new feature” takes less time than just “build new feature”.

A crucial factor is that the number of lines of code is no longer linked to the time it took to produce them. I often start by asking for options to address a problem. Out of five options, three may be easily rejected. When there are two or more realistic options, I don’t need to reason about them in the abstract: I can have Claude build both and see how each of them feels. There is no opportunity cost choice between one option and another.

Validation from Y-Combinator

The AI coding scene moves fast. As I was writing the first draft of this post I was puzzled that I hadn’t seen anyone else discussing a workflow like this. Between then and now Twill.AI announced a real product which seems like the same idea but with some professional features like more advanced sandboxing and integration with Slack, Whatsapp etc. that they’re betting are worth paying for.

Conclusion

Mine is a very low-stakes project which gives me freedom to experiment. There are serious gaps that I would need to close before applying this workflow to professional work, but I don’t think they’re insurmountable (certainly Twill.AI and many others are trying their best). I expect that over time I’ll be able to extend the agents’ work to close more and more gaps without breaking the bank. In the meantime, you can find me at the park with my daughter.

Your actions will just stop running if either Claude or Github plans run out. There are time-based limits that refresh over different periods. Claude is based on tokens and I use their cheaper tier of models (Sonnet) to make the limit go further. Github Pro Plans include an actions-based limit of 3000 minutes per month. I use much less than 3000 minutes per month on my side project. Github has recently changed their pricing model, but it is mostly with regards to AI token pricing, and in this case I’m getting my tokens directly from Anthropic. ↩
Before you worry about the safety of running code without reading: there is no backend, no personal information tracked, and all data is stored in a user’s browser or in their own Google Drive. Also note that I don’t read code per-feature during the loop, but I do get to look at the code as a whole over time. ↩
A few weeks back I was watching people play a game that made me think of a new feature. I took out my phone to write an issue, then carried on watching. By the time I was home, the feature was ready in production. ↩

Your carousel might not be accessible: designing for reduced motion

2026-05-01T00:00:00+00:00

On a website I worked on, I noticed some logos stacked vertically on the left instead of being spread horizontally and evenly spaced, as I expected, and as the design clearly suggested.

Opened an issue. Carefully listed browsers and versions. I investigated like a careful end-user. Not like a developer solutionizing on it, and looking at the code yet. I thought I just found a bug while doing something else, I was not there to drop everything and fix it on the spot.

Hard reload. Cleared cache. Tried again. Same problem. Oh well, it’s a bug.

A designer checked and told me it looked fine for them. So… at that point I was puzzled. 🧐

Then it clicked. I have Reduce Motion enabled in my system settings.

So yeah, it wasn’t a bug… or was it just a different kind of bug?

That’s when it hit me: a scrolling set of logos, a carousel, isn’t inherently accessible. And more importantly, when a user opts into reduced motion for accessibility reasons, we need to ensure that the layout stays intact, and with it, the meaning of the content.

A pretty common UI pattern like a carousel suddenly became… not so inclusive.

If you’d asked me yesterday whether a carousel of moving items is accessible design, I would have said, “sure, why not?”

Today, I’d say: not entirely. Not unless there’s a proper fallback, a static list of items that preserves layout and still communicates the same information clearly.

Because accessibility isn’t just about turning things off. It’s about making sure the experience still works when things change.

And the good news is, this is something we can account for.

@media (prefers-reduced-motion: reduce) {
  /* provide a non-animated, stable layout */
}

A small detail, but one that can make the difference between something that looks fine, and something that actually works for everyone.

Reviewing Dependabot PRs is boring. Let Claude do it for you.

2026-04-30T00:00:00+00:00

I’m not going to lie, when I start my day and I see 10 Dependabot PRs open in the project, I just want to close the laptop and go for a walk. And I have the feeling that, like me, many other developers feel the same way, because I keep seeing Dependabot PRs sit open in projects for weeks. Nobody wants to read the changelog, check the dependencies, look for breaking changes, and still risk shipping a regression because they missed an important line buried in the notes.

In the age of AI and automation, we can definitely get some help with this. This is what my colleague Fritz suggested. We were watching Dependabot PRs pile up and figured the real pain point was that people were lacking the information they needed to merge with confidence. So I used Claude’s skill-creator to build a skill that gives me exactly that: a short summary of the changes, the risk, and a recommendation: can I merge, or do I need to look more carefully myself?

A Dependabot PR review skill

You point the skill at a Dependabot PR or at the whole repo and it gives you back the one thing the PR description never tells you: should I merge this, and if not, why not?

It works in two modes:

Single-PR mode — paste a Dependabot PR URL and you get a full review for that one PR.
Audit mode — ask it to “review all open dependabot PRs” and it discovers every open Dependabot PR in the repo with gh, analyzes them one by one, and produces a single triage report.

For each PR, the skill does roughly what a careful human would do, just faster and without getting bored:

Reads the PR diff to figure out the gem name, the old and new version, and whether the bump is patch, minor, or major.
Pulls the changelog between those two versions from GitHub releases, CHANGELOG.md, or RubyGems and only keeps the parts that matter: breaking changes, deprecations, security fixes, notable behaviour changes. The release-notes-style noise gets stripped out.
Greps the codebase to see where the gem is actually used. A bump to a gem that lives in three test files is a very different story from a bump to a gem that runs in your payment flow, and the skill calls that out.
Hands you a verdict in one of four buckets:
- Merge — safe, low risk
- Verify — looks safe but here are the specific things to check first
- Investigate — needs human judgment, here’s why
- Hold — there are breaking changes, you’ll need code work before merging

In audit mode you get all of that as a summary table at the top: PR number, gem, bump (7.2.4 → 8.0.10), type, age, verdict, and a “why”, sorted worklist style: Merge first, then Verify, Investigate, Hold.

What the output actually looks like

Here’s a trimmed example of what audit mode prints back into the chat for a repo with a handful of open Dependabot PRs:

Found 5 open Dependabot PRs. Analyzing each now…

| #     | Gem            | Bump            | Type     | Age | Verdict     | Why                                  |
|-------|----------------|-----------------|----------|-----|-------------|--------------------------------------|
| #9170 | rubocop        | 1.65.0 → 1.68.0 | minor    | 12d | Merge       | dev-only, no breaking changes        |
| #9168 | sidekiq        | 7.2.4 → 7.3.1   | minor    | 6d  | Verify      | check Redis 6.2+ in production       |
| #9165 | aws-sdk-s3     | 1.143 → 1.150   | minor    | 21d | Verify      | new default checksum algorithm       |
| #9159 | devise         | 4.9.3 → 4.9.4   | patch    | 3d  | Merge       | patch, safe to merge                 |
| #9142 | rails          | 7.2.4 → 8.0.10  | major    | 30d | Hold        | breaking: deprecated Active Job APIs |

And then for each PR, a per-PR section with the changelog highlights, the files in your codebase that touch that gem, and the reasoning behind the verdict so you can scan the table for the easy wins.

Posting the review back to the PR

The chat transcript is not where teams review code, so after the review is done, the skill asks if you want to post it as a comment on the PR. Nothing gets posted without an explicit yes.

The comment uses a collapsible

block, with the verdict and a one-line reason above the fold so a teammate scrolling the timeline can triage without expanding, and the full review tucked underneath. There’s also an invisible marker in the comment, so if you re-run the audit a week later, it can detect its own previous comments and skip PRs that already have a review instead of spamming duplicates.

Give it a try

Curious about the skill? Check it out here. If you have any improvements or feedback, please open an issue or pull request. We love feedback!

Retro-driven development

2026-04-28T00:00:00+00:00

Every session ends with a retro. This week, twenty-four commits out of about a hundred and forty started with that retro. Only a handful added anything new. I wasn’t building the system anymore. It was refactoring itself.

It is Week Four.

Tuesday: four commits before lunch

The 17th. Four refactor-from-retro commits before noon. Reusing API connections across commands instead of reconnecting each time. /morning filtering rules. Stale 1:1 prep dropped from the daily log. The system had been running for three weeks, and friction points had accumulated. I was working through them in fifteen-minute bursts between meetings.

By the end of the day, I had added eight more commits. An actionability check for /context. Top 7 priorities in the daily log. A self-management outcome for my Fusion goal. Retro after retro, feeding back into the commands.

Wednesday: picking sides

Wednesday ran hard. Nine refactor commits between meetings.

I read Sally Lait’s post on semantic calendar emoji and colours. I copied her system straight into /calendar:

🦚 Peacock (default): 1:1s, ad-hoc work
🫐 Blueberry: recurring group meetings
🌿 Sage: pairing, workshops, active work
🍌 Banana: internal socials, external community
✏️ Graphite: transit, food

By evening, I’d refactored /evening to use Ruby instead of Python. It was a small religious war. I picked the side my team knows. The CLAUDE.md gained a preference note. This sort of thing accumulates.

Thursday: the Anytime problem

The Anytime list from Things hit 75,000 characters. That’s around 19,000 tokens. It triggered context compaction mid-session. I noticed overdue items slipping through the cracks.

I needed a filter. Not a prompt. A real script. I wrote bin/filter-anytime.

KEEP_FIELDS = %w[Title UUID Tags Area Project Deadline Notes]

items.each do |raw|
  status = raw[/^Status:\s*(.+)/, 1]
  next if status =~ /completed|canceled/

  if tags.include?("Waiting") && deadline_str
    deadline = Date.parse(deadline_str) rescue nil
    next if deadline && deadline >= today
  end

  puts raw
end

Fifty-five lines of Ruby. It runs before the agent sees the list. The filter runs on rules. It doesn’t guess. Overdue items stopped slipping.

Every session, the agent regenerates the filter. Not this time. I wrote real code. Guessing has limits.

Monday: a stretch of quiet time

A quiet Monday morning. Six refactor commits in one sitting. /calendar, /inbox, /weekly, /context. A stretch of uninterrupted time before the week’s meetings started.

That is when I realised the system had shifted. I wasn’t grinding through tasks. I was editing the system that edits my day. Maintenance, not task-grinding. The point of building a system is to make it fade into the background.

Tuesday: the cap

By the 24th, I noticed something else. My Anytime list kept growing. Each session, I added new tasks from retrospectives, meetings, and the inbox. The filter was treating the symptom. The disease was that the input exceeded the throughput.

I added a commitment cap.

Commitment cap: No more than 20 active next actions in
Things at any time across all areas (work and personal).
If /morning surfaces items that would exceed the cap,
flag it and ask what to defer before proceeding.

/morning now blocks the Top 7 until I’ve deferred enough items to sit under 20. The check is mechanical. I can’t talk it into running anyway.

What I learned

The dominant mode this week wasn’t invention. It was refactoring. Twenty-four commits out of about a hundred and forty say “from retro” or “from feedback.” The system improves by use, not by planning.

Retro-driven development. It works because the signal is cheap and the fix is small. Notice a friction point. Name it. In the next session, the command that caused the friction receives a line of new guidance. No meetings. No sprints. No planning.

The commitment cap came from one of those retros. So did bin/filter-anytime. So did Sally Lait’s colour conventions find their way into /calendar. Each started as an irritation, ended as a line in a command file, and changed how the next session ran.

Try it

Retros don’t need to be long. End each session with one. In the next session, fix what rubbed you the wrong way. The system is yours.