AI keeps getting better. Every month there are new models, new capabilities, and developers can do more, faster. But if you’ve been using AI where quality matters, you’ve probably noticed something: every time you let it go a little too far, it runs off the rails. Architecture degrades, weird patterns emerge, things get harder to change, and eventually you don’t want to touch the code because you’ve lost context on what it even did.
There’s a paradox here: AI lets you move faster, but the faster you move, the slower you move on the back side repairing what it broke — which encourages you to move slower in the first place.
This is not new to software engineers. I learned this lesson the hard way at one of my first hackathons as a junior developer. I spent sleepless nights for almost a week building a feature the company really wanted. I demoed it to everyone from the founders on down — they were excited and impressed. Only to be incredibly let down when I told them it was nowhere near shipping. I spent my weekends over the next couple months trying to get it into shape, only to realize I needed to rewrite the whole thing. I’d rushed it, never reviewed it, nothing was done right. It looked great in the demo and had zero stability — threading issues everywhere, a total nightmare. Hackathon demos are deceiving. What looks like a polished, vetted feature is often smoke and mirrors. AI output can be the same way, and a good mentor will give you the same advice for both: break it down into small pieces that can be verified before moving on to the next.
Whether you’re building software by hand or using amazing tools like AI, you learn the balance. How much can I go before I need to slow down to gain trust in the code? That threshold has been steadily rising — from autocomplete, to functions, to classes, to clusters of classes — but it still exists. You can only let it go so far.
So the goal becomes minimizing the number of unique problems you’re asking AI to solve at once. There are two ways to approach this, and the difference matters more than you’d think.
Vertical vs. Horizontal Refactors
Imagine a contest: whoever can refactor and merge the most code in a week wins a prize — maybe a Claude subscription so you can refactor even more code. How do you win? There are two very different approaches.
Vertical Refactors
A vertical refactor takes one feature and refactors it top to bottom — UI, service layer, data layer. You’re only touching one feature, so the scope feels contained.
But the AI is tackling a hundred different types of problems in that one feature. If you haven’t given guidance for each of those, it’s going to guess. It’ll reference legacy patterns for some, pull in something from its training data that doesn’t apply to your domain for others. By the time it’s done, you have 10,000 lines of new code and maybe 3,000 of them are wrong.
Now you need to reverse-engineer every decision. Did it get this one right? What about this one? And unlike reviewing a coworker’s PR — where you have shared context, shared training, and trust built from working together — you have zero baseline trust with the AI. It used whatever it found in open source project training and has no knowledge of the hard lessons your team learned: the crash from a year ago, the edge case that only shows up on certain devices, the business rule that exists only in code.
You end up spending an intense amount of time studying that code with zero trust, recreating the entire thing in your head. When you find issues, you’ll likely need to make many corrections — and you can use AI for those too, but that’s just more code you then need to re-review. Your boss keeps asking “Is it ready” your response “I don’t know yet” because you don’t. That’s the opposite of efficient.
Horizontal Refactors
A horizontal refactor takes one type of change and applies it across the codebase. You’re swapping out a framework, migrating an API, adopting a new pattern — the same blueprint, repeated many times.
Up front, you spend time studying how you want the refactor to work. You codify it in a skill or instruction with clear examples. Get that right once. Then ask the AI to repeat it across many files.
What comes out is far more predictable. You generated just as much code, maybe just as much value. It doesn’t look as flashy because you didn’t refactor an entire feature. But the review is dramatically faster because you’re checking the same pattern over and over, not reverse-engineering a hundred unique decisions.
Think of these horizontals as layers in your application. You go through them piece by piece, and each one makes the codebase more consistent, more modern, and more ready for the next layer.
Why Horizontal Wins
The math is straightforward. In a vertical refactor, you spend a small amount of time prompting and a large amount of time reviewing decisions you don’t trust. In a horizontal refactor, you spend a larger amount of time up front defining the pattern, and a much smaller amount of time reviewing because the output is predictable.
Back to our contest — I’d bet my Claude subscription that horizontal refactors would win the day.
As you build up horizontal skills for different types of changes, you’re also building toward the ability to eventually compose them into full vertical refactors — but with real patterns backing each layer, not bespoke AI decisions.
Claude Chain
This is the approach behind Claude Chain, an open source tool I’ve been building. (It uses Claude but has no affiliation with Anthropic — I’m waiting for the cease and desist, at which point I’ll change the name.)
You define a specification and a list of tasks in markdown, and it works through them one by one, staging a pull request for each. It’s a natural fit for horizontal refactors: you define the specification for how to refactor that one thing, create one task per source code file or cluster of files in your codebase, and let it rip. As long as you have the capacity to review those PRs, you can move through them fast.
I’ve tried the same approach with vertical, feature-level work. It’s a waste of time — too many unique decisions, too much variance. But for horizontal refactors where the pattern is well-defined, it works extremely well.
A note on AI and trust: while writing this post, AI helpfully generated a link to someone else’s open source project instead of my own Claude Chain repo. I didn’t catch it until seconds before posting. Case in point for review.









