Why Large Language Models Won’t Replace Human Coders

The promise of LLMs for the software development world is to transform coders into architects. Not all LLMs are created equal, though.

Feb 29th, 2024 7:55am by Peter Schneider

Featued image for: Why Large Language Models Won’t Replace Human Coders

Photo by Philipp Katzenberger on Unsplash.

Is generative AI going to replace human programmers? Probably not. A human using GenAI might, though. But with so many Large Language Models (LLMs) in the mix today, mileage will vary.

If you’re struggling to keep up with all the LLMs, you’re not the only one. We’re witnessing an enthusiastic arms race for LLMs. Google’s GenAI offerings alone have become plentiful — its newest open models called Gemma are the latest example of rapid LLM downsizing happening (is it time to call them Small Language Models?).

More pertinent to the DevOps community is the lightning-fast pace of development we’re seeing of other LLMs specifically tailored to code generation, such as Meta’s recently updated Code Llama 70B. Naturally, GenAI has spooked more than a few developers. Nearly half of them in one recent study expressed fears about succeeding in a GenAI world with their current technical skill sets.

But is this concern really warranted? Reports of the death of the human programmer may be greatly exaggerated. Humans may even have more time to prepare for a GenAI-dominated world than they might realize.

In fact, the more appropriate question a developer should ask isn’t “Will GenAI take my job?” but rather, “Which LLM do I use?”

Too Big to Succeed for Coding

The promise of LLMs for the software development world is transforming coders into architects. Not all LLMs are created equal, though, and it’s worth exploring why smaller LLMs are even popping up in the first place.

The more powerful mainstream models, like GPT-4 and Claude 2, can still barely solve less than 5% of real-world GitHub issues. ChatGPT still hallucinates a lot: fake variables, or even concepts that have been deprecated for over a decade. Also, it makes nonsense look really good. You can try to “prompt engineer” your way out of that nonsense, but there’s a sweet spot to the amount of context that is beneficial — too much leads to more confused and random results at the expense of more processing power.

The bigger concern with LLMs for coding is trust. Historically, mainstream LLMs have indiscriminately sucked up everything online like a big digital vacuum cleaner, without much transparency over where they’re sourcing data from. If even one percent of the code a company ships contains another organization’s copyrighted code, that’s a problem. You can imagine a nightmare recall scenario where a shipped product doesn’t have over-the-air capabilities to sort out dubious code.

The LLM landscape is quickly changing, though.

Are LLMs Specialized Enough for Coding?

When Meta announced updates to its Code Llama 70B earlier in the year, it felt like a welcome attempt to fix the lack of focus on coding among mainstream LLMs. It’s offered in three different sizes: 7 billion, 13 billion, and 34 billion parameters. It’s also trained on 500 billion tokens of code and code-related data, containing a large context window of 100,000 tokens.

The most exciting of these, in theory, is Code Llama Python, a version of Code Llama specialized for Python — mainly because of what it represents for the evolution of LLMs going forward. Unlike some models from Meta’s Big Tech peers, this one’s dedicated purely to programming one specific language trained on roughly 100 billion additional tokens of Python code. This level of tailored-modeling for specific use cases is exactly what the industry needs more of.

Emphasis on ‘exciting in theory’ is needed because it remains to be seen how useful something like Code Llama will actually prove to developers. Hop over to Reddit and the early verdict seems to be that the model has garnered frustration over issues that include, among other things, a complicated prompt format, overly strict guardrails, and crucially, hallucinations. That last point is another humbling reminder that any model is only as good as the data it’s trained on.

Flawed or not, though, Meta’s bespoke LLM approach has drawn important attention to the fact that large language models aren’t the only way to succeed in AI-assisted code generation. We’re seeing this reflected in the industry’s gathering momentum for smaller, more focused LLMs specializing in coding — such as BigCode, Codegen, and CodeAlpaca. StarCoder is another one, which, despite only being 15.5 billion parameters in size, has been found to outperform the largest models like PaLM, LaMDA, and LLaMA in evaluation benchmarks.

Each of these options has pros and cons, but the most important thing is that smaller models will be far safer for use than larger ones. If you’re coding in C++, do you really need your LLM chock-full of irrelevant knowledge like, “who was the third president of the United States?” The smaller the data pool is, the easier it is to keep things relevant, the cheaper the model is to train, and the less likely you are to inadvertently steal someone’s copyrighted data.

DevOps teams in 2024 would do well to thoroughly research all LLM options available on the market, rather than defaulting to the most visible ones. It may even be worth using more than one for different use cases.

But back to the existential question at hand…

Will GenAI Replace Humans?

Are any of these GenAI tools likely to become substitutes for real programmers? Unless the accuracy of coding answers supplied by models increases to within an acceptable margin of error (i.e 98-100%), then probably not.

Let’s assume for argument’s sake, though, that GenAI does reach this margin of error. Does that mean the role of software engineering will shift so that you simply review and verify AI-generated code instead of writing it? Such a hypothesis could prove faulty if the four-eyes principle is anything to go by. It’s one of the most important mechanisms of internal risk control, mandating that any activity of material risk (like shipping software) be reviewed and double-checked by a second, independent, and competent individual. Unless AI is reclassified as an independent and competent lifeform, then it shouldn’t qualify as one pair of eyes in that equation anytime soon.

If there’s a future where GenAI becomes capable of end-to-end development and building Human-Machine Interfaces, it’s not in the near future. LLMs can do an adequate job of interacting with text and elements of an image. There are even tools that can convert web designs into frontend code. Compared to coding, however, it’s much harder for AI to single-handedly take on design that relates graphical and UI/UX workflows (albeit not impossible). Coding is also just one part of development. The rest is investing in something novel, figuring out who the audience is, translating ideas into something buildable, and polishing. That’s where the human element comes in.

Regardless of how good LLMs ever get, one principle should always prevail for programmers: treat every code like it’s your own. Do the peer review and ask your colleague, “Is this good code?” Never blindly trust it.

Peter Schneider is a senior product manager at Qt Group, a global software company headquartered in Espoo, Finland.