Back to Kernel Debugging
... with AI as a partner!
It's been a while since I've had to do kernel debugging. My jobs at VMware/Broadcom & Weka dealt mostly with user space code that was either using or by-passing the OS entirely. Now at CoreWeave I'm back dealing with the Linux Kernel and even though most things are largely the same, there's been major changes in a few things related to debugging Linux due to .. as you can probably guess .. AI.
Kernel Debugging with AI
On my first week at CoreWeave I debugged a relatively simple crash happening to an NVIDIA driver (you can find more about it here - btW even though there's no feedback, the PR was merged in the 590 versions). By the end of my first month I also root-caused a soft-lockup issued and verified that it was an existing problem reported for older v6.5 kernels (relevant issue in LaunchPad)
After root-causing the above myself I decided to run an exercise and see how could I have used AI the most effectively to do this job for me or at least help me do my job better. I then started using it in internal and side-projects. Now it's an indispensable part of my debugging workflow. Here's what I learned:
Giving the dmesg (stack trace of crash/lockup + surrounding context) and access to a repo containing the Linux kernel branch running on that system can get you a very long way. Depending on the nature of your issue, the AI can either get you to root-cause (RC) or very close. For the hard bugs at the very least it can do the grunt work of opening tabs to all the relevant code paths involved.
Another helpful tip is to point the AI to the appropriate LKML list, Canonical’s Launchpad, or the Kernel’s Bugzilla instance to see if this issue has already come up in the past before you analyze the failure yourself. Even though there's nothing like the feeling of RCing a failure yourself, when you have to triage and debug many of them, time is of the essence.
If you have a crash dump, you can also point the AI to drgn and sdb (together with their relevant APIs and usage documents so the model learns how to use them, if it doesn't know already). I also have to say that drgn is just a much better tool for this at the current time. It is a Python library and most models know how to use Python already AND its online reference docs are much more comprehensive.
If the AI cannot debug your issue in one shot it is still useful as a debugging partner. Yes sometimes it could hallucinate and lead you down the wrong rabbit hole BUT this is not very different to what happens without AI in my experience. There have been many times in the past where I made a wrong assumption or read an if-statement wrong myself (damn you double-negation) and wasted days down the wrong path. The AI can be your rubber ducky and so much more. I treat it as a coworker whom I'd keep asking for proof that their hypothesis of a bug is true.
Corollary to the above, your debugging experience will be as good as you make it. If you're used to the psychology of debugging (be curious, question assumptions, not be afraid of being wrong, use the scientific method of starting with hypothesis and proving/disproving them, etc..) then debugging with AI will 10x or more your productivity. Vague prompts, prompts with opinions rather than facts, not pushing back when the model speculates, and treating AI output as an answer instead of a hypothesis are the anti-patterns here.
Finally, AI inherently has two things that are invaluable for debugging: stamina and tenacity. It can go down all the rabbit holes working tirelessly, no questions asked. It will also work relentlessly until it has some solution to your problem. Both of these characteristics can also lead to annoying things like a model being stuck into some kind of logical loop, so again it's up to you to steer it clear.

