Skip to content
View HenryNdubuaku's full-sized avatar
  • Cactus Compute
  • London
  • 20:24 (UTC)

Block or report HenryNdubuaku

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
HenryNdubuaku/README.md

Henry Ndubuaku

LinkedIn Twitter Email Spotify

I could train a 1B-A200m model on an iPhone 17 Pro at ~650 tokens/sec. It will take 360 days on 20B tokens of training data and use 156KW of electricity which cost $51.

The phone will fry of course, so I wrote algorithms to run inference on your phone rather. We named it after a plant that survives in resource-constrained environments, the Cactus.

cactus can run similar model on your Grandma’s Pixel 6a at 36 tokens/second while only draining 10% battery per hour of continuous inference and using 250MB RAM only.

I had an offer from Nvidia, one of my dream companies, but went on to build Cactus in Jan 2025. Cactus launched July 2025, grew to 4k GitHub stars and completed 10m inference tasks across 900+ projects in 2025.

Cactus raised funding in Aug 2025 from YCombinator, FCVC (portfolio include Slack, Coinbase, GitLab, Instacart etc.), Oxford, 6 smaller funds like Transpose (run by Garry Tan's brother).

Besides VCs, Cactus also gilot checks from fellow YC founders, as well as 62 tech CTOs/VP/Directors both via syndicate and directly at Google DeepMind etc.

We have now grown 8 exceptionally gifted MTS from UCLA, Nokia, Google, Stanford, Oxford. The project is now also maintaiained by UCLA's BruinAI, UWaterloo's WatAI, Yale's YAA and NUS's SCAIS.

Follow the journey!

Core Expertise

Maths Computing AI/ML/RL Distributed Systems GPU

Main Tools

Python C++ PyTorch Jax CUDA Vulkan Neon Cloud

Career Progression

  • 2025-XX: Cactus (YC S25) - Founder & CTO (tiny inference engine for phones and wearables).
  • 2024-25: Deep Render - AI Research Engineer (realtime video models that run on phone GPU/NPU).
  • 2021-24: Wisdm - ML Software Engineer (distributed perception AI for Maxar Defence satelite views).
  • 2019-21: MSc + Open-source activities (JAX/NanoDl, Torch/SuperLazyAutograd, CUDARepo, etc.).
  • 2018-19: Google GADS Scholarship Programme with Andela (pre-MSc), around systems design.
  • 2017-18: National Youth service, posted to software engineering after bootcamp, mostly ARM.
  • 2012-16: Started uni at 15y, covered EECS, data structures, algorithms, maths, physics.

Fun Highlights

  • Wrote Math & CS For ML (with codes).
  • Gave this lecture to a small ML group in Nigeria, on optimising large-scale ML in JAX.
  • Co-host this monthly dinner for AI researchers, engineers and founders in London.
  • Kevin Murphy (DeepMind Principal), Thomas Wolf (HuggingFace Co-foubder), Daniel Holtz (Mid Journey Founder), Steve Messina (IBM CTO) followed back on X.
  • After CUDARepo, Nvidia reached out, I did 7 technical rounds, got a verbal offer, back-and-forth over YOE/pay, then I got YC.
  • Did MSc at QMUL, just to work with Prof Matt Purver (Ex-Stanford Researcher on CALO), did my project/thesis with his team.
  • Did BEng under Prof Onyema Uzoamaka (Rumoured first Nigerian CS grad from MIT), he taught computing archs off-head!

Pinned Loading

  1. cactus-compute/cactus cactus-compute/cactus Public

    Kernels & AI inference engine for mobile devices.

    C++ 4.1k 267

  2. nanodl nanodl Public

    A Jax-based library for building transformers, includes implementations of GPT, Gemma, LlaMa, Mixtral, Whisper, SWin, ViT and more.

    Python 298 12

  3. cuda-tutorials cuda-tutorials Public

    CUDA tutorials for Maths & ML tutorials with examples, covers multi-gpus, fused attention, winograd convolution, reinforcement learning.

    Cuda 206 8

  4. super-lazy-autograd super-lazy-autograd Public

    Hand-derived memory-efficient super lazy PyTorch VJPs for training LLMs on laptop, all using one op (bundled scaled matmuls).

    Python 39

  5. pete pete Public

    Parameter-efficient transformer embeddings replace learned embeddings with hardware-aware polynomial expansions of token IDs.

    Python 7

  6. tango tango Public

    Decentralised ML engine, where tiny edge devices like smart watches, phones, VR headsets, game consoles etc., could contribute.

    Go 5 2