Code Yarns – index

Hi! I am Ashwin Nanjappa. Welcome to my corner of the web.

I accelerate DL inference at NVIDIA with TensorRT. Prior to that I got a PhD in GPU algorithms, did a postdoc in Computer Vision and worked at an AI startup. More info can be found at my old personal website.

I write regularly here, maintaining both a ✍ tech blog and a ✍ personal blog.

I am active on @codeyarns@mastodon.social (🐘 Mastodon) and not so much on @codeyarns.bsky.social (🦋 BlueSky).

Rest of my stuff:

📄 Articles
- NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut (2025-09-09)
  NVIDIA Technical Blog
- NVIDIA Blackwell Delivers Massive Performance Leaps in MLPerf Inference v5.0 (2025-04-02)
  NVIDIA Technical Blog
- MLPerf Inference v5.0 Advances Language Model Capabilities for GenAI (2025-04-02)
  MLCommons
- Introducing a Graph Neural Network Benchmark in MLPerf Inference v5.0 (2025-04-02)
  MLCommons
- NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1 (2024-08-24)
  NVIDIA Technical Blog
- SDXL: An MLPerf Inference benchmark for text-to-image generation (2024-08-24)
  MLCommons
- NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records (2024-03-27)
  NVIDIA Technical Blog
- Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models (2024-03-27)
  MLCommons
- Leading MLPerf Inference v3.1 Results with NVIDIA GH200 Grace Hopper Superchip Debut (2023-09-11)
  NVIDIA Developer Blog
- New MLPerf Inference Network Division Showcases NVIDIA InfiniBand and GPUDirect RDMA Capabilities (2023-07-06)
  NVIDIA Developer Blog
- Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI (2023-04-05)
  NVIDIA Developer Blog
- Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA (2022-08-08)
  NVIDIA Developer Blog
- Getting the Best Performance on MLPerf Inference 2.0 (2022-04-06)
  NVIDIA Developer Blog
- GTC Connect with the Experts (2022-03-23)
  Optimize Deep Learning Inference Workloads using NVIDIA TensorRT and Deploying AI Models in Production with NVIDIA Triton Inference Server
- GTC Connect with the Experts session (2020-03-23)
  NVIDIA TensorRT Applications: Conversational AI, Recommenders, and Object Detection
- Visual Search as a Cloud Service by Large-Scale Commodity GPU Adoption (2017-03-13)
  SuperComputing Frontiers 2017, Singapore
- Developer stories - Ashwin Nanjappa from Singapore (2017-02-08)
  Interview by Workshape.io
- Hand Pose Estimation Demo Booth
  Best Booth Award, A*STAR Scientific Conference (ASC) 2014
📚 Books
- Caffe2 Quick Start Guide
  Packt Publishing (May 31, 2019)
- Instant GLEW
  Packt Publishing (July 25, 2013)
📃 Papers
- Mouse pose estimation from depth images
  Ashwin Nanjappa, Li Cheng, Wei Gao, Chi Xu, Adam Claridge-Chang, Zoe Bichler
  Paper, arXiv
- GHand: A GPU algorithm for realtime hand pose estimation using depth camera
  Ashwin Nanjappa, Chi Xu, Li Cheng
  Eurographics, 2015
  Paper, Video, DOI
- Estimate Hand Poses Efficiently from Single Depth Images
  Chi Xu, Ashwin Nanjappa, Xiaowei Zhang, Li Cheng
  International Journal of Computer Vision (IJCV), 2015
  Paper, DOI
- Real-time hand pose estimation from depth camera using GPU
  Ashwin Nanjappa, Chi Xu, Li Cheng
  GPU Technology Conference 2014 (South East Asia)
  Poster, BibTeX
- Efficient hand pose estimation from single depth images
  X-periment!, Singapore Science Festival, 2014
  Poster
- Delaunay mesh generation using the GPU
  Ashwin Nanjappa, Thanh-Tung Cao, Mingcen Gao, Meng Qi, Tiow-Seng Tan, Zhiyong Huang
  Merit Award, NVIDIA Poster Contest, GPU Technology Conference 2014 South East Asia)
  Poster, BibTeX
- A GPU accelerated algorithm for 3D Delaunay triangulation
  Ashwin Nanjappa, Thanh-Tung Cao, Mingcen Gao, Tiow-Seng Tan
  ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), 2014
  Paper, Video, Code, BibTeX, DOI
- gHull: A GPU algorithm for 3D Convex Hull
  Mingcen Gao, Thanh-Tung Cao, Ashwin Nanjappa, Tiow-Seng Tan
  ACM Transactions on Mathematical Software (TOMS), 2013
  Paper, Video, BibTeX, DOI
- Delaunay triangulation in R³ on the GPU
  PhD Thesis, National University of Singapore, 2012
  Thesis, Code [1, 2], BibTeX
💾 Code
- gStar4D
  The gStar4D algorithm computes the 3D Delaunay triangulation on the GPU. The CUDA implementation of gStar4D is robust and achieves a speedup of up to 5 times over the 3D Delaunay triangulator of CGAL.
- gDel3D
  The gDel3D algorithm constructs the Delaunay Triangulation of a set of points in 3D using the GPU. The algorithm utilizes a novel combination of incremental insertion, flipping and star splaying to construct Delaunay. The CUDA implementation is robust and its runtime is 10 times faster when compared to the Delaunay triangulator of CGAL.
- gReg3D
  The gReg3D algorithm computes the 3D regular (weighted Delaunay) triangulation on the GPU. Our CUDA implementation of gReg3D is robust and achieves a speedup of up to 4 times over the 3D regular triangulator of CGAL.
- GPU Coursera
  I created this library of code to work offline on the assignments of Heterogenous Parallel Programming, a GPU/CUDA course offered by Coursera. Many folks chipped in and have converted this into an easy to use library for the course.