Hi! I am Ashwin Nanjappa. Welcome to my corner of the web.
I accelerate DL inference at NVIDIA with TensorRT. Prior to that I got a PhD in GPU algorithms, did a postdoc in Computer Vision and worked at an AI startup. More info can be found at my old personal website.
I write regularly here, maintaining both a ✍ tech blog and a ✍ personal blog.
I am active on @codeyarns@mastodon.social (🐘 Mastodon) and not so much on @codeyarns.bsky.social (🦋 BlueSky).
Rest of my stuff:
📄 Articles
NVIDIA
Blackwell Ultra Sets New Inference Records in MLPerf Debut
(2025-09-09)
NVIDIA Technical Blog
NVIDIA
Blackwell Delivers Massive Performance Leaps in MLPerf Inference
v5.0 (2025-04-02)
NVIDIA Technical Blog
MLPerf
Inference v5.0 Advances Language Model Capabilities for GenAI
(2025-04-02)
MLCommons
Introducing a
Graph Neural Network Benchmark in MLPerf Inference v5.0
(2025-04-02)
MLCommons
NVIDIA
Blackwell Platform Sets New LLM Inference Records in MLPerf Inference
v4.1 (2024-08-24)
NVIDIA Technical Blog
SDXL:
An MLPerf Inference benchmark for text-to-image generation
(2024-08-24)
MLCommons
NVIDIA
H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference
Records (2024-03-27)
NVIDIA Technical Blog
Llama
2 70B: An MLPerf Inference Benchmark for Large Language Models
(2024-03-27)
MLCommons
Leading
MLPerf Inference v3.1 Results with NVIDIA GH200 Grace Hopper Superchip
Debut (2023-09-11)
NVIDIA Developer Blog
New
MLPerf Inference Network Division Showcases NVIDIA InfiniBand and
GPUDirect RDMA Capabilities (2023-07-06)
NVIDIA Developer Blog
Setting
New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for
AI (2023-04-05)
NVIDIA Developer Blog
Full-Stack
Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA
(2022-08-08)
NVIDIA Developer Blog
Getting
the Best Performance on MLPerf Inference 2.0 (2022-04-06)
NVIDIA Developer Blog
GTC
Connect with the Experts (2022-03-23)
Optimize Deep Learning Inference Workloads using NVIDIA TensorRT and
Deploying AI Models in Production with NVIDIA Triton Inference
Server
GTC
Connect with the Experts session (2020-03-23)
NVIDIA TensorRT Applications: Conversational AI, Recommenders, and
Object Detection
Visual
Search as a Cloud Service by Large-Scale Commodity GPU Adoption
(2017-03-13)
SuperComputing Frontiers 2017, Singapore
Developer
stories - Ashwin Nanjappa from Singapore (2017-02-08)
Interview by Workshape.io
Hand
Pose Estimation Demo Booth
Best Booth Award, A*STAR Scientific Conference (ASC) 2014
📚 Books
Caffe2
Quick Start Guide
Packt Publishing (May 31, 2019)
Instant
GLEW
Packt Publishing (July 25, 2013)
📃 Papers
Mouse pose estimation from depth images
Ashwin Nanjappa, Li Cheng, Wei Gao, Chi Xu, Adam Claridge-Chang, Zoe
Bichler
Paper, arXiv
GHand: A GPU algorithm for realtime hand pose estimation
using depth camera
Ashwin Nanjappa, Chi Xu, Li Cheng
Eurographics, 2015
Paper, Video, DOI
Estimate Hand Poses Efficiently from Single Depth
Images
Chi Xu, Ashwin Nanjappa, Xiaowei Zhang, Li Cheng
International Journal of Computer Vision (IJCV), 2015
Paper, DOI
Real-time hand pose estimation from depth camera using
GPU
Ashwin Nanjappa, Chi Xu, Li Cheng
GPU Technology Conference 2014 (South East Asia)
Poster, BibTeX
Efficient hand pose estimation from single depth
images
X-periment!, Singapore Science Festival, 2014
Poster
Delaunay mesh generation using the GPU
Ashwin Nanjappa, Thanh-Tung Cao, Mingcen Gao, Meng Qi, Tiow-Seng Tan,
Zhiyong Huang
Merit Award, NVIDIA Poster
Contest, GPU Technology Conference 2014 South East Asia)
Poster, BibTeX
A GPU accelerated algorithm for 3D Delaunay
triangulation
Ashwin Nanjappa, Thanh-Tung Cao, Mingcen Gao, Tiow-Seng Tan
ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D),
2014
Paper, Video, Code, BibTeX, DOI
gHull: A GPU algorithm for 3D Convex Hull
Mingcen Gao, Thanh-Tung Cao, Ashwin Nanjappa, Tiow-Seng Tan
ACM Transactions on Mathematical Software (TOMS), 2013
Paper, Video, BibTeX, DOI
Delaunay triangulation in R³ on the GPU
PhD Thesis, National University of Singapore, 2012
Thesis, Code [1, 2], BibTeX
💾 Code
gStar4D
The gStar4D algorithm computes the 3D Delaunay triangulation on the GPU.
The CUDA implementation of gStar4D is robust and achieves a speedup of
up to 5 times over the 3D Delaunay triangulator of CGAL.
gDel3D
The gDel3D algorithm constructs the Delaunay Triangulation of a set of
points in 3D using the GPU. The algorithm utilizes a novel combination
of incremental insertion, flipping and star splaying to construct
Delaunay. The CUDA implementation is robust and its runtime is 10 times
faster when compared to the Delaunay triangulator of CGAL.
gReg3D
The gReg3D algorithm computes the 3D regular (weighted Delaunay)
triangulation on the GPU. Our CUDA implementation of gReg3D is robust
and achieves a speedup of up to 4 times over the 3D regular triangulator
of CGAL.
GPU
Coursera
I created this library of code to work offline on the assignments of
Heterogenous Parallel Programming, a GPU/CUDA course offered by
Coursera. Many folks chipped in and have converted this into an easy to
use library for the course.