Nitin Gupta

Tensor Deduplication for Multi-Model Inference

ngupta@nitingupta.dev (Nitin Gupta) — Mon, 08 Dec 2025 00:00:00 +0000

Summary

Problem: Multi-model workloads are the norm: A/B tests, customer fine-tunes, safety variants, multi-stage pipelines. GPU memory scales linearly with model count, and VRAM is the limiting resource.
Solution: Tensor deduplication automatically identifies and shares bit-identical weight tensors across models, requiring no checkpoint modifications.
Results: Across diffusion and LLM workloads, real-world savings range from 3–32%. DeepFloyd IF stages share 18.87 GB (32% reduction). Synthetic upper bound is 50%.
Overhead: Hashing adds <1% to model load time. Zero runtime overhead since the forward pass is unchanged.
Compatibility: Works with HuggingFace safetensors, GGUF, and Diffusers pipelines. No changes to training or checkpoints required.

Multi-Model Memory Bloat

Modern inference deployments rarely serve a single model. Production systems routinely load:

Shared Backbones: Loading Weights Once, Serving Many Models

ngupta@nitingupta.dev (Nitin Gupta) — Sat, 29 Nov 2025 00:00:00 +0000

I keep running into the same pattern when trying to self-host models (which is a lot of fun): we run several big models side by side, all of them valuable, all of them slightly different, and all of them wasting VRAM by reloading nearly the same weights.

This post is my attempt to explore a specific idea:

Can we load a shared backbone of weights once on a GPU, then load only the small, unique pieces per model that reuse that backbone?

Proactive Compaction

ngupta@nitingupta.dev (Nitin Gupta) — Sat, 07 Mar 2020 22:33:52 -0800

This feature has now been accepted and merged in the upstream kernel and will be part of kernel release 5.9. This post has been updated to match the upstream version of this feature.

In my previous post, I described how on-demand compaction scheme hurts hugepage allocation latencies on Linux. To improve the situation, I have been working on Proactive Compaction for the Linux kernel, which tries to reduce higher-order allocation latencies by compacting memory in the background.

Linux kernel hugepage allocation latencies

ngupta@nitingupta.dev (Nitin Gupta) — Tue, 04 Feb 2020 00:00:00 +0000

Some drivers needs to allocate almost all memory as hugepages to reduce (on-device or CPU) TLB pressure. However, on a running system, higher order allocations can fail if the memory is fragmented. Linux kernel can do on-demand compaction as we request more hugepages but this style of compaction incurs very high latency.

To show the effect of on-demand compaction on hugepage allocation latency, I created a test program “frag” which allocates almost all available system memory followed by freeing $\frac{3}{4}$ of pages from each hugepage-sized aligned chunk. This allocation pattern results in ~300% fragmented address space w.r.t order 9 i.e. physical mappings of our VA space is spread over 3x the number of hugepage-aligned chunks than what is ideally required.

A layered object store design in Elixir (Part VI)

ngupta@nitingupta.dev (Nitin Gupta) — Fri, 24 Jan 2020 00:00:00 +0000

We built an object store from scratch in Elixir using a layered design approach. The overall theme has been to avoid generalizing the design too much which kept implementation of each layer/module simple. We were also careful when adding any third-party dependencies which has multiple advantages: deeper understanding of your codebase, easier debugging (I hate unknown code-paths in backtraces).

For reference, here are links for all five parts along with their summaries:

A layered object store design in Elixir (Part V)

ngupta@nitingupta.dev (Nitin Gupta) — Thu, 23 Jan 2020 00:00:00 +0000

Part I, introduces the overall design of our object store. In this post we focus on the Web layer. This is the final layer for our object store responsible for exposing it over the web. It will expose endpoints: /upload for uploading a file and /file/:file_id for getting a file by ID. A typical GraphQL application with also expose endpoint /graphql which directly plugs into your API layer, however I will not discuss this part and stay focused on the object store side of things.

A layered object store design in Elixir (Part IV)

ngupta@nitingupta.dev (Nitin Gupta) — Wed, 22 Jan 2020 00:00:00 +0000

Part I, introduces the overall design of our object store. In this post we focus on the API layer. All layers till now were just concerned about storing the input file together with some file-format specific transforms (like thumbnails). It is at the API layer where we will be storing per-file system and user metadata. This metadata can be used to support application specific business logic and security policies.

This layer will depend on all per-file-format modules: ImageStore, VideoStore, etc. We will use Postgres for storing per-file metadata, so we also depend on the postgrex package. A typical API layer will also be exposing a GraphQL interface which forms the core of application specific business logic. I am not going to include an example GraphQL interface here but absinthe would be my preferred way of doing it, anytime.

A layered object store design in Elixir (Part III)

ngupta@nitingupta.dev (Nitin Gupta) — Mon, 20 Jan 2020 00:00:00 +0000

Part I layer.

ImageStore

The ImageStore module is responsible for storing images along with their thumbnail. It will use the FileStore layer to actually store files on disk. Before we define module interfaces, lets see our application requirements:

All images must be stored in the jpg format.
Images cannot be larger than 1920x1080. We do not want to store user provided version at all.
Thumbnails should use the same jpg format.
All thumbnails must have the same size of 256x256.

Note that we are going for highly application specific requirements rather than a more general, configurable design. I have seen most of the complexity in software stacks is due to the temptation of making them “reusable”. As you will see, the implementation is going to be so simple, with clearly defined interfaces, that it would be much easier for you to create such a module for each of your applications, with its specific requirements baked in.

A layered object store design in Elixir (Part II)

ngupta@nitingupta.dev (Nitin Gupta) — Mon, 13 Jan 2020 00:00:00 +0000

Part I, introduces the overall design of our object store. In this post we focus on its first layer, the FileStore.

The FileStore layer is responsible for actually storing the file in our object store. At this level, we are not concerned about what kind of file it is (image, video, document, or whatever else), nor do we have any notion of security. We just store whatever input path is given to us.

A layered object store design in Elixir (Part I)

ngupta@nitingupta.dev (Nitin Gupta) — Sun, 12 Jan 2020 00:00:00 +0000

I recently designed an object store from scratch in Elixir. It has been serving me well as a backend for an app which needs to store all kinds of files: images, videos, documents. I wanted something simple to avoid dealing with off-the-shelf object stores which require complex configurations and to avoid cloud storage which is dead simple to use but can get very expensive, very quickly. For this project, simplicity was the key to make sure I can debug any failures quickly.

Elixir collections

ngupta@nitingupta.dev (Nitin Gupta) — Thu, 02 Jan 2020 00:00:00 +0000

Elixir is a function programming language that I have been using a lot in recent months to build all kinds of applications. Understanding of built-in collection types is essential to use any language effectively and Elixir is no different.

This posts summarizes all collection type along with pros/cons/gotchas for each one of them.

Collection	Example	When
Tuples	`{:ok, "All good"}`	Returning data from a function
Lists	`[1, "two", :three]`	For a collection of items
Keyword lists	`[one: 1, two: 2]`	Passing options to a function
Maps	`%{one: 1, two: 2}`	Flexible key/value store
Structs	`%User{name: "John", age: 32}`	Typed/fixed key/value store

Tuples

{:ok, foo} “tagged” tuple since begins with an atom like :ok or :error like {:error, 543, "some error"}

Examples:

Subtle Errors in C++ Programs

ngupta@nitingupta.dev (Nitin Gupta) — Wed, 24 Apr 2019 15:21:51 -0700

I recently stumbled upon a subtle bug in a benchmark code which again reminds me to never use C++ again, if I can.

Here’s a buggy snippet from this code (simplified):

// BUGGY
ostringstream os;
int i = 1;
os << "foo-" << i << ".dat";
const char *filename = os.str().c_str();
int fd = open(filename, O_RDONLY);

You may expect above code to try open a file named foo-1.dat but that’s not what is happening here.

In this snippet, os.str() create a temporary string object which is destroyed immediately after call to c_str() method. So, filename ends up pointing to freed memory which can of course contain arbitrary content (till you reach a NULL).

Setting Up Backup Snapshots on Linux

ngupta@nitingupta.dev (Nitin Gupta) — Wed, 24 Apr 2019 00:00:00 +0000

For some time I’ve been looking for a backup solution for Linux that can periodically take snapshots of data, allowing me to go back in history of any file just like git. I finally found restic which fits these requirements. Here is how I set it up to take snapshots of particular directories, say every 15 minutes.

Installing restic

Though restic is available in repositories of almost all Linux distros, I recommend downloading the latest release directly from GitHub to avoid dealing with potentially outdated version. Helpfully, you can stay current with restic release with restic self-update.

Google Drive on Linux

ngupta@nitingupta.dev (Nitin Gupta) — Mon, 22 Apr 2019 00:00:00 +0000

There is no official Google drive client for Linux. I tried many different clients found all over GitHub but none of them worked reliably for me except rclone. I also tried third-party proprietary clients like Insync but allowing read-write access to all your Google drive files to a closed source blob is too much to swallow.

Once caveat with rclone is that it does not natively support bi-directional sync (github issue) but someone developed a python script rclonesync-V2 which is a wrapper around rclone which does the job. With these two pieces of software we can get close-to-official Google drive client experience.

Faster compilation with distcc

ngupta@nitingupta.dev (Nitin Gupta) — Sat, 19 Jun 2010 05:22:00 -0700

Often, you have more than one system at your disposal but no clear way of distributing your compilation workloads over to all or some of them. They might be running different OSes which makes it look even more difficult. In my case, I have one laptop (2 cores) and a desktop (4 cores) connected with a WiFi network. The laptop runs Linux (Fedora 13 64-bit) while the desktop runs Windows 7 (64-bit). I wanted to somehow offload Linux kernel compilation over to my powerful desktop and keep my laptop cool :)

Compressed RAM disk for Windows, The Virtual Way!

ngupta@nitingupta.dev (Nitin Gupta) — Sun, 30 May 2010 06:53:00 -0700

Recently, I developed Linux kernel driver which creates generic RAM based compressed block devices (called zram). Being RAM disks, they do not provide persistent storage but there are many use cases where persistence is not required: /tmp, various caches under /var, swap disks etc. These cases can benefit greatly from high speed RAM disks along with savings which compression brings!

However, all this seems to be completely Linux centric. But with virtualization, zram can be used for Windows too! The trick is a expose zram as a ‘raw disk’ to Windows running inside a Virtual Machine (VM). I will be using VirtualBox as example but exposing raw disks should be supported by other Virtualization solutions like VMware, KVM too.

Comprehensive graphical Git diff viewer

ngupta@nitingupta.dev (Nitin Gupta) — Sun, 27 Dec 2009 06:28:00 -0800

Since a long time, I was looking for a graphical git diff viewer which could show original and modified file side-by-side and highlight the changes. There are few solutions but none of them is sufficient:

A tool included with git called ‘git-difftool’ is partially helpful – it can show changes graphically but diff for each file is shown one-by-one. This is very irritating. In fact, unusable even with just 10-15 files.
Another alternative is the meld diff viewer which is “git aware”. The problem here is that it can show diff for uncommitted changes only which is very limiting. What if you want to see what changes between Linux kernel, say 2.6.33-rc1 and 2.6.33-rc2? or changes between last two commits? meld cannot do it, AFAIK.
Finally, with kompare, you can do something like: ‘git diff master | kompare -o -’. This method however, does not show original and new files side-by-side. It is simply prettier diff highlighting.

None of above methods are sufficient. So, I wrote the following script which solves our problem: show complete contents of original and new files side-by-side and highlight the differences.

Linux kernel workflow with Git

ngupta@nitingupta.dev (Nitin Gupta) — Wed, 23 Sep 2009 10:21:00 -0700

You worked on some part of Linux kernel. It works great. Now, how to generate the patch series and send it out for review? For this, I always used to generate diffs, create a set of draft mails (one for each patch) in KMail or Thunderbird, and send all these mails one-by-one. This workflow quickly became a big headache. Then I learned Git (and some related tools) to do all this from command line and wow! what a relief!

ccache to speed-up Linux kernel compile

ngupta@nitingupta.dev (Nitin Gupta) — Thu, 19 Mar 2009 14:40:00 -0700

In case you are unfamiliar with ccache, its a “compiler cache”. Compiling is primarily CPU intensive task. So, ccache caches compiled objects - so next time we compile same code, it reuses these objects thereby significantly speeding-up compilation.

I need to recompile Linux kernel usually several times a day, with different permutations of config settings. This almost forces a ‘make clean’ or ‘make mrproper’ which deletes all compiled objects in build tree and then we have to rebuild everything all over again. This takes enormous amount of time. ccache comes to rescue! I’m surprised why I didn’t use it earlier.

SLOB memory allocator

ngupta@nitingupta.dev (Nitin Gupta) — Wed, 18 Mar 2009 18:08:00 -0700

Linux kernel has few SLAB allocator variants included: SLAB, SLUB and SLOB. Of these, SLOB is especially meant to be used on embedded devices – it tries to be more memory space efficient than other SLAB variants.

Yesterday, I had a detailed look at SLOB allocator for possible use in compcache poject and found it unacceptable for the purpose. I did it in response to feedback on xvmalloc allocator – as part of compcache patches posted of inclusion in mainline Linux kernel: http://lkml.org/lkml/2009/3/17/116

Anti-tip of the month

ngupta@nitingupta.dev (Nitin Gupta) — Wed, 18 Mar 2009 17:48:00 -0700

Very old but still as relevant… and very interesting too! Directly go to “anti-tip” section of this article.

“The moral of the story is: don’t get tricky. C programmers often try to minimize the number of lines of C in their program without consideration for what the compiler will generate. When in doubt, write clear code and give the optimizer a chance to maximize performance. Look at the compiler output. Your code will be easier to debug and probably faster too.”

Fedora 10 instability issue solved!

ngupta@nitingupta.dev (Nitin Gupta) — Wed, 18 Mar 2009 17:35:00 -0700

One of my Fedora 10 systems used to freeze very frequently. After lot of looking around I found its because of “KWin Composing” which gives OpenGL driven special effects for desktop. Unfortunately, Linux has always been bad at radeon drivers, so it better to disable these effects especially if you have radeon video cards.

in ~/.kde/share/config/kwinrc:

in [Compositing] section change Enabled=true to Enabled=false.

Reboot after this change. Now I never get any system freeze - as is expected from solid Linux system :)

Difference Engine - Harnessing Memory Redundancy in Virtual Machines

ngupta@nitingupta.dev (Nitin Gupta) — Fri, 13 Mar 2009 05:30:00 -0700

Here is link to paper (pdf) (MP3)

Recently I came across this paper published in OSDI ‘08. Its an extension to VMware’s page-sharing and shows some amazing and hard to believe results. VMware page-sharing mechanism scans memory for all VMs and maps pages with same contents to a single page. This achieves memory savings if multiple VMs are hosted running same OS. However, with technique discussed in this paper, we find pages that are nearly same. For such pages, they save a base page and other similar pages as delta of original page. For pages which are not similar to any other page are simply compressed. Their benchmarks shows upto 45% more memory saving over ESX page-sharing under some (specially crafted) workload.

Nitin Gupta

ngupta@nitingupta.dev (Nitin Gupta) — Mon, 01 Jan 0001 00:00:00 +0000

I am deeply passionate about optimizing GPU performance and delving into the intricacies of resolving bottlenecks within render and compute workloads.

With nearly 15 years of experience delving into the nitty-gritty details of technology, I’ve dedicated a significant portion of my career to working on various low-level components, including Linux Kernel Proactive Compaction (LWN.net article, Phoronix article), zram, zsmalloc, etc. to the Linux kernel, with a particular emphasis on scalability and performance enhancements.