India Data Lab Initiative (IDLI)

Overview

The India Data Lab Initiative (IDLI) harmonizes India’s flagship household and firm surveys so researchers can work with consistent, analysis-ready microdata. By standardizing layouts, reconciling evolving classification systems, and validating outputs against official benchmarks, the lab lowers the fixed cost of using datasets such as the National Sample Surveys (NSS) and the Annual Survey of Industries (ASI).

This repository provides a standardized, reproducible Stata-based pipeline for processing and cleaning publicly available NSS labour, NSS consumption, NSS enterprise and ASI datasets.{ASI, NSS enterprise and NSS consumption datasets which are currently being cleaned and validated will be released soon}. The goal of this project is to make high-quality, fully cleaned, analysis-ready datasets easily accessible to:

Researchers
Academicians
Policy analysts
Students and data users

The scripts convert publicly available raw data into consistent, harmonized, clean .dta outputs, ensuring that users can directly begin analysis without spending time on data wrangling.

Key Features

Fully automated data cleaning and data processing pipeline
Generates standardized clean .dta files
Master do-files allow one-click end-to-end execution
Modular script structure (extract → clean → process → validate)
Compatibility across systems — users only update their paths, not the code
Ensures consistency, reproducibility, and minimal manual intervention

Repository layout for NSS Labour Dataset (1987 – 2011)

This repository titled idli_ext contains the full codebase for cleaning, harmonizing, and preparing the NSS Labour datasets for the years 1987 to 2011.

idli_ext └── code └── nss └── nss_lab

Within the nss_lab folder, there are multiple do files:

00_master_nss_lab.do # This is the master script, it runs the entire pipeline
Household-level cleaning scripts(*_hc) # These are multiple .do files for cleaning household level NSS labor datasets for years 1987-2011
Person-level cleaning scripts (*_pc) # These are multiple .do files for cleaning personal level NSS labor datasets for years 1987-2011
Harmonization scripts # These are multiple .do files for district, industry and occupation code harmonization

Additionally, there is:

A preamble file located in the nss folder that initializes the coding environment to configure paths, install packages, and register shared directories
A district concordance folder in the nss folder used for harmonizing district identifiers across survey rounds.

NOTE: Household-level cleaning scripts (HC) and Person-level cleaning scripts (PC) are for cleaning the micro datasets, as an user you don't need to execute them separately. The Master do-files run the entire workflow.

Please note that, similar datasets for ASI (Annual Survey of Industries), NSS (National Sample Surveys) Consumption, and NSS enterprise will be uploaded soon on the IDLI website. The README file will get updated accordingly.

How to Use the Code

Download Raw Data

Raw NSS and ASI datasets (CSV and DTA) are publicly available on the MOPSI and IDLI website (https://www.idli.dev/). Download them and store them anywhere (preferably Documents folder) on your system.

Clone `idli_ext` GitHub repository

Clone the repository and place it inside the shared directory referenced by your global root (e.g., Dropbox or OneDrive) so the relative paths defined in 00_preamble.do resolve correctly.

Set Your System Path

In the provided preamble script, simply enter your local system path ("C:/Users/username_as_per_your_system" OR "/users/username_if_using_a_mac_device/Documents") where the raw datasets are stored.

You DO NOT need to:

edit code logic
modify global macros
change any processing steps

Only update the required path location where indicated.

Run the master do-file on Stata

Open Stata and run the master script: do 00_master_nss_lab.do

This will:

Check/install required packages (if enabled in preamble)
Run year-specific household and person cleaning scripts
Apply variable harmonization and code mappings
Validate outputs and export .dta and .csv files into the output folder

Check outputs

After the master run completes, go to your output folder and verify if a nss_lab_final.dta dataset is saved.

Note: District concordance spreadsheet in documentation/district_concordance/ are imported directly by the Stata code to reconcile NSS labor district codes before merging or validation.

How to run only part of the pipeline

If you only want to run one round’s person or household cleaning (without running everything), run that specific file after running the preamble:

do 01_1_2007_clean_hc.do // household for 2007 do 01_2_2007_clean_pc.do // person for 2007

Important:Only run individual scripts for inspection or validation. Do not modify them.

Requirements

Stata 17 or higher
Basic system path defined by the user
Raw data downloaded from the MOPSI/IDLI website
Internet connection optional (only for installing missing SSC packages)

All required Stata packages — including gtools, reghdfe, grstyle, palettes, distinct, ftools, mipolate, nicelabels, and others are automatically checked and installed in the script.

Users may install additional packages locally, but project scripts should remain unchanged.

Best practices & rules

Do not modify the cleaning scripts — edits will break consistency across years and across users.
Only change the small USER CONFIG block in 00_master_nss_lab.do that sets paths.
Keep raw data outside the repo (e.g., in ~/data/NSS_raw/) and keep outputs in ~/data/NSS_working/.
Add outputs/, raw data folders, and .dta files to .gitignore.

If you must change a script for research, make a personal copy and document the changes — but do not commit those changes to the main pipeline.

Troubleshooting

Missing Packages If any required package is missing, install it using: ssc install eg.ssc install gtools, ssc install reghdfe, ssc install nicelabels
Path Errors Make sure your system path uses correct formatting:

Windows: "C:/Users/username/Documents/..."

Mac: "/Users/username/Documents/..."

Linux: "/home/username/..."

Large File Warning For large NSS/ASI files, Stata may require: set excelxlsxlargefile on (This is already included in the script.)

Best Practices

Always use the master do-file for full processing.
Use individual scripts (e.g. code\nss\nss_lab\01_variable_clean.do) only for reviewing logic.
Never change script structure, variable definitions, or processing rules.
Store raw and processed data in clearly separated directories.

This project is maintained by the IDLI research and data engineering team.

License

This project uses publicly available NSS datasets. Processed datasets and scripts follow IDLI licensing and documentation standards.

Team

Ananya Kotia – Founder and Director • www.ananyakotia.com
Bharat Singhal – Research Associate
Naila Fatima – Research Associate (2023–25)
Bommi Reddy Meghana Vardhan – Research Manager
Ayush Chaudhary – Research Associate

Contributing

Fork the repository and create a feature branch.
Run the relevant master script(s) and validation routines to ensure harmonized outputs remain consistent.
Submit a pull request summarizing the methodological change, affected rounds, and validation evidence.

Users may: Submit issues Suggest enhancements Contribute documentation

However, core scripts must not be altered under any circumstances to maintain pipeline integrity.

Private raw data are not stored in this repository; only code and documentation needed to reproduce the harmonized releases are included.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
code/nss		code/nss
documentation		documentation
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

India Data Lab Initiative (IDLI)

Overview

Key Features

Repository layout for NSS Labour Dataset (1987 – 2011)

How to Use the Code

Download Raw Data

Clone `idli_ext` GitHub repository

Set Your System Path

Run the master do-file on Stata

Check outputs

How to run only part of the pipeline

Requirements

Best practices & rules

Troubleshooting

Best Practices

License

Team

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

India Data Lab Initiative (IDLI)

Overview

Key Features

Repository layout for NSS Labour Dataset (1987 – 2011)

How to Use the Code

Download Raw Data

Clone idli_ext GitHub repository

Set Your System Path

Run the master do-file on Stata

Check outputs

How to run only part of the pipeline

Requirements

Best practices & rules

Troubleshooting

Best Practices

License

Team

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Clone `idli_ext` GitHub repository

Packages