The India Data Lab Initiative (IDLI) harmonizes India’s flagship household and firm surveys so researchers can work with consistent, analysis-ready microdata. By standardizing layouts, reconciling evolving classification systems, and validating outputs against official benchmarks, the lab lowers the fixed cost of using datasets such as the National Sample Surveys (NSS) and the Annual Survey of Industries (ASI).
This repository provides a standardized, reproducible Stata-based pipeline for processing and cleaning publicly available NSS labour, NSS consumption, NSS enterprise and ASI datasets.{ASI, NSS enterprise and NSS consumption datasets which are currently being cleaned and validated will be released soon}. The goal of this project is to make high-quality, fully cleaned, analysis-ready datasets easily accessible to:
- Researchers
- Academicians
- Policy analysts
- Students and data users
The scripts convert publicly available raw data into consistent, harmonized, clean .dta outputs, ensuring that users can directly begin analysis without spending time on data wrangling.
- Fully automated data cleaning and data processing pipeline
- Generates standardized clean
.dtafiles - Master do-files allow one-click end-to-end execution
- Modular script structure (extract → clean → process → validate)
- Compatibility across systems — users only update their paths, not the code
- Ensures consistency, reproducibility, and minimal manual intervention
This repository titled idli_ext contains the full codebase for cleaning, harmonizing, and preparing the NSS Labour datasets for the years 1987 to 2011.
idli_ext └── code └── nss └── nss_lab
Within the nss_lab folder, there are multiple do files:
- 00_master_nss_lab.do # This is the master script, it runs the entire pipeline
- Household-level cleaning scripts(*_hc) # These are multiple .do files for cleaning household level NSS labor datasets for years 1987-2011
- Person-level cleaning scripts (*_pc) # These are multiple .do files for cleaning personal level NSS labor datasets for years 1987-2011
- Harmonization scripts # These are multiple .do files for district, industry and occupation code harmonization
Additionally, there is:
- A preamble file located in the
nssfolder that initializes the coding environment to configure paths, install packages, and register shared directories - A district concordance folder in the
nssfolder used for harmonizing district identifiers across survey rounds.
NOTE: Household-level cleaning scripts (HC) and Person-level cleaning scripts (PC) are for cleaning the micro datasets, as an user you don't need to execute them separately. The Master do-files run the entire workflow.
Please note that, similar datasets for ASI (Annual Survey of Industries), NSS (National Sample Surveys) Consumption, and NSS enterprise will be uploaded soon on the IDLI website. The README file will get updated accordingly.
Raw NSS and ASI datasets (CSV and DTA) are publicly available on the MOPSI and IDLI website (https://www.idli.dev/). Download them and store them anywhere (preferably Documents folder) on your system.
Clone the repository and place it inside the shared directory referenced by your global root (e.g., Dropbox or OneDrive) so the relative paths defined in 00_preamble.do resolve correctly.
In the provided preamble script, simply enter your local system path ("C:/Users/username_as_per_your_system" OR "/users/username_if_using_a_mac_device/Documents") where the raw datasets are stored.
You DO NOT need to:
- edit code logic
- modify global macros
- change any processing steps
Only update the required path location where indicated.
Open Stata and run the master script: do 00_master_nss_lab.do
This will:
- Check/install required packages (if enabled in preamble)
- Run year-specific household and person cleaning scripts
- Apply variable harmonization and code mappings
- Validate outputs and export .dta and .csv files into the output folder
After the master run completes, go to your output folder and verify if a nss_lab_final.dta dataset is saved.
Note: District concordance spreadsheet in documentation/district_concordance/ are imported directly by the Stata code to reconcile NSS labor district codes before merging or validation.
If you only want to run one round’s person or household cleaning (without running everything), run that specific file after running the preamble:
do 01_1_2007_clean_hc.do // household for 2007
do 01_2_2007_clean_pc.do // person for 2007
Important:Only run individual scripts for inspection or validation. Do not modify them.
- Stata 17 or higher
- Basic system path defined by the user
- Raw data downloaded from the MOPSI/IDLI website
- Internet connection optional (only for installing missing SSC packages)
All required Stata packages — including gtools, reghdfe, grstyle, palettes, distinct, ftools, mipolate, nicelabels, and others are automatically checked and installed in the script.
Users may install additional packages locally, but project scripts should remain unchanged.
- Do not modify the cleaning scripts — edits will break consistency across years and across users.
- Only change the small USER CONFIG block in 00_master_nss_lab.do that sets paths.
- Keep raw data outside the repo (e.g., in ~/data/NSS_raw/) and keep outputs in ~/data/NSS_working/.
- Add outputs/, raw data folders, and .dta files to .gitignore.
If you must change a script for research, make a personal copy and document the changes — but do not commit those changes to the main pipeline.
-
Missing Packages If any required package is missing, install it using: ssc install eg.
ssc install gtools,ssc install reghdfe,ssc install nicelabels -
Path Errors Make sure your system path uses correct formatting:
Windows: "C:/Users/username/Documents/..."
Mac: "/Users/username/Documents/..."
Linux: "/home/username/..."
- Large File Warning For large NSS/ASI files, Stata may require: set excelxlsxlargefile on (This is already included in the script.)
- Always use the master do-file for full processing.
- Use individual scripts (e.g.
code\nss\nss_lab\01_variable_clean.do) only for reviewing logic. - Never change script structure, variable definitions, or processing rules.
- Store raw and processed data in clearly separated directories.
This project is maintained by the IDLI research and data engineering team.
This project uses publicly available NSS datasets. Processed datasets and scripts follow IDLI licensing and documentation standards.
- Ananya Kotia – Founder and Director • www.ananyakotia.com
- Bharat Singhal – Research Associate
- Naila Fatima – Research Associate (2023–25)
- Bommi Reddy Meghana Vardhan – Research Manager
- Ayush Chaudhary – Research Associate
- Fork the repository and create a feature branch.
- Run the relevant master script(s) and validation routines to ensure harmonized outputs remain consistent.
- Submit a pull request summarizing the methodological change, affected rounds, and validation evidence.
Users may: Submit issues Suggest enhancements Contribute documentation
However, core scripts must not be altered under any circumstances to maintain pipeline integrity.
Private raw data are not stored in this repository; only code and documentation needed to reproduce the harmonized releases are included.