This workflow turns survey responses into privacy‑safe synthetic datasets while keeping the statistical patterns intact.
Objective: allow analysis, sharing, and portfolio demonstration without exposing any respondent’s real data.
- The end product is a synthetic dataset that can be shared publicly without any privacy risk.
- For my current workflow as Survey Lead for the Data Engineering Pilipinas group.
-
Comparable distributions per column: based on Side‑by‑side frequency distribution plots for every column, comparing original vs synthetic data.
-
Click here to see all the plots. Distribution Plots of Original and Synthetic Data
-
Salary vs. education stacked bar plots indicate slight deviations from original dataset. Slight manual edits will be done.
-
This is expected based on the algorithm.
-
The random entries for a few specific ranges in Salary will be deleted among Career Stage == Students and Educational Status == Secondary education.
-
Note that original dataset also contains Salary entries for Students(Education) or Students/Career Break(Career Stage).
- Raw survey data often contains personally identifiable or sensitive information (e.g., salary).
- Directly sharing it — even internally — can breach trust or compliance rules.
- This workflow uses a Bayesian network approach (
bnlearnin R) to model relationships between variables, then generates synthetic records that mimic the original dataset’s structure and distributions. As an added step, rows from synthetic dataset matching the rows from the original dataset are dropped. - You can read more about bnlearn here: bnlearn documentation
- Click here to see workflow for generating the synthetic dataset
- Click here to see workflow for generating the plots
-
Load & Preprocess
- Read raw survey CSV, already cleaned of identifiers.
- Create an
age_grpfactor from numeric age. - Remove non‑modeling columns.
- Normalize text encoding and replace blanks with
"CODEASBLANK". - Convert all variables to factors for modeling consistency.
-
Model & Synthesize
- Learn Bayesian network structure via Hill‑Climbing. Below is the resulting network.
- Fit conditional probability tables.
- Set target duplication = 20. Set max iterations = 2000. Start loop.
- Generate synthetic datasets with the same number of rows as the original dataset.
- Check if duplicate count is = 20. If > 20, loop back to generation step. Stop if met.
- If duplicate count = 20, save seed, and save synthetic dataset.
- Learn Bayesian network structure via Hill‑Climbing. Below is the resulting network.
-
Audit for Privacy
- Tag datasets as
realorsynthetic. - Combine and check for exact record matches across all factor combinations.
- Export duplication check containing synthetic rows for deletion in a csv file.
- Drop row_ids of synthetic dataset identified as similar to original dataset.
- Export both original df and cleaned synthetic df for frequency distribution plots.
- Tag datasets as
-
Frequency Distributions
- Compute per‑variable counts and proportions for both real and synthetic datasets.
- Combine into a single table for plotting.
- Export combined long table of frequencies as csv.
-
Visualization & Export
- Loop through all variables, generating side‑by‑side bar plots (Original vs Synthetic).
- Save plots as PNGs and embed in an HTML report for easy review.
- Quick checks: Salary by Career Stage and by Education Status.
- Manually edit the Synthetic dataset to delete the few random entries for Salary among Career Stage = Students and Educational Status = Secondary education.