Synthetic Data Generation & Validation Using R's bnlearn Package

Why This Project Exists:

This workflow turns survey responses into privacy‑safe synthetic datasets while keeping the statistical patterns intact.
Objective: allow analysis, sharing, and portfolio demonstration without exposing any respondent’s real data.

What This Project Produces and Whom This is For:

The end product is a synthetic dataset that can be shared publicly without any privacy risk.
For my current workflow as Survey Lead for the Data Engineering Pilipinas group.

Findings

Comparable distributions per column: based on Side‑by‑side frequency distribution plots for every column, comparing original vs synthetic data.
Click here to see all the plots. Distribution Plots of Original and Synthetic Data

Salary vs. education stacked bar plots indicate slight deviations from original dataset. Slight manual edits will be done.

This is expected based on the algorithm.

The random entries for a few specific ranges in Salary will be deleted among Career Stage == Students and Educational Status == Secondary education.

Note that original dataset also contains Salary entries for Students(Education) or Students/Career Break(Career Stage).

Background

Raw survey data often contains personally identifiable or sensitive information (e.g., salary).

Directly sharing it — even internally — can breach trust or compliance rules.

This workflow uses a Bayesian network approach (bnlearn in R) to model relationships between variables, then generates synthetic records that mimic the original dataset’s structure and distributions. As an added step, rows from synthetic dataset matching the rows from the original dataset are dropped.

You can read more about bnlearn here: bnlearn documentation

Workflow:

Click here to see workflow for generating the synthetic dataset

Click here to see workflow for generating the plots

Load & Preprocess

Read raw survey CSV, already cleaned of identifiers.

Create an age_grp factor from numeric age.

Remove non‑modeling columns.

Normalize text encoding and replace blanks with "CODEASBLANK".

Convert all variables to factors for modeling consistency.

Model & Synthesize

Learn Bayesian network structure via Hill‑Climbing. Below is the resulting network.

Fit conditional probability tables.

Set target duplication = 20. Set max iterations = 2000. Start loop.

Generate synthetic datasets with the same number of rows as the original dataset.

Check if duplicate count is = 20. If > 20, loop back to generation step. Stop if met.

If duplicate count = 20, save seed, and save synthetic dataset.

Audit for Privacy

Tag datasets as real or synthetic.

Combine and check for exact record matches across all factor combinations.

Export duplication check containing synthetic rows for deletion in a csv file.

Drop row_ids of synthetic dataset identified as similar to original dataset.

Export both original df and cleaned synthetic df for frequency distribution plots.

Frequency Distributions

Compute per‑variable counts and proportions for both real and synthetic datasets.

Combine into a single table for plotting.

Export combined long table of frequencies as csv.

Visualization & Export

Loop through all variables, generating side‑by‑side bar plots (Original vs Synthetic).

Save plots as PNGs and embed in an HTML report for easy review.

Quick checks: Salary by Career Stage and by Education Status.

Manually edit the Synthetic dataset to delete the few random entries for Salary among Career Stage = Students and Educational Status = Secondary education.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
code_dir		code_dir
plots_dir		plots_dir
LICENSE		LICENSE
README.md		README.md
bn_network_graph.PNG		bn_network_graph.PNG
original_n774_vs_synthetic_deduped_n743.pdf		original_n774_vs_synthetic_deduped_n743.pdf
sal_vs_educ_actual.PNG		sal_vs_educ_actual.PNG
sal_vs_educ_synth.PNG		sal_vs_educ_synth.PNG
syn_df_for_sharing.csv		syn_df_for_sharing.csv
workflow_bnlearn.txt		workflow_bnlearn.txt
workflow_plots.txt		workflow_plots.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synthetic Data Generation & Validation Using R's bnlearn Package

Why This Project Exists:

What This Project Produces and Whom This is For:

Findings

Background

Workflow:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SandyGCabanes/Survey-Data-Privacy-Protection-Using-R-and-Bayesian-Networks

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generation & Validation Using R's bnlearn Package

Why This Project Exists:

What This Project Produces and Whom This is For:

Findings

Background

Workflow:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages