Skip to content

SandyGCabanes/Survey-Data-Privacy-Protection-Using-R-and-Bayesian-Networks

Repository files navigation

Synthetic Data Generation & Validation Using R's bnlearn Package

Why This Project Exists:

This workflow turns survey responses into privacy‑safe synthetic datasets while keeping the statistical patterns intact.
Objective: allow analysis, sharing, and portfolio demonstration without exposing any respondent’s real data.


What This Project Produces and Whom This is For:

  • The end product is a synthetic dataset that can be shared publicly without any privacy risk.
  • For my current workflow as Survey Lead for the Data Engineering Pilipinas group.

Findings

  • Comparable distributions per column: based on Side‑by‑side frequency distribution plots for every column, comparing original vs synthetic data.

  • Click here to see all the plots. Distribution Plots of Original and Synthetic Data

  • Age Group: Original vs. Synthetic

  • Educational Status: Original vs. Synthetic

  • Salary vs. education stacked bar plots indicate slight deviations from original dataset. Slight manual edits will be done.

  • This is expected based on the algorithm.

  • The random entries for a few specific ranges in Salary will be deleted among Career Stage == Students and Educational Status == Secondary education.

  • Note that original dataset also contains Salary entries for Students(Education) or Students/Career Break(Career Stage).

  • Original salary vs. education splits

  • Synthetic salary vs. education splits


Background

  • Raw survey data often contains personally identifiable or sensitive information (e.g., salary).
  • Directly sharing it — even internally — can breach trust or compliance rules.
  • This workflow uses a Bayesian network approach (bnlearn in R) to model relationships between variables, then generates synthetic records that mimic the original dataset’s structure and distributions. As an added step, rows from synthetic dataset matching the rows from the original dataset are dropped.
  • You can read more about bnlearn here: bnlearn documentation

Workflow:

  1. Load & Preprocess

    • Read raw survey CSV, already cleaned of identifiers.
    • Create an age_grp factor from numeric age.
    • Remove non‑modeling columns.
    • Normalize text encoding and replace blanks with "CODEASBLANK".
    • Convert all variables to factors for modeling consistency.
  2. Model & Synthesize

    • Learn Bayesian network structure via Hill‑Climbing. Below is the resulting network. Bayesian network graph
    • Fit conditional probability tables.
    • Set target duplication = 20. Set max iterations = 2000. Start loop.
    • Generate synthetic datasets with the same number of rows as the original dataset.
    • Check if duplicate count is = 20. If > 20, loop back to generation step. Stop if met.
    • If duplicate count = 20, save seed, and save synthetic dataset.
  3. Audit for Privacy

    • Tag datasets as real or synthetic.
    • Combine and check for exact record matches across all factor combinations.
    • Export duplication check containing synthetic rows for deletion in a csv file.
    • Drop row_ids of synthetic dataset identified as similar to original dataset.
    • Export both original df and cleaned synthetic df for frequency distribution plots.
  4. Frequency Distributions

    • Compute per‑variable counts and proportions for both real and synthetic datasets.
    • Combine into a single table for plotting.
    • Export combined long table of frequencies as csv.
  5. Visualization & Export

    • Loop through all variables, generating side‑by‑side bar plots (Original vs Synthetic).
    • Save plots as PNGs and embed in an HTML report for easy review.
    • Quick checks: Salary by Career Stage and by Education Status.
    • Manually edit the Synthetic dataset to delete the few random entries for Salary among Career Stage = Students and Educational Status = Secondary education.

About

Synthetic Dataset Generation From Surveys Using Bayes Method in R bnlearn Package

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages