Top 6 Open Source Sensitive Data Discovery Tools

with

updated on Mar 5, 2026

The following tools are selected based on GitHub activity and sorted by GitHub star count in descending order. They cover the main use cases for sensitive data discovery: metadata cataloging with lineage, agentless scanning, and API-based detection of PII, PCI data, and credentials at rest.

Administrative features

Tool	Graphical dashboard	Search-based	Data lineage	Federated database system
DataHub	✅	✅	✅	✅
Apache – Atlas	✅	✅	✅	❌
Marquez	✅	✅	✅	Not shared.
OpenDLP	❌	❌	❌	❌
Piiano Vault – ReDiscovery	❌	Not shared.	❌	❌
Nightfall AI – Sensitive data scanner	✅	✅	❌	❌

Feature descriptions:

Graphical dashboard – allows to visualize your data findings.
Search-based functionality – allows searching for data assets.
Data lineage – allows users to visualize how data is generated, transformed, transmitted, and used across a system over time.
Federated database system – maps multiple autonomous database systems into a single federated database.

These functionality (especially data lineage and search capabilities) allow businesses to:

Uncover the location of their personal information (PII), payment card industry (PCI) data, etc., stored across multiple databases, apps, and user endpoints.
Comply with industry regulatory data protection and privacy standards such as General Data Protection Regulation (GDPR), and California Consumer Privacy Act (CCPA).

Data security features

Feature descriptions:

Data masking– allows hiding data by modifying its original letters and numbers, so that it has no value to unauthorized intruders while remaining usable for authorized employees.
Data loss prevention (DLP) – detects potential data breaches and prevents them by blocking sensitive data.

Categories and GitHub stars

Tool selection & sorting:

Number of reviews: 10+ GitHub stars.
Update release: At least one update was released last week as of November 2024.
Sorting: Tools are sorted by GitHub stars in descending order.

DataHub

DataHub is an open-source, unified platform for sensitive data discovery, observability, and governance, built by Acryl Data and LinkedIn. It is also commercially offered by Acryl Data as a cloud-hosted SaaS offering.

Key features:

Column-level data lineage: traces data flow from source to consumption across platforms.
AI-assisted data quality: anomaly detection flags data quality issues automatically.
Extensibility: REST APIs, Python SDK, and LangChain integration for building agents with access to DataHub metadata.
80+ native connectors: Snowflake, BigQuery, Redshift, Hive, Athena, Postgres, MySQL, SQL Server, Trino, Looker, Power BI, Tableau, Okta, LDAP, S3, Delta Lake, and others.

Consideration: DataHub’s architecture runs multiple interconnected services (GMS, MCE consumer, MAE consumer, search index, graph store). Production deployments typically require Kubernetes. Setup complexity is the most frequently cited pain point in the community.

Apache – Atlas

Apache Atlas is an open-source tool for metadata management and governance, designed primarily for Hadoop and big data ecosystems. It supports classification, lineage tracking, and search across data assets in environments built on Hive, HBase, Kafka, Spark, Sqoop, and Storm.

Key features

Dynamic classification: Apache Atlas allows creating custom classifications such as PII (Personally Identifiable Information), EXPIRES_ON, DATA_QUALITY, and SENSITIVE.
Metadata types: The platform provides pre-defined metadata types for Hadoop and non-Hadoop environments. This allows users to manage metadata for several data sources, such as HBase, Hive, Sqoop, Kafka, and Storm.
SQL-like query language (DSL): The platform supports a domain-specific language (DSL) that provides SQL-like query functionality to search entities. This makes it accessible for users familiar with SQL.
Integration with external tools: Apache Hive, Apache Spark, Kafka, and Presto, making it adaptable for big data environments.

Considerations:

Configuring Atlas in a multi-cloud environment is complex, particularly when bridging AWS, Azure, and Databricks APIs. Atlas does not have native connectors for these platforms additional configuration is required to record lineage from AWS Redshift or Azure Synapse.
Cloud-native cataloging services (e.g., AWS Glue) may offer lower-overhead lineage tracking for teams already committed to a single cloud provider.
Atlas is best suited to organizations running Hadoop, Spark, and Hive at scale. Teams without a Hadoop-centric stack will find its architecture adds unnecessary complexity.

Marquez

Marquez is an open-source data catalog for collecting, aggregating, and visualizing metadata from a data ecosystem. It provides a Web UI and REST API for browsing datasets, understanding their dependencies, and tracking changes through data pipelines.

Search datasets: Users can easily search for datasets, view their attributes, and understand their dependencies across the data ecosystem.
Visualize lineage: The lineage graph in Marquez provides a clear, interactive view of how datasets are connected and transformed through workflows. This is crucial for understanding data pipelines, tracing errors, and ensuring data reliability.
Centralized metadata repository: Marquez aggregates metadata from diverse sources, consolidating it into a single system for easy access and management.

Example workflow: To inspect lineage metadata, navigate to the Marquez UI and search for a job (e.g., etl_delivery_7_days) using the search box. From the job’s output dataset (public.delivery_7_daysYou can view the dataset name, schema, description, and upstream inputs.

Piiano Vault – ReDiscovery

Piiano Vault is a privacy vault for storing and securing sensitive personal data within your own cloud environment. Rather than scanning existing databases for sensitive data, Vault is designed as the authoritative store for the most sensitive fields credit card numbers, bank account numbers, national IDs (SSNs), names, emails, and phone numbers installed alongside your existing application databases.

Vault is deployed within your architecture via Docker or Kubernetes (Helm charts available). SDKs are available for Python (Django ORM), TypeScript, Java, and Go. The vault-releases repository was last updated in August 2025.

Use case distinction: Vault is not a data discovery scanner. It is a structured storage system for sensitive data that organizations want to centralize and protect, not a tool for finding sensitive data already scattered across existing systems.

Nightfall

Nightfall is a commercial AI-native DLP platform, not a fully open-source tool. Its GitHub repositories include open-source scanner scripts (Apache 2.0) that use Nightfall’s API to scan directories, exports, and backups. Executing scans requires a Nightfall API key and calls Nightfall’s commercial detection engine. The free tier allows up to 100 scans per month on public and private repositories.

Open-source scanner capabilities (free tier):

Scans the full commit history of public and private repositories.
Detects credentials, secrets, PII, and credit card numbers.
Runs up to 100 scans per month.

Distinct feature: Nightfall can send alerts to Slack when violations are detected and push results to a SIEM, reporting tool, or webhook endpoint.

Example use case: Scan a Salesforce backup to detect sensitive data at rest. The scanner (1) submits backup files to Nightfall’s API for scanning, (2) runs a local webhook server to receive results, and (3) exports findings to a CSV file.

The above URL is provided by Nightfall. It is the temporarily signed S3 URL to retrieve the sensitive findings that Nightfall identified.

Next to Read

MFAFeb 23

Top 6 Open Source Sensitive Data Discovery Tools

Administrative features

Data security features

Categories and GitHub stars

DataHub

Apache – Atlas

Key features

Marquez

Piiano Vault – ReDiscovery

Nightfall

Further reading

Be the first to comment

Next to Read

Compare 10 Open Source MFA Tools

Top Open Source UEBA Tools & Commercial Alternatives

Top 5 Open Source MDM Software

Best 50+ Open Source AI Agents Listed

Top 5 ZTNA Open Source Components

Top 8 Open Source RBAC Tools in 2026