dank-extract is a CLI tool for fetching, cleaning, and exporting cannabis datasets. It extracts data from public sources, applies cleaning and validation, and outputs to multiple formats (CSV, JSON, DuckDB).
The cleaned data snapshots are published to the dank-data repository. These datasets power the dank-mcp Model Context Protocol server.
- Installation
- Usage
- Supported Datasets
- Data Cleaning
- Building
- Contribution and Conduct
- Credits and License
Install using go install:
$ go install github.com/AgentDank/dank-extract@latestIt will be installed in your $GOPATH/bin directory, which is often ~/go/bin.
dank-extract [options]
Options:
-c, --compress Compress output files with zstd
--db string DuckDB file path (default: .dank/dank-extract.duckdb)
-h, --help Show help
--max-cache-age duration Maximum age of cached data before re-fetching (default 24h0m0s)
-n, --no-fetch Don't fetch data, use existing cache
-o, --output string Output directory for exports (default: current directory)
--root string Root directory for .dank data (default ".")
-t, --token string ct.data.gov App Token
-v, --verbose Verbose output
Fetch, clean, and export CT cannabis brand data:
$ dank-extract --verbose
2026/01/09 16:46:31 Fetching CT brands data...
2026/01/09 16:46:35 Fetched 30841 brands from API
2026/01/09 16:46:35 Cleaned brands: 30841 -> 30836 (removed 5 erroneous records)
2026/01/09 16:46:35 Wrote CSV to us_ct_brands.csv
2026/01/09 16:46:35 Wrote JSON to us_ct_brands.json
2026/01/09 16:46:41 Wrote DuckDB to .dank/dank-extract.duckdb
Successfully processed 30836 CT cannabis brandsus_ct_brands.csv- CSV format with all fieldsus_ct_brands.json- JSON format with all fields.dank/dank-extract.duckdb- DuckDB database with indexed tables
Use --compress to output .zst compressed files.
Currently the following datasets are supported:
Because upstream datasets are not perfect, we apply cleaning:
- Empty/Trace Values: Fields with "TRC", "<LOQ", "<0.1", etc. are treated as trace amounts
- Error Detection: Multiple decimal points, invalid characters, letters at start
- Validation: Cannabinoid/terpene percentages must be 0-100%
- Missing Data: Empty brand names are filtered out
Generally, we remove weird characters and treat detected "trace" amounts as 0. We also remove rows with ridiculous data (e.g., 90,385% THC entries from decimal point errors).
Building is performed with standard Go tooling:
$ go build -o dank-extract ./cmd/dank-extractOr using go install:
$ go install ./cmd/dank-extractdank-extract/
├── cmd/dank-extract/main.go # CLI entry point
├── sources/
│ ├── cache.go # Cache file management
│ └── us/ct/
│ ├── brand.go # CT Brand struct, fetch, clean, export
│ └── measure.go # Measure type with validation
├── internal/db/
│ ├── db.go # DuckDB utilities
│ └── duckdb_up.sql # Schema migration
├── go.mod
└── go.sum
Pull requests and issues are welcome. Or fork it. You do you.
Either way, obey our Code of Conduct. Be shady, but don't be a jerk.
Copyright (c) 2025 Neomantra Corp. Authored by Evan Wies for AgentDank.
Released under the MIT License, see LICENSE.txt.
Made with 🌿 and 🔥 by the team behind AgentDank.