Feature: OSINT Investigation Skill — Public Data Sources, Entity Resolution, and Evidence Chains (inspired by OpenPlanter)

## Overview

[OpenPlanter](https://github.com/ShinMegamiBoson/OpenPlanter) is an open-source recursive investigation agent (MIT license, 1.4k stars) that acts as a "Community Edition of Palantir" — ingesting heterogeneous public datasets, resolving entities across them, and surfacing non-obvious connections through evidence-backed analysis. Its most distinctive and transferable feature is a **complete OSINT investigation framework**: a curated wiki of 16 public data sources with structured documentation and Python acquisition scripts, plus reusable investigation templates for entity resolution, cross-link analysis, and statistical timing correlation.

Hermes Agent currently has `domain-intel` (passive DNS/WHOIS/SSL reconnaissance) and `arxiv` (academic paper search) as its only research/OSINT capabilities. There is no framework for structured investigations across heterogeneous data sources, no entity resolution tooling, and no evidence chain construction. This would make Hermes Agent capable of genuine investigative work — from following campaign finance trails to cross-referencing government contracts with lobbying disclosures.

This should be a **skill** (likely Skills Hub rather than bundled, given the specialized audience) because the entire capability can be expressed as instructions + shell commands + existing Hermes tools (`terminal`, `web_extract`, `read_file`, `write_file`, `search_files`). The acquisition scripts use only Python stdlib. No custom tool integration needed.

---

## Research Findings

### How OpenPlanter's Investigation Framework Works

#### 1. Data Source Wiki (16 Sources, 9 Categories)

OpenPlanter maintains a structured wiki of public data sources, each following a standardized 9-section template:

| Category | Sources |
|:---|:---|
| Campaign Finance | MA OCPF, FEC Federal |
| Government Contracts | Boston Open Checkbook, USASpending.gov, SAM.gov |
| Corporate Registries | MA Secretary of Commonwealth, SEC EDGAR |
| Financial | FDIC BankFind |
| Lobbying | Senate LD-1/LD-2 Disclosures |
| Nonprofits | ProPublica 990 / IRS 990 |
| Regulatory | EPA ECHO, OSHA Inspections |
| Sanctions | OFAC SDN List |
| International | ICIJ Offshore Leaks |
| Infrastructure | US Census Bureau ACS |

**Template structure per source:**
1. Summary — what it is, who publishes, why it matters
2. Access Methods — API endpoints, bulk download URLs, auth requirements, rate limits
3. Data Schema — key fields, record types, table relationships
4. Coverage — jurisdiction, time range, update frequency, data volume
5. Cross-Reference Potential — which other sources can be joined and on what keys
6. Data Quality — known issues (formatting, missing fields, duplicates)
7. Acquisition Script — path to the Python script that downloads/transforms the data
8. Legal & Licensing — public records law, terms of use
9. References — official docs, data dictionaries

The **cross-reference potential** section is particularly valuable — it explicitly maps join keys between sources, creating a graph of data source interconnections that an agent can traverse to plan multi-source investigations.

#### 2. Acquisition Scripts (Python Stdlib Only)

Each data source has a corresponding fetch script (`scripts/fetch_fec.py`, `scripts/fetch_sec_edgar.py`, etc.) that uses only Python stdlib (`urllib.request`, `json`, `csv`, `xml.etree`, `argparse`). Zero external dependencies for data acquisition.

Scripts handle: API pagination, rate limiting, data normalization, CSV/JSON output, error handling.

#### 3. Investigation Templates

**Entity Resolution (`entity_resolution.py`, ~741 lines):**
Three-tier name normalization:
1. Standard: uppercase, remove legal suffixes (LLC, Inc, Corp), remove punctuation, collapse whitespace
2. Aggressive: alphabetically sorted unique tokens (word-bag comparison)
3. Token overlap: >60% token overlap with at least 2 shared tokens

Three-tier matching with explicit confidence:
- `employer_exact` / `donor_exact` — high confidence
- `employer_fuzzy` / `donor_fuzzy` — medium confidence  
- `employer_token_overlap` — low confidence

Red flag analysis targeting pay-to-play indicators:
- Sole-source vendor whose employees donate (HIGH if >$1M contracts)
- Bundled donations (3+ donors from same employer to same candidate)
- Significant donation amounts relative to contract value

**Cross-Link Analysis (`cross_link_analysis.py`, ~586 lines):**
Alternative matching pipeline using pandas + optional rapidfuzz (token_sort_ratio at threshold 82). Detects contractor-donor matches and bundled donations.

**Timing Analysis (`timing_analysis.py`, ~338 lines):**
Statistical permutation testing for donation-contract timing correlation:
1. For each vendor-politician pair, calculate mean distance from donations to nearest contract award
2. Generate 1000 random null hypothesis award dates
3. P-value = fraction of permutations where mean distance ≤ observed
4. Effect size = (null_mean - observed) / null_std

This is genuine computational investigative journalism methodology.

**Findings Builder (`build_findings_json.py`, ~163 lines):**
Assembles structured investigation reports with machine-readable evidence chains: each finding has id, title, severity, confidence, summary, evidence list, and source files.

### Key Design Decisions

1. **Wiki-as-knowledge-base** — Data source documentation is structured for both humans AND AI agents. An agent reads the wiki entry and knows the API endpoint, auth requirements, schema, rate limits, and CLI commands.
2. **Stdlib-only acquisition** — Zero dependency installation for data fetching. Maximum portability.
3. **Explicit confidence levels** — Every entity match has a confidence tier (confirmed/probable/possible/unresolved), preventing false positives from being treated as certainties.
4. **Evidence chain construction** — Every claim traces to a specific record: claim → evidence → source → confidence.
5. **Template-driven extensibility** — Adding new data sources follows a copy-template-and-fill pattern.

---

## Current State in Hermes Agent

**What we have:**
- `domain-intel` skill — passive DNS/WHOIS/SSL/crt.sh reconnaissance. Domain-focused only.
- `arxiv` skill — academic paper search via free REST API.
- `web_search` + `web_extract` tools — general web research.
- `terminal` tool — can run Python scripts for data analysis.
- `execute_code` tool — sandboxed Python execution with RPC tool access.

**What we don't have:**
- No structured investigation framework
- No entity resolution capabilities
- No public data source catalog
- No evidence chain tracking
- No investigation templates (entity matching, cross-linking, timing analysis)
- No red flag / anomaly detection patterns

The `domain-intel` skill is the closest analog but covers only DNS/infrastructure OSINT. This proposal covers the much broader space of financial, regulatory, corporate, and political OSINT.

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **skill** because:
- The entire capability is instructions + shell commands + existing tools
- Acquisition scripts are standalone Python (run via `terminal`)
- Entity resolution scripts are standalone Python (run via `terminal`)
- Data source documentation is reference material (read via `read_file`)
- No custom Python integration or API key management needed in the agent harness
- No binary data, streaming, or real-time events

**Bundled vs. Skills Hub:** This is specialized (investigative journalists, researchers, OSINT analysts) rather than broadly useful to most users. Recommend **Skills Hub** with documentation for how to install and use it.

### What We'd Need

1. **Skill SKILL.md** — Investigation workflow instructions with trigger conditions
2. **Reference wiki** — Adapted data source entries (references/ directory)  
3. **Investigation scripts** — Entity resolution, cross-link analysis, timing analysis (scripts/ directory)
4. **Acquisition scripts** — Fetch scripts for each data source (scripts/ directory)
5. **Template** — Data source template for users to add their own sources

### Phased Rollout

**Phase 1: Core Skill + Top Data Sources**
- Create the skill with investigation workflow instructions
- Include the wiki template and 5-6 most broadly useful data sources:
  - FEC (campaign finance — federal)
  - SEC EDGAR (corporate filings)
  - USASpending (government contracts)
  - Senate LD (lobbying disclosures)
  - OFAC SDN (sanctions)
  - ICIJ Offshore Leaks (international)
- Include the entity resolution script (adapted from OpenPlanter)
- Include the cross-link analysis script
- Basic investigation quickstart guide

**Phase 2: Full Data Source Catalog + Advanced Analysis**
- Add remaining data sources (OSHA, EPA, FDIC, ProPublica 990, Census ACS, SAM.gov)
- Add the timing analysis script (permutation testing)
- Add the findings builder (structured evidence chain output)
- Red flag analysis templates
- Add investigation templates for common scenarios:
  - "Follow the money" (campaign finance + contracts)
  - "Corporate connections" (EDGAR + lobbying + contracts)
  - "Sanctions screening" (OFAC + corporate registries)

**Phase 3: Integration & Polish**
- Integration with `delegate_task` for parallel multi-source investigations
- Integration with structured memory (#346) for persistent entity graphs
- Visualization support (network graphs of entity connections)
- Export templates (Markdown reports, JSON evidence packages)
- User-contributed data source template + contribution guide

---

## Pros & Cons

### Pros
- **Fills a major capability gap** — No existing Hermes skill covers structured investigation workflows
- **High-impact use case** — Investigative journalism, compliance, due diligence, academic research
- **Zero new dependencies for Phase 1** — Acquisition scripts use Python stdlib only
- **Reusable patterns** — Entity resolution and cross-linking are applicable far beyond the included data sources
- **MIT licensed source material** — OpenPlanter is MIT, safe to adapt
- **Self-contained** — As a skill, it doesn't touch the core codebase at all
- **Extensible** — Template-driven design makes it easy for users to add data sources

### Cons / Risks
- **Specialized audience** — Not everyone needs OSINT investigation capabilities
- **Data source maintenance** — Government APIs change; wiki entries may become stale
- **Legal considerations** — While all data sources are public records, bulk collection may violate some ToS. Skill should include usage guidelines.
- **Scope creep** — Investigation is a huge domain; must stay focused on the data-to-evidence pipeline
- **rapidfuzz dependency** — Cross-link analysis benefits from fuzzy matching library (optional, falls back to exact matching)

---

## Open Questions

1. Should the skill ship with a "quickstart investigation" demo (like OpenPlanter's Boston corruption investigation) or just the building blocks?
2. Should entity resolution be a separate reusable script, or embedded in the skill instructions for the agent to adapt per-investigation?
3. How should the data source wiki be structured within the skill? One big reference file, or separate files per source?
4. Should we include non-US data sources (UK Companies House, EU lobbying register) in Phase 1 or defer?
5. Should evidence chain output integrate with #346 (Structured Memory) once that's implemented?

---

## References

- [OpenPlanter](https://github.com/ShinMegamiBoson/OpenPlanter) — Source repo (MIT license)
- [MarkTechPost coverage](https://www.marktechpost.com/2026/02/21/is-there-a-community-edition-of-palantir-meet-openplanter-an-open-source-recursive-ai-agent-for-your-micro-surveillance-use-cases/) — Overview article
- [OpenPlanter wiki/](https://github.com/ShinMegamiBoson/OpenPlanter/tree/main/wiki) — Data source catalog
- [OpenPlanter scripts/](https://github.com/ShinMegamiBoson/OpenPlanter/tree/main/scripts) — Investigation scripts
- Hermes `domain-intel` skill — Existing OSINT capability (domain-focused only)
- #346 — Structured Memory System (potential integration for evidence chains)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: OSINT Investigation Skill — Public Data Sources, Entity Resolution, and Evidence Chains (inspired by OpenPlanter) #355

Overview

Research Findings

How OpenPlanter's Investigation Framework Works

1. Data Source Wiki (16 Sources, 9 Categories)

2. Acquisition Scripts (Python Stdlib Only)

3. Investigation Templates

Key Design Decisions

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category	Sources
Campaign Finance	MA OCPF, FEC Federal
Government Contracts	Boston Open Checkbook, USASpending.gov, SAM.gov
Corporate Registries	MA Secretary of Commonwealth, SEC EDGAR
Financial	FDIC BankFind
Lobbying	Senate LD-1/LD-2 Disclosures
Nonprofits	ProPublica 990 / IRS 990
Regulatory	EPA ECHO, OSHA Inspections
Sanctions	OFAC SDN List
International	ICIJ Offshore Leaks
Infrastructure	US Census Bureau ACS

Feature: OSINT Investigation Skill — Public Data Sources, Entity Resolution, and Evidence Chains (inspired by OpenPlanter) #355

Description

Overview

Research Findings

How OpenPlanter's Investigation Framework Works

1. Data Source Wiki (16 Sources, 9 Categories)

2. Acquisition Scripts (Python Stdlib Only)

3. Investigation Templates

Key Design Decisions

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions