Skip to content

Feature: OSINT Investigation Skill — Public Data Sources, Entity Resolution, and Evidence Chains (inspired by OpenPlanter) #355

@teknium1

Description

@teknium1

Overview

OpenPlanter is an open-source recursive investigation agent (MIT license, 1.4k stars) that acts as a "Community Edition of Palantir" — ingesting heterogeneous public datasets, resolving entities across them, and surfacing non-obvious connections through evidence-backed analysis. Its most distinctive and transferable feature is a complete OSINT investigation framework: a curated wiki of 16 public data sources with structured documentation and Python acquisition scripts, plus reusable investigation templates for entity resolution, cross-link analysis, and statistical timing correlation.

Hermes Agent currently has domain-intel (passive DNS/WHOIS/SSL reconnaissance) and arxiv (academic paper search) as its only research/OSINT capabilities. There is no framework for structured investigations across heterogeneous data sources, no entity resolution tooling, and no evidence chain construction. This would make Hermes Agent capable of genuine investigative work — from following campaign finance trails to cross-referencing government contracts with lobbying disclosures.

This should be a skill (likely Skills Hub rather than bundled, given the specialized audience) because the entire capability can be expressed as instructions + shell commands + existing Hermes tools (terminal, web_extract, read_file, write_file, search_files). The acquisition scripts use only Python stdlib. No custom tool integration needed.


Research Findings

How OpenPlanter's Investigation Framework Works

1. Data Source Wiki (16 Sources, 9 Categories)

OpenPlanter maintains a structured wiki of public data sources, each following a standardized 9-section template:

Category Sources
Campaign Finance MA OCPF, FEC Federal
Government Contracts Boston Open Checkbook, USASpending.gov, SAM.gov
Corporate Registries MA Secretary of Commonwealth, SEC EDGAR
Financial FDIC BankFind
Lobbying Senate LD-1/LD-2 Disclosures
Nonprofits ProPublica 990 / IRS 990
Regulatory EPA ECHO, OSHA Inspections
Sanctions OFAC SDN List
International ICIJ Offshore Leaks
Infrastructure US Census Bureau ACS

Template structure per source:

  1. Summary — what it is, who publishes, why it matters
  2. Access Methods — API endpoints, bulk download URLs, auth requirements, rate limits
  3. Data Schema — key fields, record types, table relationships
  4. Coverage — jurisdiction, time range, update frequency, data volume
  5. Cross-Reference Potential — which other sources can be joined and on what keys
  6. Data Quality — known issues (formatting, missing fields, duplicates)
  7. Acquisition Script — path to the Python script that downloads/transforms the data
  8. Legal & Licensing — public records law, terms of use
  9. References — official docs, data dictionaries

The cross-reference potential section is particularly valuable — it explicitly maps join keys between sources, creating a graph of data source interconnections that an agent can traverse to plan multi-source investigations.

2. Acquisition Scripts (Python Stdlib Only)

Each data source has a corresponding fetch script (scripts/fetch_fec.py, scripts/fetch_sec_edgar.py, etc.) that uses only Python stdlib (urllib.request, json, csv, xml.etree, argparse). Zero external dependencies for data acquisition.

Scripts handle: API pagination, rate limiting, data normalization, CSV/JSON output, error handling.

3. Investigation Templates

Entity Resolution (entity_resolution.py, ~741 lines):
Three-tier name normalization:

  1. Standard: uppercase, remove legal suffixes (LLC, Inc, Corp), remove punctuation, collapse whitespace
  2. Aggressive: alphabetically sorted unique tokens (word-bag comparison)
  3. Token overlap: >60% token overlap with at least 2 shared tokens

Three-tier matching with explicit confidence:

  • employer_exact / donor_exact — high confidence
  • employer_fuzzy / donor_fuzzy — medium confidence
  • employer_token_overlap — low confidence

Red flag analysis targeting pay-to-play indicators:

  • Sole-source vendor whose employees donate (HIGH if >$1M contracts)
  • Bundled donations (3+ donors from same employer to same candidate)
  • Significant donation amounts relative to contract value

Cross-Link Analysis (cross_link_analysis.py, ~586 lines):
Alternative matching pipeline using pandas + optional rapidfuzz (token_sort_ratio at threshold 82). Detects contractor-donor matches and bundled donations.

Timing Analysis (timing_analysis.py, ~338 lines):
Statistical permutation testing for donation-contract timing correlation:

  1. For each vendor-politician pair, calculate mean distance from donations to nearest contract award
  2. Generate 1000 random null hypothesis award dates
  3. P-value = fraction of permutations where mean distance ≤ observed
  4. Effect size = (null_mean - observed) / null_std

This is genuine computational investigative journalism methodology.

Findings Builder (build_findings_json.py, ~163 lines):
Assembles structured investigation reports with machine-readable evidence chains: each finding has id, title, severity, confidence, summary, evidence list, and source files.

Key Design Decisions

  1. Wiki-as-knowledge-base — Data source documentation is structured for both humans AND AI agents. An agent reads the wiki entry and knows the API endpoint, auth requirements, schema, rate limits, and CLI commands.
  2. Stdlib-only acquisition — Zero dependency installation for data fetching. Maximum portability.
  3. Explicit confidence levels — Every entity match has a confidence tier (confirmed/probable/possible/unresolved), preventing false positives from being treated as certainties.
  4. Evidence chain construction — Every claim traces to a specific record: claim → evidence → source → confidence.
  5. Template-driven extensibility — Adding new data sources follows a copy-template-and-fill pattern.

Current State in Hermes Agent

What we have:

  • domain-intel skill — passive DNS/WHOIS/SSL/crt.sh reconnaissance. Domain-focused only.
  • arxiv skill — academic paper search via free REST API.
  • web_search + web_extract tools — general web research.
  • terminal tool — can run Python scripts for data analysis.
  • execute_code tool — sandboxed Python execution with RPC tool access.

What we don't have:

  • No structured investigation framework
  • No entity resolution capabilities
  • No public data source catalog
  • No evidence chain tracking
  • No investigation templates (entity matching, cross-linking, timing analysis)
  • No red flag / anomaly detection patterns

The domain-intel skill is the closest analog but covers only DNS/infrastructure OSINT. This proposal covers the much broader space of financial, regulatory, corporate, and political OSINT.


Implementation Plan

Skill vs. Tool Classification

This should be a skill because:

  • The entire capability is instructions + shell commands + existing tools
  • Acquisition scripts are standalone Python (run via terminal)
  • Entity resolution scripts are standalone Python (run via terminal)
  • Data source documentation is reference material (read via read_file)
  • No custom Python integration or API key management needed in the agent harness
  • No binary data, streaming, or real-time events

Bundled vs. Skills Hub: This is specialized (investigative journalists, researchers, OSINT analysts) rather than broadly useful to most users. Recommend Skills Hub with documentation for how to install and use it.

What We'd Need

  1. Skill SKILL.md — Investigation workflow instructions with trigger conditions
  2. Reference wiki — Adapted data source entries (references/ directory)
  3. Investigation scripts — Entity resolution, cross-link analysis, timing analysis (scripts/ directory)
  4. Acquisition scripts — Fetch scripts for each data source (scripts/ directory)
  5. Template — Data source template for users to add their own sources

Phased Rollout

Phase 1: Core Skill + Top Data Sources

  • Create the skill with investigation workflow instructions
  • Include the wiki template and 5-6 most broadly useful data sources:
    • FEC (campaign finance — federal)
    • SEC EDGAR (corporate filings)
    • USASpending (government contracts)
    • Senate LD (lobbying disclosures)
    • OFAC SDN (sanctions)
    • ICIJ Offshore Leaks (international)
  • Include the entity resolution script (adapted from OpenPlanter)
  • Include the cross-link analysis script
  • Basic investigation quickstart guide

Phase 2: Full Data Source Catalog + Advanced Analysis

  • Add remaining data sources (OSHA, EPA, FDIC, ProPublica 990, Census ACS, SAM.gov)
  • Add the timing analysis script (permutation testing)
  • Add the findings builder (structured evidence chain output)
  • Red flag analysis templates
  • Add investigation templates for common scenarios:
    • "Follow the money" (campaign finance + contracts)
    • "Corporate connections" (EDGAR + lobbying + contracts)
    • "Sanctions screening" (OFAC + corporate registries)

Phase 3: Integration & Polish


Pros & Cons

Pros

  • Fills a major capability gap — No existing Hermes skill covers structured investigation workflows
  • High-impact use case — Investigative journalism, compliance, due diligence, academic research
  • Zero new dependencies for Phase 1 — Acquisition scripts use Python stdlib only
  • Reusable patterns — Entity resolution and cross-linking are applicable far beyond the included data sources
  • MIT licensed source material — OpenPlanter is MIT, safe to adapt
  • Self-contained — As a skill, it doesn't touch the core codebase at all
  • Extensible — Template-driven design makes it easy for users to add data sources

Cons / Risks

  • Specialized audience — Not everyone needs OSINT investigation capabilities
  • Data source maintenance — Government APIs change; wiki entries may become stale
  • Legal considerations — While all data sources are public records, bulk collection may violate some ToS. Skill should include usage guidelines.
  • Scope creep — Investigation is a huge domain; must stay focused on the data-to-evidence pipeline
  • rapidfuzz dependency — Cross-link analysis benefits from fuzzy matching library (optional, falls back to exact matching)

Open Questions

  1. Should the skill ship with a "quickstart investigation" demo (like OpenPlanter's Boston corruption investigation) or just the building blocks?
  2. Should entity resolution be a separate reusable script, or embedded in the skill instructions for the agent to adapt per-investigation?
  3. How should the data source wiki be structured within the skill? One big reference file, or separate files per source?
  4. Should we include non-US data sources (UK Companies House, EU lobbying register) in Phase 1 or defer?
  5. Should evidence chain output integrate with Feature: Structured Memory System — Typed Nodes, Graph Edges, and Hybrid Search #346 (Structured Memory) once that's implemented?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions