Role Collector is an autonomous job-sourcing agent focused on finding fresh job postings (especially <24h) across ATS platforms and discovery channels, then deduplicating, storing, and surfacing them in a review dashboard.
We implemented a full pipeline that continuously discovers job URLs, extracts structured job data, deduplicates records, and ranks results for fast application workflows.
- Autonomous worker loop with scheduling, source prioritization, backoff, and cycle summaries
- Multi-source discovery:
- ATS API discovery (seed + learned slugs)
- ATS Google search (host-targeted + broad)
- Funding/event discovery + company resolver + watchlist polling
- LinkedIn public post discovery
- ATS API connector expansion (pattern-port approach):
ashby,lever,greenhouse(existing)workday,smartrecruiters,icims(added)
- Deterministic sharded seed traversal for large catalogs
- Freshness metadata across pipeline:
posted_at_source,observed_at,freshness_bucket(lt_24h,24_72h,gt_72h,unknown)
- Two-layer dedupe:
- Exact idempotency (description hash)
- Semantic duplicate scoring
- Premium dashboard UI with freshness-first ranking and filtering
What: Expanded ATS API discovery coverage and quality.
How:
- Added Workday / SmartRecruiters / iCIMS clients under
src/job_agent/sources/ats_api/clients.py - Ported robust patterns from ATS-scrapers style (pagination, retries, canonical URL shaping, metadata extraction)
- Added normalization helpers in
src/job_agent/sources/ats_api/normalize.pyfor employment type, remote type, timestamps, and freshness bucketing
What: Prevent offset drift and inconsistent shard traversal across cycles.
How:
- Seed providers are now processed in deterministic sorted order before flattening/sharding
- Offset memory remains stable across worker runs
What: Make "apply-now" jobs easier to prioritize.
How:
- Persisted freshness metadata in
jobstable (posted_at_source,observed_at,freshness_bucket) - Added DB migration support in
src/job_agent/db/migrate.py - Jobs API now sorts freshness-first
- Dashboard adds:
Freshnessfilter<24h onlytoggle
What: Keep watchlist and existing flows stable with richer ATS outputs.
How:
- Watchlist enumeration now supports both legacy
list[str]and richer ATS metadata records - Added full-suite test entrypoint and validation target (
make test-all)
- Worker cycle starts (
agent/worker.py) - Source schedule computed (
agent/scheduler.py) - Discovery tools run (
agent/tools.py+graph/nodes.py) - Candidate URLs normalized and fetched
- Job extraction pipeline runs (ATS parser → JSON-LD → DOM → LLM fallback)
- Dedupe checks (exact + semantic)
- Upsert into SQLite + source audit records
- UI/API serves ranked review surface
- Python 3.12
- FastAPI + Jinja2 (dashboard)
- SQLite (primary storage)
- Requests + BeautifulSoup + lxml (HTTP + HTML parsing)
- Playwright / nodriver (browser-assisted paths)
- Pydantic (typed config/schemas)
- LangGraph-style node workflow
- Optional Ollama-backed LLM classification/extraction
- RapidFuzz (string similarity)
- Sentence-transformers embeddings (semantic scoring)
- Langfuse integration (configurable)
- Rich run/cycle/source stats + event logging
src/job_agent/agent/— worker loop, scheduler, tool registrysrc/job_agent/graph/— orchestration nodes and state transitionssrc/job_agent/sources/— discovery channels (ATS, funding, LinkedIn)src/job_agent/extract/— extraction pipeline + parsers + schemasrc/job_agent/dedupe/— hashing, embeddings, scoringsrc/job_agent/db/— schema, migrations, repository methodsui/— FastAPI server, templates, static assetstests/— unit/integration tests
make setup
cp .env.example .env
make migrateOptional (if using local Ollama flows):
make ollamamake workermake runmake review
# http://127.0.0.1:8501make test-all # full suite with PYTHONPATH=src
make test
make lint
make formatPrimary entities:
jobsjob_sourcessearch_runsagent_cyclesagent_source_statsfunding_eventscompanieslinkedin_posts
Freshness fields in jobs:
posted_at_sourceobserved_atfreshness_bucket
- End-to-end discovery, extraction, dedupe, persistence, and review workflow is operational
- ATS coverage significantly expanded
- Full test suite passing
- Premium UI and freshness-prioritized application workflow in place
- Config lives in
config.yaml - Environment variables documented in
.env.example - Database path defaults to
./data/jobs.db