Role Collector

Role Collector is an autonomous job-sourcing agent focused on finding fresh job postings (especially <24h) across ATS platforms and discovery channels, then deduplicating, storing, and surfacing them in a review dashboard.

What We Built

We implemented a full pipeline that continuously discovers job URLs, extracts structured job data, deduplicates records, and ranks results for fast application workflows.

Major Capabilities

Autonomous worker loop with scheduling, source prioritization, backoff, and cycle summaries
Multi-source discovery:
- ATS API discovery (seed + learned slugs)
- ATS Google search (host-targeted + broad)
- Funding/event discovery + company resolver + watchlist polling
- LinkedIn public post discovery
ATS API connector expansion (pattern-port approach):
- ashby, lever, greenhouse (existing)
- workday, smartrecruiters, icims (added)
Deterministic sharded seed traversal for large catalogs
Freshness metadata across pipeline:
- posted_at_source, observed_at, freshness_bucket (lt_24h, 24_72h, gt_72h, unknown)
Two-layer dedupe:
- Exact idempotency (description hash)
- Semantic duplicate scoring
Premium dashboard UI with freshness-first ranking and filtering

Recent Implementation Highlights (What + How)

1) ATS Discovery Strengthening

What: Expanded ATS API discovery coverage and quality.

How:

Added Workday / SmartRecruiters / iCIMS clients under src/job_agent/sources/ats_api/clients.py
Ported robust patterns from ATS-scrapers style (pagination, retries, canonical URL shaping, metadata extraction)
Added normalization helpers in src/job_agent/sources/ats_api/normalize.py for employment type, remote type, timestamps, and freshness bucketing

2) Deterministic Seed Sharding

What: Prevent offset drift and inconsistent shard traversal across cycles.

How:

Seed providers are now processed in deterministic sorted order before flattening/sharding
Offset memory remains stable across worker runs

3) Freshness-First Pipeline + UI

What: Make "apply-now" jobs easier to prioritize.

How:

Persisted freshness metadata in jobs table (posted_at_source, observed_at, freshness_bucket)
Added DB migration support in src/job_agent/db/migrate.py
Jobs API now sorts freshness-first
Dashboard adds:
- Freshness filter
- <24h only toggle

4) Compatibility + Reliability

What: Keep watchlist and existing flows stable with richer ATS outputs.

How:

Watchlist enumeration now supports both legacy list[str] and richer ATS metadata records
Added full-suite test entrypoint and validation target (make test-all)

High-Level Architecture

Worker cycle starts (agent/worker.py)
Source schedule computed (agent/scheduler.py)
Discovery tools run (agent/tools.py + graph/nodes.py)
Candidate URLs normalized and fetched
Job extraction pipeline runs (ATS parser → JSON-LD → DOM → LLM fallback)
Dedupe checks (exact + semantic)
Upsert into SQLite + source audit records
UI/API serves ranked review surface

Tech Stack

Core Runtime

Python 3.12
FastAPI + Jinja2 (dashboard)
SQLite (primary storage)
Requests + BeautifulSoup + lxml (HTTP + HTML parsing)
Playwright / nodriver (browser-assisted paths)
Pydantic (typed config/schemas)

Agent / Intelligence

LangGraph-style node workflow
Optional Ollama-backed LLM classification/extraction

Dedupe

RapidFuzz (string similarity)
Sentence-transformers embeddings (semantic scoring)

Observability / Ops

Langfuse integration (configurable)
Rich run/cycle/source stats + event logging

Repository Layout

src/job_agent/agent/ — worker loop, scheduler, tool registry
src/job_agent/graph/ — orchestration nodes and state transitions
src/job_agent/sources/ — discovery channels (ATS, funding, LinkedIn)
src/job_agent/extract/ — extraction pipeline + parsers + schema
src/job_agent/dedupe/ — hashing, embeddings, scoring
src/job_agent/db/ — schema, migrations, repository methods
ui/ — FastAPI server, templates, static assets
tests/ — unit/integration tests

Setup

make setup
cp .env.example .env
make migrate

Optional (if using local Ollama flows):

make ollama

Run

Autonomous worker

make worker

Manual run

make run

Dashboard

make review
# http://127.0.0.1:8501

Test & Quality

make test-all    # full suite with PYTHONPATH=src
make test
make lint
make format

Data Model Notes

Primary entities:

jobs
job_sources
search_runs
agent_cycles
agent_source_stats
funding_events
companies
linkedin_posts

Freshness fields in jobs:

posted_at_source
observed_at
freshness_bucket

Current Status

End-to-end discovery, extraction, dedupe, persistence, and review workflow is operational
ATS coverage significantly expanded
Full test suite passing
Premium UI and freshness-prioritized application workflow in place

Notes

Config lives in config.yaml
Environment variables documented in .env.example
Database path defaults to ./data/jobs.db

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
scripts		scripts
src/job_agent		src/job_agent
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Role Collector

What We Built

Major Capabilities

Recent Implementation Highlights (What + How)

1) ATS Discovery Strengthening

2) Deterministic Seed Sharding

3) Freshness-First Pipeline + UI

4) Compatibility + Reliability

High-Level Architecture

Tech Stack

Core Runtime

Agent / Intelligence

Dedupe

Observability / Ops

Repository Layout

Setup

Run

Autonomous worker

Manual run

Dashboard

Test & Quality

Data Model Notes

Current Status

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Role Collector

What We Built

Major Capabilities

Recent Implementation Highlights (What + How)

1) ATS Discovery Strengthening

2) Deterministic Seed Sharding

3) Freshness-First Pipeline + UI

4) Compatibility + Reliability

High-Level Architecture

Tech Stack

Core Runtime

Agent / Intelligence

Dedupe

Observability / Ops

Repository Layout

Setup

Run

Autonomous worker

Manual run

Dashboard

Test & Quality

Data Model Notes

Current Status

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages