Skip to content

tavaresgmg/data-sync-ingestion-coding-challenge

Repository files navigation

DataSync Ingestion (Coding Challenge)

Production-grade ingestion job that extracts 3,000,000 events from the DataSync Analytics API and persists them into PostgreSQL, with pagination, rate limit handling, retries, and resumability. Everything runs in Docker Compose.

Key links:

Requirements (must)

  • Runs entirely in Docker using the provided docker-compose.yml.
  • Works with: sh run-ingestion.sh (scripts are POSIX sh compatible).
  • TypeScript implementation under packages/.
  • Stores events in Postgres table ingested_events.
  • Resumable (safe to restart; continues from DB checkpoint).
  • No external dependencies beyond the target API and local Postgres (no extra services, no extra keys).

How To Run

Preflight:

sh preflight.sh

Run ingestion:

export TARGET_API_KEY=YOUR_API_KEY
sh run-ingestion.sh

Export all event IDs (after ingestion completes):

sh export-event-ids.sh

Prepare submission (prints a curl command):

export GITHUB_REPO_URL=https://github.com/tavaresgmg/data-sync-ingestion-coding-challenge
sh prepare-submission.sh event_ids.txt

Architecture (high level)

  • Source
  • Stream-first: POST /internal/dashboard/stream-access -> streamAccess.endpoint + token header/value.
  • Fallback: GET /api/v1/events (cursor + limit).
  • Reliability
  • Idempotent inserts via PRIMARY KEY (event_id) + ON CONFLICT DO NOTHING.
  • Checkpoint table ingestion_state stores cursor + last timestamp for time-overlap resume.
  • Throughput
  • Fetch/insert pipelining (no cursor parallelism).
  • Batch insert uses UNNEST(...) to avoid massive VALUES(...) overhead.

Architecture Decisions (summary)

  • Stream-first ingestion with /events fallback (throughput-first, but safe).
  • Resumability via cursor when valid; otherwise time-window resume using until = lowWatermark + overlap with idempotent dedupe.
  • Keep the system “single-service” (no extra infra) to satisfy the Docker-only verification requirement.

API Discovery Notes

From the dashboard bundle (static analysis), Chrome DevTools Network inspection (clicking Start Stream), and unauthenticated endpoints:

  • /api/v1/events appears to be capped at limit=5000 and rate-limited to 10 requests/min (via X-RateLimit-*).
  • Rate limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.
    • Reset can be epoch seconds, epoch milliseconds, or delta seconds (handled).
  • 429 handling: Retry-After can be delta-seconds or HTTP-date (handled).
  • Stream access: POST /internal/dashboard/stream-access (auth required) provides a high-throughput stream endpoint.
    • In practice, this route appears to be gated behind “dashboard-like” request headers. I captured the request headers used by the dashboard's Start Stream button in Chrome DevTools and mirrored them in the ingestion job to unlock the stream.
    • The stream feed also appears to be capped at limit=5000 per request (higher values are ignored).
  • Bulk details: POST /api/v1/events/bulk accepts { ids: [...] }.

Performance Tuning (runbook)

Progress logs (ingest.progress) include:

  • eventsPerMinute
  • fetchMsEma, dbTxMsEma, dbInsertMsEma, dbStateMsEma
  • rateLimit snapshot

Notes:

  • eventsPerMinute is computed from the most recent progress interval and will naturally fluctuate with network jitter and server-side response time (the dominant cost is fetchMsEma; DB work is typically sub-200ms).

Safe knobs:

  • BATCH_SIZE (default 5000)
  • RATE_LIMIT_BUFFER (default 1)
  • PROGRESS_LOG_INTERVAL_MS (default 10000)

Speed-run knob (durability trade-off):

export PG_SYNC_COMMIT=off

Testing

Local:

npm test

Notes:

  • Unit tests always run.
  • Integration tests are enabled when TEST_DATABASE_URL is set (CI runs them with a Postgres service).

What I Would Improve With More Time

  • Time-window sharding with multiple workers (per-shard checkpoint + overlap) using since/until to push throughput well beyond a single cursor stream.
  • Deeper HTTP tuning for lower per-request latency: undici Agent (keep-alive), more aggressive pipelining, and a small prefetch buffer (queue) to reduce jitter.
  • Postgres COPY-based ingestion pipeline for maximum DB throughput (more complex, higher payoff).
  • Add an ingestion health endpoint (optional) and structured metrics export (OTel/Prometheus).
  • Add coverage reporting (vitest --coverage) as a non-gating signal.

Docs Index

  • ADRs: docs/adr/

AI Tools Used

This solution was developed with assistance from:

  • OpenAI Codex (GPT-5.3-Codex, high/xhigh) for architecture iteration, edge cases, tests, scripts, and documentation.
  • Claude Opus 4.6 (via Claude Code) for additional review/refinement on parts of the solution.

About

DataSync ingestion coding challenge solution

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors