Production-grade ingestion job that extracts 3,000,000 events from the DataSync Analytics API and persists them into PostgreSQL, with pagination, rate limit handling, retries, and resumability. Everything runs in Docker Compose.
Key links:
- Dashboard: http://datasync-dev-alb-101078500.us-east-1.elb.amazonaws.com
- API base: http://datasync-dev-alb-101078500.us-east-1.elb.amazonaws.com/api/v1
- Runs entirely in Docker using the provided
docker-compose.yml. - Works with:
sh run-ingestion.sh(scripts are POSIXshcompatible). - TypeScript implementation under
packages/. - Stores events in Postgres table
ingested_events. - Resumable (safe to restart; continues from DB checkpoint).
- No external dependencies beyond the target API and local Postgres (no extra services, no extra keys).
Preflight:
sh preflight.shRun ingestion:
export TARGET_API_KEY=YOUR_API_KEY
sh run-ingestion.shExport all event IDs (after ingestion completes):
sh export-event-ids.shPrepare submission (prints a curl command):
export GITHUB_REPO_URL=https://github.com/tavaresgmg/data-sync-ingestion-coding-challenge
sh prepare-submission.sh event_ids.txt- Source
- Stream-first:
POST /internal/dashboard/stream-access->streamAccess.endpoint+ token header/value. - Fallback:
GET /api/v1/events(cursor + limit). - Reliability
- Idempotent inserts via
PRIMARY KEY (event_id)+ON CONFLICT DO NOTHING. - Checkpoint table
ingestion_statestores cursor + last timestamp for time-overlap resume. - Throughput
- Fetch/insert pipelining (no cursor parallelism).
- Batch insert uses
UNNEST(...)to avoid massiveVALUES(...)overhead.
- Stream-first ingestion with
/eventsfallback (throughput-first, but safe). - Resumability via cursor when valid; otherwise time-window resume using
until = lowWatermark + overlapwith idempotent dedupe. - Keep the system “single-service” (no extra infra) to satisfy the Docker-only verification requirement.
From the dashboard bundle (static analysis), Chrome DevTools Network inspection (clicking Start Stream), and unauthenticated endpoints:
/api/v1/eventsappears to be capped atlimit=5000and rate-limited to10 requests/min(viaX-RateLimit-*).- Rate limit headers:
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset.- Reset can be epoch seconds, epoch milliseconds, or delta seconds (handled).
- 429 handling:
Retry-Aftercan be delta-seconds or HTTP-date (handled). - Stream access:
POST /internal/dashboard/stream-access(auth required) provides a high-throughput stream endpoint.- In practice, this route appears to be gated behind “dashboard-like” request headers. I captured the request headers used by the dashboard's Start Stream button in Chrome DevTools and mirrored them in the ingestion job to unlock the stream.
- The stream feed also appears to be capped at
limit=5000per request (higher values are ignored).
- Bulk details:
POST /api/v1/events/bulkaccepts{ ids: [...] }.
Progress logs (ingest.progress) include:
eventsPerMinutefetchMsEma,dbTxMsEma,dbInsertMsEma,dbStateMsEmarateLimitsnapshot
Notes:
eventsPerMinuteis computed from the most recent progress interval and will naturally fluctuate with network jitter and server-side response time (the dominant cost isfetchMsEma; DB work is typically sub-200ms).
Safe knobs:
BATCH_SIZE(default5000)RATE_LIMIT_BUFFER(default1)PROGRESS_LOG_INTERVAL_MS(default10000)
Speed-run knob (durability trade-off):
export PG_SYNC_COMMIT=offLocal:
npm testNotes:
- Unit tests always run.
- Integration tests are enabled when
TEST_DATABASE_URLis set (CI runs them with a Postgres service).
- Time-window sharding with multiple workers (per-shard checkpoint + overlap) using
since/untilto push throughput well beyond a single cursor stream. - Deeper HTTP tuning for lower per-request latency:
undiciAgent(keep-alive), more aggressive pipelining, and a small prefetch buffer (queue) to reduce jitter. - Postgres
COPY-based ingestion pipeline for maximum DB throughput (more complex, higher payoff). - Add an ingestion health endpoint (optional) and structured metrics export (OTel/Prometheus).
- Add coverage reporting (
vitest --coverage) as a non-gating signal.
- ADRs:
docs/adr/
This solution was developed with assistance from:
- OpenAI Codex (GPT-5.3-Codex, high/xhigh) for architecture iteration, edge cases, tests, scripts, and documentation.
- Claude Opus 4.6 (via Claude Code) for additional review/refinement on parts of the solution.