DataSync Ingestion (Coding Challenge)

Production-grade ingestion job that extracts 3,000,000 events from the DataSync Analytics API and persists them into PostgreSQL, with pagination, rate limit handling, retries, and resumability. Everything runs in Docker Compose.

Key links:

Dashboard: http://datasync-dev-alb-101078500.us-east-1.elb.amazonaws.com
API base: http://datasync-dev-alb-101078500.us-east-1.elb.amazonaws.com/api/v1

Requirements (must)

Runs entirely in Docker using the provided docker-compose.yml.
Works with: sh run-ingestion.sh (scripts are POSIX sh compatible).
TypeScript implementation under packages/.
Stores events in Postgres table ingested_events.
Resumable (safe to restart; continues from DB checkpoint).
No external dependencies beyond the target API and local Postgres (no extra services, no extra keys).

How To Run

Preflight:

sh preflight.sh

Run ingestion:

export TARGET_API_KEY=YOUR_API_KEY
sh run-ingestion.sh

Export all event IDs (after ingestion completes):

sh export-event-ids.sh

Prepare submission (prints a curl command):

export GITHUB_REPO_URL=https://github.com/tavaresgmg/data-sync-ingestion-coding-challenge
sh prepare-submission.sh event_ids.txt

Architecture (high level)

Source
Stream-first: POST /internal/dashboard/stream-access -> streamAccess.endpoint + token header/value.
Fallback: GET /api/v1/events (cursor + limit).
Reliability
Idempotent inserts via PRIMARY KEY (event_id) + ON CONFLICT DO NOTHING.
Checkpoint table ingestion_state stores cursor + last timestamp for time-overlap resume.
Throughput
Fetch/insert pipelining (no cursor parallelism).
Batch insert uses UNNEST(...) to avoid massive VALUES(...) overhead.

Architecture Decisions (summary)

Stream-first ingestion with /events fallback (throughput-first, but safe).
Resumability via cursor when valid; otherwise time-window resume using until = lowWatermark + overlap with idempotent dedupe.
Keep the system “single-service” (no extra infra) to satisfy the Docker-only verification requirement.

API Discovery Notes

From the dashboard bundle (static analysis), Chrome DevTools Network inspection (clicking Start Stream), and unauthenticated endpoints:

/api/v1/events appears to be capped at limit=5000 and rate-limited to 10 requests/min (via X-RateLimit-*).
Rate limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.
- Reset can be epoch seconds, epoch milliseconds, or delta seconds (handled).
429 handling: Retry-After can be delta-seconds or HTTP-date (handled).
Stream access: POST /internal/dashboard/stream-access (auth required) provides a high-throughput stream endpoint.
- In practice, this route appears to be gated behind “dashboard-like” request headers. I captured the request headers used by the dashboard's Start Stream button in Chrome DevTools and mirrored them in the ingestion job to unlock the stream.
- The stream feed also appears to be capped at limit=5000 per request (higher values are ignored).
Bulk details: POST /api/v1/events/bulk accepts { ids: [...] }.

Performance Tuning (runbook)

Progress logs (ingest.progress) include:

eventsPerMinute
fetchMsEma, dbTxMsEma, dbInsertMsEma, dbStateMsEma
rateLimit snapshot

Notes:

eventsPerMinute is computed from the most recent progress interval and will naturally fluctuate with network jitter and server-side response time (the dominant cost is fetchMsEma; DB work is typically sub-200ms).

Safe knobs:

BATCH_SIZE (default 5000)
RATE_LIMIT_BUFFER (default 1)
PROGRESS_LOG_INTERVAL_MS (default 10000)

Speed-run knob (durability trade-off):

export PG_SYNC_COMMIT=off

Testing

Local:

npm test

Notes:

Unit tests always run.
Integration tests are enabled when TEST_DATABASE_URL is set (CI runs them with a Postgres service).

What I Would Improve With More Time

Time-window sharding with multiple workers (per-shard checkpoint + overlap) using since/until to push throughput well beyond a single cursor stream.
Deeper HTTP tuning for lower per-request latency: undici Agent (keep-alive), more aggressive pipelining, and a small prefetch buffer (queue) to reduce jitter.
Postgres COPY-based ingestion pipeline for maximum DB throughput (more complex, higher payoff).
Add an ingestion health endpoint (optional) and structured metrics export (OTel/Prometheus).
Add coverage reporting (vitest --coverage) as a non-gating signal.

Docs Index

ADRs: docs/adr/

AI Tools Used

This solution was developed with assistance from:

OpenAI Codex (GPT-5.3-Codex, high/xhigh) for architecture iteration, edge cases, tests, scripts, and documentation.
Claude Opus 4.6 (via Claude Code) for additional review/refinement on parts of the solution.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
docs/adr		docs/adr
packages/ingestion		packages/ingestion
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
biome.json		biome.json
docker-compose.yml		docker-compose.yml
export-event-ids.sh		export-event-ids.sh
package-lock.json		package-lock.json
package.json		package.json
preflight.sh		preflight.sh
prepare-submission.sh		prepare-submission.sh
run-ingestion.sh		run-ingestion.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSync Ingestion (Coding Challenge)

Requirements (must)

How To Run

Architecture (high level)

Architecture Decisions (summary)

API Discovery Notes

Performance Tuning (runbook)

Testing

What I Would Improve With More Time

Docs Index

AI Tools Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataSync Ingestion (Coding Challenge)

Requirements (must)

How To Run

Architecture (high level)

Architecture Decisions (summary)

API Discovery Notes

Performance Tuning (runbook)

Testing

What I Would Improve With More Time

Docs Index

AI Tools Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages