I still remember the first time I had to explain file sync on a whiteboard. I used the analogy of a shared notebook that keeps rewriting itself whenever anyone edits a page. It sounds simple until you ask: what happens when two people edit the same page at the same time, offline, on a spotty connection? That question is why designing a Dropbox-like system is such a classic interview problem. It forces you to balance user expectations (files should “just appear” everywhere) with hard constraints (latency, storage cost, consistency, and conflict resolution).
I’ll walk you through a realistic design for a modern file sync and sharing system, the kind of system you’d build in 2026. I’ll frame it like I would in an interview: requirements first, then architecture, then deep dives into sync, metadata, storage, and scale. I’ll include pitfalls I’ve seen in real systems, practical ranges for performance, and a few code snippets where they clarify tricky mechanics. My goal is to help you tell a clear story under pressure and walk away with a design you’d actually be proud to ship.
Requirements and scope I’d lock down early
When I’m in the room, I start by asking for scope so I don’t overbuild. For a Dropbox-like system, I usually assume:
- Core: file upload, download, sync across devices, sharing via links or shared folders
- Devices: web, desktop, and mobile clients
- Files: any type, up to, say, 10 GB per file in the base version
- Sync: near real-time; I target 1–5 seconds for small file propagation under normal load
- Availability: high; “I can’t find my files” is a hard failure
- Consistency: users expect to see their own changes immediately; cross-device consistency can be eventual as long as conflicts are handled clearly
- Scale: tens of millions of users, billions of files
Out of scope for the base interview design: content editing collaboration like Google Docs, advanced content search, and regulatory compliance features beyond basic access control.
I call this out because it lets me design for fast sync of files rather than real-time collaboration, which changes everything about conflict resolution and storage.
High-level architecture and data flow
At a high level, I break the system into three planes:
1) Control plane for metadata: user accounts, file/folder tree, versions, permissions, share links.
2) Data plane for file content: chunk storage, encryption at rest, deduplication.
3) Sync plane for notification and client coordination: change feeds, push notifications, conflict resolution.
Here’s the big picture I describe:
- Clients track a local folder and maintain a local metadata database.
- When a local change happens (create, update, move, delete), the client computes a diff, uploads file chunks to the data plane, then writes metadata (file version, chunk list) to the control plane.
- The control plane emits change events to a sync service.
- Other clients subscribed to that user or shared folder receive a push signal (or long-poll) and pull the change list from the control plane, then download missing chunks from the data plane.
I explicitly separate metadata from file bytes. It simplifies caching, keeps most requests lightweight, and makes metadata consistency manageable without huge data movement.
Data model and metadata design
A clean metadata model is what makes the rest of the system sane. I use a versioned file tree and immutable file versions.
Core entities:
- User: id, email, plan, rootfolderid
- Folder: id, parentid, name, ownerid
- File: id, parentid, name, ownerid, size, currentversionid
- FileVersion: id, fileid, createdat, createdby, contenthash, chunk_list, size
- Chunk: id (hash), size, storage_location
- Permission: subject (user or group), resource (file or folder), role (viewer/editor)
- ShareLink: id, resource, permissions, expiration
I rely on immutable versions because they are natural for rollback, conflict resolution, and audit. The “currentversionid” pointer in the File entity keeps common reads fast.
I also include a “tree_version” on folders, which increments when child entries change. That helps clients detect if their cached folder listings are stale without re-fetching entire directories.
Practical note on IDs
I typically choose ULIDs or Snowflake-style IDs for ordering and sharding. For chunk IDs, I prefer a content hash (SHA-256 or BLAKE3) so deduplication becomes trivial.
Client sync model: watching, batching, and safety
The client is often the most complex component. It has to handle local filesystem events, network issues, and conflicts.
I use this model:
- A file watcher detects local changes.
- A local metadata DB (SQLite on desktop, a local KV on mobile) tracks file IDs, version IDs, and pending operations.
- A sync engine batches changes into a local queue, uploads chunks, then commits metadata.
- A pull worker listens for server changes, fetches deltas, and applies them to the local folder.
Key behaviors:
- Batching: I batch changes over 250–500 ms to avoid thrashing on rapid edits.
- Backoff: Exponential backoff on repeated failures to reduce load.
- Atomicity: A file is only moved into the local synced folder when the content and metadata commit succeed. This avoids “half-synced” files.
A concise pseudo-implementation of the upload path helps explain this:
import hashlib
from pathlib import Path
def chunkfile(path: Path, chunksize=4 1024 1024):
with path.open(‘rb‘) as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
yield chunk
def uploadfile(path, api, userid, parent_id):
chunks = []
total = 0
for data in chunk_file(Path(path)):
h = hashlib.sha256(data).hexdigest()
total += len(data)
api.put_chunk(h, data) # idempotent; OK if already exists
chunks.append({"hash": h, "size": len(data)})
meta = {
"userid": userid,
"parentid": parentid,
"name": Path(path).name,
"size": total,
"chunks": chunks
}
api.commitfileversion(meta) # creates new FileVersion
The important piece here is idempotence: uploading a chunk should be safe to retry. If the chunk already exists, the server just returns success.
Metadata service and consistency model
I keep metadata in a strongly consistent store, usually a relational database or a strongly consistent distributed KV. I accept that cross-device propagation is eventual, but metadata writes for a user are linearizable.
Why? Users get confused if they rename a file and immediately open the folder on another device and see the old name for several minutes. I target sub-second commit visibility in the control plane, with propagation through the sync plane in a few seconds.
To do this, I:
- Use a primary metadata store that supports transactions for file tree operations (move, rename, delete).
- Use a change log table or stream keyed by userid and by shared-folderid for clients to pull incremental updates.
- Keep a snapshot endpoint so clients can rebuild state on first sync or after data loss.
A common approach:
changes(user_id, cursor)returns ordered changes since cursor.- Each change includes an entity type, id, and new version.
Handling folder operations
Folder moves and renames are a common source of bugs. I treat them as metadata-only changes. The key is to ensure the file’s “path” is derived, not stored as a string. Storing full paths will create painful update storms on move.
File storage and deduplication
The data plane is about storing bytes cheaply and reliably. I use a chunk store backed by object storage (S3-like), fronted by a service that handles chunk deduplication and encryption.
My typical flow:
- Client computes chunk hashes.
- Client uploads missing chunks only.
- Server stores chunks by hash.
- FileVersion stores chunk list (hashes + size).
This gives you:
- Deduplication across users and files for identical chunks
- Resumable uploads because chunks are small
- Parallel downloads for faster sync
Chunk sizing is a classic interview discussion. I usually start with 4 MB chunks; for large files, I may use content-defined chunking (CDC) to reduce re-upload on small edits. For the interview, I acknowledge CDC is more complex and offer it as an enhancement.
Encryption and access control
I assume server-side encryption at rest and TLS in transit. For client-side encryption, I mention it as a premium feature, but note it complicates server-side deduplication. In a base design, I keep deduplication global and rely on ACL checks for access.
If asked about security, I mention:
- Per-file encryption keys stored in a KMS
- Encrypted chunks with per-tenant keys if compliance requires it
- Signed URLs for client downloads
Sync notifications and long-poll
The sync plane connects metadata changes to clients. I keep it simple:
- A change log and cursor-based polling for reliability
- Push notifications (WebSockets or mobile push) as a signal to wake clients
The push doesn’t include data, just “there are changes.” Clients then call the changes endpoint. This prevents missing updates when a device is offline and keeps the protocol robust.
I also avoid over-notifying. Clients include their last known cursor; the server only notifies if there are changes beyond that cursor.
Typical latencies:
- Commit metadata: 50–150 ms
- Notify clients: 200–500 ms
- Total perceived sync for small files: 1–5 seconds
Those are real-world ranges, not hard promises.
Conflict resolution strategy
Conflicts happen when two devices modify the same file without seeing each other’s changes. If you don’t handle this well, the system feels unreliable.
I use a simple, user-friendly approach:
- Each file version has a parent version ID.
- If a new version arrives and the server sees that the client’s parent isn’t the current version, it’s a conflict.
- The server keeps both versions. One becomes the current version; the other is renamed with a conflict suffix.
On the client, I surface this as “Report (conflicted copy).pdf” and let the user decide. It’s not elegant, but it’s predictable, which is what matters in file sync.
For a more modern enhancement, I mention: if the file type is text and small, I can attempt a three-way merge. But I do not assume that in the base design because it adds complexity and is error-prone.
Sharing model and permissions
Sharing is a core feature. I support:
- Shared folders (multiple collaborators)
- Share links (public or restricted)
I implement access control with an ACL model:
- Resource: file or folder
- Subject: user or group
- Role: owner, editor, viewer
Inheritance is key: a folder’s permissions cascade to child files, with explicit overrides allowed. When a share link is created, it becomes a subject with a token-based identity and limited scope.
When a user joins a shared folder, I create a mapping between the folder and the user. Their client starts receiving change events for that folder’s change log.
I also mention rate-limiting for public links to avoid abuse, and expiration support.
Scaling the metadata service
Metadata is the hottest part of the system. Here’s how I scale it:
- Shard by user_id for most endpoints.
- Use a separate shard key for shared folders to distribute heavy collaboration workloads.
- Add read replicas for folder listings and share link resolution.
- Cache folder listings and file metadata in a distributed cache with short TTLs.
Hotspot example: a shared team folder with thousands of users. I handle this by:
- Storing shared folder change logs separately
- Fanning out notifications via a push service
- Clients pulling deltas based on folder-specific cursors
This prevents a single user shard from getting crushed by collaboration activity.
Handling large files and resumable transfers
Large files stress upload time, reliability, and storage costs. I use:
- Multipart upload with chunk-level retries
- Client-side resume by storing uploaded chunk hashes
- Server-side cleanup for abandoned uploads
A quick example of a resume strategy:
async function resumeUpload(file, api) {
const chunkSize = 4 1024 1024;
const uploaded = new Set(await api.listUploadedChunks(file.name));
for (let i = 0; i < file.size; i += chunkSize) {
const chunk = file.slice(i, i + chunkSize);
const hash = await hashChunk(chunk);
if (uploaded.has(hash)) continue;
await api.uploadChunk(hash, chunk);
}
await api.commitVersion(file.name);
}
The key is that the server can list chunks already uploaded for a pending version, which makes resume fast.
Observability and failure handling
If you can’t see what’s happening, you can’t fix it. I instrument:
- Per-request latency in the metadata and data planes
- Sync success rate per client version
- Queue length in the change notification system
- Conflict rates per file type
Failures to plan for:
- Partial upload: clean up or allow resume
- Metadata commit failure: client retries, dedupe by idempotency key
- Out-of-order change delivery: client uses version IDs to ignore stale updates
- Clock skew: never rely on timestamps for ordering; use server-assigned sequence numbers
I also add a “health banner” in clients when the sync engine is down or credentials are stale. You’d be surprised how much trust you gain by telling people what’s wrong.
Common mistakes I see in interviews
I make these explicit because they can sink an otherwise good design:
- Storing full file paths: massive update storms on move or rename
- No idempotency: retries create duplicate versions or corrupt state
- Assuming always-online clients: offline is the norm for mobile and laptops
- Ignoring conflict resolution: this is a core problem, not an edge case
- Mixing metadata and blob storage: makes scaling and caching far harder
If you avoid those five, you’re already ahead of most candidates.
When a Dropbox-like design is the wrong choice
It’s also important to say when not to build this:
- If your product is a real-time collaborative editor, you should use OT/CRDT patterns instead of file sync.
- If your primary content is huge media, you might prioritize streaming and partial download over sync semantics.
- If your users are strictly on a private network, you might prefer a central SMB-style model for simplicity.
I say this because design is about fit, not just scale.
Modern tooling and AI-assisted workflows (2026 context)
In 2026, I expect teams to integrate AI-assisted workflows into their storage and sync stack. That doesn’t mean AI is in the data path; it means it helps with reliability and developer velocity.
Examples I’d mention in an interview:
- AI-assisted log analysis to detect sync regression patterns
- Automatic schema migration assistants for metadata changes
- Security review automation for share link changes
- Synthetic client agents that replay real workloads to validate new sync algorithms
These are not required for the base design, but they signal that you’re thinking about operational reality, not just whiteboard architecture.
Traditional vs modern approaches
When asked to compare, I use a small table and pick a recommendation.
Traditional approach
My recommendation
—
—
Fixed 4–8 MB
Start fixed; add CDC only for large files
Polling only
Use push to signal, pull for reliability
Single SQL
Sharded with change log
Last write wins
Preserve both versions
Single request
Multipart with resumable chunksI prefer the modern approach where it improves user experience without adding fragile complexity. For CDC, I keep it optional unless I know the workload is dominated by large files with small edits.
A practical walkthrough: uploading and syncing a file
Here’s how I narrate a simple end-to-end flow:
1) You edit Quarterly Report.pdf on your laptop.
2) The client detects the change, chunks the file, uploads missing chunks, and commits a new file version.
3) The metadata service writes a change log entry and increments the folder tree version.
4) The sync service notifies your phone and desktop.
5) Those devices fetch the change list, see the file version, and download the missing chunks.
6) The file appears on both devices within a few seconds.
This shows how metadata, data, and sync planes work together while keeping the system robust to partial failures.
Performance considerations and practical ranges
I avoid exact numbers because they vary, but I do use realistic ranges:
- Metadata read latency: typically 10–30 ms from cache, 50–150 ms from DB
- Chunk upload: 50–300 ms per chunk depending on network
- Sync propagation: 1–5 seconds for small files in normal conditions
- Conflict rates: usually under 1% for typical consumer workloads
These numbers help ground your design in real-world expectations.
Edge cases worth mentioning
These are small details that show you’ve built a system like this before:
- Renaming a file while it uploads: client pauses upload, recomputes metadata, continues
- Deleting a file during download: client cancels download and cleans partial data
- Shared folder removed: client stops syncing and retains a local archive with a warning
- Storage dedup and privacy: dedup is content-based; access control still enforced at metadata layer
I keep these brief, but I always mention at least two to demonstrate operational thinking.
A short checklist for interview success
When I coach candidates, I give them this practical sequence:
- Clarify scope and constraints in the first 2 minutes
- Separate metadata from file bytes
- Show a clean sync flow with push + pull
- Handle conflicts explicitly
- Add a brief scaling plan
- Mention one operational concern (monitoring, retries, or data retention)
It keeps the conversation crisp and shows good judgment.
Key takeaways and what I’d do next
If you remember only a few things, make them these: separate metadata and blob storage, design a reliable sync protocol with idempotent chunk uploads, and treat conflict resolution as a core feature. Those are the pillars that make file sync feel “magical” to users while staying sane for engineers.
If you want to practice, I recommend doing a timed mock: 5 minutes for requirements, 10 minutes for architecture and data model, 10 minutes for deep dive. Then iterate by adding one enhancement per round, such as CDC, team sharing, or audit trails. You’ll be surprised how quickly your narrative becomes smoother and your design decisions become sharper.
If you’re building a real system, start with fixed-size chunking and a reliable metadata store, then measure before adding complexity. I’ve seen too many teams build CDC or real-time collaboration prematurely and spend months untangling operational issues. Build the boring core first, instrument it well, and let your data guide the next step.
That’s how I design a Dropbox-like system in interviews and in practice: clear separation of concerns, reliable sync primitives, and a focus on user trust over cleverness.


