10 Distributed storage & computing systems interview Q&As – Big Data

A distributed system consists of multiple software components that are on multiple computers (aka nodes), but run as a single system. These components can be stateful, stateless, or serverless, and these components can be created in different languages running on hybrid environments, using open-source technologies, open standards, and interoperability.

Q01. What are the key requirements to be a distributed system?
A01. A distributed system must satisfy the following 3 characteristics.

1) The computers or nodes operate concurrently.

2) The computers or nodes fail independently, hence must be fault tolerant. Also, scale independently where you can add new nodes to the cluster.

3) The computers or nodes do not share a global clock. The nodes could be on the cloud across different availability zones & time zones. “ap-southeast-1” & “ap-southeast-2” are different availability zones in the region “asia-pacific (Sydney)”.

Q2. What are some of the distributed systems, and why are they so popular?
A2. Microservices, MPP (i.e. Massively parallele processing) systems like Redshift, Snowflake, Databricks, etc and NoSQL databases MongoDB, Cassandra, etc are distributed systems. Modern architectures use:

1) distributed Storage &
2) distributed computation.

In this post let’s focus on Big Data technologies.

Big Data

Big Data systems use both distributed storage & computation. In Hadoop the storage & computation are collocated on multiple nodes whereas on the cloud they are decoupled & often preferred. For example, The S3 (i.e. Simple Storage Service) decouples the storage against the compute (e.g EMR on AWS) requirements. This decoupling allows you to elastically & independently scale up or down the storage & compute requirements.

Hadoop

1) Distributed Storage: is accomplished via HDFS (i.e. Hadoop Distributed File System) where data is split & stored across 100’s of “Data Nodes“.

2) Distributed Computing: is accomplished via “YARN” (i.e. Yet Another Resource Negotiator), which is a node manager responsible for distributing the code to the nodes where the data is. The distributed code is then executed against the subset of the data on their respective nodes.

The data is distributed across multiple machines (aka nodes) & the computation code is also copied across multiple machines & executed (E.g. via Apache Spark executor) against each machine’s subset of the data concurrently. Each machine will return a subset of the data to the master node to be combined & returned the final result to the client. So, unlike multithreading, which shares CPU, storage & Memory with other threads running within the same machine, the distributed systems use share nothing architecture whereby each node will have its own CPU, storage & memory. These nodes will horizontally scale. In cloud computing this is known as “auto scaling” with a flick of a configuration change via a console.

A typical Hadoop Architecture with storage & compute resources

Distributed storage & computing

1) Distributed storage systems like Apache Hadoop Distributed File System (i.e. HDFS), NoSQL databases like HBase, Cassandra, MongoDB, Redis, etc and Cloud based data warehouses like Snowflake, Apache Hive, Amazon Redshift, etc.

2) Distributed computing with Microservices, Apache Spark, Apache Storm, AWS EMR (i.e. Elastic Map Reduce), etc.

3) Distributed messaging systems like Apache Kafka, Amazon Kinesis Data Streams, Amazon SQS (i.e. Simple Queue Service), Amazon SNS (i.e. Simple Notification Service with pub/sub model), etc. Amazon SQS & SNS are useful in building Microservices.

Q03. Why distributed systems are popular?
A03. Interviewer will also ask about the cluster size & Data size to understand the environment you had worked on. For example, 250 million transactions or events per day, 100+ worker nodes, 1 petabyte of data storage with 2 year retention policy, etc.

1) Firstly, modern architectures are cloud based.

2) Modern architectures need to scale independently. Server nodes in 100s with fault-tolerance as opposed to in tens. So, be prepared to use terms like the application was deployed to “24 core 100 node cluster in AWS”, “3.6 Petabytes of structured & semi-structured data was stored across 280 node cluster, etc.

3) Data sizes in Terabytes & Petabytes as opposed to gigabytes. You need to handle not only structured data, but also semi-structured data like XML, JSON, log files, event logs, etc and unstructured data like e-mail messages, PDFs, Word documents, etc.

4) Distributed Architectures are often 1) event-driven and 2) asynchronous with more complexities to Solution & debug.

Q04. What are some of the key considerations when designing distributed systems?
A04.

a) Cap theorem asserts that any networked shared-data system can have only two of three desirable properties as in “Consistency & Availability”, “Availability & Partition-tolerance”, or “Consistency & Partition-tolerance”.

b) Nodes can fail independently & new nodes can be added. So, concepts like Consistent Hashing ring, coordinating services like Zookeeper where you can use distributed configuration settings across nodes, have distributed counters, distributed locks, etc become to learn.

c) Out of order processing as events are streamed potentially in out of order from various event sources requiring additional logic.

e) Replay considerations as the master data in the HDFS (i.e. Hadoop Distributed File System) raw zone needs to be rerun in batch modes with renewed access patterns, business logic & rules. Append only systems (E.g. Big Data on HDFS) don’t update data models, but keep appending to the existing data & evaluate the current state by 1) deleting the current data models and 2) replaying the event logs (i.e. the master data). These jobs are run as batch jobs with eventual consistency. This is known as event-sourcing.

f) The NoSQL (i.e. Not Only SQL) databases make development quicker as they are schema-less (i.e. implicit schema), scale rapidly by adding more nodes & handle large volumes of combined data (i.e. you store Order, Line Items, and possibly Customer details together as a JSON clob or key-value pairs), which minimizes the number of joins. The NoSQL databases don’t handle ACID (e.g. transaction boundaries, isolation levels, constraints like foreign key, etc) properties, and shift that responsibility to the application layer. This can add more challenges.

g) The data structuring, partitioning & distribution challenges – Skewed data, data shuffling, cartesian join prevention, etc. You will see potential challenges and solutions on the post 15 Apache Spark best practices & performance tuning interview FAQs.

h) Race conditions, performance issues, and out-of-memory issues when dealing with larger volumes of immutable & append-only data logs that need to be replayed into memory or indexed data models for responsiveness & real-time querying.

Q05. Can you name some of the distributed systems?
A05.

Snowflake is a fully cloud based unlimited storage and compute. Snowflake is a massively parallel processing (i.e. MPP) database that is fully relational, ACID compliant, and processes standard SQL.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

Databricks provides a unified analytics platform with integrated tools and services within delta lake.

Apache Kafka is a distributed streaming platform that is used to build real time streaming data pipelines and applications that adapt to data streams.

Q06. How will you be going about choosing between an SQL Vs NoSQL database?
A06. Any non-trivial application requires a persistent store. Now a days you have many choices between monolithic RDBMs databases & distributed NoSQL databases like MongoDB, HBase, Cassandra, etc.

CAP Theorem

CAP Theorem – source: https://dzone.com/articles/better-explaining-cap-theorem

Q07. What is a distributed database?
A07. Distributed databases are logically interrelated with each other to often represent as a single logical database, but the data is physically stored across multiple nodes in multiple sites and independently managed. In general, distributed databases have the following features:

1) Location independent: For example, across multiple AWS availability zones.

2) Distributed query processing: For example, Apache Spark, Hive, Impala, etc

3) Distributed transaction management (E,g. Paxos, Raft, Zab, etc) & hardware, operating system & network independent.

Q08. When will you consider using a distributed database?
A08.

1) When the amount of data stored & the number of reads are getting very larger & larger the distributed databases are much easier to horizontally scale by adding more nodes/computers. They can also store structured (e.g relational data), semistructured (e.g JSON, XML) & unstrctured (e.g documents, log files) data.

2) When the data and DBMS software are distributed over several nodes & sites, one site may fail whilst other sites continue to operate and you are not only able to access the data, but also access the data that exist at the failed site due to the data replication. This gives increased reliability and availability as there is no single point of failure.

3) When the queries are broken up & executed in parallel better performance can be achieved for a very large volume of data.

Q09. What are some of the basic concepts you must know when working with the distributed systems?
A09.

1. Partitioning

You need to have a way to distribute the data across multiple nodes. For example, hashing a part of every table’s primary key the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider:

Partition key must have enough values to spread the data for each table evenly across all the nodes in the cluster. Otherwise your data will be skewed, where some nodes will be working harder than the other nodes. The partitions which are highly loaded become the bottlenecks for the system, and this is known as hotspotting.

Typical real-world partition keys are user id, device id, account number etc. along with a time modifier like year-month or year added to the partition key to spread the data.

Key Range partitioning is form of partitioning, where you divide the entire keys into continuous ranges and assign each range to a partition. The downside of using key range partitioning is that if the range boundaries are not decided properly, it may lead to hotspots. Key Hash Partitioning is a form of partitioning to apply a hash function on the key which results in a hash value, and then mod it with the number of partitions. The same key will always return the same hash code. The real problem with hash partitioning starts when you change the number of partitions. Consistent Hashing to the rescue where the output range of a hash function is treated as a fixed circular space or “ring”. Each node in the system is assigned a random value within this space which represents its “position” on the ring.

2. Replication

The same piece of data is usually replicated to multiple machines/nodes to give fault tolerance & high availability. For example, if the replication factor is set to 3, you have will have same data copied 3 times across different nodes. If one node goes down, the data on a different node is copied to another live node to maintain 3 copies.

3. Consistency

Every read receives the most recent write or an error and all replicas in the system have the same data at the same time.

When you update that piece of information on one of the machine/node, it may take some time (usually milliseconds) to reach every machine/node that holds the replicated data. This creates the possibility that you might get information that hasn’t yet updated on the replica. In many scenarios, this anomaly is acceptable. In scenarios where this type of behaviour is not acceptable, the NoSQL databases may give the choice, per transaction, whether data to be eventually consistent or strongly consistent.

Consistency can be achieved using a leader-follower replication scheme, where one replica acts as the leader and is responsible for updating the data, while the other replicas follow the leader and replicate the updates.

4. Availability

Every request receives a response without the guarantee that it contains the most recent version of the information, even when there is a node failure or network partition. This means the system is always available to serve the client requests.

A load-balancing mechanism is used to distribute user requests across multiple replicas, ensuring that the system can handle high levels of traffic.

It also uses a a multi-leader replication, where multiple nodes are elected as leaders, and each leader can accept writes independently. When a write occurs, it is replicated to all other nodes in the system.

5. Partition tolerance

Network failures are inevitable & distributed systems will have network partitions. Network partition means nodes losing contact with each other and having communication & synchronisation issues. Partition tolerance means tolerating partitioning in the network. For example, when you have a number of EC2 nodes in a VPC across multiple availability zones and a particular availability zone can go down.

The system must continue to work despite some parts of the network is partitioned or become disconnected.

Partition tolerance is achieved using a consensus algorithm like Paxos or Raft, which helps the replicas to agree on the state of the system even if parts of the network fail or become disconnected.

7. CAP theorem

The CAP theorem states that you can only gurantee to have partition tolerance with availability or consistency, but not both. For example, HBase, MongoDB, Redis, etc favours consistency, whilst Cassandra and AWS Dynamo DB favours availability.

If distributed system is designed to provide strong consistency and partition tolerance with the trade off of low availability, then every write operation must be confirmed by all replicas before the write is considered committed. As a result, write operations may take longer, hence reducing the system’s availability.

If you favor availability, then writes need to be made to any replica and then Distributed asynchronously to other replicas. This means you get eventual consistency.

7. File formats & compression

Distributed systems use container data formats like Avro, Parquet, ORC, Sequence file, etc where the data can be compressed with algorithms like LZO, Snappy, LZ4, etc and data can be easily split to be distributed across multiple nodes. 4 key considerations in choosing the storage file formats are:

1) Usage patterns like accessing 5 columns out of 50 columns vs accessing most of the columns.
2) Splittability to be processed in parallel.
3) Block compression saving storage space vs read/write/transfer performance
4) Schema evolution to add fields, modify fields, and rename fields.

8. Schema-on-read

Data can be stored schema-less and a structure can applied during processing (i.e. on-read) based on the requirements of the processing application. This is different from “Schema-On-Write”, which is used in RDBMs where schema need to be defined before the data can be loaded.

Despite the schema-less nature, schema design is an important consideration. This includes directory structures and schema of objects stored as metadata (E.g. Hive meta store).

9. Shared nothing architecture

In distributed systems each node is completely independent of other nodes. There are no shared resources like CPU, memory, and disk storage that can become a bottle-neck. For example, Hadoop’s processing frameworks like Spark, Pig, Hive, Impala, etc processes distinct subset of the data and there is no need to manage access to the shared data. “Sharing nothing” architectures are very

1. Scalable as more nodes can be added without further contention.

2. Fault tolerant as each node is independent, and there are no single points of failure, and the system can quickly recover from a failure of an individual node.

Q10. What are the needs of distributed applications?
A10. Distributed applications need a proper lifecycle management involving

1) Deployment & rollback of the apps.
2) Scheduling the app.
3) Distributed config management.
4) Resource failure isolation.
5) Auto or manual scaling.
6) Handling hybrid workloads as in stateful, stateless, serverless, etc.
7) Service discovery & failover.
8) Dynamic routing.
9) Service retry, timeout & circuit breaking.
10) Security & data privacy
11) Rate limiting or throttling the API calls.
12) Monitoring & distributed logging.
13) Distributed cacheing.
14) Protocol conversions.
15) Message transformation.
16) Message routing with point-to-point or pub/sub.
17) Applicate state mgmt, transaction management (e.g SAGA pattern, event sourcing, etc), etc.

Q11. What is a Paxos protocol?
A11. Distributed systems are implemented for high availability and scalability using many commodity machines, hence these machines are not reliable and these come up and go down quite often. In such systems, there is often a need to agree upon “something” i.e. to have consensus. It needs to ensure that all servers in the system can agree on a single source of truth, even if some servers fail. In other words, the system must be fault-tolerant.

Each server node keeps its own log and nodes can communicate with each other to establish the order of the events. How can all nodes agree on a consistent view of the log? Paxos is a consensus protocol to agree on the order. A distributed consensus ensures a consensus of data (e.g. logs, configurations, etc) among nodes in a distributed system or reaches an agreement on a proposal. This is applicable to any distributed systems such as HDFS, ZooKeeper, Kafka, Redis, and Elasticsearch.

Paxos/Raft/Zab are the different variation of distributed consensus algorithms. They follow 2 key principles:

1) two phase commit, and
2) quorum majority vote.

This means a value is committed only when majority nodes have gone through the above 2-phase commit process.

1) A leader proposes values to the followers.
2) Leaders wait for acknowledgements from a quorum of followers before considering a proposal committed.

Q12. What is a Quorum?
A12. Minimum number of servers required to run the Zookeeper is called Quorum. Zookeeper replicates whole data tree to all the quorum servers. This number is also the minimum number of servers required to store a client’s data before telling the client it is safely stored

Zab stands for Zookeeper Atomic Broadcast. ZooKeeper is not a consensus protocol. Learn more at Apache Zookeeper interview Q&As

Raft algorithms simplify Paxos algorithms. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems.


300+ Java Interview FAQs

Tutorials on Java & Big Data