System Design - Introduction | System Design Primer

System Design Series - Tutorials and Interview Template

System Design - Introduction | System Design Primer | Beginners

System Design - Scaling the application | System Design Primer

System Design - Scaling the database | System Design Primer

SQL vs NoSQL: Should we use SQL or NoSQL? | Databases

System Design Interview Template | Ace the System Design Interview

The system design interview mostly deals with discussing how you would build a software application for scale. It is a very critical round for senior software developer roles (2+ years of experience; SDE II and above). At SDE-I roles, you might be expected to have a little idea about building scalable systems and so it might be a good idea to get a high-level understanding.

Before we start designing a system, let's learn about a few terminologies that will be used while covering the different concepts.

System Design Terminologies

Machine: Machine denotes a single computer where we can host our application, database, etc.
It could denote a logical entity as well. Example: AWS and other cloud providers provide a slice of a bigger machine. The abstracted slice of the machine is also said to be a machine.
Server/Application: Server/Application denotes the application that is running the backend code and is serving requests on a particular port on the machine.
Client: Clients are the devices/applications through which request is sent to the server.
Distributed System: A system with multiple components on different machines which communicate and coordinate with one another.
Node: A node denotes a single machine in the distributed system.
Memory: RAM.
Storage: File-System/Disk space.
In-Memory: Data stored in the memory (RAM). It is faster to write to memory than to disk.
Resources: Memory, Storage, CPU, etc. Resources are what costs money. Storage is very cheap compared to Memory.
High-Level Design (HLD): Designing the architecture of the system without going into the details of the code design or database schema design.
Low-Level Design (HLD): Designing the code/class structure and database tables considering the entities, their relationships, and interactions, properties, methods, etc.

Performance and Scalability

Performance: The amount of work that the system does. Increasing performance means that the system should be able to do more amount of work.
Scalability: A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to the resources added.

Generally, the aim should be to have both good performance (fast for a single user) as well as a scalable system (fast for multiple users).

Reference: A Word on Scalability

Latency and Throughput

Latency: Latency is the time required to perform some action or to produce some result. Latency is measured in units of time -- hours, minutes, seconds, nanoseconds, or clock periods.
Throughput: Throughput is the number of actions executed or results produced per unit of time.

Generally, the aim should be to have low latency and high throughput.

Reference: Understanding Latency versus Throughput

CAP Theorem

Before learning about CAP Theorem, let's try to understand the following terminologies.

Consistency: If a write request updates some data then the next read should return data that is correct based on the write request.

Availability: Every request gets a response irrespective of whether the data is consistent or not.
Availability is generally measured in terms of the number of 9s. A 99.99% (four 9s) availability guarantees that the system will be down for less than or equal to ~52 mins per year.

Partition Tolerance: In a distributed system, there might be network failures between any two nodes resulting in a communication failure between the two nodes (known as a partition). A distributed system is partition tolerant if it continues the work even after a partition has happened.

CAP Theorem states that you can only support two out of the three guarantees (Consistency, Availability, and Partition Tolerance) in a distributed system.

Since networks are bound to fail, you need to support Partition Tolerance in a distributed system which leaves us with two options:

CP (Consistency and Partition Tolerance): The system will either give a consistent response or it will fail. It will never give an incorrect response.
AP (Availability and Partition Tolerance): The system will always give a response even though it might be inconsistent.

Choosing CP or AP depends on the business use case.

Examples:

In a social network, AP might make more sense. It is fine if we are not able to see a like/comment for a few seconds.
In a banking system, CP might make more sense. A transaction should reflect if it has been done otherwise it might cause a big loophole. It is relatively fine if the system is unable to allow transactions for a while.

Most system design architectures/patterns optimize for one of the above two options (CP or AP).

Designing a system

Let's design a scalable system with a real-world example (Twitter).

When we have to build our own social media platform, we would not have a lot of load/traffic at the beginning. We can start with a single machine containing our backend server (Port: 80/443) and a MySQL database (Port: 3306). Port 80 and 443 will be open to the outside world whereas other ports can only be accessed internally.

Our domain is mapped to the Public IP Address of this machine in the DNS. Whenever a user opens twitter, the client sends requests to the backend server and the server interacts with the database.

A simple performant system that works as expected. The users get delighted and start talking about it on other social media platforms.

Our website starts getting a lot of traffic and we now face the problem that every small company is happy about: Our website is getting more than anticipated traffic and our servers are unable to handle it. The two major components that we need to scale are the application server and the database. Let's look at how to scale both of them in the next parts of this series.

Next parts: