System Design - Scaling the application

System Design Series - Tutorials and Interview Template

System Design - Introduction | System Design Primer | Beginners

System Design - Scaling the application | System Design Primer

System Design - Scaling the database | System Design Primer

SQL vs NoSQL: Should we use SQL or NoSQL? | Databases

System Design Interview Template | Ace the System Design Interview

Note: This is the 2nd part of the System Design series and assumes that you know the basic terminologies of System Design.

Create a distributed system

The first thing that we can do to reduce the load on our machine is to move the database to another machine. Now, we have a distributed system.

Note: Both the machines would be in the same network. Our backend server would interact with the database through its private IP. No other computer should be able to access the database.

This solution is definitely a necessity. But, a single machine hosting our backend application would have two issues:

Single point of failure: If this machine dies, we will not be able to serve requests and our website goes down. This machine is a single point of failure.
High load: By moving the database to a different machine, we would be able to reduce the resource utilization on the machine. But, the resources could still not be enough for the load.

Scaling for high load: Vertical Scaling

After getting more traffic on our website, we are short of resources on the machine. What can we do to fix this?

Replace this machine with a larger machine in terms of resources (CPU, memory, etc).

This type of scaling is known as Vertical Scaling. Vertical Scaling means scaling the system by providing more resources to the system through larger machines.

Pros and Cons of Vertical Scaling

Vertical Scaling has the following advantages:

Scale: Vertical Scaling solves our problem of handling the high load.
Simplicity: It is simple as we do not need to make any code changes.

Vertical Scaling has the following disadvantages:

Limited resources on a single machine: We can get limited resources from a single machine. Even supercomputers might not be able to handle the load that the most used websites get.
Single point of failure: If this machine dies, our website goes down. This machine is a single point of failure.
Expensive: We would need a machine that can handle our website's peak load. But, most of the time, the resources would be underutilized.

Scaling for high load: Horizontal Scaling

Vertical Scaling has its limitation in terms of the amount of resources that we can get in a single machine. We can fix this by having multiple smaller machines serving the requests. The requests would be sent to our original server which will route the traffic to one of the nodes. The routing will happen through some algorithm.

This type of scaling is known as Horizontal Scaling. Horizontal Scaling means scaling the system by distributing the traffic among multiple machines/nodes.

How to make Horizontal Scaling work?

Load Balancer

The original machine will route and balance the load among the different nodes. The machine/service that balances the load is known as a load balancer.

There are various applications/services that we can use to create a load balancer. Load balancers can route traffic based on various metrics like least-busy, random, round-robin, sticky, etc.

Pros and Cons of Horizontal Scaling

Horizontal Scaling is generally better than Vertical Scaling as it provides the following advantages:

Distributed load: Load is distributed across multiple machines. If we expect a higher load, we can add new nodes to handle the increased traffic.
No single point of failure: Even if a machine dies, the system would still work as other nodes can take up the requests.
You might say that the load balancer is now a single point of failure which is true. We can fix this by having two or more load balancers.
Even DNS servers allow mapping multiple IP Addresses to a domain name. The DNS server provides one of the IP Addresses to the client based on some logic.
Cheap: We only need to use resources based on the current traffic. We can increase the number of nodes if we expect higher traffic. We can reduce the number of nodes in case of lower traffic. Most cloud providers provide auto-scaling. Auto-scaling services automatically add or remove nodes based on some rules.

Horizontal Scaling has the following disadvantages:

Stateless systems: Nodes should not contain any stateful data as the requests can go to any node. Stateful data includes user information, preferences, sessions, etc. Moving from a stateful system to a stateless system might be a bit difficult.
Different machines for DB, cache, etc: In the case of vertical scaling, we can keep the database, cache, etc on the same machine. With horizontal scaling, all the data stores need to be centralized outside the nodes.

Other Optimizations

We can do certain other optimizations in the application logic to scale our server.

Caching static resources

Caches are data stores that can store data so that future retrievals are faster. Caching is the technique of using cache to store data for faster data retrieval.

We can cache static resources like HTML, CSS, JS, and media files to avoid hits on our application server. Two major levels of caching static resources are:

Client Cache: Browsers can cache the static resources based on the cache policy we set on the response headers.
CDN (Content Delivery Network): CDN is a network of servers that stores resources at different locations. It serves requests from the ones that are closest to the user thereby reducing latency. We can store and retrieve the static resources from a CDN instead of the application servers.

Asynchronous Processing

There are certain scenarios where immediate processing is not required.

Example

When a user likes a tweet, the server can immediately respond with success. It can do so without actually sending the notification to the other person. It is fine if there is a delay of a few seconds in notifying. The notification flow can happen asynchronously.

Asynchronism helps reduce the latency of expensive requests. This is generally done using the concept of queues. Let's look at how this works:

The server takes the request and does some processing. It then puts the expensive processing tasks into a queue and responds with success.
Another set of servers (workers) listen to the queue for tasks that they can pick up.
Once any of the workers see a task:

they pick it up,
start doing the processing, and,
then signal after completing the task.

The server adding to the queue is also known as a publisher or producer. The workers are known as subscribers or consumers.

Tasks that do not need immediate processing can wait. Using a queue ensures that the user is not blocked because of such tasks.

Now that we have scaled our application servers, let's see if there are any other bottlenecks.

It's the database.

The database:

might not be able to handle the high traffic
is a single point of failure
might have increased latency for both 'reads' and 'writes'

Let's look at how to scale our database in the next article.

Next parts: