Design Pastebin or Bit.ly – System Design Interview Question

System design interview questions are largely asked questions nowadays and most of the time software engineers struggle to answer such questions. Today we are discussing Design Pastebin or Bit.ly services to understand their software design structure.

Bit.ly is a URL shortening service, so in layman’s terms, all it does is when an URL is provided to it provides you back a short URL and if someone clicks on it then it will be redirected to that original URL the one which was provided.

Let’s Discuss the Use Cases and Constraints

To start answering any System Design Question you need to follow a step-by-step structure to answer all the important answers which interviewer is looking for. So, in this step-by-step guide, the first one is discussing the Use Cases and their constraints.

Use Cases:

  • The user enters a block of text (probably URL) and gets a randomly generated link. In this, you may discuss with the interviewer if they want the generated link to expire after some time or it should exist indefinitely. In our case, we will discuss both.
  • The user enters the paste or Bit.ly URL and views the contents.
  • User can be anonymous, this should be discussed as well if your interviewer wants the user to be anonymous or not. In this case, we are assuming the user will be anonymous.
  • Service is placed to track analytics of the links. For example Monthly visit stats
  • Service deletes the expired shortened URLs or paste.
  • Service has very high availability.

Additional cases which can be further added after discussion with interviewer

  • The user registers for an account on the website via email and verifies his account.
  • The user logs are maintained and allows him to edit URL and Paste.
  • The user can set the visibility.
  • The user can set the short link.

Constraints and assumptions

Next thing after discussing use cases you should start with the constraint which we are going to follow in answering the system design questions.

State assumptions:

  • Traffic in not evenly distributed and its dynamic.
  • Following a short link should be fast.
  • Pastes can only be text and nothing else, no hyperlink here.
  • Not require to maintain pageview or URL views in real time.
  • Expected minimum users more than 11 million users
  • Expected minimum paste writes per month is of around 10 million.
  • Expected minimum paste or short URL read is more than 100 million.
  • 10:1 read to write ratio.

Calculation usage for complete system

You should discuss with the interviewer if he wants to discuss the complete usage of this system in terms of back-of-the-envelope.

  • Size per paste
    • 1 KB content per paste
    • shortlink – 7 bytes
    • expiration_length_in_minutes – 4 bytes
    • created_at – 5 bytes
    • paste_path – 255 bytes
    • total = ~1.27 KB
  • 15.6 GB of new paste content per month
    • 1.27 KB per paste * 11 million pastes per month
    • ~500 GB of new paste content in 3 years
    • 396 million shortlinks in 3 years
    • Assume most are new pastes instead of updates to existing ones
  • 4 paste writes per second on average
  • 40 read requests per second on average

To make the above calculations compensated in the below complete compilation in terms of broader b=pictures.

  • 2.5 million seconds per month
  • 1 request per second = 2.5 million requests per month
  • 40 requests per second = 100 million requests per month
  • 400 requests per second = 1 billion requests per month

Creating a High-Level Design for Above System

You should always outline the high-level design once you are completed with important components such as use cases and constraints.

System Desing Interview question Short URL

As you can see we have drawn a high-level picture for the system design we are discussing right now. We need to have a client who will be communicating with the web servers and then all other handlings for each user evens will be handled by the Web Server.

The webserver will take care of writing and reading data from the APIs and then those writing API and reading API will be communicating with SQL and Cache memory mentioned as Object Store.

All the read and write we will be directly done either through Read API and Write API by communicating to database or cache memory. Similarly to read data related to Analytics of short URLs it will be communicating with the database and cache memory directly.

Step 3: Design core components

You should always try to dive into the details of each core component with the interviewer. This will help the interviewer to understand your knowledge about system design in general.

Use case: User enters a block of text (probably URL) and gets a randomly generated link (in short format)

We could use a relational database as a large hash table, mapping the generated URL to a file server and path containing the paste file. Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.

An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.

  • The Client sends a create paste request to the Web Server, running as a reverse proxy
  • The Web Server forwards the request to the Write API server
  • The Write API server does the following:
    • Generates a unique URL
      • Checks if the URL is unique by looking at the SQL Database for a duplicate
      • If the URL is not unique, it generates another URL
      • If we supported a custom URL, we could use the user-supplied (also check for a duplicate)
    • Saves to the SQL Database pastes table
    • Saves the paste data to the Object Store
    • Returns the URL

Discuss with your interviewer to understand how much of the code is being expected from you to write. Or is he happy enough with the pseudo-code only? There may be situations where the interviewer may expect a real code to be written from your side.

The paste table could have the following structure:

shortlink char(7) NOT NULL
expiration_length_in_minutes int NOT NULL
created_at datetime NOT NULL
paste_path varchar(255) NOT NULL
PRIMARY KEY(shortlink)

Setting the primary key to be based on the short link column creates an index that the database uses to enforce uniqueness, as you might already know that the primary key in the database will always be unique and cannot be duplicated in any circumstances.

We’ll create an additional index on created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds while reading from SSD takes 4x and from disk takes 80x longer.1

To generate the unique URL, we could use the following methods:

  • Take the MD5 hash of the user’s ip_address + timestamp because:
    • MD5 is a widely used hashing function that produces a 128-bit hash value.
    • MD5 is uniformly distributed.
    • Alternatively, we could also take the MD5 hash of randomly generated data from the above design.
  • And Base 62 encode the MD5 hash because:
    • Base 62 encodes to [a-zA-Z0-9] which works well for URLs, eliminating the need for escaping special characters
    • There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
    • Base 64 is another popular encoding but provides issues for URLs because of the additional + and / characters
    • The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:
def base_encode(num, base=62):
    digits = []
    while num > 0
      remainder = modulo(num, base)
      digits.push(remainder)
      num = divide(num, base)
    digits = digits.reverse
  • Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]

We’ll use a public REST API:

$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
    "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste

Response:

<!-- HTML Code -->
{
    "shortlink": "foobar"
}

For internal communications, we could use Remote Procedure Calls.

Use case: User enters a paste’s url and views the contents

  • The Client sends a get paste request to the Web Server
  • The Web Server forwards the request to the Read API server
  • The Read API server does the following:
    • Checks the SQL Database for the generated URL
      • If the URL is in the SQL Database, fetch the paste contents from the Object Store
      • Else, return an error message for the user

REST API:

$ curl https://pastebin.com/api/v1/paste?shortlink=foobar

Response:

<!-- HTML Code -->
{
    "paste_contents": "Hello World"
    "created_at": "YYYY-MM-DD HH:MM:SS"
    "expiration_length_in_minutes": "60"
}

Use case: Service tracks analytics of pages

Since real-time analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.

Again discuss with your interviewer to know how much of code writing is he expecting from you for this.

class HitCounts(MRJob):

    def extract_url(self, line):
        """Extract the generated url from the log line."""
        ...

    def extract_year_month(self, line):
        """Return the year and month portions of the timestamp."""
        ...

    def mapper(self, _, line):
        """Parse each log line, extract and transform relevant lines.

        Emit key value pairs of the form:

        (2016-01, url0), 1
        (2016-01, url0), 1
        (2016-01, url1), 1
        """
        url = self.extract_url(line)
        period = self.extract_year_month(line)
        yield (period, url), 1

    def reducer(self, key, values):
        """Sum values for each key.

        (2016-01, url0), 2
        (2016-01, url1), 1
        """
        yield key, sum(values)

Use case: Service which will delete expired pastes automatically after a stipulated time frame.

To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamps are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.

We can also, add an automatic scanner in the code to run periodically every 30 days to delete the expired URLs as this will reduce the overall load on the database and will make the database more optimized for use.

Scaling the Design to Handle large request

Scaling is the most important part of a system design interview, as the interviewer also likes to ask about scaling the design you just build. What if there is millions or billions of users currently will your system hold or it will break.

What techniques can we follow to make it more scalable or what are the bottlenecks we have to take care of in this design and can that be handled by the system.

Design Pastebin or Bit.ly - System Design Interview Question

Important: Do not simply jump right into the final design from the initial design!

State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat.

It’s important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web ServersCDNMaster-Slave Replicas? What are the alternatives and Trade-Offs for each?

We’ll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.

The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery. An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.

To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.

average paste writes per second (with higher at peak) should be doable for a single SQL Write Master-Slave. We should also consider moving some data to a NoSQL Database.

If you liked the answer then please follow us on Facebook and Twitter. Let us know the questions and answer you want to cover in this blog.

Wanna read more interview-related questions? Check Top Interview Questions category.

Note: All the codes used here are having creative commons attributions. And are provided to open source license.