HLD or High Level System Design of Apache Kafka Startup
Last Updated :
30 Mar, 2023
Apache Kafka is a distributed data store optimized for ingesting and lower latency processing streaming data in real time. It can handle the constant inflow of data sequentially and incrementally generated by thousands of data sources.
Why use Kafka in the first place?
Let’s look at the problem that inspired Kafka in the first place on Linkedin. The problem is simple: Linkedin was getting a lot of logging data, like log messages, metrics, events, and other monitoring/observability data from multiple services. They wanted to utilize this data in two ways:
- Have an online near-real-time system that can process and analyze this data.
- Have an offline system that can process this data over a longer period.
Most of the processing was done for analysis, for example, analyzing user behavior, how users use LinkedIn, etc.
Requirement Gathering
The problem is easy to understand, but the solution can seem pretty complex. This is because the problem itself has so many constraints and requirements. Here are some examples of the requirements that such a system needs:
- The system should be highly scalable. Popular products can generate tens or hundreds of TBs of data in events, metrics, and logs daily. This requires an almost linearly scalable distributed system to handle such high throughput.
This is important because we need to support the extremely high traffic. Easily hundreds of thousands of messages per second. - It should allow “producers” to send messages and “consumers” to subscribe to certain messages. This is important since there can be multiple consumers(like the online and offline systems we discussed) to the same message, and messages are generally asynchronous.
Consumers should also be able to decide how and when to consume messages. For example, in the problem we discussed, we’d want one consumer to consume messages as soon as possible and the other to do it every few hours. - Messages can be immutable (there is no need to delete log data after all), transaction-like semantics and complex delivery guarantees aren’t important requirements.
Message Brokers vs Kafka
Maybe using message brokers such as RabbitMQ, and ActiveMQ, can solve the above problem, but they cannot, and let's see why:
- Message Batching: Since we are pulling a lot of messages on the consumer, it doesn’t make sense to pull messages one by one. Most of the time, you’d want to batch messages. Otherwise, most of your time would be wasted on-network calls.
Since message brokers aren’t really meant to support such high throughput, they generally don’t provide good ways to batch messages. - Different consumers with different consumption requirements: We discussed having two types of consumers, one online system which processes messages in real-time and the other an offline system that might want to read messages received in the past twelve or twenty-four hours.
This pattern doesn’t work with most message brokers or queues. This is because some message brokers, like RabbitMQ, use a push-based model, pushing messages from the broker to the consumer. This leads to lesser flexibility for the consumer since the consumer cannot decide how and when to consume messages. - Small and simple messages: Message sizes are generally larger in most message brokers. This isn’t a bug, but it’s by design. Message brokers often support many features, like different options for routing messages, message guarantees, being able to acknowledge every message individually, etc., which leads to large individual message headers.
Large messages are fine as long as you don’t have a lot of them and you don’t have to store them, but that is precisely what we want to do in our system. - Distributed high-throughput system: One of the most important requirements is very high throughput. We want to support hundreds of thousands of messages per second, even going up to millions per second. Running this system in a single node is infeasible.
We need a distributed system that can support this throughput, which many message brokers don’t. - Large queues: Message brokers often have varying support for large queue sizes. This depends on the message broker you are using and your configuration, but the internet is filled with people facing issues with message broker queue sizes.
So, let's now understand what should be the architecture of the Kafka system with the above mentioned requirements.
High-Level Design Architecture of Kafka
High-level design of Apache KafkaVarious Components of the Above Design
- Topics: Topics are simply a stream of messages. Producers send messages to topics, and consumers poll them for messages.
- Consumers: A consumer is simply an application that wants to listen to a topic. It continuously polls the broker about any messages on the topic. With each polling request, the consumer specifies the last message it received and some other configurable parameters.
- Producers: Producers are applications that produce a message and publish them to the queue. Publishing messages is pretty simple: specify a topic, a message, an optional key, and optional metadata, and send it to the broker.
- Consumer groups:
- Consumers would typically be a part of a consumer group. Instead of a consumer listening to a topic, generally, a consumer group would listen to a topic. The consumer group comprises multiple consumers, and anyone will receive the message.
- Generally, a single consumer would not be able to process many messages, so you’d need multiple consumers to handle messages. That way, you can support a higher throughput of messages.
- We had the example previously, where various services publish events to a Kafka topic. The events could be related to user or organization activity, such as a user searching for a company or a new job posting. There are two types of consumers listening to this. One is a recommendation service that processes these events and updates data in its database about future recommendations that must be provided to the user. The other is a script that is run once every 24 hours to provide insights into how users use our platform.
- Then, we add a recommendation service that listens to this events topic in real-time. However, since we are getting many messages, a single consumer in the Recommendation service cannot cope, so we need to add more consumers.
- This is where consumer groups come in. Multiple consumers can be a part of the consumer group, and all the messages get divided into multiple consumers.
Partitions in topics for better scale
Partitions
Having a close look at topics, we see that every topic is divided into a configurable number of 'partitions'. Every single message in a topic is sent to exactly one partition.
Depending on the configuration and the message, this can be either based on the message's key or in a round-robin fashion. Regardless, what’s important is that a message sent to a topic eventually goes into a single partition.
And partitions aren’t very complex. They are an append-only-like system to store messages. Think of them like a log file and the message like lines in a log file.
Consumers from a consumer group aren’t directly listening to topics. Instead, they listen to zero, one, or more partitions of the topic. Every consumer gets messages only from the partitions it listens to.
Since every consumer is assigned its own partitions on startup, consumers don’t need to discuss which messages have already been consumed. This is also helpful as it helps to scale Kafka linearly since adding more partitions/nodes doesn’t increase the work or communication between existing partitions/nodes. These partitions are often in different brokers running on different machines.
Kafka storage layout
Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file.
Similar Reads
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Class Diagram | Unified Modeling Language (UML) A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
Unified Modeling Language (UML) Diagrams Unified Modeling Language (UML) is a general-purpose modeling language. The main aim of UML is to define a standard way to visualize the way a system has been designed. It is quite similar to blueprints used in other fields of engineering. UML is not a programming language, it is rather a visual lan
14 min read
Steady State Response In this article, we are going to discuss the steady-state response. We will see what is steady state response in Time domain analysis. We will then discuss some of the standard test signals used in finding the response of a response. We also discuss the first-order response for different signals. We
9 min read
System Design Tutorial System Design is the process of designing the architecture, components, and interfaces for a system so that it meets the end-user requirements. This specifically designed System Design tutorial will help you to learn and master System Design concepts in the most efficient way from basics to advanced
4 min read
Backpropagation in Neural Network Back Propagation is also known as "Backward Propagation of Errors" is a method used to train neural network . Its goal is to reduce the difference between the modelâs predicted output and the actual output by adjusting the weights and biases in the network.It works iteratively to adjust weights and
9 min read
Polymorphism in Java Polymorphism in Java is one of the core concepts in object-oriented programming (OOP) that allows objects to behave differently based on their specific class type. The word polymorphism means having many forms, and it comes from the Greek words poly (many) and morph (forms), this means one entity ca
7 min read
3-Phase Inverter An inverter is a fundamental electrical device designed primarily for the conversion of direct current into alternating current . This versatile device , also known as a variable frequency drive , plays a vital role in a wide range of applications , including variable frequency drives and high power
13 min read
What is Vacuum Circuit Breaker? A vacuum circuit breaker is a type of breaker that utilizes a vacuum as the medium to extinguish electrical arcs. Within this circuit breaker, there is a vacuum interrupter that houses the stationary and mobile contacts in a permanently sealed enclosure. When the contacts are separated in a high vac
13 min read