Introduction to Apache Kafka


A lot of effort in any software development projects is wasted because of tight coupling between components. In spite of all precautions, we do end up with components that should have been independent, but got tightly coupled - hence it is impossible to change one without disturbing the other.

Hence, a lot of effort in software architectures, designs and tools is focused on decoupling components - in developing tools and best practices that enable such decoupling. Using these tools and frameworks does not ensure decoupling. But, a sensible architect or developer can use them to reduce coupling between components.

What is Kafka?


Kafka is one such tool - a messaging system that enables us to separate components - into message producers and consumers. In a well designed system, one microservice will do its job and publish the output into Kafka - with the hope that someone somewhere will read and process that message.

Similarly, the consumer will just read the messages off the Kafka bus, and process it, with that someone somewhere has provided the required data. This does not mean relaxing security. But the components should not be coded to explicitly communicate with each other. Instead, they communicate with the Kafka broker, that takes care of passing messages along.

Important Concepts


Kafka is a distributed streaming platform.

Sounds nice! Doesn't it? But what exactly does it mean? In very simple words, it means that Kafka is an application that can help you with a continuous flow of data. It is scalable, and can provide very high performance by running on multiple servers at a time.

To understand Kafka, we need to understand a few important concepts

  • Message
  • Producer
  • Consumer
  • Broker
  • Cluster
  • Topic
  • Partition
  • Offset
  • Consumer Group

Let's look at them one by one.

Message


Message is the core of all that we are doing - the data that we need to pass over the Kafka from one component to another. Although Kafka is streaming platform, data travels in chunks. And these chunks are called Messages. A Kafka message could be a formatted JSON or XML, or a simple unicode string, or anything that components want to exchange. But, for Kafka it is only an array of bytes.

Topic


Any message on Kafka, belongs to a topic. Several topics can be created on the Kafka. One can think of the topic as an independent queue - where messages are enqueued - independent of the other topics.

Producer


As one would expect, a producer is the external entity that produces messages to be published on a Kafka Topic. It invokes the Kafka API, with the actual message and the Kafka Topic as parameters.

This adds a new message to the given topic.

Consumer


Complementary to the producer, consumer reads messages off the Kafka Topic. The messages are available to the consumer in the FIFO order.

The consumer can listen or poll a given topic, for availability of new messages. It can process the message as per the business logic involved.

The Kafka consumer only needs to know the topic on which the messages are published - nothing else about the source of the messages, or about any other entity that might be reading the same messages.

Kafka takes care of managing all that for us.

Broker


Aptly called so, Broker is the core component of Kafka, that takes care of passing messages across the table. The producer and consumer both interact with the broker to exchange the messages across the network.

Cluster


As we saw, Kafka is a distributed platform. It is not restricted to a given server. But it can be distributed over several servers.

Such an application is called a cluster. A Kafka cluster is simply a distributed Kafka instance running over several servers.

Offset


As we saw, Kafka enqueues the messages it gets from the producer, and keeps them safe for consumers who might ask for those messages. Kafka does not discard message immediately after it is read. It has to keep it for any other potential consumer that might want the same message. Of course, this duration can be configured.

But the point is that it is necessary to keep track of messages that are processed by a consumer. This is implemented using the concept of offsets. An offset is simply a sequence number assigned to a message on the topic. Thus, as soon as a producer publishes a message on the topic, it is assigned an offset. Kafka takes care of maintaining this offset counter.

When a consumer reads a message from the topic, it also notes the offset of the last message read by the consumer. The next get will succeed only if there are more messages after that offset.

Consumer Group


As the consumer reads from the Kafka topic, it can keep track of its last offset - without Kafka bothering about it. But that is not the best thing to do as it some logic from Kafka enters the consumer. The consumer develops a state. If that state is lost - if the instance of the consumer terminates or is restarted, the last offset will be lost. And the new consumer instance will start reading all over again. This should be avoided.

Also, as the consumer application scales, we may have a distributed consumer, with several pods reading off the Kafka topic. We do not want two pods reading the same message, and processing it twice.

This problem is solved by the concept of Kafka Consumer Group, When a consumer or group of consumers distribute the data, they register a consumer group. The Kafka broker then takes care of allocating the messages to individual instances of the consumer group.

So the consumer group as a whole has a last offset - that is maintained by the Kafka broker itself. The next request for message from any instance in the consumer group, will return new message based on the last offset of the group as a whole.

So we can be sure there is no duplicate processing or missed records.

Zookeeper


As the name suggests, the Zookeper is kind of a manager. Zookeeper takes care of leadership election of Kafka Broker and Topic Partition pairs. It manages service discovery for Kafka Brokers that form the cluster. It sends changes of the topology to Kafka, so each node in the cluster knows when a new broker joined, a Broker died, a topic was removed or a topic was added, etc. Zookeeper provides an in-sync view of Kafka Cluster configuration.

Partition


There are times when the messages are just too many to be processed sequentially. We have to split data into parallel streams. That is necessary when we have a distributed system. Here, we have the concept of topic partition.

A topic can be partitioned to allow multiple parallel streams. This does not mean breaking the FIFO order. Kafka still allows you to make sure that related messages go sequentially on the same topic. But, when we have unrelated chunks of messages, the FIFO order may not be required. That is where partitioning helps.

The consumers still fetches messages by the group id. The underlying partitions just work seamlessly.

Installing Kafka


Enough of theory. Let us now try to get our hands dirty, by creating a Kafka instance. Kafka is developed in Java. Hence, it is platform independent, and installation is not difficult at all. We just need to download the chunk of jar files and place them appropriately. Then, use the appropriate commands to run it.

But things soon get complicated as the number of jar files and the complexity of the commands increases. To simplify this, Apache Kafka provides us a simple compressed file that contains everything we need. Just download it from the server

Download & Extract


If you don't like to download from the browser, you can use a command line

wget http://www-us.apache.org/dist/kafka/2.4.0/kafka_2.13-2.4.0.tgz

Next, extract its contents. Again if you are a Linux freak, you may not like to use a WinZip to do the job. So here is the command line:

tar xzf kafka_2.13-2.4.0.tgz

Then, keep the extracted folder in a good location.

mv kafka_2.13-2.4.0 /usr/local/kafka

JDK


Kafka is implemented in Java, so it needs a good JDK to run. If you are on Windows, you will have to install the latest JDK by downloading it from the Oracle site. But, if you are on Linux, you might want to use the OpenJDK instead.

sudo apt install default-jdk

Install


If you check out the extracted folder, you will see something like this

$ ls
bin  config  libs  LICENSE  logs  NOTICE  site-docs

The libs folder has all the jar files. The bin folder has all the scripts - bash scripts if you are on Linux. Or bat scripts if you are on Windows. They contain all that you need to run the Kafka server.

Create Service on Linux


If you are on Linux, you can go a step further by installing the Kafka service. Follow this configuration

zookeeper.service


Create a file /etc/systemd/system/zookeeper.service, and add the below content to it.

[Unit]
Description=Apache Zookeeper server
Documentation=http://zookeeper.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/usr/local/kafka/bin/zookeeper-server-start.sh /usr/local/kafka/config/zookeeper.properties
ExecStop=/usr/local/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

kafka.service


Next, create another service file /etc/systemd/system/kafka.service - and add the below content.

[Unit]
Description=Apache Kafka Server
Documentation=http://kafka.apache.org/documentation.html
Requires=zookeeper.service

[Service]
Type=simple
Environment="JAVA_HOME=/usr/lib/jvm/java-1.11.0-openjdk-amd64"
ExecStart=/usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties
ExecStop=/usr/local/kafka/bin/kafka-server-stop.sh

[Install]
WantedBy=multi-user.target

Reload Daemon


Finally, we reload the systemd daemon to load the updated configuration

systemctl daemon-reload

Working with Kafka


Now that we have Kafka installed setup on our machine, we can try it out. We have to start the broker, then create a topic, along with producers and consumers. Finally we send a message from the producer to the consumer. Let us see how this works.

Start Kafka


As we saw above, the bin folder has all the scripts required to work with Kafka. We just need to pick the appropriate one and run it here. Since we created a service out of it, on Linux, we use this:

sudo systemctl start zookeeper
sudo systemctl start kafka

That is it!

Create a Topic


Next, we run the script for creating a topic.

cd /usr/local/kafka
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic myTestTopic

This creates a topic for us. Since it is a simple test topic, we have set the replication factor to 1. On a distributed enterprise application, the replication factor could be higher.

Producer


Creating a producer is equally simple. Just run a script provided to us.

cd /usr/local/kafka
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic myTestTopic

This provides us a > prompt. We can add the message we want to add here.

> Hello World
> Hello Kafka
>

The message is accepted by Kafka and saved for any consumer who might ask for it.

Consumer


Finally we create a consumer that can read messages from the producer.

cd /usr/local/kafka
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myTestTopic --from-beginning

As we run this command, the consumer starts reading from the chosen topic

So we see this output

Hello World
Hello Kafka

We can continue to type from the producer, and the messages will appear on the consumer. Cool?

Well, Kafka is not limited to sending text messages across two terminal windows. We can send huge messages across the globe across multiple applications. That makes it interesting