What is Apache Kafka?
The Apache Kafka project page defines Apache Kafka as a publish-subscribe messaging rethought as a distributed commit log. Kafka is a high-throughput distributed messaging system.
The Kafka project page is probably the best source of information for anyone interested in Kafka and its inner workings. However, the above definitions would be more helpful to readers who already understand what a commit log is, how messaging systems work and has a basic idea of distributed system.
In this post, my attempt would be to explain these concepts in a simple way and make it understandable to anyone interested the basic architecture and workings of Apache Kafka. So, lets get started.
What is a message?
A message in its literal sense would be some information that needs to go from a source to a destination. When this information is sent from the source and is received by the destination we state that the message has been delivered. For e.g. let us consider you and your friend use a long hollow pipe to talk to each other. When your friend speaks into it from one end, you would keep your ears to the other end to hear what he just said. In this way you will receive the messages he would have to convey.
However, one thing to note here is that you and your friend should synchronize the activity of speaking and hearing, i.e. if your friend is speaking into the pipe, you at the other end should be ready to listen. If you are not ready, the message is lost.
What is a message in the context of a messaging system?
In its simplest form a message is nothing but data. A system that can handle transmission of data from a source to a destination is called a messaging system. In the computing world, the source could be machine that is generating data (for e.g. a Web Server generates large number of logs) or a human user. In most cases you will see the volumes generated by systems and machines to be way larger than the ones generated by human beings.
In a messaging system you would want to ensure that the message is delivered to the target. Even if the target is not ready to receive it, there has to be a provision to hold it. To achieve this, the messaging system should provide a way of retaining the message in the medium of communication until the target is ready to receive it.
The messaging system, with the ability to send, hold and deliver messages seems to be a good system. However, in reality, the scenarios could be such that the messages may have to be delivered to more than one destination. This introduces the need to hold the messages until it is consumed by all the destinations. This brings forward one more characteristic of a messaging system, i.e. the source is only going to deliver the messages to the communication channel/medium (think of the pipe) and the destination will be responsible for reading the message from the medium. So now, the messages can wait on the channel and the different destinations will read the messages at their own pace.
As there could be many destinations, there could also be several sources that write to the medium.
With this understanding, lets give some technical labels to the different components in a messaging system.
The source or the different sources create messages. Lets call them publishers.
The destinations/targets are the one who read the message, in other words, they subscribe to the messages generated by the publishers. Lets call them subscribers.
There is one more component that we still need to mention. So far, we have been referring to it as the channel/medium to which publishers write data and from where subscribers read data. That is not an entirely correct way of referring to it. A more appropriate way to call it would be to refer to it as – the log. Why log?
Logs, in a general sense stand for a record of “what happened”. In our context the message can be classified as an event and the log is what stores that event. Publishers publish events to the log and the subscribers read these events from the log to know “what happened”. Following is a simple illustration of how a log appears.
As you can see above, the log is a sequence of numbered messages that are append only and ordered by time. The log, in the database world, is often referred to as a commit log. To understand how logs work and understand it in-depth is not the intention of this post. However, if you are interested in knowing more about logs, you should read the following blog post by Jay Kreps:
This is one of the best technical posts I have read. It is like a mini-course on distributed systems and is extremely well written. You should also check out his video here – http://youtu.be/aJuo_bLSW6s
In this post we have discussed the following:
- Messaging system
- Commit Log
In the next post, we will get a bit deeper into understanding the above terms with respect to Apache Kafka and how they all work together. We will also see how Apache Kafka maintains a distributed commit log across a cluster making it a reliable distributed messaging system.