Apache Kafka

What is Apache Kafka and how does it work?

What is Apache Kafka and how does it work?

In this lesson, we will see what is Apache Kafka and how does it work along with its some most common use-cases.  Apache Kafka was originally developed at LinkedIn in 2010 and moved to become a top-level Apache project in 2012. It has three main components:

We will study a lot more Kafka concepts in coming sections. Let's move ahead.

Apache Kafka Concepts

Before we dig deeper, we need to be thorough about some concepts in Apache Kafka. Here are the terms we should know, very briefly:

Topic Partitions

The concept for Topic, Topic Partitions and offset can also be made clear with an illustrative figure:

Topic partion and Consumer offset in Apache Kafka

Apache Kafka as Publish-subscribe messaging system

With Kafka, the Producer applications publish messages which arrives at a Kafka Node and not directly to a Consumer. From this Kafka Node, messages are consumed by the Consumer applications.

Kafka Producer and Consumer

As a single topic can get a lot of data at one go, to keep Kafka horizontally scalable, each topic is divided into partitions and each partition can live on any node machine of a cluster.

Again, Kafka Broker doesn't keep records of which consumer has consumed how many packets of data. It is the consumers responsibility to keep track of data it has consumed. Due to the reason that Kafka doesn't keep track of acknowledgements and messages of each consumer application, it can manage many more consumers with negligible impact on throughput. In production, many applications even follow a pattern of batch consumers, which means that a consumer consumes all the messages in a queue at a regular interval of time.

Installation

To start using Apache Kafka, it must be installed on the machine. To do this, read Install Apache Kafka on Ubuntu.

Use Case: Website Usage Tracking

Kafka is an excellent tool to be used when we need to track activity on a website. The tracking data includes and not limited to page views, searches, uploads or other actions users may take. When a user is on a website, the user might take any number of actions when he/she surfs through the website.

For example, when a new user registers on a website, the activity might be tracked on in what order does a new user explores the features of a website, if user sets their profile as needed or prefers to directly leap on to the features of the website. Whenever the user clicks a button, the metadata for that button is collected in a data packet and sent to the Kafka cluster from where the analytics service for the application can collect this data and produce useful insights on the related data. If we look to divide the tasks into steps, here is how the process will look like:

  1. A user registers on a website and enters into the dashboard. The user tries to access a feature straightaway by interacting with a button.
  2. The web application constructs a message with this metadata to a topic partition of topic “click”.
  3. The message is appended to the commit log and offset is incremented
  4. The consumer can now pull the message from the Kafka Broker and show website usage in real-time and show past data if it resets its offset to a possible past value

Use Case: Message Queue

Apache Kafka is an excellent tool which can act as a replacement for message broker tools like RabbitMQ. Asynchronous messaging helps in decoupling the applications and creates a highly scalable system.

Just like the concept of microservices, instead of building one large application, we can we can divide the application into multiple parts and each part has a very specific responsibility. This way, the different parts can be written in completely independent programming languages as well! Kafka has in-built partitioning, replication, and fault-tolerance system that makes it good as a large-scale message broker system.

Recently, Kafka is also seen as a very good log collection solution which can manage log file collection server broker and provide these files to a central system. With Kafka, it is possible to generate any event that you want any other part of your application to know about.

Using Kafka at LinkedIn

It is interesting to note that Apache Kafka was earlier seen and used as a way through which data pipelines could be made consistent and through which data was ingested into Hadoop. Kafka worked excellently when multiple data sources and destinations were present and providing a separate pipeline process for each combination of source and destination was not possible. LinkedIn's Kafka architect, Jay Kreps describes this familiar problem well in a blog post:

My own involvement in this started around 2008 after we had shipped our key-value store. My next project was to try to get a working Hadoop setup going, and move some of our recommendation processes there. Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy prediction algorithms. So began a long slog.

Apache Kafka and Flume

If you move out to compare these two on the basis of their functions, you'll find a lot of common features. Here are some of them:

Conclusion

In this lesson, we looked at many concepts about Apache Kafka. Read more Kafka based posts here.

Parhaat Linux-peliohjaimen kartoitussovellukset
Jos haluat pelata pelejä Linuxissa peliohjaimella tyypillisen näppäimistön ja hiiren syöttöjärjestelmän sijaan, on sinulle hyödyllisiä sovelluksia. Mo...
Hyödyllisiä työkaluja Linux-pelaajille
Jos haluat pelata pelejä Linuxissa, on todennäköistä, että olet käyttänyt sovelluksia ja apuohjelmia, kuten Wine, Lutris ja OBS Studio, pelikokemuksen...
HD Remastered Games for Linux that Never had a Linux Release Earlier
Many game developers and publishers are coming up with HD remaster of old games to extend the life of franchise, please fans requesting compatibility ...