What is Apache Kafka?
Apache Kafka is a distributed streaming platform. This implies means that Kafka has the following capabilities:
Publishing and subscribing to the stream of records, similar to how message queue (e.g. RabbitMQ) works
Storing streams of records for a specified time. Kakfa is not a replacement for a database or a logging platform. However, Kafka can be a component of a logging platform
Processing stream of records. We can join two sources of related data to produce an output stream of the desired record
When do we use Kafka?
When we want to build a real-time data pipeline to share records reliably between applications.
For example- An e-commerce company streams ordering data to be used by different teams like accounting, data science etc. to extract meaningful insight or maintain records
When we want to build a real-time data streaming application that transforms or triggers an action.
For example- An e-commerce company wants to join consumer search behaviour with buying decisions, it would join browsing records with selling records
Terminologies
Producer
As the name suggests, a producer is the application that emits the records.
Consumer
Consumer is the application that receives the records.
Broker
Brokers are systems (or in simpler terms, servers) responsible for maintaining the published data. One broker may have zero or more topics.
Kafka Cluster
A Kafka system having more than one broker is known as a cluster. The number of brokers in a cluster can be increased without any downtime. A cluster is used to persist and replicate data.
Topic
A topic is like a category where a producer publishes and stores records. A consumer subscribes to a particular topic to read messages.
Messages are published to the particular topic where they are retained for a configurable duration. Any topic can be subscribed to by any number of consumers.
Partition
Kafka topics are divided into a number of partitions. Any record written to a particular topic goes to a particular partition.
Each record is assigned and identified by a unique offset. Replication is implemented at the partition level. The redundant unit of the topic partition is called a replica.
The logic that decides the partition for a message is configurable. Partition helps in reading/writing data in parallel by splitting it into different partitions spread over multiple brokers.
Each replica has one server acting as leader and others as followers. The leader handles the read/write while followers replicate the data. In case the leader fails, any one of the followers is elected as the leader.
Consumer Groups
Any consumer can read from any topic from any particular offset, just now or the start. Consumers can join a group called a consumer group. A consumer group has a set of consumers subscribed to a particular topic. Kafka ensures any record in a topic is only read by any one of the consumers of a particular consumer group.
Consumer pulls records from the topic partition. Each consumer is assigned a set of partitions to consume data. Kafka can support multiple consumers with little overhead. By using consumers in a group, Kafka parallelises reading and thus supports reading at a very high throughput.
The number of consumers that can be used to read the records is limited by the number of partitions in a topic. Kafka works on pull model, thereby it only sends data to the consumer only when the consumer wants to receive the data.
Follow the post for a tutorial on Kafka producer and consumer step-by-step hands-on example.