Building High Performance Apache Kafka Consumers in Go

As a full-stack developer and distributed systems engineer, I help companies build large-scale streaming data architectures using technologies like Apache Kafka.

In this comprehensive 3200+ word guide, you‘ll learn how to build a high performance Kafka consumer in Go that‘s able to process streams of data at scale.

Overview of Apache Kafka

Apache Kafka is a distributed, partitioned, replicated commit log service. It was open sourced by LinkedIn in 2011 and later became a top level Apache project in 2012.

What does this actually mean though? At a high level, Kafka provides a durable, fault tolerant, append-only messaging system. Here‘s a quick rundown of Kafka terminology and concepts:

Topics and Logs

Kafka organizes data streams into topics. Under the covers, each topic is broken into one or more partitions, which reside on disk as logs:

Topic 1
  |
  |-- Partition 0 (Log)
  |-- Partition 1 (Log) 
  |-- Partition 2 (Log)

New records get appended to partitions in an immutable fashion. Kafka retains data for a configurable period, defaulting to 7 days.

Producers and Consumers

Clients communicate with Kafka through TCP connections to the cluster. Producers publish messages to topics while consumers subscribe to topic data feeds.

Messages

Messages consist of a key, value and metadata like timestamps. Kafka treats both the key and value as binary blobs – they have no schema enforced on the broker. Hence producers and consumers must agree on serialization formats.

Consumer Groups

When multiple consumers connect to a topic, they can form consumer groups. Group members divide topic partitions between themselves so each partition is consumed by exactly one member. This allows scaling out consumers horizontally.

Kafka Architecture Overview

Now that we understand some basics terminologies, let‘s look at Kafka‘s architecture:

Kafka architecture

Brokers

At the heart of a Kafka deployment are brokers – Kafka cluster nodes that manage data persistence and replication. Brokers expose a TCP interface for producers/consumers as well as inter-broker communication.

Typical broker configurations utilize direct attached storage like SSDs for performance. SAS disks can be used as secondary storage tiers. Brokers themselves are stateless, relying on ZooKeeper for cluster management.

ZooKeeper

Apache ZooKeeper coordinates the Kafka cluster and manages configurations, broker metadata, consumer offsets and more. ZooKeeper elects controller instances which handle cluster leadership duties like partition replication and broker failover handling.

Producers and Consumers

Clients communicate with brokers directly to publish and subscribe to topic data feeds. The brokers themselves are agnostic to the message payload format – it‘s up to clients to agree on serialization formats like JSON, Protobuf or Avro.

Now that we‘ve covered Kafka basics, let‘s focus in on the consumer clients.

Kafka Consumer Client Capabilities

Some key capabilities Kafka consumer clients provide:

Subscribe to topics and partition feeds
Track offset position in partitions
Control polling duration with timeouts
Batch fetch records for efficiency
Commit offsets manually or automatically
Rebalance partitions on consumer group changes
Handle broker and network failures
Consumer group coordination

Now let‘s go through how to leverage these capabilities within a Go application.

Golang Setup

For starters, make sure Go 1.9+ is installed. Next let‘s initialize a module:

mkdir consumer-app-go && cd consumer-app-go
go mod init github.com/myorg/consumer-app-go

This sets up Go module support so dependencies get pulled into the vendor folder.

For interacting with Kafka, we‘ll use the Confluent Kafka Golang client:

go get github.com/confluentinc/confluent-kafka-go/kafka

Let‘s now create a file consumer.go to hold our application code.

Kafka Consumer Configuration

Inside consumer.go, the first step is to configure our consumer. This sets up the initial bootstrap brokers as well as the consumer group this process joins:

package main

import (
  "fmt"
  "github.com/confluentinc/confluent-kafka-go/kafka"
)

func main() {

  // Consumer configuration
  c, err := kafka.NewConsumer(&kafka.ConfigMap{
    "bootstrap.servers": "localhost:9092",  
    "group.id":          "my-consumer",
  })

}

In the config map, we primarily need to specify:

bootstrap.servers

This provides Kafka brokers for bootstrapping initial connections and metadata. The client will fetch whole cluster information from these servers.

group.id

Uniquely identifies the consumer group this process is joining. Think of it like a worker ID. The same group.id across processes enables scale-out capabilities.

Some other useful configs include:

auto.offset.reset: Where to start if no offsets committed yet
enable.auto.commit: To enable background commits
session.timeout.ms: Grace period to rejoin and resume

Now we‘re ready to start consuming.

Kafka Consumer Topic Subscription

The consumer process must subscribe to one or more topics it will consume. This handles coordinating which topic partitions are assigned to the consumer process:

func main() {

  // Previously initialized consumer

  topics := []string{"trades", "users"}  

  err = c.SubscribeTopics(topics, nil)

}

By subscribing to topics, the consumer joins the group coordinating those topic feeds. Topic partitions will be dynamically assigned by the Kafka brokers to balance load.

Now that we‘ve subscribed, messages will start flowing based on partition assignments.

Consuming Messages & Events

Withsubscriptions initialized, we can start the main consumer loop:

for {
  msg, err := c.ReadMessage(-1)

  if err == nil {
    fmt.Printf("Message: %s\n", string(msg.Value))
  } else {
    fmt.Printf("Error: %v \n", err)
  }  
}

A few pointers on the message consumption process:

ReadMessage fetches next message from assigned partitions
Timeout of -1 blocks indefinitely until new message arrives
The msg contains key, value, timestamp and other metadata
Printed the value payload for simplicity

This main loop allows the app to process a continual stream of messages off Kafka topics by:

Pulling through TCP socket for assigned partitions
Deserializing message batches
Iterating over messages and processing

Other configurations like batch sizes and polling intervals allow tuning throughput.

Kafka Consumer Flow Control

An important capability provided by Kafka is consumer flow control. By adding some buffering, backpressure provides a mechanism to add resilience:

Kafka consumer flow control

Some key parameters that control message flow:

Fetch Batch Sizes

Consumers pull messages in batches for efficiency. A larger batch means less overhead per message. However setting this too high can overwhelm consumers. 50-500 messages per request is common.

Prefetch Buffer

The consumer library buffers a configurabale number of batches ahead of the application. This allows the consumer to keep consuming when application is slow to avoid backpressure propagating through the pipeline.

Poll Duration

In cases when no new messages arrive, the timeout controls how long ReadMessage blocks. This balances efficiency with batching.

Tuning for sustained high throughput requires balancing these configurations.

Now let‘s take a deeper look into delivery semantics.

Kafka Consumer Delivery Semantics

Kafka offers two delivery semantics:

At most once

Easy to implement – consume, process, and commit sequentially. No duplication. Data loss on crashes.

At least once

Ensures every message gets processed but duplicates may occur. Needs idempotent processing & commit on success.

The approach boils down to when offsets commits occur either before or after processing steps.

For stronger delivery guarantees, we‘ll look at implementing an at least once approach.

Implementing At Least Once Delivery

Kafka commits two offsets – current position and last stable offset. The stable value trails behind indicating guarantee all messages processed through offset committed successfully even in crashes.

Kafka message commits

To implement at least once semantics:

Fetch message
Process message
Commit message offset

This ensures that duplicates may occur after crashes, but everything gets processed eventually. Combining this with idempotent processing provides end-to-end guarantees.

Let‘s look at an example in Go:

for {

  msg, _ := consumer.ReadMessage(-1)

  // Process msg 
  process(msg)

  // Mark offset successfully processed
  consumer.CommitMessage(msg)

}

This style waits to commit only after successful processing, ensuring at least once delivery. The trailing committed offset lags behind indicating we processed this safely before proceeding.

Now that we understand the fundamentals of building a consumer, let‘s explore some more advanced capabilities.

Managing Consumer Group Rebalancing

As consumers join or leave a group, Kafka handles ownership transfers of partitions through rebalancing. Assignments get shifted between members leading to a graceful handoff:

Kafka rebalancing

Rebalancing enables:

Adding consumers for scale
Removing unhealthy consumers
Replication across data centers

Graceful shutdowns can notify the group ahead of departure. But crashes also get handled ensuring data continues flowing.

The consumer app must properly handle the RebalanceCb callback to reposition data streams and commit offsets. This makes coordination seamless behind the scenes.

Consumer Monitoring & Metrics

Observability into consumer group metrics provides invaluable insight:

Throughput

Number of messages passing through lets you size systems and identify bottlenecks.

End to End Lag

The delay between producer send time and processing time indicates how "caught up" a partition is. Rising lag indicates problems.

Consumer Heartbeats

Metrics on heartbeat send/fail rates show consumer health.

Here‘s an example dashboard:

Kafka consumer metrics

Prometheus is a popular choice, storing numeric time series metrics generated from Kafka client instrumentation and logs. Grafana visualizes these metrics through dashboards.

Comparing Kafka Go Client Libraries

There‘s two popular Kafka client libraries for Go – Sarama and Confluent. Let‘s compare some key differences:

Feature	Sarama	Confluent
Community Support	Apache licensed, community driven	Open source, commercially supported
Client Capabilities	Consumer groups + producers	Adds admin client, schemas, serializers
Error Handling	Returns errors	Handles retries automatically
Compatibility	Apache Kafka 0.8+	Confluent Platform 5.0+
Async Support	Async producers + consumers	Sync clients, async processing
Maturity	Since 2013, stable API	Since 2016, evolving API

Generally Confluent‘s Go client builds on Sarama adding value like schema registry integration and admin capabilities. It wraps the C librdkafka library focusing on usability.

Sarama provides a lower level interface interacting directly with Kafka brokers. This makes it ideal for building event streaming applications on your own framework.

Implementing Kafka Streams in Go

Let‘s explore a more advanced example that builds a stream processing application. We‘ll leverage a worker pool to distribute processing across threads:

Stream processing architecture

package main

import (
  "context"

  "github.com/confluentinc/confluent-kafka-go/kafka"  
  "golang.org/x/sync/errgroup"
)

type Message struct {
  Key string
  Value string 
}

func main() {

  topic := "trades"

  consumer := CreateConsumer(topic)
  defer consumer.Close()

  workers := CreateWorkerPool(10)

  g := new(errgroup.Group)

  // Start consumer loop
  g.Go(func() error {

    for {

      msg, err := consumer.ReadMessage(-1)
      if err != nil { return err }

      workers.Process(&Message{
          Key:   string(msg.Key),  
          Value: string(msg.Value),
      })
    }
  })

  // Block until all goroutines done
  if err := g.Wait(); err != nil {
    log.Fatal(err)
  }  
}

func CreateConsumer(topic string) *kafka.Consumer {
  // New consumer
}

func CreateWorkerPool(size int) *WorkerPool {
  // Starts up workers
}

By leveraging goroutines for concurrent processing behind the scenes, we build a scalable stream processing application. Additional workers allow handling higher throughput. Batching also improves efficiency greatly.

Kafka enables several streaming paradigms:

KSQL

Streaming SQL for data pipelines and materialized views.

Kafka Streams

Distributed, scalable library for building streaming apps.

Flink

Unified stream & batch processing at scale.

There‘s several ways to build on the foundations Kafka provides.

Kafka Consumer Performance & Sizing

When planning out hardware capacity, data ingest rates dictate storage and consumer sizing requirements.

Here‘s some ballpark Kafka benchmarks on modern cloud hardware profiles:

Workload	Throughput	Drives	Instance Type
Non-Replicated Topics	1.5 Gbps	1 EBS gp3	m5.8xlarge
3x Replication Factor	500 Mbps	1 EBS gp3	m5.8xlarge

Some useful metrics for forecasts:

1 MB/s → 86 GB/day
3 MB/s → 250 GB/day

Analyzing maximum ingest rates and retention policies allows projections of storage capacity over time.

Generally CPU is the initial bottleneck before additional processing, like compression, strains IO bandwidth. Also consider downstream consumers when sizing – often the slowest link in pipelines.

Overprovisioning brokers allows headroom for spikes or maintenance. Scale-out architectures easily add capacity as needed.

Conclusion

Kafka provides a battle-tested foundation for large scale streaming data architectures. Golang empowers building high performance applications like consumer systems.

In this guide we covered Kafka architecture basics, implementing a consumer group, delivery semantics, monitoring, stream processing examples and sizing considerations.

Getting the foundations right on ingest pipelines prevents major issues down the road once business relies on streams of data. Investment into scalable, observable consumer apps pays dividends over time as complexity grows.

To learn more, check out the Kafka site and Confluent docs. The complete Go consumer app code is available on GitHub.

Building High Performance Apache Kafka Consumers in Go

Overview of Apache Kafka

Kafka Architecture Overview

Kafka Consumer Client Capabilities

Golang Setup

Kafka Consumer Configuration

Kafka Consumer Topic Subscription

Consuming Messages & Events

Kafka Consumer Flow Control

Kafka Consumer Delivery Semantics

Implementing At Least Once Delivery

Managing Consumer Group Rebalancing

Consumer Monitoring & Metrics

Comparing Kafka Go Client Libraries

Implementing Kafka Streams in Go

Kafka Consumer Performance & Sizing

Conclusion

Replacing Local Git Branches with Remote Versions: An In-Depth Guide

Unlocking Analytics Performance with PySpark‘s Parquet Integrations

Demystifying Fork and Exec in C: An Expert‘s Guide

Demystifying ‘origin‘ vs ‘upstream‘ Remotes: A Git Guide for Developers

Unlocking Ansible‘s Powerful Selectattr and Rejectattr Capabilities

Count Array Element Occurrences in JavaScript: An In-Depth 3k+ Word Guide for Developers

Linuxhaxor.net – About Open Source & Linux

Overview of Apache Kafka

Kafka Architecture Overview

Kafka Consumer Client Capabilities

Golang Setup

Kafka Consumer Configuration

Kafka Consumer Topic Subscription

Consuming Messages & Events

Kafka Consumer Flow Control

Kafka Consumer Delivery Semantics

Implementing At Least Once Delivery

Managing Consumer Group Rebalancing

Consumer Monitoring & Metrics

Comparing Kafka Go Client Libraries

Implementing Kafka Streams in Go

Kafka Consumer Performance & Sizing

Conclusion

Related posts:

Similar Posts

Linuxhaxor.net – About Open Source & Linux