I/D/E · Messaging

Consumer Groups

Summary

How multiple consumers coordinate to process partitions in parallel with fault tolerance, automatic rebalancing, and exactly-once guarantees

TL;DR

Consumer groups enable multiple consumer instances to work together to process partitions from a topic in parallel. Each partition is assigned to exactly one consumer within a group, providing parallel processing while maintaining ordering guarantees. Automatic rebalancing handles failures and scaling.

Visual Overview

Consumer Group Architecture
Topic: user-events (4 partitions)

  Part 0    Part 1    Part 2    Part 3  

                                  
                                  
                                  

 Consumer  Consumer  Consumer  Consumer 
    A         B         C         D     


Group: "analytics-processors"

KEY GUARANTEES:
 Each partition assigned to exactly ONE consumer in group
 Each consumer can handle multiple partitions
 Automatic rebalancing on member changes
 Fault tolerance through coordinator failover

REBALANCING SCENARIOS:

1. Consumer joins: 4 partitions  5 consumers (rebalance)
2. Consumer crashes: 4 partitions  3 consumers (rebalance)
3. Partition added: New partition needs assignment (rebalance)

Core Explanation

What is a Consumer Group?

A consumer group is a logical collection of consumer instances that work together to consume messages from a topic. The group provides:

  • Load distribution: Partitions spread across consumers
  • Fault tolerance: Failed consumers automatically replaced
  • Scaling: Add/remove consumers dynamically
  • Coordination: Group coordinator manages partition assignments

Partition Assignment Guarantee

The Golden Rule:

Each partition is assigned to exactly one consumer within a consumer group at any given time.

Partition Assignment Guarantee
VALID ASSIGNMENT (4 partitions, 3 consumers):
Consumer A: [Partition 0, Partition 1]
Consumer B: [Partition 2]
Consumer C: [Partition 3]
 Each partition assigned exactly once

INVALID ASSIGNMENT:
Consumer A: [Partition 0]
Consumer B: [Partition 0] ✕ Partition 0 assigned twice!

This guarantee ensures:

  • No duplicate processing within a group
  • Ordering maintained per partition
  • Clear ownership of each partition

How Partition Assignment Works

Assignment Strategies:

// 1. RANGE STRATEGY (default)
// Assigns consecutive partitions to consumers
Topic: user-events (6 partitions)
Consumer A: [0, 1]
Consumer B: [2, 3]
Consumer C: [4, 5]
// Pro: Simple, predictable
// Con: Uneven if partition count doesn't divide evenly

// 2. ROUND-ROBIN STRATEGY
// Distributes partitions one-by-one in round-robin
Topic: user-events (6 partitions)
Consumer A: [0, 3]
Consumer B: [1, 4]
Consumer C: [2, 5]
// Pro: Even distribution
// Con: Less predictable, more partition movement on rebalance

// 3. STICKY STRATEGY
// Minimizes partition movement during rebalance
// Keeps existing assignments when possible
// Pro: Reduces rebalancing overhead
// Con: Slightly more complex

Configuration:

Properties props = new Properties();
props.put(ConsumerConfig.GROUP_ID_CONFIG, "analytics-processors");
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    "org.apache.kafka.clients.consumer.RangeAssignor");

Group Coordinator and Rebalancing

Coordinator Selection:

Coordinator Selection
Group ID: "analytics-processors"
              
  hash(groupId) % num_partitions(__consumer_offsets)
              
  Partition 23 in __consumer_offsets
              
  Broker 2 (leader of partition 23)
              
  Broker 2 becomes Group Coordinator

Rebalancing Protocol (Simplified):

Rebalancing Protocol
REBALANCING FLOW:

1. TRIGGER EVENT
  Consumer joins group
  Consumer leaves/crashes
  Consumer heartbeat timeout
  Partition count changes

2. COORDINATOR INITIATES REBALANCE
  Sends REBALANCE_IN_PROGRESS to all consumers
  Consumers stop processing, commit offsets

3. JOIN GROUP PHASE
  All consumers re-join group
  Send their supported partition assignment strategies
  Coordinator collects member info

4. ASSIGNMENT PHASE
  Coordinator runs assignment strategy
  Calculates new partition assignments
  Sends assignments to consumers

5. RESUME PROCESSING
  Consumers start consuming from new assignments

TOTAL REBALANCE TIME: ~500ms to several seconds

Scaling Patterns

Under-Subscribed (Fewer Consumers than Partitions):

Under-Subscribed Pattern
4 Partitions, 2 Consumers:
Consumer A: [P0, P1]
Consumer B: [P2, P3]

Throughput: 2x (2 parallel consumers)
Utilization: 100% (all consumers busy)

Fully-Subscribed (Equal Consumers and Partitions):

Fully-Subscribed Pattern
4 Partitions, 4 Consumers:
Consumer A: [P0]
Consumer B: [P1]
Consumer C: [P2]
Consumer D: [P3]

Throughput: 4x (4 parallel consumers)
Utilization: 100% (optimal)

Over-Subscribed (More Consumers than Partitions):

Over-Subscribed Pattern
4 Partitions, 6 Consumers:
Consumer A: [P0]
Consumer B: [P1]
Consumer C: [P2]
Consumer D: [P3]
Consumer E: []  ⚠️ IDLE
Consumer F: []  ⚠️ IDLE

Throughput: 4x (limited by partitions)
Utilization: 67% (2 consumers wasted)

✕ Cannot scale beyond partition count!

Multiple Consumer Groups

Independent Processing:

Multiple Consumer Groups
Topic: user-events (4 partitions)
      
       Group: "analytics" (processes all events)
           Consumer A: [P0, P1]
           Consumer B: [P2, P3]
      
       Group: "fraud-detection" (also processes all events)
            Consumer X: [P0, P1, P2, P3]

Each group independently consumes ALL messages.
Groups do NOT affect each other.

Use Case - Multiple Processing Pipelines:

Multiple Processing Pipelines
Topic: "user-actions"

Group 1: "real-time-analytics"
 Processes events for live dashboards

Group 2: "ml-feature-pipeline"
 Extracts features for ML models

Group 3: "audit-logger"
 Archives events for compliance

All three groups consume the SAME messages independently.

Tradeoffs

Advantages:

  • ✓ Horizontal scalability (add more consumers)
  • ✓ Automatic fault tolerance (consumer failures handled)
  • ✓ Load balancing across consumers
  • ✓ Multiple independent processing pipelines (multiple groups)

Disadvantages:

  • ✕ Rebalancing causes processing pause (stop-the-world)
  • ✕ Cannot scale beyond partition count
  • ✕ Partition assignment may be uneven
  • ✕ Rebalancing overhead on frequent consumer changes

Real Systems Using This

Kafka (Apache)

  • Implementation: Group coordinator per partition in __consumer_offsets
  • Scale: Thousands of consumer groups processing trillions of messages
  • Typical Setup: 10-50 consumers per group for high-throughput topics

Amazon Kinesis

  • Implementation: Kinesis Client Library (KCL) provides similar consumer group semantics
  • Scale: Auto-scaling consumer groups based on shard count
  • Typical Setup: 1 worker per shard, auto-scaling with shard splits/merges

Apache Pulsar

  • Implementation: Shared subscription model (similar to consumer groups)
  • Scale: Automatic load rebalancing without stop-the-world pauses
  • Typical Setup: Dynamic consumer scaling with minimal disruption

When to Use Consumer Groups

✓ Perfect Use Cases

High-Throughput Event Processing

High-Throughput Event Processing
Scenario: Processing 1M events/sec from user activity stream
Solution: Consumer group with 100 consumers (10K events/sec each)
Result: Linear scaling, automatic fault tolerance

Parallel Data Pipeline

Parallel Data Pipeline
Scenario: Real-time ETL from Kafka to data warehouse
Solution: Consumer group with partitions = number of available cores
Result: Maximize parallelism while maintaining ordering per partition

Multiple Processing Pipelines

Multiple Processing Pipelines Use Case
Scenario: Same events need processing by analytics, ML, and audit systems
Solution: Three separate consumer groups on same topic
Result: Independent processing without interfering with each other

✕ When NOT to Use

Need Broadcast to All Consumers

Broadcast Requirement
Problem: Every consumer must receive ALL messages
Issue: Consumer groups distribute messages (each gets subset)
Alternative: Use separate consumer groups or pub-sub pattern

Very Low Latency Requirements

Very Low Latency Requirements
Problem: Sub-millisecond latency critical
Issue: Rebalancing causes temporary processing pause
Alternative: Single consumer or fixed partition assignment

More Consumers than Partitions Long-Term

More Consumers than Partitions
Problem: Want to run 100 consumers with only 10 partitions
Issue: 90 consumers will be idle, wasting resources
Alternative: Increase partition count or reduce consumers

Interview Application

Common Interview Question 1

Q: “You have a topic with 10 partitions. If you deploy 15 consumers in the same consumer group, what happens?”

Strong Answer:

“Only 10 consumers will be active - one per partition. The remaining 5 consumers will be idle since each partition can only be assigned to one consumer in a group. This is inefficient. To utilize all 15 consumers, I’d either increase the partition count to 15+, or split the workload across multiple topics. If scaling further is anticipated, I’d over-provision partitions upfront since changing partition count requires topic recreation.”

Why this is good:

  • Shows understanding of partition assignment constraint
  • Identifies the inefficiency
  • Provides multiple solutions
  • Considers future scaling

Common Interview Question 2

Q: “What happens during a consumer group rebalance? How does it affect processing?”

Strong Answer:

“Rebalancing occurs when consumers join, leave, or crash. The process:

  1. Coordinator detects the change (heartbeat timeout or explicit notification)
  2. Sends REBALANCE_IN_PROGRESS to all group members
  3. Consumers stop processing and commit their offsets
  4. All consumers re-join the group
  5. Coordinator calculates new partition assignments using the configured strategy
  6. Consumers receive new assignments and resume processing

Impact: Processing pauses for ~500ms to several seconds. In production, we minimize rebalances by:

  • Using static membership (Kafka 2.3+) to avoid rebalances on restarts
  • Tuning session.timeout.ms and heartbeat.interval.ms
  • Using sticky assignor to minimize partition movement
  • Graceful shutdowns with proper leave group notifications”

Why this is good:

  • Detailed step-by-step understanding
  • Quantifies the impact
  • Shows production awareness
  • Provides optimization strategies

Red Flags to Avoid

  • ✕ Confusing consumer groups with partition replicas
  • ✕ Claiming you can assign same partition to multiple consumers in one group
  • ✕ Not knowing about rebalancing and its impact
  • ✕ Forgetting that consumer count cannot exceed partition count for effectiveness

Quick Self-Check

Before moving on, can you:

  • Explain consumer groups in 60 seconds?
  • Draw a diagram showing partition-to-consumer assignment?
  • Explain what triggers a rebalance?
  • Calculate optimal consumer count given partition count?
  • Identify when to use multiple consumer groups?
  • Explain the partition assignment guarantee?

See It In Action

Prerequisites

Used In Systems

  • Real-Time Analytics Pipelines - Consumer groups for parallel processing
  • Event-Driven Microservices - Multiple consumer groups per service

Explained In Detail

  • Kafka Architecture - Consumer Groups & Rebalancing section (30 minutes)
  • Deep dive into rebalancing protocols, partition assignment strategies, and coordinator mechanics

Next Recommended: Offset Management - Learn how consumers track their position in partitions

Production signal

Why this concept matters

Interview 90% of messaging interviews
Production LinkedIn, Uber, Netflix
Performance Billions of messages
Scale Hundreds of parallel workers