At PayPay Card, we operate a multi-tenant Kafka cluster that powers a variety of real-time and batch workloads. Recently, we faced a puzzling performance issue: every day, during our scheduled batch loads from AWS Glue, Kafka producer latency for unrelated workloads would spike from under 10ms to over 800ms. This post details our investigation, the root cause, and the optimizations that restored cluster-wide stability—without sacrificing throughput or reliability.

The Problem: Kafka Latency Spikes During Batch Loads

Each day, several 2GB+ CSV datasets are ingested from AWS Glue into Amazon MSK (Managed Streaming for Kafka). The data lands in a Kafka topic with 30 partitions. During these loads, we observed severe degradation: all producers in the cluster experienced produce latency spikes, even though the total data volume was not unusually large.

What Was Happening Under the Hood?

  • Spark’s Default Behavior:
    • The CSV files are splittable and uncompressed, so Spark divides them into chunks for parallel processing.
    • With spark.sql.files.maxPartitionBytes = 128MB, a 2GB file yields ~16 tasks, each writing ~125MB to Kafka
  • Kafka Producer Defaults:
    • No compression.
    • batch.size = 128KB, linger.ms = 0.
    • High concurrency from Spark led to many small, inefficient batches (often just 40–80KB).

This meant each Spark task produced thousands of small requests. For a single 2GB file, the cluster saw ~32,000 produce requests—each incurring network and thread overhead on the brokers.

Impact on Kafka Brokers

The brokers’ request handler threads became saturated. Idle percentage dropped, CPU usage spiked, and request queues grew. The result: unrelated producers suffered latency spikes above 800ms, even though the cluster wasn’t handling an unusually high data volume.


The Solution: Smarter Parallelism and Producer Tuning

We made several targeted changes to both Spark and Kafka producer configurations. Our guiding principle: reduce the number of requests, increase the efficiency of each request.

  • Controlled Parallelism in Spark
    • We set df.repartition(60), creating 60 tasks (2× the number of Kafka partitions).
    • Each task handled ~33MB, translating to ~267 batches at 128KB each.
    • This reduced the total number of requests from ~32,000 to ~16,000, with more consistent batch sizes.
  • Enabling Producer Compression1
    • Enabled kafka.compression.type = zstd (with lz4 as fallback).
    • Achieved ~50% reduction in wire size for JSON payloads—2GB raw data became ~1GB over the network.
    • Broker CPU per request dropped, as less data needed to be copied, checksummed, and written to disk.
  • Batch Smoothing with Linger and Buffer
    • Set linger.ms = 10 to allow records to accumulate.
    • Kept batch.size = 128KB for efficiency.
    • Increased buffer.memory = 256MB to prevent stalls if Spark outpaced the network.
    • Result: average batch size doubled, halving the request volume again.
  • Reliability Without Sacrificing Throughput
    • Enabled acks = all and enable.idempotence = true for strong delivery guarantees.
    • Throughput optimizations did not compromise data correctness.

Results: Stable Latency, Happier Tenants

After these changes, Kafka producer latency for all workloads remained stable—even during peak batch loads. Broker CPU and request handler metrics normalized, and the cluster handled both batch and real-time workloads smoothly.

Key Takeaways

  • Batch size and request volume matter: Too many small requests can overwhelm brokers, even if total data volume is modest.
  • Tune parallelism to your partition count: More tasks aren’t always better—find the sweet spot for your workload and cluster.
  • Compression and batching are your friends: They reduce network and CPU load, benefiting all tenants.
  • Monitor, measure, iterate: CloudWatch and broker metrics were essential in diagnosing and validating our fixes.

Discover more from Product Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading