Dr. Vosk: Kafka DynamoDB Scaling to 1M EPS, Avoiding Bottlenecks

Achieving one million events per second with Kafka and DynamoDB isn't a matter of provisioning more resources; it's a battle against write amplification and consumer-side bottlenecks. Many architectures fail because they overlook the subtle but significant costs of enforcing data integrity patterns like idempotency under extreme load.

1. Architecting for 1M EPS

The basic setup uses Kafka as a buffer for incoming events and a distributed log. Subsequently, enrichment microservices retrieve these events, perform transformations like data aggregation, and persist the refined data in DynamoDB for analytical querying. The real challenge at 1M EPS isn't whether individual components *can* handle the load, but synchronizing them to prevent bottlenecks and ensure data integrity. Data distribution, consumer parallelism, and guaranteeing consistency under heavy use are the key concerns.

2. Sizing Partitions and Provisioning Writes

To handle 1M EPS, the system needs to be set up to spread the load evenly. A single Kafka partition has a limit based on message size, replication settings, and the broker's hardware. While dependent on message size and broker hardware, conservative benchmarks place single-partition write throughput around 50 MB/s, with highly optimized setups potentially reaching over 100-200 MB/s. On typical cloud hardware with a replication factor of 3 and 1KB messages, this limit is often dictated by the broker's network I/O and replication traffic, not raw disk speed. Exceeding this requires careful tuning of batch sizes and linger settings. For 1KB events, a good starting point for 1M EPS is a topic with 500-1000 partitions. This provides sufficient write parallelism and allows for an equal number of consumer instances in a consumer group to maximize parallel processing.

On the DynamoDB side, performance depends on Write Capacity Units (WCUs) and how you design your partition key. For a typical 1M EPS workload with events under 1KB, you'll need to provision around 1,000,000 WCUs. More importantly, the DynamoDB table's primary key must prevent write hot-spotting. A partition key with low cardinality will funnel a ton of writes to a single storage partition, causing throttling. The partition key needs high cardinality, like a `UserID` or `TransactionID`. If you don't have a natural key like that, create a synthetic one using a logical identifier plus a random suffix. This ensures writes are spread evenly.

3. The High Cost of Strong Consistency

DynamoDB defaults to eventual consistency. You *can* get stronger guarantees, but they come at a cost. An eventually consistent read uses half the Read Capacity Units (RCUs) compared to a strongly consistent read for the same data size. So, choosing strong consistency effectively doubles your RCU costs. Similarly, a transactional write consumes two WCUs for every standard write's one WCU.

At 1M EPS, increased costs associated with strong consistency can quickly deplete the allocated budget for regular event processing. Doubling the WCU requirement to 2M for transactional writes, or doubling RCU costs for checking data, makes the system too expensive and difficult to manage. The decision is clear: the main event processing path *must* be built around eventual consistency. Reserve strong consistency for less frequent tasks, like admin functions or reconciliation processes, where you absolutely need immediate data visibility.

4. Enforcing Idempotency Without Killing Throughput

By default, Kafka guarantees at-least-once delivery. Given that, and the distributed nature of the consumers, idempotency isn't just a pattern; it's essential. The real trick is implementing it without creating a new bottleneck. One common problem is the "poison pill" message—a bad event that crashes a consumer over and over, blocking processing on that partition. You have to handle this by setting up a Dead Letter Queue (DLQ). This isolates the failing messages for later analysis, letting the consumer keep going.

To enforce idempotency at 1M EPS, the most reliable approach is a separate DynamoDB "ledger" table. This table uses the unique event identifier (`EventID`) as its partition key. The consumer first tries a conditional write to this table using the expression `attribute_not_exists(EventID)`. If the write succeeds, the event is new and gets processed. If it fails with a `ConditionalCheckFailedException`, the event is a duplicate and gets ignored. This conditional write is a single, atomic API call.

While this ledger-table pattern is robust, it's not without cost. It doubles the write operations, meaning your provisioned WCUs for the entire system will be roughly 2M instead of 1M, directly doubling that portion of your bill. Crucially, DynamoDB consumes write capacity for a conditional write even if the condition fails. To mitigate the impact of these failures, developers should now use the `ReturnValuesOnConditionCheckFailure` parameter, introduced in mid-2023. Setting this to `ALL_OLD` returns the item's current state within the exception, avoiding a costly follow-up read operation to diagnose the conflict, which was a necessary inefficiency in older designs.

While Kafka's Exactly-Once Semantics (EOS) offers an alternative, it introduces performance trade-offs. The transactional coordination required for EOS can increase message latency and significantly reduce a producer's maximum throughput, sometimes requiring multiple producer instances to compensate. For a 1M EPS firehose where low latency is critical, the consumer-side idempotency check via a DynamoDB ledger, despite its higher write cost, often provides a more scalable and operationally simpler path.

Therefore, a robust and scalable architecture for this workload converges on a design that combines high Kafka partition counts, horizontally scaled consumers, a DynamoDB primary key designed for writes with high cardinality, and a dedicated DynamoDB ledger table to enforce idempotency using conditional writes. This approach provides a resilient and cost-aware blueprint for operating at 1M EPS, balancing throughput with the non-negotiable requirement of data integrity.

Kafka and DynamoDB scaling — Kafka and DynamoDB data flow.