Apache Kafka has become the backbone of real-time data streaming for countless organizations, powering everything from reimagined customer experiences at companies like Instacart, Airbnb, and Netflix to critical operational workflows. However, it can be a complex and challenging journey to fully leverage the power of Apache Kafka. This post dives into seven common mistakes data practitioners make when working with the open-source Kafka and provides actionable solutions to avoid these mistakes that ensure a more stable and scalable Apache Kafka-based data streaming platform.

Challenges of Scaling Apache Kafka

While widely adopted, Apache Kafka presents significant challenges as you scale including:

  • Operational Burden and Management Costs: Managing and operating Kafka is time-consuming and costly, especially as your streaming data platform grows.
  • Lack of built-in Enterprise Features: Essential security and governance features are not included out-of-the-box, requiring significant effort to build from scratch.
  • Limited pre-built Connectors: Connecting data sources often requires building custom connectors due to limited support for connectors for cloud-based data sources.
  • Limited integrated Stream Processing: Implementing real-time stream processing often requires integrating tools like Apache Flink and Spark Structured Streaming.
  • Resource-Intensive Failover Design: The complexity and risk associated with failover increase exponentially with the volume and variety of streaming data.

At a high level, organizations need to standardize best practices, improve data governance and reliability with tools like Karaspace or Schema Registry, and plan for scalability. As with working with most open-source solutions, these extra features need to be built or integrated in-house for Apache Kafka.

Mistake 1: Selecting the Wrong or Unmaintained Clients

One of the benefits of Kafka’s open-source nature is its vast client ecosystem. That means there are tons of clients for Apache Kafka, all of which, may not be well supported or maintained on time.

Solution

  • Maintain an Internal Supported List: Establish a standard for configurations and maintain a list of vetted and supported clients for developers.
  • Regularly Update Clients: Update Apache Kafka clients at least annually to benefit from performance improvements, bug fixes, and security patches.
  • Use Maintained Clients: Prioritize using well-maintained clients. Confluent provides supported clients in languages like Python, .NET, Java, Go, and JavaScript.
  • Check for Active Contributors and Regular Releases: Check the Kafka client has active contributors, regular releases, and is up-to-date with the latest Kafka protocol.
  • Utilize Centralized Monitoring: Use tools like Confluent Cloud Console to identify client versions and investigate clients reporting unknown versions as they may be unsupported.

Mistake 2: Incorrect Client Usage

Even with the right client, improper usage can lead to significant issues. Common mistakes include not embracing delivery guarantees, mistaking producer.send as synchronous, having port limitations for strict ordering, using external calls in consumer loops, and not retrying client errors correctly.

Solution

  • Educate Developers and Set Best Practices: Provide guidance to developers and establish best practices based on specific clients and use cases.
  • Use Client Guides and Documentation: Refer to client guides and documentation to understand proper usage and language-specific best practices.
  • Understand Keys and Partitions: Use identifiers like user_id or order_id as keys to guarantee messages with the same key are in the same partition, maintaining ordering.
  • Understand Message Delivery Guarantees: Kafka has three types of delivery guarantees. Check with downstream data consumers to determine the best option for each use case.
    • At Most Once: Messages are delivered once or not at all (fire and forget). This has the lowest latency but can result in data loss.
    • At Least Once: The producer waits for broker confirmation. If not received, the message is resent, potentially leading to duplicates.
    • Exactly Once: Messages are resent with idempotency, handling duplicates on the broker side.
  • Embrace Retries on the Client Side: Implement proper retry logic in client code based on the chosen message delivery mechanism to handle transient errors automatically and gracefully.

Mistake 3: Not Tuning Client Configurations

Kafka’s strength lies in its tunability, but this also introduces complexity in configuration. You must fine-tune your Kafka setup based on business requirements and downstream applications.

Solution

  • Communicate Configuration Best Practices: Establish and communicate safe configuration best practices to developers, e.g., using internal wikis as a source of truth.
  • Understand and Establish Service Goals: Define service goals (throughput, latency, durability, and availability) based on application and business requirements and help developers understand how configurations affect these goals for their applications using Kafka.
    • Throughput and Durability Focused: Suitable for use cases like log aggregation or metrics collection that are not latency-sensitive and often required for long-term analysis, or required for offline analysis typically served using batch processing jobs.
    • Latency Focused: Essential for real-time alerts, video streaming, or low-latency experiences in applications like multiplayer games and collaboration tools.
  • Establish Tailored Defaults: Avoid blindly using Kafka’s default configurations, which are designed for basic use cases but may not be optimal for specific workloads.
  • Involve Developers in Observability: Implement client monitoring and ensure development teams understand the metrics and can identify problems independently.

Mistake 4: Not Caring About Schemas from the Beginning

Overlooking schemas initially and trying to implement them later is a common mistake. Since Apache Kafka is a schemaless streaming platform, this mistake is common – even among experienced data professionals. It quickly becomes a problem as defining schemas later on can be challenging.

Solution

  • Adopt Schema Registry from Day One: Start with a Schema Registry and familiarize your team with its functionality, which acts as a database for schemas and their associated topics.
  • Leverage Schemas as First-Class Citizens: Define schemas, agree on formats and evolution with producers and consumers, and enforce schema usage for topics shared across teams.
  • Adopt Tooling for Schema Management: Manage schemas using CI/CD and implement RBAC controls on the Schema Registry to prevent accidental or inadvertent registrations.
  • Understand the Benefits beyond Avoiding Errors: Schematized data enhances the efficiency and reliability of data processing for analytical use cases by providing clearly defined structures, ensuring consistency, and enabling robust data type checking across pipelines.

Mistake 5: Not Planning for Multi-Tenancy

As Kafka adoption grows, it often evolves into a platform used by multiple teams. However, natively, Apache Kafka lacks built-in tools for isolating tenants, leading to a lack of planning for multi-tenancy.

Solution

  • Define Discrete Ownership by Topic: Define ownership and responsibilities (e.g., who is responsible for monitoring, costs, and incident response) for topics, .e.g, using internal wikis.
  • Leverage Kafka’s Built-in Features:
    • Access Control Lists (ACLs): Define specific permissions per principal to control read and write access to topics. This limits cross-team usage of non-owned topics.
    • Quotas: Define limits for the resources a consumer team can use on the broker.
  • Prepare for Potential Spikes and Plan Capacity: Anticipate potential spikes in load and plan capacity accordingly. Determine whether limits should be hard or soft boundaries.
  • Automate Changes: Use CI/CD for managing controls like ACLs and quotas to reduce human errors and maintain an audit trail across your data streaming platform.

Mistake 6: Not Preparing for Scale

Many data practitioners fail to prepare for scale, often due to a lack of understanding of streaming data modeling or how Kafka stores and processes data under the hood. This leads to scalability issues when your streaming data platform grows in terms of data volume or producer applications.

Solution

  • Use a Meaningful Partition Key: Choose a partition key that makes sense within your use case, such as region_id for data coming from multiple regions.
  • Plan for Capacity Thoroughly: Consider the type of data (JSON, plain text, images, etc.), anticipated throughput, and potential peak times for each application.
  • Determine the Right Number of Partitions: The number of partitions is a key lever for scale, as it directly impacts throughput and consumer parallelism.
  • Plan for Failure: Regularly test how clients behave when a broker fails or stops in pre-production environments to optimise best practices for production applications.
  • Understand Downstream Consumer Processing Time: Analyze consumer processing time as lag can also be caused by slow consumer processing rather than Kafka itself.

Mistake 7: Not Considering Day 2 Operations

The final common mistake is neglecting Day 2 operations, even if the initial setup is correct. You may built the perfect streaming data platform, but still face issues with its maintenance later on.

Solution

  • Learn About Observability Best Practices: Understand what metrics to monitor and how they work together to provide a comprehensive view of overall performance.
  • Establish Tools for Day 2 Operations: Utilize infrastructure as code tools like Ansible, Terraform, and Kubernetes’ operators to manage configurations, deployments, users, topics, performance, capacity, broker management, and data retention with changing requirements.
  • Plan for Ongoing Maintenance and Upgrades: Have a solid plan for maintenance tasks and regular upgrades to ensure top-notch security and high performance.
  • Use an Application Performance Monitoring (APM) Tool: Helps monitor and alert on key metrics, collect and search client logs, and inform about upgrades and cluster imbalances.
  • Include Application Metrics in Observability: Ensure your observability setup includes not just Kafka metrics but also application metrics that capture interactions with downstream systems.

By being aware of these common mistakes and implementing the suggested solutions, you can significantly improve their Apache Kafka deployments, ensuring they are robust, scalable, and highly performant – even under unknown peaks. Understanding these challenges and solutions is crucial for data professionals looking to leverage Apache Kafka effectively for their data streaming needs.

References

  1. Confluent Blog [ 5 Common Pitfalls When Using Apache Kafka (original) (archived) ]
  2. Confluent Resources - Online Talk [ 7 Common Apache Kafka Mistakes and How to Solve Them (original) (archived) ]

Let’s discuss.

Get in touch to discuss an idea or project. We can work together to make it live! Or enquire about writing guest posts or speaking in meetups or workshops.