Deep dive into: LinkedIn Re-Architects Service Discovery: Replacing Zookeeper with Kafka and xDS at Scale

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Unlock the full InfoQ experience by logging in! Stay updated with your favorite authors and topics, engage with content, and download exclusive resources.

Prevent AI-generated tech debt with Skeleton Architecture. This approach separates human-governed infrastructure (Skeleton) from AI-generated logic (Tissue) using Vertical Slices and Dependency Inversion. By enforcing security and flow control in rigid base classes, you constrain the AI to safe boundaries, enabling high velocity without compromising system integrity.

Sub‑100-ms APIs emerge from disciplined architecture using latency budgets, minimized hops, async fan‑out, layered caching, circuit breakers, and strong observability. But long‑term speed depends on culture, with teams owning p99, monitoring drift, managing thread pools, and treating performance as a shared, continuous responsibility.

Sarah Usher discusses the architectural "breaking point" where warehouses like BigQuery struggle with latency and cost. She explains the necessity of a conceptual data lifecycle (Raw, Curated, Use Case) to regain control over lineage and innovation. She shares practical strategies to design a single source of truth that empowers both ML teams and analytics without bottlenecking scale.

Thiago Ghisi discusses the strategic evolution required to lead 100+ engineers without breaking the organization. He explains his "Three Levels of Impact" framework and shares practical lessons on speeding up decision-making, cultivating leadership teams, and building organizational resilience. Learn why he views reorgs as a continuous deployment feature rather than a one-time traumatic event.

As systems grow, observability becomes harder to maintain and incidents harder to diagnose. Agentic observability layers AI on existing tools, starting in read-only mode to detect anomalies and summarize issues. Over time, agents add context, correlate signals, and automate low-risk tasks. This approach frees engineers to focus on analysis and judgment.

InfoQ Homepage News LinkedIn Re-Architects Service Discovery: Replacing Zookeeper with Kafka and xDS at Scale

Analysis & Development

Patrick Farry

In a recent LinkedIn Engineering Blog post, Bohan Yang describes the project to upgrade the company's legacy ZooKeeper-based service discovery platform. Facing imminent capacity limits with thousands of microservices, LinkedIn needed a more scalable architecture. The new system leverages Apache Kafka for writes and the xDS protocol for reads, enabling eventual consistency and allowing non-Java clients to participate as first-class citizens. To ensure stability, the team implemented a "Dual Mode" strategy that allowed for an incremental, zero-downtime migration.

The team identified critical scaling problems with the legacy Apache ZooKeeper-based system. Direct writes from app servers and direct reads/watches from clients meant that large application deployments caused massive write spikes and subsequent "read storms," leading to high latency and session timeouts. Additionally, since ZooKeeper enforces strong consistency (strict ordering), a backlog in read requests could block writes, causing healthy nodes to fail health checks. The team estimated that the current system would reach its maximum capacity in 2025.

To address these shortcomings, a new architecture was developed that moved from strong consistency to an eventual consistency model, providing better performance, availability, and scalability. The new system separates the write path (via Kafka) from the read path (via an Observer service). The Service Discovery Observer consumes Kafka events to update its in-memory cache and pushes updates to clients via the xDS protocol, which is compatible with Envoy and gRPC. The use of the xDS standard enables LinkedIn to deploy clients in many languages beyond Java. This adoption also enables future integration with Service Mesh (Envoy) and centralized load balancing.

Post-upgrade benchmarks showed that a single Observer instance can maintain 40k client streams and process 10k updates per second. Observers operate independently per data center (fabric) but allow clients to connect to remote Observers for failover or cross-data center traffic.

The migration had to occur without interrupting billions of daily requests or requiring manual changes from thousands of app owners. The team implemented Dual Read and Write mechanisms. For reads, clients subscribed to both ZooKeeper and the new Observer. ZooKeeper remained the Source of Truth for traffic routing during the pilot phase of a client system migration, while background threads verified the accuracy of Observer data against ZooKeeper data before switching traffic over. For writes, app servers announced their presence to both ZooKeeper and Kafka simultaneously. Automated cron jobs analyzed ZooKeeper watchers to identify "long-tail" legacy clients preventing the decommissioning of ZooKeeper writes.

Future Impact

After implementing the new service, data propagation latency improved significantly, dropping from P50 < 10s / P99 < 30s to P50 < 1s / P99 < 5s. The system now supports hundreds of thousands of app instances per data center with horizontal scalability via the Observer layer.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Reliability rules have changed. At QCon London 2026, unlearn legacy patterns and get the blueprints from senior engineers scaling production AI today.

InfoQ.com and all content copyright © 2006-2026 C4Media Inc. Privacy Notice, Terms And Conditions, Cookie Policy

Source: View Original