Back to blogs
2025-08-13
4 min read

Designing a Scalable Queue System – System Design Principles

Deep dive into building a resilient, observable, and cost‑efficient queueing platform. Covers producers/consumers, backpressure, retries, DLQs, idempotency, ordering, and monitoring.

Designing a Scalable Queue System – System Design Principles

Building a reliable queueing system is about more than just buffering messages. It’s about flow control, fault isolation, observability, and operational ergonomics.

Project Name and Context

  • Project: Scalable Queue System (Kafka/RabbitMQ/SQS-like design)
  • Context: Power asynchronous workloads (notifications, emails, video processing, webhooks)
  • Targets: High throughput, exactly-once-like semantics (practical), graceful failure modes

Functionality and Purpose

  • Durable message delivery with at‑least‑once semantics by default
  • Consumer groups for horizontal scaling
  • Backpressure so producers don’t overwhelm downstreams
  • Retries + DLQ to quarantine poison messages
  • Ordering guarantees where needed (partition keys)
  • Observability: lag, error rate, processing latency, saturation

Problem It Solves

  • Smooths traffic spikes; protects databases/APIs from overload
  • Decouples producers and consumers (team boundaries + release cadence)
  • Improves reliability with retry semantics and isolation
  • Enables async patterns for cost and performance wins

Tech Stack Used

  • Broker: Kafka or RabbitMQ (or managed: SQS + SNS)
  • Producers/Consumers: Node.js/TypeScript (worker processes)
  • Storage: Postgres for stateful workflows; Redis for rate limits
  • Infra: Docker, k8s (HPA/VPA), Grafana + Prometheus, OpenTelemetry

How I Built It (or Would Build It)

Topic/Queue Design

payments.events
  ├─ payments.created (partition by userId)
  ├─ payments.failed  (partition by paymentId)
  └─ payments.refund  (partition by userId)
  • Partition by a key that preserves local ordering where required
  • Split high‑volume vs low‑volume topics to isolate hotspots

Producer (Node.js)

// src/producer/paymentsProducer.ts
import { Kafka } from "kafkajs";

const kafka = new Kafka({ brokers: [process.env.KAFKA!] });
const producer = kafka.producer();

export async function publishPaymentCreated(evt: {
  paymentId: string;
  userId: string;
  amount: number;
}) {
  await producer.connect();
  await producer.send({
    topic: "payments.created",
    messages: [
      {
        key: evt.userId,
        value: JSON.stringify(evt),
        headers: { "x-idempotency-key": evt.paymentId },
      },
    ],
  });
}

Consumer with Backoff, Idempotency, and DLQ

// src/consumer/paymentsCreatedConsumer.ts
import { Kafka } from "kafkajs";
import { upsertPayment } from "../services/payments";
import { isProcessed, markProcessed } from "../services/idempotency";

const kafka = new Kafka({ brokers: [process.env.KAFKA!] });
const consumer = kafka.consumer({ groupId: "payments-workers" });

export async function run() {
  await consumer.connect();
  await consumer.subscribe({ topic: "payments.created", fromBeginning: false });

  await consumer.run({
    eachMessage: async ({ message }) => {
      const evt = JSON.parse(message.value!.toString());
      const key = message.headers?.["x-idempotency-key"]?.toString();

      if (key && (await isProcessed(key))) return; // idempotent

      try {
        await upsertPayment(evt);
        if (key) await markProcessed(key);
      } catch (err) {
        // pseudo: publish to retry topic with exponential delay
        // or use DLQ after max attempts
        throw err;
      }
    },
  });
}

Retry + Dead‑Letter Queue (DLQ)

payments.created
  └─ payments.created.retry.<1,2,3>
      └─ payments.created.dlq
  • Exponential backoff between retries
  • Cap attempts to 3–5 before DLQ
  • DLQ requires explicit operator action to reprocess

Ordered vs Unordered Work

  • Use partition keys to keep order for a subset (e.g., per user)
  • For throughput, prefer unordered processing; use idempotent writes

Observability

  • Lag per consumer group (alert if > threshold)
  • Processing latency p50/p95/p99
  • Error rate and retries count
  • Saturation (CPU, concurrency slots)

Unique Implementations or System Design

  • Idempotency via Redis set or Postgres unique constraints
  • Outbox Pattern to publish events from transactional DB commits
  • Poison message fencing using DLQ + quarantine window
  • Quota‑aware backpressure: producers read lag/credits before sending

Realistic Challenges or Tradeoffs

  • Exactly‑once is expensive; aim for idempotent at‑least‑once
  • Ordering reduces throughput (hot partitions). Balance with partitioning strategy
  • Fan‑out topics increase cost; prefer event enrichment over duplication
  • Multi‑region adds latency; use per‑region topics + async replication

What I Learned or Improved

  • Designing for failure first (DLQ, retries, visibility)
  • Idempotency patterns that survive redeploys and crashes
  • Instrumentation as a feature (operators are users!)
  • Cost analysis: throughput vs retention vs storage IO