Built byPhoenix

© 2026 Phoenix

← Blog
Message QueuesEvent-DrivenDistributed SystemsArchitectureKafkaBackend

Queues and Event-Driven Architecture: When, Why, and How Not to Get Burned

Phoenix·April 20, 2026·15 min read

Queues and Event-Driven Architecture: When, Why, and How Not to Get Burned

The request that does too much

Picture a single POST /checkout handler. Inside it: validate the cart, charge the card, write the order to the database, generate a PDF invoice, send a confirmation email, push an event to the analytics pipeline, notify the warehouse, and sync the customer to the CRM. Eight things. Three of them are calls to third-party APIs you don't control.

This design has four problems, and they compound:

  • Latency. The user waits for all of it. Your p99 is the sum of every dependency's p99 — including the slow ones.
  • Fragility. If the email provider is having a bad afternoon, the whole checkout returns a 500 — even though the card was already charged. One non-critical dependency takes down the critical path.
  • It fails as a unit. The nightmare case is partial failure: the card is charged, then the process crashes before the order row is written. Now you owe a customer money and have no record of why.
  • Scaling. You scale the entire monolith to keep up with whichever step is slowest.

The fix is conceptually simple. Do the minimum synchronous work the user is actually waiting on — validate, charge, persist the order — and get everything else off the request path. That "everything else" goes onto a queue, and a separate worker drains it. The user gets their 200 in 80ms; the invoice and the email happen a beat later.

That's the entire pitch for message queues. The rest of this post is about doing it without quietly losing data.

When you actually need a queue

Don't reach for a broker because it's fashionable. Reach for one when you have at least one of these:

  • Decoupling. The producer shouldn't have to know who consumes a message, or how many consumers there are. Adding a fourth listener to "order placed" shouldn't touch the checkout code.
  • Load smoothing / backpressure. Traffic spikes. A queue absorbs the burst and lets consumers drain at a sustainable rate, so your database doesn't fall over at the Black Friday peak. The queue depth is your backpressure signal.
  • Async / background work. Anything the user doesn't need an immediate answer to: emails, thumbnails, exports, webhook delivery, re-indexing.
  • Failure isolation and retries. A flaky third-party API can be retried independently, on its own schedule, without failing the user's original request.
  • Fan-out. One event, many independent consumers. OrderPlaced feeds the email service, the analytics warehouse, the fraud check, and fulfilment — none of which know about each other.

If none of these apply, you probably don't need a queue. I'll come back to that.

Mental models: don't conflate them

This is where most teams get into trouble. "Queue," "log," and "pub/sub" are not synonyms, and picking the wrong one bakes pain into your architecture.

Traditional message queue (a work queue). RabbitMQ, AWS SQS, BullMQ on Redis. Messages are tasks. Multiple workers compete for them — each message goes to exactly one worker and is deleted once acknowledged. This is the model for background jobs. Think of it as a shared to-do list that many workers pull from.

Log. Kafka, Redis Streams. An append-only, ordered, durable sequence of records. Messages are not deleted when read — they stay for a retention window. Each consumer tracks its own offset, multiple independent consumer groups read the same log at their own pace, and you can rewind and replay from any point. Think of it as a journal you can re-read.

Pub/sub. This is a delivery pattern, not a storage model. A publisher emits to a topic and every current subscriber gets a copy. Classic pub/sub (plain Redis pub/sub, for instance) is fire-and-forget: if you weren't subscribed at that moment, you missed it. Kafka feels like "pub/sub with a memory" precisely because of retention.

The two storage models compare like this:

Work queue (SQS, RabbitMQ, BullMQ)Log (Kafka, Redis Streams)
Mental modelShared to-do listAppend-only journal
ConsumptionWorkers compete; one gets each messageEach consumer group reads every message
After readDeleted on ackRetained (by time or size)
ReplayNo — once it's gone, it's goneYes — rewind the offset
OrderingLimited or nonePer-partition
Best forBackground jobs, task distributionEvent streaming, audit trails, many readers

The classic mistakes: using Kafka as a task queue (painful — there's no per-message ack/redelivery and no easy "just delete this one poison message"), or using SQS as an event log (you can't replay, and a second consumer steals messages from the first instead of getting its own copy).

Delivery guarantees, and the "exactly-once" myth

There are three theoretical guarantees:

  • At-most-once. Deliver and forget. Fast, lossy. Fine for fire-and-forget metrics you can afford to drop.
  • At-least-once. Redeliver until acknowledged. The message will arrive — possibly more than once. This is what almost every real system runs on.
  • Exactly-once. Each message is processed once and only once.

Here's the uncomfortable truth: end-to-end exactly-once delivery across a network is, for practical purposes, marketing. Kafka's "exactly-once semantics" is real but narrow — it covers transactional reads and writes within the Kafka world (Kafka-to-Kafka). The moment your consumer charges a card or calls a third-party API, that guarantee evaporates, because the network can always drop your ack after the side effect happened, forcing a redelivery.

So you stop chasing exactly-once delivery and instead engineer exactly-once effect: idempotent consumers. Every consumer must be safe to run twice with the same message. There are two reliable ways to get there:

  • A dedupe key. Each message carries a stable id; you record processed ids and skip anything you've seen.
  • Idempotent writes. Design the write so that doing it twice is a no-op — an upsert on a unique key, or a conditional update.
ts
async function handlePayment(job) {  const { paymentId, orderId, amount } = job.data
  // ON CONFLICT DO NOTHING: 0 rows affected means we already saw this id  const result = await db.query(    'INSERT INTO processed_payments (payment_id) VALUES ($1) ON CONFLICT DO NOTHING',    [paymentId]  )
  if (result.rowCount === 0) {    return // duplicate delivery — safe to ack and move on  }
  // psp.charge is itself idempotent on the same key — belt and suspenders  await psp.charge({ idempotencyKey: paymentId, orderId, amount })}

A stable key plus a uniqueness constraint is how you turn at-least-once delivery into exactly-once effect. If you take one thing from this post: assume duplicates, and make them harmless.

Ordering is expensive — buy only what you need

Global, total ordering across a distributed queue means a single serialization point, which means you can't scale consumers horizontally. It's almost never what you actually need.

What you usually need is per-entity ordering: all events for account A processed in order, while account A and account B are processed fully in parallel.

Both major models give you exactly this via a key:

  • Kafka: the partition key. Same key goes to the same partition, and a partition is strictly ordered. Different keys spread across partitions for parallelism.
  • SQS FIFO: the MessageGroupId. Same group is ordered; different groups run concurrently.
ts
// Same key -> same partition -> ordered. Different keys run in parallel.await producer.send({  topic: 'account-events',  messages: [{ key: accountId, value: JSON.stringify(event) }],})

Pick a key with enough cardinality to spread load, but coarse enough to preserve the ordering you actually care about — usually the entity id (accountId, orderId). One gotcha: because ordering is per-partition, retries can reorder things. If message 1 fails and gets parked while message 2 succeeds, you've broken order. When order is sacred, you sometimes have to halt the whole partition on failure rather than skip ahead.

The dual-write problem and the Transactional Outbox

Here's a bug I've watched bite team after team. You want to save an order and publish an event:

ts
await db.orders.insert(order)               // 1. write to the databaseawait broker.publish('OrderPlaced', order)  // 2. publish to the broker// no shared transaction — a crash between these two lines is a lost event

These are two separate systems with no shared transaction. Four things can happen, and two of them are bad:

  1. Both succeed. Great.
  2. The DB write fails, so you never publish. Fine — consistent.
  3. The DB write succeeds, then the process crashes before publishing. The order exists, but no event is ever emitted. Consumers never hear about it. Lost event.
  4. You reorder to publish first, then the DB insert fails. Now there's an event for an order that doesn't exist. Phantom event.

This is the dual-write problem, and you cannot solve it by being careful. You cannot atomically write to your database and a message broker. Distributed transactions (two-phase commit) are the textbook answer, and they're operationally miserable — almost nobody runs them in anger.

The pattern that actually works is the Transactional Outbox. Instead of publishing inside your business transaction, you write the event to an outbox table in the same database transaction as your business data. One transaction, one database — genuinely atomic. A separate relay process then reads the outbox and publishes to the broker.

sql
CREATE TABLE outbox (  id            BIGSERIAL   PRIMARY KEY,  aggregate_id  TEXT        NOT NULL,   -- e.g. the order id  event_type    TEXT        NOT NULL,   -- e.g. 'OrderPlaced'  payload       JSONB       NOT NULL,  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),  published_at  TIMESTAMPTZ             -- NULL until the relay ships it);
-- a partial index keeps "find the unpublished rows" fastCREATE INDEX outbox_unpublished_idx  ON outbox (created_at)  WHERE published_at IS NULL;

The business write becomes one atomic transaction:

ts
// Either both rows commit, or neither does.await db.transaction(async (tx) => {  await tx.orders.insert(order)  await tx.outbox.insert({    aggregateId: order.id,    eventType: 'OrderPlaced',    payload: order,  })})

No lost events, no phantom events. Then a relay drains the table:

ts
// Relay: poll the outbox on a short interval (or use LISTEN/NOTIFY)const rows = await db.query(  'SELECT * FROM outbox WHERE published_at IS NULL ' +    'ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED')
for (const row of rows) {  // messageId gives the consumer a stable dedupe key  await broker.publish(row.event_type, row.payload, {    messageId: String(row.id),  })  await db.query('UPDATE outbox SET published_at = now() WHERE id = $1', [    row.id,  ])}

Note that the relay is itself at-least-once: it can crash after publishing but before the UPDATE, so the row gets published again on the next pass. That's fine — it's exactly why consumers must be idempotent, and why we pass a stable messageId. FOR UPDATE SKIP LOCKED lets you run several relay instances safely without them stepping on each other.

If you'd rather not run a polling relay at all, Change Data Capture tools like Debezium tail the database's write-ahead log and stream outbox inserts straight to Kafka. Same pattern, less of your code, more infrastructure to operate.

Retries, backoff, and dead-letter queues

At-least-once makes retries a first-class design concern, not an afterthought.

A poison message is one that will never succeed no matter how many times you retry — a malformed payload, a referenced row that's been deleted, a plain bug in the handler. Without a ceiling, it loops forever, burning CPU and head-of-line-blocking everything behind it.

You need three things:

  1. Max attempts — give up after N tries.
  2. Exponential backoff — don't hammer a struggling downstream. Wait 1s, 2s, 4s, 8s, and add jitter so a fleet of consumers doesn't retry in lockstep and create a thundering herd.
  3. A dead-letter queue (DLQ) — after the last attempt, move the message to a separate queue instead of dropping it. The DLQ is your inspection bench: alert on its depth, look at what landed there, fix the bug or the bad data, and replay.

Here's a BullMQ producer and worker wiring up all three:

ts
import { Queue, Worker } from 'bullmq'
const connection = { host: '127.0.0.1', port: 6379 }const checkoutQueue = new Queue('checkout', { connection })const deadLetter = new Queue('checkout-dlq', { connection })
// Producer: configure attempts + backoff when you enqueueawait checkoutQueue.add(  'send-receipt',  { orderId, email },  {    attempts: 5,    backoff: { type: 'exponential', delay: 1000 }, // 1s, 2s, 4s, 8s, 16s    removeOnComplete: true,    removeOnFail: false, // keep failed jobs so we can inspect them  })
// Consumer: a worker competing for jobs on the 'checkout' queueconst worker = new Worker(  'checkout',  async (job) => {    await sendReceiptEmail(job.data.orderId, job.data.email)  },  { connection, concurrency: 10 })
// After the FINAL attempt fails, park the job in a dead-letter queueworker.on('failed', async (job, err) => {  if (job && job.attemptsMade >= (job.opts.attempts ?? 1)) {    await deadLetter.add('dead-letter', {      original: job.data,      failedReason: err.message,      attempts: job.attemptsMade,    })  }})

Tune attempts to the failure mode. A transient network blip deserves several retries; a 400 Bad Request from a third-party API will never succeed, so retrying is pointless — fail fast to the DLQ and stop wasting capacity.

Ack semantics and the visibility timeout

A subtle but critical detail: in an at-least-once system, a message is not deleted when it's delivered. It's deleted when it's acked. Between delivery and ack, the broker hides it from other consumers. SQS calls this the visibility timeout; other systems call it an ack deadline or a lease.

If your consumer takes longer than that timeout to finish, the broker assumes it died and redelivers the message to a different consumer — and now you're processing the same message twice, concurrently. This is one of the most common "wait, why did this run twice?!" production mysteries.

A few rules of thumb:

  • Set the visibility timeout comfortably above your p99 processing time.
  • For genuinely long jobs, periodically extend the lease (a heartbeat) rather than setting one enormous timeout up front.
  • Ack only after the work is durably done. Acking on receive quietly turns at-least-once into at-most-once, and you'll lose messages on the next crash.
  • And, once more: idempotency is the seatbelt for when redelivery happens anyway. It always eventually happens.

Events vs commands, and event-driven architecture

Once events are flowing, you can build your whole architecture around them. First, a distinction people constantly blur:

  • A command is a request to do something — ChargeCard, SendEmail. It's aimed at one handler and it can be rejected. Imperative, present tense.
  • An event is a statement that something already happened — CardCharged, OrderPlaced. It's named in the past tense, has no opinion about who's listening, and can't be rejected. It's history.

Work queues mostly carry commands; event logs mostly carry events. Getting the naming right (OrderPlaced, not PlaceOrder, for an event) keeps your system honest about who is coupled to whom.

Event-carried state transfer. A thin event like OrderPlaced { orderId } forces every consumer to call back to the order service for the details — a fan-in of synchronous calls that quietly rebuilds the coupling you were trying to escape. A fatter event like OrderPlaced { orderId, items, total, customer } carries enough state that consumers can act without a callback. The trade-off is payload size and staleness. I lean toward carrying the state consumers actually need.

Choreography vs orchestration. Two ways to run a multi-step workflow:

  • Choreography: each service listens for events and reacts, emitting its own. No central brain. OrderPlaced leads payment to charge and emit PaymentCaptured, which leads fulfilment to ship and emit OrderShipped. Loosely coupled — but the overall flow lives nowhere. It's emergent, and hard to see end to end.
  • Orchestration: a central coordinator (a saga or workflow engine) explicitly drives each step and runs compensating actions when one fails. Much easier to reason about and debug; the cost is reintroducing a central component.

I reach for choreography for simple fan-out, and orchestration the moment there's a multi-step transaction with rollback semantics. Trying to implement a complex saga purely through choreography is how you end up with a system nobody can explain on a whiteboard.

The real costs. Event-driven architecture is not free, and pretending otherwise is how projects get burned:

  • Eventual consistency. The read model lags the write. A user updates their profile and the search index catches up 200ms later. You have to design the UX around this — and explain it to product before, not after.
  • Debugging and tracing. One user action now fans out across services asynchronously. Without distributed tracing — propagate a trace/correlation id on every message — a failure becomes a needle in a haystack. At any real scale this is non-negotiable.
  • Schema evolution. Events are a contract, and in a log they're persisted — old events in their old shapes live forever. You must version events and evolve them compatibly: add optional fields, never repurpose or remove an existing one. A schema registry helps enforce it. Treat an event schema with the same seriousness as a public API, because to your consumers, that's exactly what it is.

When NOT to use a queue

Saying no is half of architecture.

  • You need an immediate, synchronous answer. Login, "is this username taken," a balance check before rendering a screen. The user is blocked on the result — a queue only adds latency and complexity.
  • Simple CRUD. A form that writes one row to one table. A broker buys you nothing here except a second system to operate, monitor, and debug at 3am.
  • You can't stomach the eventual-consistency UX and don't have the appetite to design around it.
  • Low volume, no spikes, no fan-out, no flaky third parties. If a direct call is reliable and fast enough, just make the direct call.

Queues trade synchronous simplicity for asynchronous resilience. That trade is a clear win for variable load, slow or fragile work, and fan-out. It's a tax everywhere else — every broker is one more thing to run, monitor, secure, and reason about. Add it when the problem demands it, not before.

Wrapping up

  • Get slow, fragile, fan-out work off the request path.
  • Pick the right model: a work queue for tasks, a log for replayable event streams. Don't conflate them.
  • Assume at-least-once and make every consumer idempotent — "exactly-once delivery" will not save you.
  • Buy only the ordering you need, keyed per entity.
  • Solve dual-writes with the Transactional Outbox, not with hope and careful ordering.
  • Treat retries, backoff, DLQs, and visibility timeouts as core design, not config you set once and forget.
  • Go event-driven with your eyes open: tracing, versioned schemas, and a real plan for eventual consistency.

Used well, a queue is how your system stays up when one piece is down. Used carelessly, it's a distributed way to lose data. The entire difference is in the details above.

Further Reading

← All postsShare on X
async function handlePayment(job) {  const { paymentId, orderId, amount } = job.data
  // ON CONFLICT DO NOTHING: 0 rows affected means we already saw this id  const result = await db.query(    'INSERT INTO processed_payments (payment_id) VALUES ($1) ON CONFLICT DO NOTHING',    [paymentId]  )
  if (result.rowCount === 0) {    return // duplicate delivery — safe to ack and move on  }
  // psp.charge is itself idempotent on the same key — belt and suspenders  await psp.charge({ idempotencyKey: paymentId, orderId, amount })}
// Same key -> same partition -> ordered. Different keys run in parallel.await producer.send({  topic: 'account-events',  messages: [{ key: accountId, value: JSON.stringify(event) }],})
await db.orders.insert(order)               // 1. write to the databaseawait broker.publish('OrderPlaced', order)  // 2. publish to the broker// no shared transaction — a crash between these two lines is a lost event
CREATE TABLE outbox (  id            BIGSERIAL   PRIMARY KEY,  aggregate_id  TEXT        NOT NULL,   -- e.g. the order id  event_type    TEXT        NOT NULL,   -- e.g. 'OrderPlaced'  payload       JSONB       NOT NULL,  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),  published_at  TIMESTAMPTZ             -- NULL until the relay ships it);
-- a partial index keeps "find the unpublished rows" fastCREATE INDEX outbox_unpublished_idx  ON outbox (created_at)  WHERE published_at IS NULL;
// Either both rows commit, or neither does.await db.transaction(async (tx) => {  await tx.orders.insert(order)  await tx.outbox.insert({    aggregateId: order.id,    eventType: 'OrderPlaced',    payload: order,  })})
// Relay: poll the outbox on a short interval (or use LISTEN/NOTIFY)const rows = await db.query(  'SELECT * FROM outbox WHERE published_at IS NULL ' +    'ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED')
for (const row of rows) {  // messageId gives the consumer a stable dedupe key  await broker.publish(row.event_type, row.payload, {    messageId: String(row.id),  })  await db.query('UPDATE outbox SET published_at = now() WHERE id = $1', [    row.id,  ])}
import { Queue, Worker } from 'bullmq'
const connection = { host: '127.0.0.1', port: 6379 }const checkoutQueue = new Queue('checkout', { connection })const deadLetter = new Queue('checkout-dlq', { connection })
// Producer: configure attempts + backoff when you enqueueawait checkoutQueue.add(  'send-receipt',  { orderId, email },  {    attempts: 5,    backoff: { type: 'exponential', delay: 1000 }, // 1s, 2s, 4s, 8s, 16s    removeOnComplete: true,    removeOnFail: false, // keep failed jobs so we can inspect them  })
// Consumer: a worker competing for jobs on the 'checkout' queueconst worker = new Worker(  'checkout',  async (job) => {    await sendReceiptEmail(job.data.orderId, job.data.email)  },  { connection, concurrency: 10 })
// After the FINAL attempt fails, park the job in a dead-letter queueworker.on('failed', async (job, err) => {  if (job && job.attemptsMade >= (job.opts.attempts ?? 1)) {    await deadLetter.add('dead-letter', {      original: job.data,      failedReason: err.message,      attempts: job.attemptsMade,    })  }})