Built byPhoenix

© 2026 Phoenix

← Blog
CachingRedisPerformanceDistributed SystemsArchitectureBackend

Cache Invalidation Is Still Hard

Phoenix·May 2, 2026·14 min read

Cache Invalidation Is Still Hard

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

Phil Karlton said that decades ago and it has only gotten truer. (The joke usually tacks on a third: off-by-one errors.) I have spent a good chunk of my career on fintech and high-throughput backends, and I can tell you the caching part is rarely what keeps me up at night. Reading a value out of Redis is easy. It is the invalidation — knowing when a cached value has gone stale and removing it at exactly the right moment — that quietly produces the worst production incidents.

Here is the thing nobody warns you about when you add your first cache: you are no longer maintaining one copy of the truth. You are maintaining two, and they will disagree. The entire discipline of caching is really the discipline of managing that disagreement.

Why we cache at all

Two reasons, basically:

  • Latency. A round trip to Postgres for a hot query might be 5–30 ms. A round trip to Redis is sub-millisecond, and an in-process cache is nanoseconds. When a single request fans out into dozens of lookups, those milliseconds compound fast.
  • Load. Your database is the most expensive, hardest-to-scale tier you own. A cache absorbs read traffic so the primary does not fall over at peak. I have watched a 95%+ hit rate turn a database that was redlining into one that was bored.

So caching buys you speed and shields your most precious resource. The bill comes due as a new question: how stale is too stale? That answer is a product decision disguised as an engineering one. A stock ticker and a user's display name have wildly different tolerances. Get the tolerance wrong and you either serve incorrect data (a correctness bug) or you cache nothing useful (a performance bug).

The patterns, and what they actually cost

People throw around "cache-aside" and "write-through" as though everyone agrees on the definitions. Here is how I use them.

Cache-aside (lazy loading). The application owns the cache. On read you check the cache; on a miss you load from the DB and populate the cache yourself. This is my default for almost everything, because it is simple and only ever caches data someone actually asked for.

ts
import Redis from 'ioredis'
const redis = new Redis(process.env.REDIS_URL)const TTL_SECONDS = 60
async function getUser(userId: string): Promise<User> {  const key = `user:${userId}`
  // 1. Try the cache first.  const cached = await redis.get(key)  if (cached !== null) {    return JSON.parse(cached) as User  }
  // 2. Miss: load from the source of truth.  const user = await db.users.findById(userId)  if (!user) {    throw new NotFoundError(`user ${userId} not found`)  }
  // 3. Populate the cache for next time, with a TTL.  await redis.set(key, JSON.stringify(user), 'EX', TTL_SECONDS)  return user}

The write path is separate, and this is where it gets interesting:

ts
async function updateUserEmail(userId: string, email: string): Promise<void> {  // 1. Write the source of truth first.  await db.users.update(userId, { email })
  // 2. Then invalidate the cache. Delete, don't overwrite (more on that later).  await redis.del(`user:${userId}`)}

Read-through. Same read flow, but the cache layer (or a wrapper around it) handles the DB load on a miss, so the application just asks the cache and never touches the database directly. It is cache-aside with the loading logic pushed down behind an interface — cleaner call sites, less control.

Write-through. Writes go through the cache, which synchronously writes to the DB before returning. The cache stays consistent with the DB for the keys it holds, at the cost of write latency (you pay for two writes) and the fact that you cache things that may never be read.

Write-behind (write-back). Writes hit the cache and return immediately; the cache flushes to the DB asynchronously. Fast writes, great for absorbing bursts, but you are now risking data loss if the cache dies before it flushes. I reach for this rarely, and only with a durable queue behind it. In fintech, "we lost the write" is not a sentence you ever want to say.

Refresh-ahead. The cache proactively refreshes hot keys before they expire, based on access patterns. When it works, users never feel a miss. When it mispredicts, you waste work refreshing things nobody wanted. Useful for a small, predictably-hot set of keys.

My rule of thumb: start with cache-aside plus a TTL. Reach for the others only when a specific number forces you to.

Invalidation strategies

There are fewer real options here than the blogosphere implies.

TTL / expiry. Every key gets a time-to-live; when it expires, the next read repopulates it. This is the most underrated strategy in the entire field. A short TTL makes the system self-healing: even if every other invalidation mechanism you built has a bug, the data is wrong for at most N seconds and then fixes itself. I put TTLs on everything, even when I also invalidate explicitly. The TTL is your backstop.

ts
// One atomic command: set value AND expiry together.await redis.set('config:homepage', payload, 'EX', 30)
// SETEX is the older, equivalent form (note the order: key, ttl, value).await redis.setex('config:homepage', 30, payload)
// Never do this — a crash between the two leaves a key with no TTL, forever:await redis.set('config:homepage', payload)await redis.expire('config:homepage', 30)

Explicit delete-on-write. When you write the DB, you delete the affected cache key. Precise and immediate — when it works. The catch is that you have to know every key affected by a write, which is harder than it sounds once you start caching derived or aggregated data.

Write-through. Covered above. Invalidation is implicit because the cache is updated as part of the write itself.

Versioned / namespaced keys. This one is a superpower for invalidating groups of keys without scanning. Instead of deleting a thousand keys, you embed a version number in the key and bump it.

ts
// The version key acts as a generation counter for a whole group.async function tenantCacheKey(tenantId: string, resource: string) {  const version = (await redis.get(`tenant:${tenantId}:ver`)) ?? '0'  return `tenant:${tenantId}:v${version}:${resource}`}
// Invalidate EVERYTHING for a tenant in O(1): just bump the version.async function invalidateTenant(tenantId: string) {  await redis.incr(`tenant:${tenantId}:ver`)  // Old keys are now orphaned and will expire on their own via TTL.}

When you INCR the version, every old key is instantly orphaned — no KEYS, no SCAN, no blocking the server. The orphans expire on their own. I use this for "invalidate everything for tenant X" semantics constantly.

A quick warning: never run KEYS * against a production Redis to find keys to delete. It is O(N) and blocks the single-threaded server while it walks the entire keyspace. Versioned keys exist precisely so you never have to.

The cache-aside race that will bite you

Here is the bug I see most often, and it is subtle. The standard cache-aside write is "update the DB, then delete the cache key." It looks correct. It is not, under concurrency.

Picture two requests interleaving:

T1  Reader: GET user:42            ->  MISST2  Reader: SELECT ... FROM users  ->  reads the OLD valueT3  Writer: UPDATE users SET ...   ->  writes the NEW valueT4  Writer: DEL user:42            ->  cache is now emptyT5  Reader: SET user:42 = OLD      ->  stale value resurrected

The reader did its DB read at T2 (old value), got descheduled, and only got around to populating the cache at T5 — after the writer had already deleted the key. Now the cache holds the old value, the DB holds the new value, and nothing fixes it until the TTL expires. With no TTL, it is wrong forever.

Mitigations, roughly in the order I reach for them:

  • Delete, don't update. Always delete the key on a write rather than writing the new value into it. Two concurrent writers that both set can land in either order; deletion forces the next read to reload from the source of truth and eliminates a whole separate class of write-write races.
  • Short TTL as a backstop. The race window leaves stale data, but a 30–60 s TTL caps the damage. For the overwhelming majority of systems this alone is "good enough." Be honest with yourself about whether you actually need better.
  • Delayed double-delete. Delete the key, do the write, then schedule a second delete a second or two later — long enough to clobber any stale value a slow reader repopulated mid-flight. It is a hack, but a well-known and effective one.
ts
async function updateUserEmail(userId: string, email: string): Promise<void> {  const key = `user:${userId}`
  await db.users.update(userId, { email })  await redis.del(key) // first delete, right after the write
  // Second delete a moment later, to clobber any stale value a slow  // concurrent reader repopulated mid-write. A naive in-process timer is  // shown here; in production put this on a durable delay queue so it  // survives a restart.  setTimeout(() => {    void redis.del(key)  }, 1000)}
  • Write-through / coordinated loads. If you genuinely cannot tolerate the window, route reads and writes through a single coordinated path (write-through, or single-flight loads with locking — see below). You trade simplicity for consistency.

I want to be blunt here: there is no perfectly-consistent, simple, and fast cache. The cache and the DB are two systems, and keeping two systems perfectly in sync is the distributed-systems problem. Anyone who tells you their Redis cache layer is "strongly consistent" is either running something much heavier under the hood or has not hit the race yet.

Cache stampede (the thundering herd)

You have one extremely hot key — say a homepage config read on every request. It has a TTL. The instant it expires, the thousand requests in flight all miss simultaneously and all fire the identical expensive query at the database. A database that was serving essentially zero load for this key suddenly takes a thousand copies of the same query in the same millisecond. That is a stampede, and it can knock over a database that was nowhere near capacity.

Three mitigations, and I usually combine them.

Single-flight / per-key locking. Only one request is allowed to recompute a given key; the rest wait for that one result. Here is a sketch using Redis SET NX as the lock:

ts
async function getWithSingleFlight<T>(  key: string,  ttlSeconds: number,  load: () => Promise<T>,): Promise<T> {  const cached = await redis.get(key)  if (cached !== null) return JSON.parse(cached) as T
  const lockKey = `lock:${key}`  // SET ... NX = only one caller wins the lock. PX gives the lock its own  // TTL so a crashed holder can't block everyone forever.  const gotLock = await redis.set(lockKey, '1', 'PX', 5000, 'NX')
  if (!gotLock) {    // Someone else is loading. Back off briefly, then read the cache again.    await new Promise((r) => setTimeout(r, 50 + Math.random() * 50))    return getWithSingleFlight(key, ttlSeconds, load)  }
  try {    const value = await load() // the ONE database hit for this key    await redis.set(key, JSON.stringify(value), 'EX', ttlSeconds)    return value  } finally {    await redis.del(lockKey)  }}

Jittered TTLs. If you set a fixed TTL on a batch of keys created together, they expire together and stampede together. Add randomness: instead of exactly 300 s, use 300 s plus or minus a random spread. This smears expirations across a window so no single instant sees a flood.

ts
// A fixed TTL makes keys born together die together. Spread them out.function jitteredTtl(baseSeconds: number, spread = 0.2): number {  const delta = baseSeconds * spread  return Math.round(baseSeconds - delta + Math.random() * 2 * delta)}
// 300s base, +/- 20% -> each key expires somewhere in 240..360s.await redis.set(key, payload, 'EX', jitteredTtl(300))

Stale-while-revalidate. Track two notions of freshness: a soft expiry and a hard expiry. Past the soft expiry you serve the stale value immediately and kick off a background refresh; you only block on a recompute past the hard expiry. Users get a fast (slightly stale) response, and exactly one background task does the work. This is my favorite for read-heavy endpoints where a few seconds of staleness is invisible.

ts
type Entry<T> = { value: T; softExpireAt: number }
async function staleWhileRevalidate<T>(  key: string,  softTtlMs: number,  hardTtlSeconds: number,  load: () => Promise<T>,): Promise<T> {  const raw = await redis.get(key)
  if (raw !== null) {    const entry = JSON.parse(raw) as Entry<T>    if (Date.now() > entry.softExpireAt) {      // Stale but usable: refresh in the background, return stale now.      // (In production, wrap refresh() in the single-flight lock above so      //  only ONE background refresh runs per key.)      void refresh(key, softTtlMs, hardTtlSeconds, load)    }    return entry.value  }
  // Hard miss: we have to block and load.  return refresh(key, softTtlMs, hardTtlSeconds, load)}
async function refresh<T>(  key: string,  softTtlMs: number,  hardTtlSeconds: number,  load: () => Promise<T>,): Promise<T> {  const value = await load()  const entry: Entry<T> = { value, softExpireAt: Date.now() + softTtlMs }  // The hard TTL on the Redis key is the absolute backstop.  await redis.set(key, JSON.stringify(entry), 'EX', hardTtlSeconds)  return value}

The other failure modes

Cache penetration. Cache-aside only caches hits. If someone repeatedly requests an ID that does not exist, every request misses the cache and pounds the DB — a great way to get scraped or attacked. The fix is negative caching: cache the "not found" result too, with a short TTL, so repeated lookups for a missing key are absorbed.

ts
const NOT_FOUND = '__null__'
async function getUserSafe(userId: string): Promise<User | null> {  const key = `user:${userId}`  const cached = await redis.get(key)
  if (cached === NOT_FOUND) return null // a cached miss  if (cached !== null) return JSON.parse(cached) as User
  const user = await db.users.findById(userId)  if (!user) {    // Cache the absence, but briefly — the row may appear later.    await redis.set(key, NOT_FOUND, 'EX', 10)    return null  }
  await redis.set(key, JSON.stringify(user), 'EX', 60)  return user}

(For very high-volume penetration with random keys, a Bloom filter in front of the cache is the heavier-duty answer, but negative caching handles the common case.)

Cache avalanche. This is the stampede's big brother: a large number of keys expire at nearly the same moment — usually because you populated them together (a cache warm-up, a deploy, a mass import) with identical TTLs — and the DB gets buried. The fix is the same jitter from above, applied at scale, plus never flushing your entire cache at once. Treat "all keys expire simultaneously" as the self-inflicted outage it is.

What not to cache. Caching is not free; each cached thing is a consistency liability. I do not cache: data that changes on nearly every read, anything where a stale read is genuinely dangerous (authorization decisions, an account balance at the moment of a transaction), and data that is already cheap to fetch. If a query is 0.5 ms and runs rarely, a cache just adds a second system that can be wrong.

Redis specifics worth knowing

The mechanics matter because they shape your design.

  • Setting TTLs. SET key val EX 60 sets the value and expiry in one atomic command; SETEX is the older equivalent. Prefer the single command so you never leave a key without a TTL after a crash between SET and EXPIRE.
  • Eviction policy. Redis is often also memory-bound, and maxmemory-policy decides what happens when it fills up. The big ones: allkeys-lru (evict least-recently-used across all keys — a sane default for a pure cache), allkeys-lfu (least-frequently-used, better when you have a stable hot set), and the volatile-* variants (only evict keys that have a TTL set). I run dedicated cache instances as allkeys-lru; the volatile-* policies bite you when you forget to set a TTL and Redis OOMs because it refused to evict your no-TTL keys.
bash
# redis.conf — treat a cache instance as disposable memorymaxmemory 4gbmaxmemory-policy allkeys-lru   # evict least-recently-used across ALL keys
# Inspect hit/miss + eviction from the server's own counters:redis-cli INFO stats | grep -E 'keyspace_hits|keyspace_misses|evicted_keys'
  • Design for keys vanishing. This is the mental model that fixes most bugs: under maxmemory eviction, any key can disappear at any time, TTL or not. Your code must treat every cache read as "might be a miss" and degrade to the source of truth gracefully. A cache is an optimization, never a system of record. If your app breaks when a key is missing, you did not build a cache — you built a fragile primary database with extra steps.

You cannot trust a cache you cannot measure

The single highest-leverage thing you can add is hit-rate metrics. Count hits and misses, emit the ratio, and alarm on it.

ts
async function getUserMeasured(userId: string): Promise<User> {  const key = `user:${userId}`  const cached = await redis.get(key)
  if (cached !== null) {    metrics.increment('cache.hit', { cache: 'user' }) // statsd/Datadog-style client    return JSON.parse(cached) as User  }
  metrics.increment('cache.miss', { cache: 'user' })  const user = await db.users.findById(userId)  await redis.set(key, JSON.stringify(user), 'EX', 60)  return user}
// hit rate = hits / (hits + misses). Put it on a dashboard and alarm on it.

A hit rate that quietly drops from 96% to 70% is often the first sign of a real problem: a bad deploy that changed a key format, a TTL set too low, an eviction storm, or a new code path bypassing the cache entirely. Without the metric you discover it as a database-CPU page at 3 a.m. With it, you catch it on a dashboard. Also watch evicted_keys from INFO stats and your memory usage — a rising eviction rate means your working set no longer fits and your hit rate is about to fall off a cliff.

Wrapping up

Caching itself is a solved problem; invalidation is where the engineering lives. My honest, hard-won defaults:

  1. Cache-aside plus a TTL for almost everything. The TTL is non-negotiable — it is your backstop against every invalidation bug you have not found yet.
  2. Delete keys on write, and add a delayed second delete when the consistency window actually matters.
  3. Single-flight your hot keys and jitter your TTLs before you have a stampede, not after.
  4. Cache your negatives, design for keys disappearing, and measure your hit rate.

And accept the core truth: you are keeping two systems in sync, so perfect consistency and perfect performance cannot both be free. Pick your staleness budget deliberately, write it down, and let the TTL enforce it. That mindset has spared me far more incidents than any clever invalidation scheme ever did.

Further reading

  • Redis docs — Key eviction: the authoritative reference on the LRU/LFU approximations and the volatile-* vs allkeys-* modes (redis.io/docs).
  • The classic stampede write-ups: search for "thundering herd" plus "single flight" — the patterns here are decades old and worth reading in the original (the Go singleflight package and Facebook's "leases" memcache paper are good starting points).
  • Your own metrics. The most useful caching article you will ever read is your hit-rate dashboard during an incident.
← All postsShare on X
import Redis from 'ioredis'
const redis = new Redis(process.env.REDIS_URL)const TTL_SECONDS = 60
async function getUser(userId: string): Promise<User> {  const key = `user:${userId}`
  // 1. Try the cache first.  const cached = await redis.get(key)  if (cached !== null) {    return JSON.parse(cached) as User  }
  // 2. Miss: load from the source of truth.  const user = await db.users.findById(userId)  if (!user) {    throw new NotFoundError(`user ${userId} not found`)  }
  // 3. Populate the cache for next time, with a TTL.  await redis.set(key, JSON.stringify(user), 'EX', TTL_SECONDS)  return user}
async function updateUserEmail(userId: string, email: string): Promise<void> {  // 1. Write the source of truth first.  await db.users.update(userId, { email })
  // 2. Then invalidate the cache. Delete, don't overwrite (more on that later).  await redis.del(`user:${userId}`)}
// One atomic command: set value AND expiry together.await redis.set('config:homepage', payload, 'EX', 30)
// SETEX is the older, equivalent form (note the order: key, ttl, value).await redis.setex('config:homepage', 30, payload)
// Never do this — a crash between the two leaves a key with no TTL, forever:await redis.set('config:homepage', payload)await redis.expire('config:homepage', 30)
// The version key acts as a generation counter for a whole group.async function tenantCacheKey(tenantId: string, resource: string) {  const version = (await redis.get(`tenant:${tenantId}:ver`)) ?? '0'  return `tenant:${tenantId}:v${version}:${resource}`}
// Invalidate EVERYTHING for a tenant in O(1): just bump the version.async function invalidateTenant(tenantId: string) {  await redis.incr(`tenant:${tenantId}:ver`)  // Old keys are now orphaned and will expire on their own via TTL.}
T1  Reader: GET user:42            ->  MISST2  Reader: SELECT ... FROM users  ->  reads the OLD valueT3  Writer: UPDATE users SET ...   ->  writes the NEW valueT4  Writer: DEL user:42            ->  cache is now emptyT5  Reader: SET user:42 = OLD      ->  stale value resurrected
async function updateUserEmail(userId: string, email: string): Promise<void> {  const key = `user:${userId}`
  await db.users.update(userId, { email })  await redis.del(key) // first delete, right after the write
  // Second delete a moment later, to clobber any stale value a slow  // concurrent reader repopulated mid-write. A naive in-process timer is  // shown here; in production put this on a durable delay queue so it  // survives a restart.  setTimeout(() => {    void redis.del(key)  }, 1000)}
async function getWithSingleFlight<T>(  key: string,  ttlSeconds: number,  load: () => Promise<T>,): Promise<T> {  const cached = await redis.get(key)  if (cached !== null) return JSON.parse(cached) as T
  const lockKey = `lock:${key}`  // SET ... NX = only one caller wins the lock. PX gives the lock its own  // TTL so a crashed holder can't block everyone forever.  const gotLock = await redis.set(lockKey, '1', 'PX', 5000, 'NX')
  if (!gotLock) {    // Someone else is loading. Back off briefly, then read the cache again.    await new Promise((r) => setTimeout(r, 50 + Math.random() * 50))    return getWithSingleFlight(key, ttlSeconds, load)  }
  try {    const value = await load() // the ONE database hit for this key    await redis.set(key, JSON.stringify(value), 'EX', ttlSeconds)    return value  } finally {    await redis.del(lockKey)  }}
// A fixed TTL makes keys born together die together. Spread them out.function jitteredTtl(baseSeconds: number, spread = 0.2): number {  const delta = baseSeconds * spread  return Math.round(baseSeconds - delta + Math.random() * 2 * delta)}
// 300s base, +/- 20% -> each key expires somewhere in 240..360s.await redis.set(key, payload, 'EX', jitteredTtl(300))
type Entry<T> = { value: T; softExpireAt: number }
async function staleWhileRevalidate<T>(  key: string,  softTtlMs: number,  hardTtlSeconds: number,  load: () => Promise<T>,): Promise<T> {  const raw = await redis.get(key)
  if (raw !== null) {    const entry = JSON.parse(raw) as Entry<T>    if (Date.now() > entry.softExpireAt) {      // Stale but usable: refresh in the background, return stale now.      // (In production, wrap refresh() in the single-flight lock above so      //  only ONE background refresh runs per key.)      void refresh(key, softTtlMs, hardTtlSeconds, load)    }    return entry.value  }
  // Hard miss: we have to block and load.  return refresh(key, softTtlMs, hardTtlSeconds, load)}
async function refresh<T>(  key: string,  softTtlMs: number,  hardTtlSeconds: number,  load: () => Promise<T>,): Promise<T> {  const value = await load()  const entry: Entry<T> = { value, softExpireAt: Date.now() + softTtlMs }  // The hard TTL on the Redis key is the absolute backstop.  await redis.set(key, JSON.stringify(entry), 'EX', hardTtlSeconds)  return value}
const NOT_FOUND = '__null__'
async function getUserSafe(userId: string): Promise<User | null> {  const key = `user:${userId}`  const cached = await redis.get(key)
  if (cached === NOT_FOUND) return null // a cached miss  if (cached !== null) return JSON.parse(cached) as User
  const user = await db.users.findById(userId)  if (!user) {    // Cache the absence, but briefly — the row may appear later.    await redis.set(key, NOT_FOUND, 'EX', 10)    return null  }
  await redis.set(key, JSON.stringify(user), 'EX', 60)  return user}
# redis.conf — treat a cache instance as disposable memorymaxmemory 4gbmaxmemory-policy allkeys-lru   # evict least-recently-used across ALL keys
# Inspect hit/miss + eviction from the server's own counters:redis-cli INFO stats | grep -E 'keyspace_hits|keyspace_misses|evicted_keys'
async function getUserMeasured(userId: string): Promise<User> {  const key = `user:${userId}`  const cached = await redis.get(key)
  if (cached !== null) {    metrics.increment('cache.hit', { cache: 'user' }) // statsd/Datadog-style client    return JSON.parse(cached) as User  }
  metrics.increment('cache.miss', { cache: 'user' })  const user = await db.users.findById(userId)  await redis.set(key, JSON.stringify(user), 'EX', 60)  return user}
// hit rate = hits / (hits + misses). Put it on a dashboard and alarm on it.