Cache Invalidation Is Still Hard
"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton
Phil Karlton said that decades ago and it has only gotten truer. (The joke usually tacks on a third: off-by-one errors.) I have spent a good chunk of my career on fintech and high-throughput backends, and I can tell you the caching part is rarely what keeps me up at night. Reading a value out of Redis is easy. It is the invalidation — knowing when a cached value has gone stale and removing it at exactly the right moment — that quietly produces the worst production incidents.
Here is the thing nobody warns you about when you add your first cache: you are no longer maintaining one copy of the truth. You are maintaining two, and they will disagree. The entire discipline of caching is really the discipline of managing that disagreement.
Why we cache at all
Two reasons, basically:
- Latency. A round trip to Postgres for a hot query might be 5–30 ms. A round trip to Redis is sub-millisecond, and an in-process cache is nanoseconds. When a single request fans out into dozens of lookups, those milliseconds compound fast.
- Load. Your database is the most expensive, hardest-to-scale tier you own. A cache absorbs read traffic so the primary does not fall over at peak. I have watched a 95%+ hit rate turn a database that was redlining into one that was bored.
So caching buys you speed and shields your most precious resource. The bill comes due as a new question: how stale is too stale? That answer is a product decision disguised as an engineering one. A stock ticker and a user's display name have wildly different tolerances. Get the tolerance wrong and you either serve incorrect data (a correctness bug) or you cache nothing useful (a performance bug).
The patterns, and what they actually cost
People throw around "cache-aside" and "write-through" as though everyone agrees on the definitions. Here is how I use them.
Cache-aside (lazy loading). The application owns the cache. On read you check the cache; on a miss you load from the DB and populate the cache yourself. This is my default for almost everything, because it is simple and only ever caches data someone actually asked for.
The write path is separate, and this is where it gets interesting:
Read-through. Same read flow, but the cache layer (or a wrapper around it) handles the DB load on a miss, so the application just asks the cache and never touches the database directly. It is cache-aside with the loading logic pushed down behind an interface — cleaner call sites, less control.
Write-through. Writes go through the cache, which synchronously writes to the DB before returning. The cache stays consistent with the DB for the keys it holds, at the cost of write latency (you pay for two writes) and the fact that you cache things that may never be read.
Write-behind (write-back). Writes hit the cache and return immediately; the cache flushes to the DB asynchronously. Fast writes, great for absorbing bursts, but you are now risking data loss if the cache dies before it flushes. I reach for this rarely, and only with a durable queue behind it. In fintech, "we lost the write" is not a sentence you ever want to say.
Refresh-ahead. The cache proactively refreshes hot keys before they expire, based on access patterns. When it works, users never feel a miss. When it mispredicts, you waste work refreshing things nobody wanted. Useful for a small, predictably-hot set of keys.
My rule of thumb: start with cache-aside plus a TTL. Reach for the others only when a specific number forces you to.
Invalidation strategies
There are fewer real options here than the blogosphere implies.
TTL / expiry. Every key gets a time-to-live; when it expires, the next read repopulates it. This is the most underrated strategy in the entire field. A short TTL makes the system self-healing: even if every other invalidation mechanism you built has a bug, the data is wrong for at most N seconds and then fixes itself. I put TTLs on everything, even when I also invalidate explicitly. The TTL is your backstop.
Explicit delete-on-write. When you write the DB, you delete the affected cache key. Precise and immediate — when it works. The catch is that you have to know every key affected by a write, which is harder than it sounds once you start caching derived or aggregated data.
Write-through. Covered above. Invalidation is implicit because the cache is updated as part of the write itself.
Versioned / namespaced keys. This one is a superpower for invalidating groups of keys without scanning. Instead of deleting a thousand keys, you embed a version number in the key and bump it.
When you INCR the version, every old key is instantly orphaned — no KEYS, no SCAN, no blocking the server. The orphans expire on their own. I use this for "invalidate everything for tenant X" semantics constantly.
A quick warning: never run KEYS * against a production Redis to find keys to delete. It is O(N) and blocks the single-threaded server while it walks the entire keyspace. Versioned keys exist precisely so you never have to.
The cache-aside race that will bite you
Here is the bug I see most often, and it is subtle. The standard cache-aside write is "update the DB, then delete the cache key." It looks correct. It is not, under concurrency.
Picture two requests interleaving:
The reader did its DB read at T2 (old value), got descheduled, and only got around to populating the cache at T5 — after the writer had already deleted the key. Now the cache holds the old value, the DB holds the new value, and nothing fixes it until the TTL expires. With no TTL, it is wrong forever.
Mitigations, roughly in the order I reach for them:
- Delete, don't update. Always delete the key on a write rather than writing the new value into it. Two concurrent writers that both set can land in either order; deletion forces the next read to reload from the source of truth and eliminates a whole separate class of write-write races.
- Short TTL as a backstop. The race window leaves stale data, but a 30–60 s TTL caps the damage. For the overwhelming majority of systems this alone is "good enough." Be honest with yourself about whether you actually need better.
- Delayed double-delete. Delete the key, do the write, then schedule a second delete a second or two later — long enough to clobber any stale value a slow reader repopulated mid-flight. It is a hack, but a well-known and effective one.
- Write-through / coordinated loads. If you genuinely cannot tolerate the window, route reads and writes through a single coordinated path (write-through, or single-flight loads with locking — see below). You trade simplicity for consistency.
I want to be blunt here: there is no perfectly-consistent, simple, and fast cache. The cache and the DB are two systems, and keeping two systems perfectly in sync is the distributed-systems problem. Anyone who tells you their Redis cache layer is "strongly consistent" is either running something much heavier under the hood or has not hit the race yet.
Cache stampede (the thundering herd)
You have one extremely hot key — say a homepage config read on every request. It has a TTL. The instant it expires, the thousand requests in flight all miss simultaneously and all fire the identical expensive query at the database. A database that was serving essentially zero load for this key suddenly takes a thousand copies of the same query in the same millisecond. That is a stampede, and it can knock over a database that was nowhere near capacity.
Three mitigations, and I usually combine them.
Single-flight / per-key locking. Only one request is allowed to recompute a given key; the rest wait for that one result. Here is a sketch using Redis SET NX as the lock:
Jittered TTLs. If you set a fixed TTL on a batch of keys created together, they expire together and stampede together. Add randomness: instead of exactly 300 s, use 300 s plus or minus a random spread. This smears expirations across a window so no single instant sees a flood.
Stale-while-revalidate. Track two notions of freshness: a soft expiry and a hard expiry. Past the soft expiry you serve the stale value immediately and kick off a background refresh; you only block on a recompute past the hard expiry. Users get a fast (slightly stale) response, and exactly one background task does the work. This is my favorite for read-heavy endpoints where a few seconds of staleness is invisible.
The other failure modes
Cache penetration. Cache-aside only caches hits. If someone repeatedly requests an ID that does not exist, every request misses the cache and pounds the DB — a great way to get scraped or attacked. The fix is negative caching: cache the "not found" result too, with a short TTL, so repeated lookups for a missing key are absorbed.
(For very high-volume penetration with random keys, a Bloom filter in front of the cache is the heavier-duty answer, but negative caching handles the common case.)
Cache avalanche. This is the stampede's big brother: a large number of keys expire at nearly the same moment — usually because you populated them together (a cache warm-up, a deploy, a mass import) with identical TTLs — and the DB gets buried. The fix is the same jitter from above, applied at scale, plus never flushing your entire cache at once. Treat "all keys expire simultaneously" as the self-inflicted outage it is.
What not to cache. Caching is not free; each cached thing is a consistency liability. I do not cache: data that changes on nearly every read, anything where a stale read is genuinely dangerous (authorization decisions, an account balance at the moment of a transaction), and data that is already cheap to fetch. If a query is 0.5 ms and runs rarely, a cache just adds a second system that can be wrong.
Redis specifics worth knowing
The mechanics matter because they shape your design.
- Setting TTLs.
SET key val EX 60sets the value and expiry in one atomic command;SETEXis the older equivalent. Prefer the single command so you never leave a key without a TTL after a crash betweenSETandEXPIRE. - Eviction policy. Redis is often also memory-bound, and
maxmemory-policydecides what happens when it fills up. The big ones:allkeys-lru(evict least-recently-used across all keys — a sane default for a pure cache),allkeys-lfu(least-frequently-used, better when you have a stable hot set), and thevolatile-*variants (only evict keys that have a TTL set). I run dedicated cache instances asallkeys-lru; thevolatile-*policies bite you when you forget to set a TTL and Redis OOMs because it refused to evict your no-TTL keys.
- Design for keys vanishing. This is the mental model that fixes most bugs: under
maxmemoryeviction, any key can disappear at any time, TTL or not. Your code must treat every cache read as "might be a miss" and degrade to the source of truth gracefully. A cache is an optimization, never a system of record. If your app breaks when a key is missing, you did not build a cache — you built a fragile primary database with extra steps.
You cannot trust a cache you cannot measure
The single highest-leverage thing you can add is hit-rate metrics. Count hits and misses, emit the ratio, and alarm on it.
A hit rate that quietly drops from 96% to 70% is often the first sign of a real problem: a bad deploy that changed a key format, a TTL set too low, an eviction storm, or a new code path bypassing the cache entirely. Without the metric you discover it as a database-CPU page at 3 a.m. With it, you catch it on a dashboard. Also watch evicted_keys from INFO stats and your memory usage — a rising eviction rate means your working set no longer fits and your hit rate is about to fall off a cliff.
Wrapping up
Caching itself is a solved problem; invalidation is where the engineering lives. My honest, hard-won defaults:
- Cache-aside plus a TTL for almost everything. The TTL is non-negotiable — it is your backstop against every invalidation bug you have not found yet.
- Delete keys on write, and add a delayed second delete when the consistency window actually matters.
- Single-flight your hot keys and jitter your TTLs before you have a stampede, not after.
- Cache your negatives, design for keys disappearing, and measure your hit rate.
And accept the core truth: you are keeping two systems in sync, so perfect consistency and perfect performance cannot both be free. Pick your staleness budget deliberately, write it down, and let the TTL enforce it. That mindset has spared me far more incidents than any clever invalidation scheme ever did.
Further reading
- Redis docs — Key eviction: the authoritative reference on the LRU/LFU approximations and the
volatile-*vsallkeys-*modes (redis.io/docs). - The classic stampede write-ups: search for "thundering herd" plus "single flight" — the patterns here are decades old and worth reading in the original (the Go
singleflightpackage and Facebook's "leases" memcache paper are good starting points). - Your own metrics. The most useful caching article you will ever read is your hit-rate dashboard during an incident.