Case File · INV-0053
Ten minutes of 500s during a LinkedIn spike — with zero cache to fall back on
A LinkedIn post drove a wave of new visitors to SimUser AI's site. The CMS buckled under load and returned 503s. With no cache layer and no fallback, every request hit a live 500 — for ten straight minutes.
The Architecture That Failed
Every page request fetched content directly from the CMS at runtime. No cache sat between the Lambda function and the CMS — making the CMS a single point of failure.
no fallback — every request → 500
Traffic vs. 5xx Errors
Request volume and error count rose in lockstep. The CMS couldn't absorb the spike — each new visitor hit a 500 instead of a page.
{"level":"error","msg":"CMS API request failed","endpoint":"https://cms.simuser.ai/api/pages/get-started","status":503,"elapsedMs":4200}{"level":"warn","msg":"CMS unavailable — no fallback cache found for path","path":"/get-started"}{"level":"error","msg":"Rendering error page","statusCode":500,"reason":"CMS_UNAVAILABLE","path":"/get-started"}
The Cache That Was Never Warm
A DynamoDB scan confirmed the root cause: 80 cache records, every one with revalidatedAt: 1 — the Unix epoch sentinel for "never revalidated since deploy".
DynamoDB · WebsiteRevalidationTable
80 rowsCritical Finding
All 80 cache entries for the affected page show revalidatedAt: 1 — the cache was set at deploy time and never warmed. When the CMS went down, there was nothing to serve.
Before and After: Caching Strategy
The fix was structural — migrate from SSR-on-every-request to ISR with a 5-minute revalidation window and a try/catch fallback to stale cache.
Before
SSR — fetch on every request, no cache
After
ISR + fallback — stale cache beats a 500
Business Impact
Estimated leads lost
~40
Based on average conversion rate and traffic volume during the 10-minute window. Each visitor who hit a 500 on the Get Started page was a potential trial signup.
Estimate: 400 visitors × 10% signup rate × 10 min outage
Root Cause & Fix
The page used pure SSR with a synchronous CMS fetch on every request. The fix: migrate to ISR with revalidate: 300 and a try/catch fallback that serves stale content instead of throwing.
// pages/get-started — SSR: fetch on every request
export default async function GetStartedPage() {
const content = await fetch('https://cms.simuser.ai/api/pages/get-started')
.then(r => r.json());
// CMS 503 → this throws → Next renders 500 to the user
return <PageContent data={content} />;
}// ISR — cached for 5 min, tolerant to CMS outages
export const revalidate = 300; // 5 minutes
export default async function GetStartedPage() {
const content = await fetchWithFallback();
return <PageContent data={content} />;
}
async function fetchWithFallback() {
try {
const res = await fetch('https://cms.simuser.ai/api/pages/get-started', {
next: { revalidate: 300 },
signal: AbortSignal.timeout(3000),
});
if (!res.ok) throw new Error(`CMS returned ${res.status}`);
return await res.json();
} catch (err) {
console.error({ msg: 'CMS fetch failed — serving stale cache', err });
return null; // Next.js serves ISR cache automatically
}
}Stop hunting for the root cause under pressure
— let CauseFlow find it.
CauseFlow correlates DynamoDB records, Lambda logs, and CloudWatch metrics automatically — no manual log triage, no post-mortem archaeology.