Skip to main content

Case File · INV-0053

Ten minutes of 500s during a LinkedIn spike — with zero cache to fall back on

A LinkedIn post drove a wave of new visitors to SimUser AI's site. The CMS buckled under load and returned 503s. With no cache layer and no fallback, every request hit a live 500 — for ten straight minutes.

6 min readHighLeads lost during spikeFixed by CauseFlow in 5 min

The Architecture That Failed

Every page request fetched content directly from the CMS at runtime. No cache sat between the Lambda function and the CMS — making the CMS a single point of failure.

Visitors
CDN
Next.js Lambda
CMS

no fallback — every request → 500

Traffic vs. 5xx Errors

Request volume and error count rose in lockstep. The CMS couldn't absorb the spike — each new visitor hit a 500 instead of a page.

Requests / min5xx errors
LinkedIn postCMS recoveredTime →
Lambda WebsiteServer — CloudWatch Logs
{"level":"error","msg":"CMS API request failed","endpoint":"https://cms.simuser.ai/api/pages/get-started","status":503,"elapsedMs":4200}
{"level":"warn","msg":"CMS unavailable — no fallback cache found for path","path":"/get-started"}
{"level":"error","msg":"Rendering error page","statusCode":500,"reason":"CMS_UNAVAILABLE","path":"/get-started"}

The Cache That Was Never Warm

A DynamoDB scan confirmed the root cause: 80 cache records, every one with revalidatedAt: 1 — the Unix epoch sentinel for "never revalidated since deploy".

DynamoDB · WebsiteRevalidationTable

80 rows
pathrevalidatedAt
/get-started?v=0011
/pt-br/get-started?v=0021
/get-started?v=0031
/pt-br/get-started?v=0041
/get-started?v=0051
/pt-br/get-started?v=0061
/get-started?v=0071
/pt-br/get-started?v=0081
/get-started?v=0091
/pt-br/get-started?v=0101
/get-started?v=0111
/pt-br/get-started?v=0121
/get-started?v=0131
/pt-br/get-started?v=0141
/get-started?v=0151
/pt-br/get-started?v=0161
/get-started?v=0171
/pt-br/get-started?v=0181
/get-started?v=0191
/pt-br/get-started?v=0201
/get-started?v=0211
/pt-br/get-started?v=0221
/get-started?v=0231
/pt-br/get-started?v=0241
/get-started?v=0251
/pt-br/get-started?v=0261
/get-started?v=0271
/pt-br/get-started?v=0281
/get-started?v=0291
/pt-br/get-started?v=0301
/get-started?v=0311
/pt-br/get-started?v=0321
/get-started?v=0331
/pt-br/get-started?v=0341
/get-started?v=0351
/pt-br/get-started?v=0361
/get-started?v=0371
/pt-br/get-started?v=0381
/get-started?v=0391
/pt-br/get-started?v=0401
/get-started?v=0411
/pt-br/get-started?v=0421
/get-started?v=0431
/pt-br/get-started?v=0441
/get-started?v=0451
/pt-br/get-started?v=0461
/get-started?v=0471
/pt-br/get-started?v=0481
/get-started?v=0491
/pt-br/get-started?v=0501
/get-started?v=0511
/pt-br/get-started?v=0521
/get-started?v=0531
/pt-br/get-started?v=0541
/get-started?v=0551
/pt-br/get-started?v=0561
/get-started?v=0571
/pt-br/get-started?v=0581
/get-started?v=0591
/pt-br/get-started?v=0601
/get-started?v=0611
/pt-br/get-started?v=0621
/get-started?v=0631
/pt-br/get-started?v=0641
/get-started?v=0651
/pt-br/get-started?v=0661
/get-started?v=0671
/pt-br/get-started?v=0681
/get-started?v=0691
/pt-br/get-started?v=0701
/get-started?v=0711
/pt-br/get-started?v=0721
/get-started?v=0731
/pt-br/get-started?v=0741
/get-started?v=0751
/pt-br/get-started?v=0761
/get-started?v=0771
/pt-br/get-started?v=0781
/get-started?v=0791
/pt-br/get-started?v=0801

Critical Finding

All 80 cache entries for the affected page show revalidatedAt: 1 — the cache was set at deploy time and never warmed. When the CMS went down, there was nothing to serve.

Before and After: Caching Strategy

The fix was structural — migrate from SSR-on-every-request to ISR with a 5-minute revalidation window and a try/catch fallback to stale cache.

Before

SSR — fetch on every request, no cache

LambdaSSRno cacheCMS503single point of failure

After

ISR + fallback — stale cache beats a 500

LambdaISRISR Cacherevalidate: 300stale fallback

Business Impact

Estimated leads lost

~40

Based on average conversion rate and traffic volume during the 10-minute window. Each visitor who hit a 500 on the Get Started page was a potential trial signup.

Estimate: 400 visitors × 10% signup rate × 10 min outage

Root Cause & Fix

The page used pure SSR with a synchronous CMS fetch on every request. The fix: migrate to ISR with revalidate: 300 and a try/catch fallback that serves stale content instead of throwing.

Before — SSR: CMS failure → 500typescript
// pages/get-started — SSR: fetch on every request
export default async function GetStartedPage() {
  const content = await fetch('https://cms.simuser.ai/api/pages/get-started')
    .then(r => r.json());
  // CMS 503 → this throws → Next renders 500 to the user
  return <PageContent data={content} />;
}
After — ISR + fallback: CMS failure → stale cachetypescript
// ISR — cached for 5 min, tolerant to CMS outages
export const revalidate = 300; // 5 minutes

export default async function GetStartedPage() {
  const content = await fetchWithFallback();
  return <PageContent data={content} />;
}

async function fetchWithFallback() {
  try {
    const res = await fetch('https://cms.simuser.ai/api/pages/get-started', {
      next: { revalidate: 300 },
      signal: AbortSignal.timeout(3000),
    });
    if (!res.ok) throw new Error(`CMS returned ${res.status}`);
    return await res.json();
  } catch (err) {
    console.error({ msg: 'CMS fetch failed — serving stale cache', err });
    return null; // Next.js serves ISR cache automatically
  }
}

Stop hunting for the root cause under pressure
— let CauseFlow find it.

CauseFlow correlates DynamoDB records, Lambda logs, and CloudWatch metrics automatically — no manual log triage, no post-mortem archaeology.