Case File · INV-0053

Ten minutes of 500s during a LinkedIn spike — with zero cache to fall back on

A LinkedIn post drove a wave of new visitors to SimUser AI's site. The CMS buckled under load and returned 503s. With no cache layer and no fallback, every request hit a live 500 — for ten straight minutes.

6 min readHighLeads lost during spikeFixed by CauseFlow in 5 min

The Architecture That Failed

Every page request fetched content directly from the CMS at runtime. No cache sat between the Lambda function and the CMS — making the CMS a single point of failure.

Visitors

CDN

Next.js Lambda

CMS

no fallback — every request → 500

Traffic vs. 5xx Errors

Request volume and error count rose in lockstep. The CMS couldn't absorb the spike — each new visitor hit a 500 instead of a page.

Requests / min5xx errors

Lambda WebsiteServer — CloudWatch Logs

{"level":"error","msg":"CMS API request failed","endpoint":"https://cms.simuser.ai/api/pages/get-started","status":503,"elapsedMs":4200}
{"level":"warn","msg":"CMS unavailable — no fallback cache found for path","path":"/get-started"}
{"level":"error","msg":"Rendering error page","statusCode":500,"reason":"CMS_UNAVAILABLE","path":"/get-started"}

The Cache That Was Never Warm

A DynamoDB scan confirmed the root cause: 80 cache records, every one with revalidatedAt: 1 — the Unix epoch sentinel for "never revalidated since deploy".

DynamoDB · WebsiteRevalidationTable

80 rows

pathrevalidatedAt

/get-started?v=0011

/pt-br/get-started?v=0021

/get-started?v=0031

/pt-br/get-started?v=0041

/get-started?v=0051

/pt-br/get-started?v=0061

/get-started?v=0071

/pt-br/get-started?v=0081

/get-started?v=0091

/pt-br/get-started?v=0101

/get-started?v=0111

/pt-br/get-started?v=0121

/get-started?v=0131

/pt-br/get-started?v=0141

/get-started?v=0151

/pt-br/get-started?v=0161

/get-started?v=0171

/pt-br/get-started?v=0181

/get-started?v=0191

/pt-br/get-started?v=0201

/get-started?v=0211

/pt-br/get-started?v=0221

/get-started?v=0231

/pt-br/get-started?v=0241

/get-started?v=0251

/pt-br/get-started?v=0261

/get-started?v=0271

/pt-br/get-started?v=0281

/get-started?v=0291

/pt-br/get-started?v=0301

/get-started?v=0311

/pt-br/get-started?v=0321

/get-started?v=0331

/pt-br/get-started?v=0341

/get-started?v=0351

/pt-br/get-started?v=0361

/get-started?v=0371

/pt-br/get-started?v=0381

/get-started?v=0391

/pt-br/get-started?v=0401

/get-started?v=0411

/pt-br/get-started?v=0421

/get-started?v=0431

/pt-br/get-started?v=0441

/get-started?v=0451

/pt-br/get-started?v=0461

/get-started?v=0471

/pt-br/get-started?v=0481

/get-started?v=0491

/pt-br/get-started?v=0501

/get-started?v=0511

/pt-br/get-started?v=0521

/get-started?v=0531

/pt-br/get-started?v=0541

/get-started?v=0551

/pt-br/get-started?v=0561

/get-started?v=0571

/pt-br/get-started?v=0581

/get-started?v=0591

/pt-br/get-started?v=0601

/get-started?v=0611

/pt-br/get-started?v=0621

/get-started?v=0631

/pt-br/get-started?v=0641

/get-started?v=0651

/pt-br/get-started?v=0661

/get-started?v=0671

/pt-br/get-started?v=0681

/get-started?v=0691

/pt-br/get-started?v=0701

/get-started?v=0711

/pt-br/get-started?v=0721

/get-started?v=0731

/pt-br/get-started?v=0741

/get-started?v=0751

/pt-br/get-started?v=0761

/get-started?v=0771

/pt-br/get-started?v=0781

/get-started?v=0791

/pt-br/get-started?v=0801

Critical Finding

All 80 cache entries for the affected page show revalidatedAt: 1 — the cache was set at deploy time and never warmed. When the CMS went down, there was nothing to serve.

Before and After: Caching Strategy

The fix was structural — migrate from SSR-on-every-request to ISR with a 5-minute revalidation window and a try/catch fallback to stale cache.

Before

SSR — fetch on every request, no cache

After

ISR + fallback — stale cache beats a 500

Business Impact

Estimated leads lost

~40

Based on average conversion rate and traffic volume during the 10-minute window. Each visitor who hit a 500 on the Get Started page was a potential trial signup.

Estimate: 400 visitors × 10% signup rate × 10 min outage

Root Cause & Fix

The page used pure SSR with a synchronous CMS fetch on every request. The fix: migrate to ISR with revalidate: 300 and a try/catch fallback that serves stale content instead of throwing.

Before — SSR: CMS failure → 500typescript

// pages/get-started — SSR: fetch on every request
export default async function GetStartedPage() {
  const content = await fetch('https://cms.simuser.ai/api/pages/get-started')
    .then(r => r.json());
  // CMS 503 → this throws → Next renders 500 to the user
  return <PageContent data={content} />;
}

After — ISR + fallback: CMS failure → stale cachetypescript

// ISR — cached for 5 min, tolerant to CMS outages
export const revalidate = 300; // 5 minutes

export default async function GetStartedPage() {
  const content = await fetchWithFallback();
  return <PageContent data={content} />;
}

async function fetchWithFallback() {
  try {
    const res = await fetch('https://cms.simuser.ai/api/pages/get-started', {
      next: { revalidate: 300 },
      signal: AbortSignal.timeout(3000),
    });
    if (!res.ok) throw new Error(`CMS returned ${res.status}`);
    return await res.json();
  } catch (err) {
    console.error({ msg: 'CMS fetch failed — serving stale cache', err });
    return null; // Next.js serves ISR cache automatically
  }
}

Stop hunting for the root cause under pressure
— let CauseFlow find it.

CauseFlow correlates DynamoDB records, Lambda logs, and CloudWatch metrics automatically — no manual log triage, no post-mortem archaeology.

Start Investigating Free See Pricing