Your AI's Knowledge Base Is Lying to You (A Technical Post-Mortem)

A dental practice changes its Saturday hours. They update the footer on their website. Their AI receptionist continues telling callers the old hours for the next three weeks.

This is the canonical KB staleness failure. It feels like a content management problem — someone should have updated the KB when the website changed. But if you look at the engineering underneath, it's much harder than that. The AI's knowledge base was technically "fresh" by every metric the system tracked. The page loaded fine. The content hash hadn't changed. The re-crawl ran on schedule. The KB said it was up to date.

It was lying.

Why Hash Comparison Isn't Enough

The naive approach to KB freshness is hash-based: store a hash of the crawled page content, re-crawl periodically, compare hashes, re-index if they differ. This works for the case where an entire page is replaced. It fails in several realistic scenarios.

The footer problem. A dental practice's Saturday hours live in a shared footer component that appears on every page. When they change, every page hash changes — but the change is identical across all of them. A hash-comparison system re-indexes every document when it should re-index one section. More importantly, if the footer isn't in the main content area, it may not have been chunked and indexed at all on the original crawl. The hash changes, the re-index runs, and the footer is still excluded from the chunk boundaries that were used the first time.

The false freshness problem. Some pages load the same HTML shell every time but populate it with data from client-side JavaScript. The crawl sees the shell. The hash of the shell never changes. The dynamically-loaded content — which includes things like hours and pricing — is never captured. The KB thinks it has fresh data on business hours. It has a stable, perpetually-fresh snapshot of a loading spinner.

Semantic drift without textual change. "We accept most major insurance plans" means something different in 2024 than it did in 2022. The text hasn't changed. The hash hasn't changed. The business reality has drifted away from the content. Hash comparison has no mechanism to detect this.

The Chunking Problem at Update Time

Even when you correctly detect that content has changed, updating a vector KB is not as simple as replacing the old vectors with new ones.

The typical chunking strategy splits documents into overlapping windows of N tokens, embeds each chunk, and stores the vectors. When the source document changes, you need to invalidate the old vectors and insert new ones. The problem: you don't know which vectors correspond to which part of the document unless you stored that mapping carefully.

Naive chunking produces chunk IDs like docabc123chunk7. When the document changes, you delete all chunks with the docabc123 prefix and re-index the whole document. This works, but it has a cost: you're re-embedding an entire document when only one sentence changed, and you're creating a window where old vectors are deleted but new ones aren't indexed yet — the gap problem.

A worse failure mode: the document changes in a way that shifts chunk boundaries. Chunk 7 used to contain the Saturday hours. After an edit to the preceding paragraph, the Saturday hours are now in chunk 8. You re-embed correctly. But if any downstream system cached chunk 7's content, it's now stale in a way that won't be caught by a re-hash of chunk 7 (which no longer exists as the same logical unit).

Vector Database Invalidation Without Knowing What's Stale

Vector DBs aren't designed for targeted invalidation. You can't query "which vectors are semantically similar to the fact that Saturday hours are 9am–3pm and therefore need to be reconsidered." You can delete by ID, filter by metadata, or delete everything and re-index. None of these are great.

The problem compounds when you don't know what changed. You know the page hash changed. You might not know that the change was specifically to the hours listed in the footer. So you can't target the invalidation. You re-index the whole document and hope the new chunks cover the changed fact correctly.

This is the core of why KB staleness is an engineering problem, not a content management one: the vector representation doesn't preserve a clean mapping back to the source facts it encodes. You can't surgically update a belief — you can only replace chunks.

Workforce Wave KB Sync's Three-Layer Diff Approach

Workforce Wave's KB Sync runs on a configurable schedule (default: every 6 hours) and uses a three-layer comparison rather than a single hash check.

Layer 1: Structural diff
  - DOM structure comparison (not just text content)
  - Detects: element additions/removals, footer/header changes, nav changes
  - Flags: structural_change, new_page, deleted_page

Layer 2: Semantic diff
  - Embedding distance between old and new content windows
  - Threshold: cosine distance > 0.15 triggers re-chunk for that window
  - Detects: meaning changes that preserve surface text
  - More expensive; runs only on pages that passed Layer 1 unchanged

Layer 3: Fact extraction diff
  - Runs NER + structured extraction on high-signal fields:
    hours, phone, address, pricing, staff names, services
  - Compares structured output, not raw text
  - Detects: "9-5" → "9-3" even if the surrounding sentence is identical

Layer 1 is cheap and catches most changes. Layer 2 runs on pages that Layer 1 passes — it finds semantic drift without textual change. Layer 3 is targeted extraction that runs on specific field types regardless of whether Layers 1 or 2 fired.

When a change is detected, KB Sync doesn't delete and re-index the whole document. It re-chunks only the affected section (using the stored chunk-to-source-range mapping), invalidates the specific vector IDs that covered the changed range, and inserts the new chunks. The gap window is minimized; the rest of the document's vectors remain valid.

Handling the False Freshness Problem

For JavaScript-rendered pages, Workforce Wave runs a headless Chromium pass during the initial crawl and stores both the static HTML hash and the rendered content hash. If the static hash is stable but the rendered hash changes, that's a signal that client-side content has been updated — even though a naive HTTP crawl would see no change.

This catches the dynamic-hours-in-JavaScript case. It's more expensive to run, so it's applied selectively to pages that contain high-signal fields (hours, pricing, contact info) as identified by the initial crawl's fact extraction.

The Numbers

In production across WFW agents, the three-layer approach catches approximately 94% of meaningful content changes within 8 hours of the change being made. The remaining 6% is primarily JavaScript-heavy sites where the headless pass is blocked by bot detection, and structured-data-only changes where the source is a third-party system not reflected on the website at all (e.g., practice management software with hours that differ from the website).

That last case is the genuinely unsolvable one from the crawl side. If the authoritative source of business hours is a PMS that doesn't publish to the website, the KB will always be behind until the practice either updates their site or configures a direct PMS integration. The engineering can get you to 94%. The last 6% requires a different data source.

Next in this series: The Closed-Loop AI: How Call Data Improves the Agent That Made the Call — the feedback pipeline architecture behind WFW's continuous optimization system.