MethodologyMay 14, 20268 min read

How we pin LLM model versions for reproducible GEO audits

Two audits of the same domain, three weeks apart, return different GEO scores. Did the site improve — or did the model change? Without version pinning, you cannot tell the difference.

This is post five in the GEO methodology series. Post one identified five measurement sins in AI-readiness scoring. Sin two — unpinned LLM models — is the one most tools silently commit, and the one with the highest impact on score reproducibility. This post documents exactly how we solved it.

The problem: model routes change without you knowing

Our first implementation of the GEO cold recall check used Cloudflare Workers AI with the model slug `@cf/meta/llama-3.1-8b-instruct`. This slug is a named route, not a pinned version. Cloudflare upgrades the model behind the slug on their own schedule. When they do, every audit after that point reflects a different model's world-knowledge — with no record of which version ran.

The same problem exists on OpenRouter's free tier. The slug `meta-llama/llama-3.1-8b-instruct:free` is routed across inference providers in real time. OpenRouter selects the provider based on availability and load. The model weights may differ between providers even for the same nominal version.

The practical effect: a brand that was correctly recalled in April may appear as "Unknown" in May — not because their GEO presence changed, but because a model checkpoint was rotated. Without pinning, you are measuring model churn, not brand visibility.

What pinning actually requires

Pinning has two parts. First, the model: record which exact model (including provider and checkpoint) ran for each invocation. Second, the prompt: record which version of the prompt template produced the query. Both can change independently and both affect the result.

The model slug you send is a request, not a guarantee. The actual model that ran is in the API response. OpenRouter returns the resolved model in `response.model`. That is the field worth storing — not the slug you sent.

The GeoLlmClient implementation

We extracted all LLM call logic into a single GeoLlmClient class. Every audit that touches an LLM — currently `geo_cold_recall` and `category_share_of_voice` — goes through this client. Nothing else reaches an LLM directly.

// functions/_lib/geo-llm-client.ts

export const DEFAULT_GEO_LLM_MODEL = 'meta-llama/llama-3.1-8b-instruct:free';

export interface LlmResponse {
  text: string;
  model: string;       // slug we sent
  model_pin: string;   // actual model from API response — what to store
  prompt_version: string;
  latency_ms: number;
}

export class GeoLlmClient {
  async complete(prompt: string, systemPrompt?: string): Promise<LlmResponse> {
    // ... fetch to OpenRouter ...
    const data = await resp.json() as OpenRouterResponse;
    return {
      text: data.choices[0]?.message?.content?.trim() ?? '',
      model: this.model,
      model_pin: data.model ?? this.model,  // use response, fall back to request
      prompt_version: this.promptVersion,
      latency_ms: Date.now() - start,
    };
  }
}

The key field is `model_pin: data.model ?? this.model`. OpenRouter returns the actual resolved model string in `data.model`. That might be `meta-llama/llama-3.1-8b-instruct` with a provider suffix, or a version-specific identifier. We store that, not the slug we requested.

Versioning the prompt

Model pinning is necessary but not sufficient. The same model can produce different results from a rephrased question. We version every prompt template explicitly.

// auditor.ts
const GEO_RECALL_PROMPT_VERSION = 'v1.0' as const;

// The client is constructed with this version at audit start:
const llmClient = new GeoLlmClient({
  apiKey: env.OPENROUTER_API_KEY,
  promptVersion: GEO_RECALL_PROMPT_VERSION,
  model: DEFAULT_GEO_LLM_MODEL,
});

When we change the cold recall prompt — for example, to reduce hallucination rate, or to handle new brand categories — we bump the version to `v1.1`. Historical audits retain `v1.0` in their stored evidence. You can compare scores across versions and know exactly what changed.

Surfacing uncertainty: the `indicative` confidence flag

Even with perfect model and prompt pinning, a single LLM call is still a draw from a probability distribution. N=1 is not a measurement — it is an observation. We make this explicit in the audit result via a `confidence` field on each check.

export type CheckConfidence = 'measured' | 'indicative';

// Deterministic checks (HTTP fetch, parse, regex) — no stochastic variance:
// robots_txt, llms_txt_present, mcp_server_card, ... → confidence: 'measured' (default)

// LLM-derived checks — stochastic, N=1:
// geo_cold_recall, category_share_of_voice → confidence: 'indicative'

In the audit result UI, `indicative` checks render with a tilde prefix (`~indicative`) and a tooltip explaining what the label means: "This check uses a single LLM query. Results may vary between runs." The model version that produced the result appears inline.

This is not a disclaimer to hide weak methodology. It is a type system for measurement quality. A user who sees `geo_cold_recall: fail (indicative, Llama 3.1 8B)` has more information than one who sees `geo_cold_recall: fail` with no context.

The evidence vault

Model pin and prompt version in the check result are a summary. The full audit trail lives in the `evidence` table in D1 — one row per LLM invocation.

-- D1 migration 0004: evidence vault
CREATE TABLE IF NOT EXISTS evidence (
  id            TEXT PRIMARY KEY,
  audit_id      TEXT NOT NULL,
  signal_id     TEXT NOT NULL,  -- 'geo_cold_recall' | 'category_share_of_voice'
  prompt_version TEXT NOT NULL, -- 'v1.0'
  model_sha     TEXT NOT NULL,  -- actual resolved model from OpenRouter response
  response_full TEXT NOT NULL,  -- full LLM response, no truncation
  latency_ms    INTEGER,
  ts            INTEGER NOT NULL,
  n             INTEGER NOT NULL DEFAULT 1
);

Every cold recall and share-of-voice invocation stores the full response with no truncation. Earlier versions truncated at 300 characters — enough for the classifier to work, but not enough to audit later. The evidence vault retains the complete text so we can re-classify historical responses if our classifier improves, or investigate anomalies.

Wilson confidence intervals: moving beyond N=1

With the model and prompt pinned and full evidence preserved, we added the infrastructure for multi-run confidence intervals using the Wilson score interval.

// functions/_lib/stats.ts
export function wilsonCI(successes: number, n: number, z = 1.96): [number, number] {
  if (n === 0) return [0, 1];
  const p = successes / n;
  const centre = (p + z * z / (2 * n)) / (1 + z * z / n);
  const half = (z / (1 + z * z / n)) * Math.sqrt(p * (1 - p) / n + z * z / (4 * n * n));
  return [Math.max(0, centre - half), Math.min(1, centre + half)];
}

The `checkGeoRecall` and `checkCategoryShareOfVoice` functions both accept an `nRuns` parameter. When `nRuns > 1`, the audit runs the same prompt multiple times, stores each invocation in the evidence vault, and computes Wilson CI bounds. The result includes `ci_lower` and `ci_upper` as decimal probabilities.

The default is `nRuns: 1` — consistent with every previous audit in the corpus. We will increase this for premium audits once usage patterns are established. The infrastructure is live; the multi-run mode is opt-in.

What this means for score comparisons over time

Every audit now records four things that previous audits did not: `model_pin` (the actual resolved model, not the requested slug), `prompt_version` (the template version that generated the query), `n` (number of runs), and `response_full` (the complete untruncated LLM response).

When a future audit of the same domain shows a different GEO score, you can query the evidence table and compare model_sha, prompt_version, and the raw responses side by side. A score change driven by model churn looks different from one driven by actual brand visibility change: the former shows identical responses with different models; the latter shows changed responses with the same model.

Post six in this series covers the render gap — the portion of a site that AI crawlers cannot see at all, and what it means for the checks that rely on static HTTP fetches rather than rendered DOM.

methodologygeollmreproducibilityaudit-quality

Hidden Layer Research

Independent GEO audit research. Data-first. Not affiliated with any LLM vendor.

See how your domain scores against these checks →

Run a free audit

Share:Twitter / X LinkedIn

← Previous

State of GEO 2026: the data from 128 live audits

← All articles