Skip to main content

The Teckel Judge

At the heart of Teckel AI is our proprietary auditing engine, the Teckel Judge. This specialized evaluation system provides rigorous, automated analysis of every response your AI generates, ensuring clear and consistent quality measurement.

How It Works

When a trace is sent to Teckel AI, it's queued for processing by our unified Teckel Judge evaluation system. This sophisticated two-stage process efficiently analyzes responses to provide comprehensive insights into every user question:

Stage 1: Completeness Scoring

We calculate how well every user question is answered by your AI system using a 0-100% scale, use issue tags to detect issues such as confusion, low-confidence answers, as well as provide brief feedback if anything specifically was missing from the response.

Stage 2: Advanced Analytics- Accuracy, Precision, Freshness

As part of our advanced analytics system, offered via sampling, we run complex document-to-claim attribution for AI answer accuracy, as well as retrieval precision to determine if relevant documentation is being referenced. We optionally track document age to determine if the system is referencing old, and potentially stale information.

The Core Evaluation Metrics

The Teckel Judge evaluates three key quality dimensions:

1. Completeness

Completeness measures how well the AI's response addresses the user's specific question. It checks whether the AI stayed on topic and directly answered what was asked.

Calculation: AI evaluation of answer relevance to the original question

  • 1.0: The response directly and comprehensively addresses the user's question.
  • 0.8-0.99: The response is mostly relevant with minor tangential content.
  • 0.6-0.79: The response partially addresses the question but includes some irrelevant information.
  • Below 0.6: The response is largely off-topic or doesn't address the core question.

2. Accuracy

Accuracy measures factual accuracy by analyzing individual claims and their support in source documents. Each AI response gets broken into discrete statements, then we calculate the ratio of supported claims to total claims.

Calculation: Supported Claims ÷ Total Claims Extracted

  • 1.0: All factual claims are fully supported by the source documents.
  • 0.8-0.99: Most claims are supported, with only minor unsupported details.
  • 0.6-0.79: Several claims lack proper support or contain minor inaccuracies.
  • Below 0.6: The response contains significant unsupported claims or factual errors.

3. Precision

Precision evaluates how relevant retrieved document chunks are to the user's question. This metric reveals whether your RAG system finds the most useful information for each query.

Calculation: Relevant Chunks ÷ Total Chunks Retrieved

  • 1.0: All retrieved chunks are directly relevant and useful for answering the question.
  • 0.8-0.99: Most chunks are relevant, with only minor irrelevant content retrieved.
  • 0.6-0.79: A moderate amount of irrelevant information was included.
  • Below 0.6: Many retrieved chunks are not relevant to the user's question.

Claims-Based Analysis

Our unique approach breaks down every AI response into verifiable claims, providing unprecedented transparency.

What's a Claim?

A claim is any factual statement the AI makes. For example:

AI Response: "You can reset your password by clicking the profile icon and selecting 'Security Settings'. The reset link expires in 24 hours."

Claims Extracted:

  1. Password reset accessed via profile icon
  2. Reset option found in Security Settings
  3. Reset link has 24-hour expiration

Why Claims Matter

Claims based feedback allows for more grounded accuracy scoring by using:

  • WHICH parts are wrong
  • WHY they're unsupported
  • WHAT documentation is missing
  • HOW to fix it

Freshness Tracking

In addition to the three core metrics, Teckel AI tracks the freshness of your source documents when you provide the document_last_updated field in your trace data. This critical metric helps you understand when your AI might be relying on outdated information.

Important: Freshness scoring requires the document_last_updated field to be included in your source documents. Without this field, freshness cannot be calculated. We highly recommend including this timestamp for all documents to enable information age monitoring.

This metric helps you proactively identify documents that may need review or updating to maintain response quality.

Actionable Feedback

Beyond trace level scoring, our system provides detailed feedback that documentation teams can act on, such as analysis of documents, tracked topics, and more.