What Eval Frameworks Measure

I started looking at this because I was running evaluations on some fine-tuned models and couldn't figure out why RAGAS and DeepEval were giving me different scores on what felt like the same question. The more I read, the more I noticed that the frameworks weren't just implementing the same ideas differently. They actually disagree on some fundamental things. This post is my attempt to get that straight.

Before looking at each framework, there is one mechanism worth understanding because every design choice in these systems is, in some way, a response to its limitations.

Every framework in this post uses an LLM to evaluate another LLM's output. A separate model call judges whether a response is faithful, relevant, or correct. The reason this became standard is that Zheng et al. found GPT-4 agrees with human evaluators more than 80% of the time, which is roughly the rate humans agree with each other.¹ That finding made automated eval mainstream.

The 2026 picture is more complicated. JudgeBiasBench found that frontier models exceed 50% error rates on adversarial bias benchmarks.² The headline agreement rate holds on average but breaks on edge cases, which is precisely where reliable evaluation matters most.

Four bias types have been documented and measured.³ Style bias is the most underappreciated. Judges prefer markdown formatting over identical plain prose at a magnitude of 0.76 to 0.92 across all tested models. That surprised me. Most of what I had read focused on position bias, which it turns out is nearly gone in modern frontier models (at or below 0.04). The old advice to always swap the order of candidates in pairwise comparisons can actually hurt performance on adversarial benchmarks, reducing accuracy by 2.5 to 7.0 percentage points.³

Verbosity bias is more nuanced than the simple "longer is better" framing suggests. Current models penalize filler content but correctly reward genuine completeness, achieving 0.92 to 1.00 accuracy on truncation pairs.³ Self-preference bias is real but varies by model. Gemini Flash shows strong self-preference; Gemini Pro actually penalizes its own outputs.³

There is one mitigation that consistently helps: forcing chain-of-thought reasoning adds 1.5 to 13.0 percentage points on LLMBar benchmarks.³ Binary verdicts outperform 0-100 scores, with ChainPoll reporting a 23% accuracy improvement.⁴ Using a judge from a different provider than the generator, running 3-5 judge ensemble calls, and normalizing formatting before evaluation all reduce systematic bias. None of these are built into any framework by default. They are manual implementation on top of whatever tooling you choose.

With that in mind, here is what each framework actually does.

RAGAS

RAGAS is built for reference-free RAG evaluation. Its explicit design goal is to measure without needing human-labeled answers.⁵ The organizing principle is what the documentation calls Single-Aspect Focus: one metric measures one dimension, no composites.⁶

The flagship metric is Faithfulness. The computation has three steps, taken directly from the source.⁷

Step 1 is statement extraction. The LLM receives the question and the answer and is asked to break each answer sentence into atomic, pronoun-free statements. Output is a list of strings.

Step 2 is NLI verification. Each statement is checked against the retrieved context. The judge returns a binary verdict: 1 if the statement can be directly inferred from context, 0 if not.

Step 3 is scoring:

Faithfulness = supported_statements / total_statements

Returns NaN if no statements are extracted.

Answer Correctness is a reference-based metric. It combines semantic similarity with claim overlap:

Correctness = 0.5 × SemanticSim(answer, GT) + 0.5 × (matched_claims / GT_claims)

from ragas import evaluate
from ragas.metrics import faithfulness
from datasets import Dataset

data = {
    "user_input": ["Where was Einstein born?"],
    "response": ["Einstein was born in Germany on 20th March 1879."],
    "retrieved_contexts": [["Albert Einstein (born 14 March 1879) was a German-born physicist..."]]
}
result = evaluate(Dataset.from_dict(data), metrics=[faithfulness])
print(result)

What that returns:⁸

{&#39;faithfulness&#39;: 0.5000}
# Statement 1: "Einstein born in Germany" → verdict: 1
# Statement 2: "born 20th March 1879" → verdict: 0
# Reason: "Context says 14th March, not 20th March"

The framework caught a hallucinated date. That is exactly what it is designed to catch. What it cannot catch is whether the retrieved context itself was wrong.

DeepEval

DeepEval treats LLM outputs as software. The framing is explicit: test them like you test code, with thresholds, pass/fail outcomes, and CI/CD integration.⁹

Answer Relevancy works like RAGAS Faithfulness but in a different direction. The LLM extracts all statements from the actual output, then classifies each as relevant or not relevant to the input.¹⁰

Answer Relevancy = relevant_statements / total_statements

Hallucination metric checks the actual output against each item in the contexts list for contradictions.¹¹

Hallucination = contradicted_contexts / total_contexts

The difference from RAGAS Faithfulness matters more than it looks. RAGAS extracts atomic claims and checks each one against the context. DeepEval checks the whole output against each whole context for contradictions. Same goal, different computation path, and they will disagree on the same input.

from deepeval.metrics import AnswerRelevancyMetric, HallucinationMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

test_case = LLMTestCase(
    input="What is Paris?",
    actual_output="Paris is a city in France. It is the capital.",
    context=["Paris is the capital and largest city of France."]
)

metric = AnswerRelevancyMetric(threshold=0.5, model="gpt-4", include_reason=True)
evaluate([test_case], [metric])

Terminal output (structure from documentation; exact rendering illustrative¹⁰¹¹):

Running deepeval test run...

✓  test_answer_relevancy  PASSED  (score: 0.92, threshold: 0.5)
   Reason: All 3 statements directly address the input question.

✗  test_hallucination  FAILED  (score: 0.66, threshold: 0.5)
   Reason: 1 of 3 contexts contradicted by actual_output.

The reasoning trace from the HanaLoop source shows what the internal evaluation looks like:¹²

Statement 1: Paris is a city in France     → Truthful
Statement 2: Paris is the capital          → Truthful
Statement 3: Paris has area 105.4 km²      → Partially truthful
             Reason: actual area is 105 km², slight discrepancy

For smaller local models, evaluation_template= accepts a custom prompt override when the default instruction-following assumptions break down.¹⁰

deepeval set-ollama --model=<model_name>
deepeval test run test_example.py

The threshold-as-gate approach is the distinctive design choice here. Whether a score of 0.7 should ever block a deploy is a question DeepEval answers with "yes, if you set it that way." Other frameworks answer it differently.

LangSmith

LangSmith's explicit position is that evaluation only matters when it closes a loop between offline testing and live production. The documentation distinguishes between testing and evaluation, and makes the case that teams conflate them.¹³

On the threshold question, it states directly that blocking deployment on fuzzy scores creates workflow bottlenecks, and that a faithfulness score of 0.7 may be excellent for one use case and inadequate for another.¹³

There are no pre-baked formulas. Metrics are rubric-driven, defined through the openevals library.¹⁴

Prompt	What it measures	Needs reference?
`CORRECTNESS_PROMPT`	Output vs. input + reference answer	Yes
`HALLUCINATION_PROMPT`	Output vs. retrieved context	No
`GROUNDEDNESS_PROMPT`	Agreement with context	No
`RETRIEVAL_RELEVANCE_PROMPT`	Retrieved context vs. input	No

Deterministic fallbacks are available: exact match, Levenshtein distance, cosine embedding similarity.¹⁴

from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:o3-mini"
)
result = evaluator(
    inputs={"question": "Where was Einstein born?"},
    outputs={"answer": "Einstein was born in Germany."},
    reference_outputs={"answer": "Einstein was born in Ulm, Germany."}
)

Output:¹⁴

Dataset: correctness_test_dataset
Experiment: test_correctness-acbd1234

+----------------+----------------+------------------+
|                | correctness    | error            |
+================+================+==================+
| example_1      | 1.0000         | 0.0000           |
+----------------+----------------+------------------+
| example_2      | 0.0000         | 0.0000           |
+----------------+----------------+------------------+

LangSmith has the most developed story for agent evaluation. Trajectory match supports strict, unordered, subset, and superset modes. run_multiturn_simulation() uses an LLM to simulate a user and test an agent end-to-end. Thread-level evaluation scores task completion, user outcome, and agent trajectory as a whole conversation rather than per-turn.¹⁴¹³

There is an irony I noticed here. LangSmith explicitly documents judge failure modes, recommends boolean output, pairwise comparisons, and human calibration. Its own judge-based metrics still carry the same underlying vulnerabilities. Documenting the problem and solving it are not the same thing.

LlamaIndex

LlamaIndex's premise is that retrieval quality determines response quality. Evaluation belongs at the query engine level, called directly on query responses.¹⁵

FaithfulnessEvaluator checks whether the answer is supported by the source nodes. Output is binary: 1 (faithful) or 0 (not faithful), plus text reasoning.¹⁶

CorrectnessEvaluator scores 1-5 against a reference answer.¹⁶

1: not relevant to the query
2-3: relevant but contains mistakes
4-5: relevant and correct

Default passing threshold: 4.0.

from llama_index.core.evaluation import FaithfulnessEvaluator

evaluator = FaithfulnessEvaluator()
response = query_engine.query("What is Paris?")
result = evaluator.evaluate_response(response=response)
# result.passing → True
# result.feedback → "Response is supported by source documents."

Concrete output from the source:¹⁶

Response                              | Source              | Eval Result | Reasoning
New York City got its name when it    | The city came under | Pass        | YES
came under British control in 1664... | British control...  |             |

LlamaIndex uses GPT-4 directly for pass/fail judgments without additional bias mitigation. For single-turn RAG at the query engine level, it is straightforward to use. It is not designed for agent workflows.

W&B Weave

Weave is built around test-driven development for LLMs. The philosophy: define failure cases first, measure against them systematically, track every input, output, and intermediate state with version control.¹⁷

There are no pre-built metrics. Scoring functions are user-defined Python. The one built-in is MultiTaskBinaryClassificationF1.¹⁸

import weave

weave.init("my-eval-project")

class MyModel(weave.Model):
    @weave.op()
    def predict(self, question: str) -> str:
        return llm.generate(question)

@weave.op()
def correctness_scorer(output: str, target: str) -> dict:
    return {"match": output.strip() == target.strip()}

dataset = weave.Dataset(rows=[
    {"question": "Where was Einstein born?", "target": "Ulm, Germany"},
])

evaluation = weave.Evaluation(dataset=dataset, scorers=[correctness_scorer])
await evaluation.evaluate(MyModel())

There is no terminal table by design.¹⁹ Console prints:

Evaluating... → https://wandb.ai/your-org/your-project/weave/evaluations/...

All results live in the browser UI: row-level input/output pairs, aggregate scores, multi-trial comparison. Weave's emphasis is on tracking changes across iterations rather than providing a fixed metric set.

Where They Agree and Disagree

Looking at all five together, six fault lines come up that I think matter in practice.

Does eval need ground truth?

Framework	Position
RAGAS	No -- built explicitly for reference-free eval⁵
Weave	Yes -- curate known-correct targets first¹⁸
LangSmith	Both -- reference-based offline, reference-free for live production¹³
LlamaIndex	Split -- Faithfulness is reference-free, Correctness requires reference¹⁶
DeepEval	Both supported⁹

If no labeled data exists, RAGAS and LangSmith online eval are the practical options. If you are fine-tuning with known outputs, Weave and DeepEval give harder measurement guarantees.

How much does the judge get trusted?

Framework	Position
RAGAS	Trust it -- LLM metrics are closer to human eval than traditional NLP⁶
LlamaIndex	Trust it -- uses GPT-4 for direct pass/fail¹⁶
DeepEval	Trust it, with controls -- `evaluation_template` override for weak judges¹⁰
LangSmith	Skeptical -- documents position bias, verbosity bias, inconsistency; recommends boolean output, pairwise evaluation with position swap, human calibration¹³

RAGAS uses LLM-as-judge to evaluate LLM-as-judge problems (faithfulness to context) without independently mitigating judge bias. That is not a criticism of RAGAS specifically. It describes the whole space.

Gate or trend?

Framework	Position
DeepEval	Gate -- `threshold=0.5`, fail CI/CD if below⁹
LangSmith	Trend -- blocking deployment on fuzzy scores creates workflow bottlenecks¹³
Weave	TDD -- define failures first, track improvement across iterations¹⁷
RAGAS	Benchmark -- offline measurement during experimentation⁶

This is the sharpest real disagreement. DeepEval and LangSmith are not just different implementations. They have an explicit philosophical conflict about whether a 0.7 faithfulness score should ever block a deploy.

Atomic or holistic?

Framework	Position
RAGAS	Strictly atomic -- Single-Aspect Focus, one metric one dimension⁶
DeepEval	Component-level -- retrieval and generation measured separately⁹
LlamaIndex	Component-level -- FaithfulnessEvaluator and CorrectnessEvaluator are independent¹⁶
LangSmith	Holistic for agents -- single-turn metrics miss multi-step cascade failures; thread-level evaluation is the real signal¹³

Where in the lifecycle?

Framework	Primary home
DeepEval	Pre-deploy, CI/CD
RAGAS	Dev, experimentation
LlamaIndex	Dev, at query engine level
Weave	Full lifecycle, strongest at experiment tracking
LangSmith	Both explicitly -- the offline/online loop is the product¹³

Agent evaluation?

Framework	Support level
LangSmith	Deepest -- trajectory match (strict/unordered/subset/superset), multiturn simulator, thread-level judge¹⁴
RAGAS	`MultiTurnMetric` base class exists, not primary focus⁶
DeepEval	Single-turn RAG optimized, limited agent support⁹
Weave	Single-turn primary, version tracking across runs¹⁷
LlamaIndex	Query engine level, single-turn¹⁶

The Problem None of Them Fully Solve

Every framework uses an LLM to judge an LLM. None gives a complete answer to what happens when the judge is wrong.

The 80% agreement figure from Zheng et al. is the average case.¹ JudgeBiasBench found frontier models exceed 50% error rates on adversarial edge cases.² Edge cases are exactly where you need the judge to be reliable.

Compound errors make this worse in agent workflows. A judge evaluating individual steps can miss a hallucinated number in step 1 causing a cascade failure in step 3. Single-step accuracy of 95% can still produce a critical incident.⁴

LangSmith gets closest to naming the structural problem. Its documentation flags position bias, verbosity bias, and inconsistency, and recommends mitigations. But its own judge-based metrics still carry the same vulnerabilities. Identifying the problem and solving it are different things, and no framework has closed that gap.

The 2026 research-backed mitigations are not built into any of these tools by default: chain-of-thought forcing, binary verdicts over continuous scores, ensemble judges, formatting normalization before evaluation, cross-provider judge selection.³ All of these require manual implementation.

This is not a criticism of any particular framework. It is a structural constraint of using LLM-based evaluation at this stage. The judge is part of the system. When the system fails, the judge may be why, and nothing in the current tooling makes that automatically visible.

The thing I keep coming back to is that these frameworks are measuring different things even when they appear to be measuring the same thing. RAGAS Faithfulness and DeepEval Hallucination both sound like "did the model make something up." But RAGAS counts supported atomic claims while DeepEval counts contradicted contexts. They will disagree on the same output.

That is not a bug in either framework. It is a reflection of the fact that "hallucination" is not a single well-defined thing. Neither is "faithfulness" or "correctness." Each framework made explicit design choices about what those words mean and how to measure them, and those choices reflect different theories about what evaluation is for. RAGAS thinks evaluation means decomposing claims and checking each one. DeepEval thinks it means setting thresholds and shipping software that passes them. LangSmith thinks it means closing a feedback loop between production behavior and offline testing. Weave thinks it means specifying failures before you have data and tracking progress against them.

All of those are defensible. None of them is obviously right for every use case. I'm not sure the frameworks are in competition so much as they are answering slightly different questions while using similar vocabulary.