Evaluating AI-Generated Meeting Summaries with LLM-as-a-Judge

At TestDevLab, we have extensive experience in evaluating the quality of digital communication, from audio and video streams to the networks that carry them. Recently, however, we encountered a new kind of quality challenge. Nearly every meeting platform and productivity suite now ships with an AI note-taker that automatically produces meeting summaries, action items, and transcripts. The output always looks confident: well-formatted, fluent, and professional. But does it actually reflect what happened in the meeting?

When teams start relying on these notes to track decisions, deadlines, and ownership, the cost of a wrong summary is real. A missed action item quietly disappears. An invented deadline creates work nobody agreed to. And because the text reads so smoothly, these errors are easy to trust and hard to spot. To address this challenge, we built AIMSE (Artificial Intelligence Meeting Summary Evaluator) — a custom tool that scores AI-generated meeting summaries, and even transcripts, against configurable quality metrics using an LLM-as-a-judge approach.

In this blog article, we will walk you through why AI-generated meeting summaries are so difficult to evaluate, which quality dimensions actually matter, and how our tool AIMSE turns subjective judgments into structured, repeatable, and explainable quality scores.

Why meeting summaries are hard to evaluate

Before we delve deeper into our solution, let's first cover why this is not a trivial measurement problem.

Unlike a network packet or a video frame, a meeting summary has no single correct answer. Two experienced people summarizing the same meeting will produce different documents — different wording, different ordering, different levels of detail — and both can be perfectly good. Any evaluation method built on exact comparison falls apart immediately.

Classic automated text metrics do not help much either. Most of them work by counting word overlap between the generated text and a reference text. That rewards summaries that copy the reference's wording and punishes summaries that paraphrase well. Worse, overlap-based scores are blind to meaning. Namely, a summary can share plenty of vocabulary with the source and still miss the key decision, attribute an action item to the wrong person, or invent a date that was never mentioned.

That last failure mode, hallucination, is the most dangerous one. A hallucinated detail is fluent, plausible, and wrong, and the only way to catch it is to check the summary against what was actually said.

Finally, in many real-world cases, there is no reference material at all. Sometimes all you have is the generated summary itself, with no verified transcript and no human-written notes to compare against. A practical evaluation method has to work with whatever material exists.

What makes a good meeting summary?

To score anything, you first need to define what "good" means. Through our work on meeting summarization quality, we converged on a set of quality dimensions, each of which is a separate question asked of the document. The exact set is fully configurable per project, but a typical evaluation covers four groups.

Content fidelity

Does the summary tell the truth? This covers factual accuracy — whether the document correctly reflects the main topics, decisions, and agreements from the meeting — and hallucination detection, which looks specifically for statements that present something that was never mentioned or never existed in the meeting.

Coverage

Does the summary tell the whole truth? Completeness checks that all major points made it in: decisions made, actions assigned, deadlines set, and the important discussions behind them. A closely related dimension verifies that action items are explicitly listed together with who owns them and by when.

Readability

Can the reader actually use it? Here we evaluate conciseness (brief and to the point, without irrelevant detail), clarity (straightforward language a reader can follow), and structure (logically organized, often mirroring the meeting agenda).

Usefulness

Does it serve the people who depend on it? This group covers relevance (only information that matters to the meeting's objectives and stakeholders), context (enough background that someone who did not attend can still understand the decisions), and objectivity (whether the summary presents information neutrally).

Why one number is not enough

Each of these dimensions has limitations when used alone. A summary can be perfectly fluent and dangerously incomplete, or rigorously accurate but unreadable. A single combined score hides exactly the information a product team needs: what went wrong. That is why our tool evaluates every dimension separately, with its own definition and its own score, and only aggregates them at the very end. The result is a diagnosis, not just a grade.

Scoring summaries with our custom-built tool

Let's now discuss the tool we built for the evaluation itself. AIMSE is developed in Python and is built around an LLM-as-a-judge approach. Specifically, a capable large language model is handed the material under test together with a precise definition of one quality metric, and is asked to grade the document against that definition. Given a clear rubric, a strong model can apply nuanced, human-like judgment — but at machine speed, on every summary, every time.

Metrics are prompts

The core design decision in AIMSE is that every metric is a plain-text prompt file. Each file describes exactly what the judge should assess, what good and bad look like for that dimension, and how the score should be assigned. The tool can load a single metric file or entire folders of them, with the file name becoming the metric name in the results.

This makes the tool extremely flexible. The dimensions described above are a starting point, not a fixed list. Metrics can be renamed, removed, or replaced, and entirely new ones can be written for domain-specific needs (for example, "were all regulatory topics from the agenda captured?") without touching a single line of code. The judge's system prompt is just as configurable, so the entire evaluation persona can be tailored per project.

Three levels of reference

Borrowing a concept from our audio and video quality work, AIMSE supports both no-reference and full-reference evaluation, depending on what material is available:

Standalone (no-reference) evaluation. Only the generated summary is provided. The judge evaluates the document's internal qualities — clarity, structure, conciseness, and self-consistency. This is the most flexible mode and works even when no other record of the meeting exists.
Transcript-referenced evaluation. A ground-truth transcript — a verified record of what was actually said — is provided alongside the summary. Accuracy, completeness, and hallucination detection are now provable, which means every claim in the summary can be checked against the source.
Full-reference evaluation. In addition to the transcript, the judge receives "golden" meeting notes written by people, which are treated as the ground truth, and optionally separate ground-truth lists of action items and decisions. This is the strictest comparison: how close does the machine-generated summary get to expert human output?

The same engine also works one level lower in the pipeline. With a dedicated set of prompts, AIMSE can evaluate generated transcripts against ground-truth transcripts, extending the approach from summarization quality to speech-to-text quality.

Scores with reasoning attached

For every metric, the tool responds in a structured JSON format containing a score on a 0–100 scale, a comment explaining the reasoning behind it, and an example — the specific passage that triggered the judgment. Alongside these, the tool records which model produced the verdict, when, and at what token cost, and preserves the raw model response for auditing in case of failures or unexpected outcomes.

This is what makes the output actionable. A bare score only tells you that something is wrong, while the comment and example tell you what is wrong and where. When a hallucination metric returns a low score, the report points to the exact invented sentence.

Consistency through repetition

LLM judges are powerful, but they are not perfectly deterministic. This means the same question can occasionally yield a slightly different score. AIMSE addresses this with batch evaluation: the same evaluation request can be submitted multiple times per metric through the model provider's batch API, producing a distribution of scores rather than a single sample. Each repetition is saved as its own indexed result file, making it easy to check judge stability and build statistical confidence in the final numbers. As a bonus, batch processing runs at reduced cost compared to real-time API calls, which matters when evaluating large test sets.

Cloud or fully local

Meeting content is often confidential, so where the evaluation runs matters as much as how. AIMSE supports two interchangeable engines: a commercial LLM API in the cloud, or a local inference server running open-weight models on our own device hardware. The prompts, evaluation logic, and report format are identical in both modes, however, in local mode, the meeting data never leaves the machine. For clients with strict data-handling policies, this option alone can be decisive.

The overall verdict

Once all individual metrics have been evaluated, AIMSE runs a final overall evaluation stage. The per-metric results — scores, comments, and examples, stripped of technical metadata — are fed back to the tool with a dedicated prompt, producing a closing qualitative assessment of the document as a whole. The headline number itself, however, is not left to the model. The overall score is computed deterministically as a weighted average of the individual metric scores, with configurable weights so that, for example, hallucinations can count more heavily than stylistic polish.

Want to know how good your meeting AI really is?

Our quality testing services cover AI-generated summaries, transcripts, and the platforms that produce them, from one-off benchmarks to continuous regression testing of every release.

Explore our AI testing services

The final report

Every metric produces its own JSON result file, ready to be consumed by humans and pipelines alike:

{
  "metric_name": "factual_accuracy",
  "score": 85,
  "comment": "The summary accurately reflects the main decisions, but the agreed follow-up date is missing.",
  "example": "The notes state the budget review is 'ongoing', while a final figure was approved in the meeting.",
  "model_used": "...",
  "usage": { "...": "token statistics for cost tracking" }
}

Because the output is structured and machine-readable, the results plug directly into broader quality workflows. Clients use this kind of data to compare AI note-taking across different meeting platforms, to benchmark their own summarization pipeline against human-written golden notes, and to regression-test summary quality after every model or prompt change, catching degradations before users do. And since metrics are just prompts, the same report structure works for any custom dimension a project requires.

In essence, this mirrors how we approach audio and video quality evaluation: no single metric tells the whole story, so we combine several complementary ones — each with its own strengths — into a final report that is both robust and explainable.

Key takeaways and future improvements

Generic text-similarity metrics were too shallow to capture meaning, hallucinations, or missing action items, and purely manual review does not scale to the volume and release cadence of modern AI note-taking features. The task required combining quality engineering discipline with prompt design: turning vague notions like "a good summary" into explicit, testable definitions that an LLM judge can apply consistently.

Our most important conclusion is that a single quality score is insufficient. Summary quality is multi-dimensional, and only a per-metric evaluation — with reasoning and evidence attached to every score — gives teams something they can act on. Making every metric a prompt proved equally valuable: evaluation criteria can evolve as fast as the products being tested.

There are also improvements on our roadmap. Evaluation of large test sets currently runs metric by metric, so asynchronous and parallel processing is a natural next step for throughput. We also plan to automate the statistical side of repeated judgments — aggregating score distributions, flagging unstable metrics — and to further calibrate the LLM judge against panels of human raters. Finally, while the tool already runs against both cloud APIs and local open-weight models, additional model providers can be supported, and adding a visual reporting layer on top of the JSON results adds a better overview.

FAQ

Most common questions

What is an AI meeting summary evaluation, and why is it needed?

AI meeting summary evaluation is the process of measuring how well an automatically generated summary reflects what actually happened in a meeting. It is needed because AI note-takers are now built into most meeting platforms, and teams rely on their output to track decisions, deadlines, and ownership. Since these summaries are fluent and professional-looking regardless of whether they are correct, errors such as missed action items or hallucinated details are easy to trust and hard to spot without systematic evaluation.

How does the LLM-as-a-judge approach score a summary?

The summary, any available reference materials, and the definition of one quality metric are sent to a large language model that acts as the judge. The metric definition is a prompt that describes exactly what to assess and how to score it. The judge returns a structured result containing a 0–100 score, a comment explaining the reasoning, and an example passage supporting the verdict. Each metric is judged independently, and a configurable weighted average of the individual scores produces the overall result.

Which quality metrics does the tool evaluate?

A typical evaluation covers content fidelity (factual accuracy and hallucination detection), coverage (completeness and action items with owners and deadlines), readability (conciseness, clarity, and structure), and usefulness (relevance, context for non-attendees, and objectivity). Because every metric is a plain-text prompt file, this set is fully customizable. Metrics can be added, removed, renamed, or tailored to project-specific requirements without modifying the tool itself.

Can a summary be evaluated without a reference transcript or golden notes?

Yes. The tool supports three levels of reference. In standalone mode, only the generated summary is evaluated, focusing on its internal qualities. If a ground-truth transcript is available, accuracy and hallucination checks become verifiable against what was actually said. In full-reference mode, human-written golden notes — and optionally ground-truth action items and decisions — serve as the benchmark, enabling the strictest comparison between machine and expert human output.

How does the tool handle reliability and confidential meeting content?

Reliability is addressed through repetition: the same evaluation can be submitted multiple times per metric via batch processing, producing a distribution of scores that reveals how stable the judge is, at reduced API cost. Confidentiality is addressed through deployment choice: the identical evaluation pipeline can run against a cloud LLM API or against open-weight models on a local inference server, in which case, meeting data never leaves the machine.

Need structured quality evaluation for your AI-generated content? See AIMSE in action.

If your product generates meeting summaries or transcripts, or your team depends on tools that do, our evaluation pipeline delivers objective, explainable quality scores across the metrics that matter to you.

Evaluating AI-Generated Meeting Summaries Using a Custom-Built Tool & LLM-as-a-Judge