Independent Audio and Video Benchmarking for Conferencing Platforms

The single most common gap in audio and video quality testing is this: your platform performs well in controlled conditions and poorly in the ones your users actually inhabit. A user joins from a café in Berlin on hotel Wi-Fi. Another dials in from São Paulo over 4G. Bandwidth drops, packet loss creeps in, and suddenly the experience collapses. Pixelated video, choppy audio, delays that make natural conversation impossible. The underlying technology may be comparable to competitors. The experience is not. That gap drives churn, tanks app store ratings, and hands users to platforms that simply feel more reliable. The fix is not more internal testing. It is structured, independent benchmarking under the unpredictable conditions your users actually face.

This article breaks down why internal testing misses the problem, what a rigorous benchmarking methodology looks like, and how Whereby—a browser-based video conferencing platform—used independent benchmarking with TestDevLab to validate their platform’s resilience and build a framework for ongoing quality monitoring.

TL;DR

30-second summary

Why does internal testing fail to surface the audio and video quality problems that real users encounter—and what does rigorous independent benchmarking actually require?

Internal testing happens under favorable conditions—stable networks, known devices, predictable scenarios—while real users encounter congested home networks, 4G handoffs, ISP throttling, and mid-call bandwidth drops that no internal test suite reliably replicates.
Even modest network impairments—2–5% packet loss, jitter spikes of 50–200ms—cause frame rate drops, audio artifacts, and delays that make natural conversation impossible, and these moderate degradation scenarios are far more common than complete network failure yet far less commonly tested.
Perceptual quality metrics—POLQA and ViSQOL for audio, VMAF for video, and proprietary models like TestDevLab's ASQ-ViT and VQTDL—produce results that correlate with what users actually hear and see, unlike raw technical statistics that describe server output rather than end-user experience.
Network adaptation behavior is where competitive differences between platforms are most measurable: how quickly a platform recovers when bandwidth drops, whether its adaptive bitrate algorithm maintains quality or over-corrects, and whether it prioritizes audio over video when resources are constrained.
A single benchmarking study establishes a baseline, but the real competitive advantage comes from making it repeatable. Things like re-running key scenarios after major releases, codec changes, and market expansion into regions with different network profiles, with automated execution integrated into CI/CD pipelines.

Bottom line: For video conferencing platforms, the gap between lab-quality performance and real-world user experience is where churn happens, app store ratings fall, and users move to platforms that simply feel more reliable. Independent benchmarking under realistic network conditions is the only testing approach that measures it accurately.

Why does internal testing miss real-world audio and video quality problems?

Internal testing tends to happen under favorable conditions. The network is stable. The devices are known. Test scenarios are predictable. Real users, however, operate on congested home networks, switch between Wi-Fi and mobile data mid-call, use older hardware, and encounter ISP throttling during peak hours—none of which feature reliably in an internal test suite.

Even modest impairments—2–5% packet loss, jitter spikes of 50–200ms—can dramatically degrade user experience. Frame rates drop, video freezes, audio artifacts appear, and delays make natural conversation difficult. There is also a blind spot problem: internal teams have deep knowledge of how their platform should behave, which can introduce assumptions that substitute for measurement. Independent benchmarking removes that bias and produces results that carry genuine weight with stakeholders, investors, and enterprise buyers who require evidence rather than assurances.

What makes audio and video quality benchmarking difficult to get right?

Benchmarking under realistic conditions is technically demanding, and getting it wrong produces false confidence, which is worse than no benchmarking at all. Four challenges define the discipline:

1. Simulating the right network conditions

Simple bandwidth throttling is not enough. Real users encounter fluctuating packet loss, changing latency, jitter spikes, and mid-call bandwidth drops. Static tests at fixed bitrates reveal only partial behavior. What matters is how the platform adapts when conditions shift suddenly, which is exactly where quality differences between platforms become visible.

2. Using measurements that reflect human perception

Raw technical statistics do not tell you what users experience. Perceptual quality algorithms, standardized tools like POLQA and VMAF, and proprietary models, like TestDevLab’s ASQ-ViT for audio and VQTDL for video, produce repeatable, comparable results that correlate with what a person actually hears and sees. This is where specialized tooling becomes essential.

3. Measuring end-to-end, not just server output

Knowing that your server is sending video at 30 fps tells you nothing about what the user sees after that signal has been encoded, transmitted through an impaired network, decoded, and rendered on their device. The full pipeline is where quality gets lost, and where it must be measured.

4. Accounting for system performance under load

Audio and video processing—noise suppression, echo cancellation, background blur—is computationally intensive. CPU or memory spikes under certain conditions cause additional quality degradation that network-only tests will not capture, particularly on lower-end devices that many users rely on.

What metrics actually matter in an audio and video quality benchmark?

A comprehensive AV benchmarking study covers three dimensions. Each connects directly to a business outcome.

Audio quality and delay. Perceptual quality scores (POLQA, ViSQOL, or TestDevLab’s proprietary ASQ-ViT), end-to-end audio delay, and artifact indicators such as clipping, echo, or noise suppression failures. Audio is the most critical channel — users tolerate blurry video far more readily than distorted or delayed audio.
Video quality, smoothness, and latency. Perceived video clarity via VMAF or TestDevLab’s VQTDL (a deep learning, no-reference video quality model), frame rate stability, video delay, and visual artifacts including blockiness, blurring, and color distortion.
Network responsiveness and adaptation. How quickly the platform adapts when bandwidth drops, how smoothly it recovers when conditions improve, and whether its adaptive bitrate algorithm maintains the highest possible quality or over-corrects. This is where platform engineering intelligence becomes measurable — and where competitive differences are starkest.
System performance. CPU, GPU, and memory usage tracked alongside quality metrics, catching resource-driven degradation that network tests miss and that disproportionately affects lower-specification hardware.

What does a well-designed audio and video benchmarking study look like?

Whether the work is done internally or with a specialist partner, five principles define a study worth trusting.

Structured, repeatable test sessions. Clearly defined network parameters, device configurations, and call durations ensure results are comparable across tests, across time, and across platforms in competitive studies.
Realistic network profiles, not just worst-case scenarios. Moderate packet loss (2–5%), varying latency (50–200ms), and mid-call bandwidth fluctuations are far more common than complete network failure, and they are where subtle quality differences become visible.
Objective metrics combined with behavioral analysis. Numbers explain what happened; behavioral analysis explains why it matters. How quickly does quality recover after a bandwidth drop? Does the platform prioritize audio over video when resources are constrained?
Multiple iterations per scenario. A minimum of three iterations per scenario provides the statistical confidence needed to act on results.
A controlled, standardized testing environment. TestDevLab runs benchmarking studies inside ViQuLab, its purpose-built testing environment that isolates every session from external disturbance and ensures consistent data collection across the full study.

How did Whereby use independent benchmarking to validate and improve its platform?

Whereby is a European browser-based video conferencing platform and API provider used across hybrid work, healthcare, and education. The company engaged TestDevLab to design and execute a benchmarking study simulating realistic network constraints across three dimensions: audio experience (clarity, consistency, and end-to-end delay), video quality and smoothness (perceived clarity, motion fluidity, and latency), and network responsiveness (bitrate adaptation and recovery). Read the complete methodology and findings in the Whereby audio and video quality benchmarking case study.

The engagement produced three outcomes: independent validation of platform resilience that could be shared with partners and stakeholders; actionable insights identifying optimization opportunities the internal team had not surfaced; and a data-driven monitoring framework to ensure quality does not regress as the platform evolves. After several years of collaboration, TestDevLab has become a trusted long-term partner running benchmarks on a rolling basis as Whereby’s platform develops.

“We partnered with TestDevLab to benchmark our audio and video performance across the different network conditions our users face every day. We wanted independent experts who could stress-test us objectively and either validate or push back on the metrics we track ourselves… After a few years of working together, TestDevLab has become a partner we trust.” — Adela Prisacaru, Managing Director, Product & Engineering, Whereby

Should audio and video benchmarking be a one-time study or an ongoing process?

A single benchmarking study is valuable as a baseline. The real competitive advantage comes from making it repeatable. User expectations increase every year. Codecs evolve, browsers update, and the platform changes with every release. The most effective approach is to establish a baseline through initial benchmarking and use it as a reference point for ongoing monitoring, re-running key scenarios after major releases, codec changes, or market expansion into regions with different network profiles.

Automation plays a central role: automated execution allows large volumes of tests to run consistently and integrates benchmarking into CI/CD pipelines, catching regressions before they reach users. Competitive benchmarking adds another layer, showing precisely where investment will close gaps or extend leads. This is what we deliver through our competitive intelligence services.

How does TestDevLab approach audio and video quality benchmarking?

TestDevLab has been doing audio and video quality testing for over a decade, working with communications platforms, conferencing providers, VoIP operators, and video API companies. The capabilities brought to every engagement include:

Proprietary quality assessment algorithms — VQTDL (deep learning video quality) and ASQ-ViT (audio quality), plus Video Quality Box, a SaaS platform for processing video, audio, and network data across metrics including VQTDL, VMAF, ViSQOL, POLQA, FPS, SSIM, delays, and more.
ViQuLab — a standardized, isolated testing environment purpose-built for AV quality benchmarking, ensuring reliable and repeatable results across every session.
Deep real-time communication expertise — covering WebRTC, VoIP, WebSocket, conferencing, streaming, and video calling across desktop and mobile.
Flexible engagement models — from one-time benchmarks and ongoing regression testing to full competitive analysis showing exactly where the platform stands against the field.

Key takeaways

The gap between lab performance and real-world user experience is not a marginal problem for video conferencing platforms — it is where competitive positioning is actually determined. When a platform performs well under controlled conditions and degrades under the network impairment that real users encounter daily, the difference does not show up in internal test reports. It shows up in churn rates, app store ratings, and the quiet preference shift toward platforms that simply feel more reliable, regardless of whether their underlying technology is superior.

Closing that gap requires a testing approach that is structurally different from internal QA. Namely, independent, perceptually grounded, and designed around the network conditions users actually inhabit rather than the conditions that make a platform look its best. The Whereby engagement illustrates what that looks like in practice, benchmarking across realistic impairment profiles, with perceptual quality metrics that correlate with what users hear and see, producing both independent validation that could be shared with partners and stakeholders and actionable optimization insights the internal team had not surfaced. The ongoing engagement model, re-running key scenarios as the platform evolves, reflects what effective quality monitoring actually requires: not a point-in-time snapshot, but a repeatable process that catches regressions before users do.

For any communications platform competing in a market where user experience is the primary differentiator, independent AV benchmarking is not a validation exercise. It is a competitive intelligence function, one that makes the gap between your platform and the field measurable, actionable, and closable before it becomes visible in the metrics that matter commercially.

FAQ

Most common questions

Why does internal audio and video quality testing fail to surface the problems real users encounter?

Internal testing tends to happen under favorable conditions — stable networks, known devices, and predictable test scenarios that do not reflect the network variability real users encounter. Beyond network conditions, there is a blind spot problem: internal teams have deep knowledge of how their platform should behave, which introduces assumptions that substitute for measurement. Independent benchmarking removes that bias and produces results that carry genuine evidential weight with stakeholders, enterprise buyers, and investors who require data rather than assurances.

What network conditions should an audio and video benchmarking study simulate?

Moderate impairment scenarios—2–5% packet loss, varying latency of 50–200ms, jitter spikes, and mid-call bandwidth fluctuations—are far more common in real usage than complete network failure and far more revealing about platform quality. Simple bandwidth throttling at fixed bitrates is insufficient because it does not test how the platform adapts when conditions shift suddenly, which is exactly where quality differences between platforms become visible. A well-designed study tests network adaptation and recovery behavior, not just baseline performance under static conditions.

What is the difference between raw technical metrics and perceptual quality metrics in AV benchmarking?

Raw technical statistics, like bitrate, FPS, and packet loss percentage, describe what the server is sending. They tell you nothing about what the user hears and sees after the signal has been encoded, transmitted through an impaired network, decoded, and rendered on their device. Perceptual quality algorithms, like POLQA and ViSQOL for audio, VMAF for video, and proprietary models like TestDevLab's ASQ-ViT and VQTDL, produce scores that correlate with human perception of quality, making them the appropriate measurement standard for any benchmarking study designed to predict user experience rather than network behavior.

Should audio and video benchmarking be a one-time study or an ongoing process?

A single study establishes a valuable baseline, but the real competitive advantage comes from making it repeatable. User expectations increase every year, codecs evolve, browsers update, and the platform changes with every release, each of which can introduce quality regressions that one-time benchmarking will not catch. The most effective approach re-runs key scenarios after major releases, codec changes, and market expansion into regions with different network profiles, with automated execution integrated into CI/CD pipelines to catch regressions before they reach users.

What does system performance monitoring add to an AV quality benchmarking study?

Audio and video processing—noise suppression, echo cancellation, background blur—is computationally intensive, and CPU or memory spikes under load cause quality degradation that network-only tests do not capture. Tracking CPU, GPU, and memory usage alongside perceptual quality metrics reveals resource-driven degradation that disproportionately affects lower-specification hardware, the devices that a significant portion of real users rely on. Without system performance monitoring, a benchmarking study can produce clean results on high-end test devices while missing the failure modes that users on mainstream hardware encounter in production.

Does your conferencing platform perform as well under real-world network conditions as it does in your lab?

TestDevLab designs and executes independent audio and video quality benchmarking for communications platforms, using proprietary perceptual quality models, purpose-built testing environments, and realistic network impairment profiles that surface the experience gaps internal testing systematically misses.

Talk to our team

Why Your Video Conferencing Platform Performs Differently in the Real World—and How Independent Benchmarking Closes the Gap

TL;DR

30-second summary

Why does internal testing miss real-world audio and video quality problems?

What makes audio and video quality benchmarking difficult to get right?

1. Simulating the right network conditions

2. Using measurements that reflect human perception

3. Measuring end-to-end, not just server output

4. Accounting for system performance under load

What metrics actually matter in an audio and video quality benchmark?

What does a well-designed audio and video benchmarking study look like?

How did Whereby use independent benchmarking to validate and improve its platform?

Should audio and video benchmarking be a one-time study or an ongoing process?

How does TestDevLab approach audio and video quality benchmarking?

Key takeaways

FAQ

Most common questions

Why does internal audio and video quality testing fail to surface the problems real users encounter?

What network conditions should an audio and video benchmarking study simulate?

What is the difference between raw technical metrics and perceptual quality metrics in AV benchmarking?

Should audio and video benchmarking be a one-time study or an ongoing process?

What does system performance monitoring add to an AV quality benchmarking study?

Does your conferencing platform perform as well under real-world network conditions as it does in your lab?

Martina Stojmanovska

Save your team from late-night firefighting

You may also like

Video Quality Comparison for Various Codecs at Different Resolutions and Bitrates (Part 3)

How to Produce Research-Grade Benchmarking Data on Conferencing Platform Performance

How to Establish a QA Foundation for a Fast-Scaling Communications Platform

Video Quality Comparison for Various Codecs at Different Resolutions and Bitrates (Part 3)

How to Produce Research-Grade Benchmarking Data on Conferencing Platform Performance

How to Establish a QA Foundation for a Fast-Scaling Communications Platform