How We Extract and Evaluate RTP Audio Stream Quality

At TestDevLab we have extensive experience in audio quality testing, working with various metrics and processes to provide accurate results. Recently, however, we encountered a new challenge in this field. Namely, in most cases, audio is recorded as a file that the user can listen to, but what happens when the conversation is already over and all you have are the PCAP network traces? Thankfully, even if audio is not meant to be listened to directly, network traces still contain all of the packets and streams that combine into the final audio signal. To address this challenge, we built a new tool that is able to extract and convert these network streams into actual audio files, which we can then evaluate and analyze to gather objective quality data.

In this blog article, we will walk you through how the tool identifies and extracts RTP audio streams from PCAP traces, reconstructs them into playable audio files, and evaluates their quality using three no-reference metrics: NISQA, SpeechMOS, and our own in-house metric, Barko AQTDL.

TL;DR

30-second summary

How do you evaluate the quality of audio that was never recorded as a file when all you have are raw network traces?

PCAP files contain everything needed to reconstruct audio, but no existing tools combined extraction, reconstruction, and quality evaluation in a single automated pipeline. Tools like Wireshark can extract RTP streams but don't support RTP Timestamp playback timing in their command-line version. This gap required building a custom solution from scratch.
Accurate RTP stream identification requires combining three filters. SSRC alone is insufficient because it only needs to be unique within a single RTP session. Combining SSRC with source and destination IP addresses and ports produces a reliable filter for isolating individual streams within complex network traces containing multiple sessions.
Reconstructing audio accurately requires simulating RTP Timestamp playback timing. Simply concatenating packet payloads produces audio without silences, which misrepresents real call quality. By calculating timestamp differences between packets, where a gap above 160 frames for G.711 at 8kHz indicates silence, the tool reconstructs audio that accurately reflects both speech content and silence intervals.
Single MOS scores are insufficient for accurate quality assessment. Standard MOS scoring alone doesn't always identify what caused degradation. The tool uses three complementary no-reference metrics, NISQA, SpeechMOS, and TestDevLab's own Barko AQTDL, to provide a more robust, multi-dimensional quality picture across different scenarios and codec types.
Barko AQTDL adds an AI-driven layer that existing metrics cannot replicate. By combining NISQA and SpeechMOS scores with LLM-based perceptual evaluation using a customisable prompt, Barko AQTDL produces a fine-tuned final MOS score that can be adapted to specific scenarios, from narrowband telephony to WebRTC, without retraining the underlying models.

Bottom line: Evaluating audio quality from PCAP traces requires deep expertise in both network trace analysis and audio signal processing. By combining custom RTP extraction logic with three complementary no-reference quality metrics, including an AI-driven metric that can be tailored to specific use cases, TestDevLab's tool provides accurate, automated audio quality evaluation even when no reference recording exists.

What is in a PCAP file?

Before we delve deeper into our solution, let’s first cover some of the basics about what it is we are working with.

A PCAP trace is not a recording format like mp3 or wav, but a data file that contains captured network packets. These files consist of many small frames of data for each packet, rather than one continuous audio file.

In most VoIP cases, PCAP files specifically contain RTP over UDP packets. The most important parts of these packets are the RTP header, which carries various metadata such as packet sequence number, payload type, timestamp and stream identifier, as well as the payload part of the packet, which contains the actual encoded media, in this case, audio frames.

Identifying audio streams

A network trace can contain multiple audio streams, each containing its own audio data. To correctly extract each one, we must identify it using the metadata from the RTP header, as well as information from both the IP and UDP headers.

SSRC

One of the main identifiers we can use is the Synchronization Source, (SSRC), which is a 32-bit numeric identifier, usually chosen randomly, that is carried in the RTP header and uniquely identifies a single stream of RTP media. However, the SSRC has to be unique only within a given RTP session. This means that if a network trace contains multiple RTP sessions, the same SSRC may appear across different streams.

SRC/DST IP

Source and Destination IP addresses are additional filters to use for specific stream identification. As one of the most basic packet parameters located in the IP header, they identify the network endpoints and help further narrow down which RTP stream to find.

SRC/DST Port

Source and Destination ports are transport-layer identifiers that tell the system which application process on each host should handle a given packet. They are among the most commonly used fields for filtering specific RTP streams.

Combining filters

Each of these parameters has limitations when used alone. Filtering only by IP address or port can produce ambiguous results, and as mentioned above, even SSRC identifiers can sometimes be not enough to fully filter out each individual RTP stream. Used together, however, these fields form a very solid filter for pinpointing individual RTP streams within a network trace.

Rebuilding audio from RTP streams using our custom-built tool

Let’s now discuss the tool we built for audio extraction from RTP streams. It is able to extract and convert G.711 μ-law and A-law RTP stream packets into full audio media files, which are then evaluated using no-reference audio quality metrics to get a final quality score.

The tool itself is developed in Python, as it provides rapid and simple development, and strong library support.

Extracting packets

The first step when building this new tool was correct packet extraction. As mentioned above, we had to identify each unique RTP stream found in the network trace by filtering them out with SSRC, SRC/DST addresses and ports.

In this case, specifically, we used Scapy - a powerful interactive packet manipulation library that is able to decode packets across a wide range of protocols and provides us with all the tools needed to correctly filter out the required information.

Building a payload

From the extracted packets, we can now build a full payload for conversion to audio. For the first iteration of this tool, we focused on well-established industry codecs: G.711 μ-law and A-law. G.711 is a narrowband audio codec originally designed for use in telephony and still in use to this day. It is public and royalty-free to use, with both the μ-law and A-law variations operating at 8kHz frequency and 64kb/s bitrate. Both the μ-law and A-law algorithms have fixed payload types, for μ-law or 8 for A-law, which makes it easier to further filter out the required packets before building the audio media payload.

Packet timestamps are another critical factor. If we are to combine just the filtered packet payloads, we would always produce audio media without any silences, which can often occur due to various network conditions. To handle this, we have chosen to simulate RTP Timestamp playback timing, as it is found in Wireshark - a powerful and open-source network protocol analyzer. This timing is calculated based on the timestamps between packets and since we are working with G.711 payloads, it is simple to work out the difference between timestamps of two packets. A typical VoIP packet carries 20ms of audio and if the payload is 8kHz, that equals 160 samples per packet. Any timestamp difference above 160 frames indicates additional silences between the packets, which can then be added accordingly.

With this logic in place, we iterate through all of the filtered packets to reconstruct a full payload that accurately reflects both audio content and silence intervals.

Converting to audio media

The final step in rebuilding audio from the network traces is to convert the decoded payload chunks into actual media files that can be used in quality evaluation. For this we use Audioop, a built-in python module for low-level audio processing on raw PCM audio data. This fits very well in our scenario when processing G.711 μ-law and A-law payloads. And using the built-in wave module, the extracted PCM data can be written to a human-understandable WAV format audio file.

Steps to converting network traces into audio media

Audio evaluation

The next goal is to gather audio quality data for the extracted samples. For most testing cases, we use full-reference audio metrics, such as POLQA or ViSQOL. These metrics provide more consistent and accurate results, however, as the name implies, requires a reference for evaluation. No-reference metrics might not have the same consistency as full-reference metrics but they provide much more flexibility when it comes to evaluation.

For this tool, we decided to use no-reference metrics, as this allows us to use it in various cases where we do not have a clear reference audio and still provide quality insights.

To provide results that are as accurate as possible, we use multiple audio quality metrics to better understand any quality issues or challenges encountered. These metrics are NISQA, SpeechMOS and our own custom, in-house and AI-based no-reference metric, Barko AQTDL.

Need audio quality evaluation without a reference recording?

Our audio quality testing services cover both no-reference and full-reference evaluation, including PCAP-based RTP stream analysis for VoIP and real-time communication products.

Explore audio quality testing services

NISQA

NISQA (Neural Intrusive Speech Quality Assessment) is a deep learning model for speech quality prediction. As a data driven no-reference metric, it addresses the most common issue of full-reference metrics, the need for a reference signal, and tries to align more closely with human perception than simple signal-level metrics. The NISQA Corpus used for training includes more than 14,000 speech samples from both simulated and live audio recordings.

NISQA contains 5 different quality indicators: MOS, Noisiness, Coloration, Discontinuity and Loudness. MOS is the main indicator of quality degradation (scale from 1 to 5), however, it does not always give an insight of what caused the degradations. That is why NISQA includes the four additional dimensions. During the training process for NISQA weights, each speech file was labeled with both a subjective MOS value and scores for each of the other four dimensions. Afterwards, a deep neural network was trained on these ratings, providing a model that can be used in various scenarios and still provide meaningful evaluation.

The overall structure for how NISQA evaluates audio is as follows:

1. A Mel-Spectogram is calculated from the input signal.

2. The spectrogram is then split into overlapping segments.

3. A framewise neural network takes these segments as inputs and uses them to compute features that are suitable for speech quality prediction.

4. These features are calculated on a frame basis and result in a sequence of framewise features.

5. Time dependencies within the feature sequence are modelled for each frame.

6. Finally, features are aggregated over time in a pooling layer and used to estimate a single MOS value.

SpeechMOS

SpeechMOS is a quality evaluation metric that predicts the Mean Opinion Score, from a scale of 1 to 5, of an audio sample using a neural network trained on human perceptual ratings. It is a PyTorch-based tool that provides easier access to neural speech quality predictors, in this case UTMOS.

UTMOS is a neural network designed to predict human MOS values. It uses a system based on ensemble learning of strong and weak learners. Strong learners are neural network models that are able to estimate frame level scores. Weak learners are more general models that use non-neural-network machine-learning methods to predict speech features. By combining both of these, UTMOS is able to capture both more in-depth data using the strong learners, as well as provide generalization with the weaker learners. This allows it to not only work well on speech that it was mostly trained on, but also be able work in cases that are out of its domain.

Barko AQTDL

Finally, for even more accurate results, we developed a new metric - Barko AQTDL. It is a custom no-reference metric developed by TestDevLab that combines both the NISQA and SpeechMOS scores with AI analysis to provide a final quality MOS score on a scale from 1 to 5.

The metric works by sending the sample, with a specific prompt that asks the AI to evaluate the perception, pleasantness and overall evaluation of the given audio, as well as a description of these parameters. The AI looks specifically for human speech in the audio, meaning that samples that contain only call signals, extended silences, or other synthetic sound elements will receive lower scores.

Once these AI scores have been gathered, the average from the previously evaluated NISQA and SpeechMOS scores is taken and the calculated AI score is used as a scaling factor to provide another, more fine-tuned final quality score.

Interested in applying Barko AQTDL to your audio quality evaluation?

Talk to our team

Final report

In the end, our custom-built tool uses three different metrics for quality evaluation. In essence, this process is similar to how we evaluate full-reference video quality data. In those cases, we not only evaluate VMAF, but also PSNR and SSIM data. Each metric has its own strengths and weaknesses, so having multiple points of data provides a more robust tool and yields more thorough results. The same is true here, especially with Barko AQTDL that allows the user to change the given AI prompt to better suit their case. For example, NISQA and SpeechMOS are more suitable in cases with wideband or WebRTC recordings, while Barko AQTDL, with its customizable AI prompt, can be tailored to narrowband or traditional telephony scenarios.

Steps in audio evaluation process using various metrics

Key takeaways and future improvements

Before building this tool, we found no existing solutions capable of providing the exact functionality we required. While programs such as Wireshark can analyze and extract RTP streams in the format we require, its GUI does not lend itself well to automated processing pipelines. And its command-line version, tshark, does not have the functionality to extract audio with RTP Timestamp playback timing.

This task required deep expertise in both network trace and audio signal processing to build a tool that combines RTP stream extraction, conversion and perceptual quality modeling.

Regarding quality evaluation, we concluded that standard MOS scoring alone was insufficient. It did not always provide an accurate assessment of audio quality. However, by introducing multiple no-reference metrics, with an additional custom AI-driven, LLM-based metric, we could provide much more consistent and accurate scores, even in varied scenarios.

But there are also some improvements that could be made. Currently, the tool only supports G.711 packets and although these are still quite popular, especially by traditional VoIP applications, they are getting older and other codecs have taken their place, such as Opus for real-time communication in WebRTC or other online meeting scenarios, or AAC, which is the standard for streaming. However, not all codecs are public and could require additional licensing to implement.

Processing time can also increase sharply, with very large traces that contain a lot of streams. A future improvement should focus on asynchronous multi-stream analysis to further optimize the tool.

In regards to quality evaluation, while the multiple metric approach already provides better and more accurate outcomes, both NISQA and SpeechMOS have the ability for their models to be fine-tuned, and while this would take quite a lot of work and additional data, in very specific cases it could provide an even bigger benefit to have those models improved.

FAQ

Most common questions

What is RTP audio stream quality evaluation and why is it needed?

RTP audio stream quality evaluation is the process of assessing the perceptual quality of audio delivered over a network using Real-time Transport Protocol, without requiring a recorded audio file. It's needed when the only available data is a PCAP network trace from a completed call or session. While PCAP files don't contain audio recordings, they contain all the RTP packets that combine into the final audio signal, making reconstruction and quality evaluation possible with the right tooling.

How are RTP audio streams identified within a PCAP file?

Three filters are combined to reliably isolate individual RTP streams. The Synchronization Source identifier (SSRC) uniquely identifies a stream within a single RTP session but is insufficient alone when a trace contains multiple sessions. Source and destination IP addresses narrow identification to specific network endpoints. Source and destination ports identify which application process on each host handled the packets. Used together, these three filters produce a reliable method for pinpointing individual streams within complex network traces.

How is audio reconstructed from RTP packet payloads?

Packet payloads are extracted and combined, but accurate reconstruction requires simulating RTP Timestamp playback timing rather than simply concatenating payloads. For G.711 at 8kHz, each packet carries 20ms of audio — 160 samples. Any timestamp difference above 160 frames between consecutive packets indicates silence, which is inserted accordingly. This produces audio that accurately reflects both speech content and the silence intervals that occurred during the original call, rather than a compressed version with all silences removed.

What no-reference audio quality metrics does the tool use and why three?

The tool uses NISQA, SpeechMOS, and Barko AQTDL. Each metric has different strengths: NISQA provides five quality dimensions including noisiness, coloration, discontinuity, and loudness alongside MOS; SpeechMOS uses ensemble learning to generalise across both in-domain and out-of-domain speech; and Barko AQTDL combines both scores with AI-driven perceptual evaluation using a customisable prompt. Using three metrics mirrors the approach taken in video quality testing, where VMAF, PSNR, and SSIM are combined, because no single metric captures all dimensions of quality degradation reliably.

What makes Barko AQTDL different from existing no-reference audio metrics?

Barko AQTDL is TestDevLab's own AI-driven no-reference metric that uses an LLM to evaluate the perception, pleasantness, and overall quality of an audio sample, then uses that score as a scaling factor applied to the average of the NISQA and SpeechMOS scores. Unlike NISQA and SpeechMOS, its evaluation prompt is customisable, allowing it to be tailored to specific scenarios such as narrowband telephony or WebRTC without retraining the model. It also specifically targets human speech, meaning samples containing only call signals, extended silences, or synthetic audio receive appropriately lower scores.

No reference recording, no problem. We extract, reconstruct, and evaluate audio quality from network traces, so you know exactly what your users heard.

Extracting and Evaluating RTP Audio Stream Quality Using Custom-Built Tool & Barko AQTDL