Advancements in Audio and Video Quality Testing at TestDevLab

The year 2020 was very busy for our video and audio quality testing team. Due to COVID-19, many people over the world started to work from home and all their communication currently heavily relies on chat and video conferencing platforms as well as the needs for digital entertainment have increased drastically. For example, over-the-top streaming (streaming media service offered directly to viewers via the Internet) increased worldwide by an average of 50%.

TestDevLab statistics audio and video testing — Increase in OTT streaming during Covid-19 Pandemic – 1

Increase in OTT streaming during Covid-19 pandemic — Increase in OTT streaming during Covid-19 Pandemic – 2

Vendors quickly jumped on that bandwagon by implementing new features to the existing products or by creating entirely new ones from scratch. And, of course, it is important to test those changes thoroughly.

At the beginning of 2020, we had two audio & video testing setups (a mix of hardware, software, and the engineers looking after them), and this number has now increased to almost twenty. Every setup is customized to suit different product needs, be it VOIP calls, Conferencing, Live broadcasting, Video On Demand, or Audio streaming. We also have worked quite extensively to increase the number of metrics we provide to our clients.

Audio/video quality testing metrics we already had in our portfolio:

Frame rate (FPS) – a metric that shows how fluid video is;
Image Quality (Brisque) – non-reference machine learning algorithm, which is designed to objectively rate image quality;
Video Delay (Latency) – the time between video stream is sent from ‘Sender’ till it is received by the ‘Viewer’;
Audio Quality (POLQA) – standard to predict speech quality by analyzing speech;
Audio Delay – the time difference between the audio stream that is sent from ‘Sender’ till it is received by the other party.

4 major updates in our audio/video quality testing services:

Implemented industry-standard full-reference image quality assessments: VMAF, PSNR, and SSIM.
Successfully added video performance metrics: stall and freeze detection, audio and video synchronization, and resolution detection.
Started to run tests on physical devices in India, Brazil, USA, Russia, and China. In total, around 150 locations are available to be tested in.
Updated network setup to decrease interference and incorporate bandwidth limitation, packet loss, and latency in one scenario.

1. Industry-standard full-reference image quality assessments

Firstly, what is a video quality assessment algorithm? Basically, it is just a mathematical formula that attempts to predict how humans would rate the videos faster than manually. This process can be automated.

Full-reference objective video quality algorithms we have added to assess videos:

VMAF – a recent addition to the perceptual video quality assessment algorithm family, developed by Netflix, quickly becoming an industry standard. It incorporates several elementary metrics with different weighting. The machine-learning model is trained and tested using the opinion scores obtained through a subjective experiment. The downside is that it needs significantly more computational power than others. Read more about VMAF.

PSNR – a long-standing metric, still cited by key players like Facebook and Netflix, used heavily to compare codecs measure of image distortion. Computed from the estimated quantization error. The disadvantage is that PSNR compares pixel values that do not correlate well with perceived picture quality due to the complex, highly non-linear behavior of the human visual system.

SSIM – a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena. That includes both luminance masking and contrast masking terms, e.g., distortions become less visible where there is a significant activity or “texture” in the different regions of the image. This metric does not compare the pixel values but the image elements perceived by the human instead. Read more about SSIM.

Our full-reference image quality assessment process

Our Evaluation video example:

Getting a full-reference image quality assessment differs in the process compared to a no-reference assessment (BRISQUE) in the way that we need to have the original streamed undegraded file to compare with the received one during the stream. It might seem easy because if we say we use a media file saved on our machine to test broadcasting – then we do have the original. In reality not only video quality might degrade, but also the received video is likely to have freezes, buffering, and frame dropouts, and after that happens we cannot directly compare it to the original, because frames will not align anymore. Luckily, there is a solution for that – using QR codes, we can recreate the original video to match the frames to the degraded one.

2. Additional video performance metrics

This is how one of those test videos looks like and we have highlighted which parts of the video are used for getting each metric.

Example of metrics used for video testing

Stall – we register a stall when the audio stops and the video freezes, and those are calculated using FFmpeg ‘silencedetect’ and ‘freezedetect’ commands. When we detect both video freeze and audio silence for more than 200ms (200ms is the boundary when freeze becomes visible to the user), we write it down as a stall. In our media used for testing, in addition to the video – there is also an audio stream, which contains music and voice countdown from 10 to 0. So whenever there is rebuffering and audio, we detect the silence and record it as a stall.

Freeze – we register a freeze when for a certain amount of time specified by us during the video playback user sees the same frame, but the audio continues. This is calculated using the FFmpeg “freezedetect” command.

In addition, for both stalls and freezes we measure:

Stall/freeze time;
Time between stalls/freezes;
Total stall/freeze time.

Stalls and freezes influence user experience a lot and it is important for the stakeholders to find the right balance between image quality, latency, and buffer size.

Audio Video Synchronization – it is what customers expect to have by default at all times. We implement this kind of testing by using a manual approach. As we have both an audio countdown as well as a visual countdown in the corner of our test video, we can manually tell if it is in sync or out of sync by watching the video. We validate data we get by processing the files while watching actual degraded videos, and during those spot checks, we also note if synchronization was in place. Also, a vital property to check for broadcasting solutions.

Resolution detection – when VOD or broadcast is set to Auto, what video resolution is actually sent to the viewer? We put that question in front of us and solved it by adding multiple QR codes of different sizes. As we see – there are 9 of them and our algorithm can read all of them in 4k resolution, 8 of them in 1080p, and so on.

Here were the results we got from processing our video with manually chosen resolutions on YouTube. We could see that during playback of 1080p video script could read 8 QR codes, in this case – eighth represents 1080p exactly.

3. Running tests on physical devices in other countries

In our projects, we are using a service that allows end-to-end testing & monitoring with thousands of real devices deployed in 100+ locations on the real carrier and WiFi networks worldwide. Service supports:

iOS and Android;
all testing platforms, including Appium;
Audio and video QoE metrics (although our tests have shown that statistics from Headspin should be used only as additional data as they are not always accurate).

It is essential to make sure your application is performing well in the countries where the majority of your users are or in the emerging markets such as India and Latin America, especially if you have servers in those countries or nearby.

It is important to highlight that this service is not for everyone, as it is quite pricey. But the good news is that we have developed our own solution to simulate network conditions common in the countries your users are based.

4. Updated network setup

This is part of our network infrastructure for the audio and video laboratory.

Things we have implemented:

Access Points with virtual routers. We were using 3 different types of routers for one laboratory, so to provide for 20 testers we would have needed to expand to up to 60 routers. Instead, we use Ubiquiti Access Points, where 8 routers can be virtualized;
Started to use ‘tcset’ commands to limit network traffic, which allowed for parameters like bandwidth limitation, packet loss, and latency to be applied to the specific IP with one script;
Started to use ‘rsniffer’ commands to capture traffic;
Added extra Internet providers to have redundancy.

Future plans:

Continue getting our clients’ valuable insights on video and audio performance in mobile, desktop, and web platforms;
Follow trends, new technology, and solutions in the world regarding Audio and Video assessment;
Training our team to have better expertise in the tools we use.

Summary

Audio and video communication has become de-facto one of the crucial aspects of our lives and together with this boom of video technology the user expectations of high standards are growing. Be it video conferencing and its stable screen sharing function, video on demand and the absence of stalls for users in India, or VOIP communication in very busy, limited airport network, music streaming or Live gaming broadcast – by implementing recent technology advancements in the industry it is possible to give the user experience she or he is expecting. To get a clear picture of product performance – testing is a necessity.

Our engineers do not just simply deliver the raw data; we analyze in-depth the different aspects of video and audio performance, such as image and audio quality, latency, audio, and video synchronization, amount of stalls and freezes, and conclude the findings. Also, we validate the data so you can be confident that you will make the right decisions in order to improve your product and ship the best possible experience for your clients worldwide.