Full-Reference Quality Metrics: VMAF, PSNR and SSIM
One of the most important parts of analysis regarding video applications is the video quality. There are many different image quality algorithms that help us assess the quality of a given image, however, they can all be divided into three main categories: no-reference algorithms, reduced reference algorithms, and full-reference algorithms. The main difference between these types of image quality assessments are that the no-reference algorithms need a previously created dataset that is afterwards used to evaluate the degraded video, while the reduced reference algorithms aim to predict the visual quality of distorted images with only partial information about the reference images, and full-reference algorithms need both the degraded and the original video for comparison.
In this article, we will focus on full-reference algorithms and look at three different full-reference quality assessment metrics—VMAF, PSNR and SSIM.
VMAF
Video Multi-Method Assessment Fusion (VMAF) is an open source full-reference video quality assessment algorithm developed by Netflix in 2016. It was primarily developed because the wide and diverse amount of content available from Netflix did not work well with any existing image quality assessment algorithm. While testing, they discovered that many single elementary metrics were accurate for one specific case but inaccurate for another.
For example, the video quality of something like a nature documentary would subjectively look much different than the video quality of an animated show even though the metrics would report no major differences.
A solution to this problem was combining multiple elementary metrics using a machine-learning algorithm that would assess each of the individual metrics and combine them into a final metric that could retain each of the separate metric strengths and deliver a more accurate final score regardless of the type of media that was analyzed.
VMAF is a fusion of 38 elementary features. The three most important features and the only ones that are used in the pre-trained models are Visual Information Fidelity (VIF), Additive Distortion Metric (ADM) and motion.
VIF is a full reference image quality metric that is derived from a quantification of two mutual information quantities. The first is the mutual information between the input and the output of the HVS channel when no distortion is present, referred to as the reference image. The second is the mutual information between the input of the distortion channel and the output of the HVS channel, known as the distorted image.
ADM was previously named Detail Loss Metric (DLM) and originally combines both DLM and Additive Impairment Measure (AIM), however, VMAF only uses the DLM part as an elementary feature. DLM is an image quality metric that measures the loss of details that affect the visibility of the content as well as the impairments that can distract the viewer from the content. The motion feature, on the other hand, calculates the motion of the video by checking adjacent frames. The greater the difference is between these adjacent frames, the more motion there is in the video.
An important part of VMAF are the different models that can be used. Though there is the option to build custom models, the open source VMAF project provides many of these models that can be used based on the content that you want to analyze.
The main difference between the models is the kind of video resolution they are intended for. For example, evaluating a 4K video with a model that is intended for 1080p content will give higher scores than they should be. Another factor for how these models are created is the viewing distance. The default 1080p model is created using 3H distance (3x the height of the device screen). For lower resolution models such as 720p or 480p the viewing distance should be increased accordingly (these models would be best when analyzing videos from, for example, mobile devices).
VMAF is evaluated in a linear range from 0-100, where 0 is the lowest possible score and 100 is the highest. Keep in mind that even in a case where you would compare a video to itself, VMAF would not guarantee a perfect 100 score, however, you should get a score close enough, like 99 for instance.
PSNR
The peak signal-to-noise ratio (PSNR) is a non-linear full-reference metric that compares the pixel values of the original reference image to the values of the degraded image. PSNR is a long established image quality metric, most commonly used to compare the compression of different codecs, such as image compression.
To calculate PSNR, the mean squared error (MSE) must be calculated first. The lower the MSE, the lower the error and higher the PSNR results. The main idea for this is that the higher the PSNR score, the better the degraded image has been reconstructed in comparison to the reference image which in turn means that the algorithm used for reconstruction is also better.
PSNR can also be calculated for the different color spaces such as RGB or YUV. At TestDevLab, we most commonly calculate the PSNR results for brightness (Y or the luma component) as that is the information that people are most sensitive to. As mentioned earlier, PSNR is a non-linear full-reference metric. Generally, PSNR is measured in decibels on a scale from 0-60.
However, PSNR does have its drawbacks. While PSNR is a great metric to compare different codexes and ways of image compression, PSNR scores do not always correlate with perceived quality. One of the most common areas where it has been observed that PSNR does not always represent the perceived quality of an image is blurriness. For example, both of the images below have an average PSNR score of 19 even though, subjectively, the image on the left is more distinguishable and there is much less detail visible in the blurry image on the right.
While problems like these would not occur every time, this highlights what was previously mentioned about VMAF and how it uses multiple elementary metrics to supplement for weaknesses some of the individual metrics might have, which is why we at TestDevLab also use one more full-reference metric—SSIM.
SSIM
The structural similarity index measure (SSIM) is a non-linear full-reference metric that compares the luminance, contrast and structure of the original and degraded image. SSIM was first introduced in 2004 as a new way to assess image quality. Instead of measuring the absolute errors between the reference and the degraded pixels (like in PSNR), SSIM measures the structural elements of the pixels. An analogy for this would be that SSIM measures the differences between the properties (luminance, contrast and structure) of the pixels while PSNR just checks the absolute error between the pixels.
The first feature that SSIM measures is luminance. The luminance of each signal is compared and an average value over all the pixels is obtained. Next is the contrast. It is measured by using a standard deviation as an estimate of both the degraded and the reference signal which is later compared. Last comes the structure comparison. Both the reference and the degraded signals are normalized by their own standard deviation and a structure comparison of both of the normalized signals is made. Now the three features can be combined to form a structural similarity index.
SSIM is measured on a scale from 0 to 1, where the closer the score is to 1, the more similar the degraded image is to the reference image. As mentioned above, SSIM is a non-linear metric where results from 0.97 to 1 show minimal degradation, results from 0.95 to 0.97 represent low degradation and results below these ranges indicate medium and heavy degradation.
SSIM is very sensitive to any kind of structural changes, like the stretching of an image, rotations, or similar distortions. It is also highly affected by blockiness and blurriness. However, just like with PSNR, there are certain types of degradation that SSIM does not represent accurately to perceived image quality. Specifically, while SSIM is effective at recognizing any kind of structural changes, this can also be unproductive. A slight spatial shift of an image might mean it has a very low SSIM score even though the subjective image quality is just like the reference image. SSIM is also not the best at evaluating changes in image hue and similar factors. For example, the image on the left below has an SSIM score of 0.99 while the image on the right—which has completely different colors and is nowhere near the reference image—has an SSIM score of 0.97, which is still a very high SSIM score.
Challenges and solutions
There are many challenges when it comes to using these full-reference metrics. As mentioned above, one of the greatest challenges is the accuracy of a metric. That is why we offer multiple full-reference metrics because even if there may be some separate cases where PSNR or SSIM results might not be accurate with perceived image quality, multiple metrics provide us with more data to analyze.
Another challenge of full-reference metrics is that in order to compare the degraded and the reference image, both of them must have the same dimensions. Here at TestDevLab we solve this issue by using a custom spatial alignment solution based on a template-matching method to transform the comparable images into the same dimensions. We also use QR markers to align the frames of the reference video and the degraded video.
One last challenge that comes with gathering full-reference data is that the degraded video must be completely straight and without any external distortions. As mentioned previously, even the slightest differences in scale or position of an image can heavily affect the full reference scores. Currently, the only way to ensure that the degraded video can be precisely aligned with the reference video is by using screen recording software. This means that at this time it is not possible to gather full-reference data for mobile tests as they use an external camera to film the degraded video. Even if you could somehow align the degraded video so it would be pixel perfect with the reference video, the external camera could still introduce some unexpected transformations in the image that would heavily decrease the scores.
Key takeaways
Full-reference quality metrics provide us with a lot of information but it is important to know how to analyze them. Even though VMAF is quickly becoming a leading industry standard whose implementation of fusing multiple elementary features and metrics has allowed it to cover a very wide range of use cases, other metrics like PSNR and SSIM have not become irrelevant because of this. They are still widely used and can provide very accurate data based on your use case. Therefore, when testing your audio and video solution make sure to use a full-reference quality metric that provides you with the most accurate insights.
Do you have an application with audio and video capabilities? We can help you test it using a variety of full-reference quality metrics. Drop us a line and let’s discuss your project.