Blog/Audio & Video quality testing

TestDevLab’s Approach to Subjective MOS Video Quality Evaluation

Engineer working on his computer

Subjective MOS (Mean Opinion Score) evaluation is a fundamental method for assessing the perceived quality of video and images. It involves gathering human judgments to quantify quality, providing valuable insights into how viewers experience visual content.

While objective metrics provide a convenient and automated way to assess video or image quality, they often fail to capture the nuances of human perception.

Subjective evaluation, particularly using MOS, directly measures how viewers perceive quality. This approach is crucial because objective metrics, while useful for identifying certain types of distortions, don't always correlate well with the complex way humans process and interpret visual information. Ultimately, MOS provides a more accurate representation of the viewing experience, making it indispensable for applications where perceived quality is paramount.

This guide outlines practices that we, at TestDevLab, use to conduct subjective MOS evaluations, drawing primarily from ITU-T P.910, with additional information from other relevant ITU recommendations.

Why MOS instead of objective metrics?

In most cases, VMAF, VQTDL, or any other objective video quality metric would be more effective, as it is quicker to use, it doesn’t get affected by mood or loss of focus of the participants, and it doesn’t require 15 people with prior training. However, there are situations when MOS just makes more sense as the main video quality evaluation metric or can at least be used as a secondary video quality evaluation metric.

Why our clients choose MOS as the main video quality metric

1. Real-time evaluation without interference. Clients need the ability to assess video quality directly from real-time playback within the app, capturing the true user experience. This approach eliminates the need for screen recordings or video extraction, which can interfere with the integrity of the evaluation.

2. No dependence on a reference source. Most objective video quality algorithms require a source video for comparison or training. However, such reference files are not always available, particularly in dynamic environments like live broadcasts captured with a smartphone’s back camera. These videos often feature constantly changing scenes without a predefined source, making traditional objective methods impractical.

3. Evaluating organic, user-generated content. In many cases, the videos that need to be evaluated were not created for quality testing purposes. While humans can easily perceive drops in video quality, objective algorithms may overemphasize technical issues such as blurriness, poor lighting, or motion blur from fast-moving objects. For such content, subjective MOS (mean opinion score) tends to provide more reliable and contextually appropriate insights.

4. Customizable and goal-oriented evaluation. Subjective MOS offers flexibility, as it is not bound by a rigid evaluation model. Instead, it can be tailored to specific project goals, emphasizing certain aspects of video quality depending on what matters most. This adaptability ensures more relevant and actionable results.

When to choose MOS as the secondary video quality metric

1. Initial assessment in the absence of established metrics. When no specific requirements or target metrics have been defined, MOS can serve as an effective secondary metric during the initial testing phase. This approach helps to gain preliminary insights into the app's performance and user experience, enabling informed recommendations for selecting appropriate metrics and testing strategies moving forward.

2. Determining the need for additional metrics post-MOS evaluation. Following a MOS session, it becomes clearer whether objective video quality metrics alone suffice or if supplementary measures, such as fluidity detection (freeze, stall, jitter) or advanced video quality indicators (blurriness, color accuracy), are necessary to provide a comprehensive evaluation.

3. Leveraging combined objective and subjective evaluations. Integrating both objective and subjective video quality metrics allows for valuable correlations between quantifiable data and user perceptions. This dual approach can lead to meaningful conclusions or prompt further investigation, recognizing that each testing scenario is unique and that MOS frequently enhances the video quality assessment process.

4. Addressing discrepancies between objective scores and user feedback. Objective video quality scores do not always align with user experiences. For example, a product may achieve top VMAF scores yet receive feedback indicating that competitors offer superior quality. In such cases, conducting MOS evaluations can reveal that perceived quality extends beyond a single objective metric. While some users prioritize pixel clarity, others may value factors like color saturation more highly. Deeper analysis—considering differences between client and competitor products, device types, user accounts, and other variables—often uncovers nuances missed by automated data processing.

The graph below shows the evaluation of 30 videos from various short-form video apps, using VMAF and MOS scores. The videos are arranged in ascending order from the lowest to the highest quality. The results show a correlation between the two metrics, with some notable outliers. These outliers are further investigated, and the reasons behind the discrepancies in scores are analyzed.

VMAF vs. MOS correlation
VMAF vs. MOS correlation

In the following example, we can see the same setup and the same conditions, but two different apps on two different devices.

The image on the left is more saturated and vivid, but its elements are slightly blurred. The image on the right has duller colors, but the sharpness and detail are much more pronounced.

In this case, there is no definitive answer as to which example is higher quality. Some viewers might prefer the saturated colors, while others prioritize pixel clarity and detail.

This is where subjective video evaluation becomes valuable. For instance, even when assessing the same video, scores can vary significantly between participants, allowing us to explore the reasons behind these differences and better explain the results.

Key ITU recommendations

Before we go into the evaluation event, take a look at some of the key ITU recommendations you should consider.

  • ITU-T P.910: Subjective video quality assessment methods for multimedia applications
  • ITU-R BT.500-13: Methodology for the subjective assessment of the quality of television pictures
  • ITU-T P.800: Methods for subjective determination of transmission quality
  • ITU-T P.808: Subjective evaluation of speech quality with a crowdsourcing approach
  • ITU-T P.1401: Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models

MOS evaluation event

Now let’s walk through the stages of a MOS evaluation event, from preparation to reporting.

1. Process guidelines

The initial phase involves careful planning and design.

First, it's crucial to define the test objective by clearly stating the evaluation's purpose, such as comparing video codecs or assessing the impact of packet loss. Identify the specific quality attributes to be evaluated, including sharpness, color fidelity, motion smoothness, and the absence of artifacts.

Next, select test materials that are representative of the intended application, varied in content, and of high quality. Consider using standardized test sequences.

Then, determine the test conditions. This includes specifying the viewing environment by controlling lighting, using a neutral background, and setting the appropriate viewing distance. The display should be calibrated to ensure accurate reproduction.

Choose a suitable test methodology based on the objective of the evaluation event as well as available resources. Some methods are more complex and time-consuming than others. 

The table below outlines different methods:

ACR
(Absolute Category Rating)
DSIS
(Double Stimulus Impairment Scale)
DSCQS
(Double Stimulus Continuous Quality Scale)
Pair Comparison
Viewers rate each video independently.
This is good for general quality assessment.
Viewers see the original and degraded video and then rate how annoying the degradation is. This is good for evaluating the visibility of impairments. Viewers continuously compare original and degraded videos side-by-side and rate the quality.
This is good both for quality and impairment evaluation.
Viewers are shown videos in pairs and say which one they prefer.
This is good for ranking videos.

Finally, recruit representative subjects, considering factors like age and visual acuity, screening for impairments, and determining the required number of subjects for statistical significance.

2. Test environment setup

Setting up the test environment is essential for minimizing bias and ensuring consistent results. A controlled environment is crucial, where external distractions and interruptions are minimized, and a comfortable, consistent viewing experience is ensured.

Furthermore, display calibration is necessary. The device should be calibrated to ensure accurate color reproduction and luminance levels, using appropriate calibration equipment and software.

3. Test execution

The test execution phase involves several key steps to ensure accurate and reliable data collection.

  • Subject instruction provides a clear and concise explanation of the test's purpose, the rating scale, and the overall procedure. Practice trials should be included to familiarize subjects with the test interface.
  • Stimulus presentation must adhere to the pre-defined order and timing, with careful attention to accurate synchronization of audio and video, if applicable.
  • Data collection requires recording subjects' ratings accurately, along with any relevant metadata such as subject ID and viewing conditions. Collecting any subjective comments or feedback from the subjects is also important.

Finally, if using crowdsourcing, it's crucial to implement measures to ensure data quality. This includes providing clear and detailed instructions, incorporating training and qualification tasks, monitoring subject performance to identify potential outliers, and considering the use of attention checks.

3.1. Video impairment examples

While objective metrics can sometimes quantify these distortions, subjective evaluation, particularly MOS, reveals their actual impact on perceived quality. Here are some common examples:

Images showing different types of distortions in video quality.

By understanding these and other artifacts, we can better design subjective tests and interpret the results, leading to improved video and image quality.

4. Data analysis

The analysis of the collected data involves several steps.

  1. Calculate the MOS for each test condition by averaging the ratings provided by all subjects for that condition, and calculate confidence intervals to indicate the statistical reliability of the MOS values.
  2. Perform appropriate statistical tests to determine if there are statistically significant differences in perceived quality between different test conditions, as well as analyze the distribution of the data and identify any potential outliers.
  3. If comparing subjective results with objective quality metrics, it will be required to calculate correlation coefficients, such as Pearson and Spearman, to assess the relationship between subjective and objective scores, perform regression analysis to model the relationship, and evaluate the prediction accuracy of the objective metric.

Common metrics used in the analysis include:

  • MOS Score: The average of all collected scores.
  • Confidence interval: The range within which the true MOS is expected to lie, with a given level of confidence (usually 95%).
  • Standard deviation: A measure of the variability or dispersion of scores around the MOS.
  • Confidence interval limits: The lower and upper bounds of the confidence interval.
  • MOS Score with excluded outliers: The average score calculated after removing outliers, identified using the interquartile range (IQR) method.
  • Correlation with objective video quality metrics: Analysis of how MOS relates to other quantitative quality measures.

5. Reporting

The final stage is reporting the findings. The report should document the entire test procedure in detail, including the test objective, test materials, test conditions, test methodology, subject selection criteria, and data analysis methods.

The results should be presented clearly and concisely, including MOS values for each test condition, confidence intervals, results of statistical analyses, and visual aids like graphs and tables.

The report should also include a discussion of the findings, covering the interpretation of the results in the context of the test objective, a comparison of the performance of different test conditions, any limitations of the study, and the overall conclusions and recommendations.

Presentation of the results

After the evaluation event concludes, the results are collected, and the average score for each sample is calculated to determine the Mean Opinion Score (MOS). If necessary, individual scores may be reviewed to identify and remove unreliable participants; however, this should be done only in exceptional cases, as every score potentially reflects a valid user experience. Using the collected MOS scores, further analysis is conducted to identify trends, detect issues, highlight areas for improvement, and explore correlations with other video quality metrics.

The confidence interval of the statistical distribution of assessment scores should accompany the MOS. Typically, a confidence interval of approximately 95% is expected, which helps indicate the presence of any unreliable participant scores.

Evaluation reports must include detailed information about the evaluation environment, such as the materials assessed, the number of participants, and both the original and adjusted scores—before and after the exclusion of any participants.

Common issues

When conducting subjective evaluations, several common challenges can affect the accuracy and reliability of scoring. Understanding these issues helps ensure more consistent and meaningful results.

1. Tendency to centralize scores

There is often a tendency to keep scores centralized, with participants evaluating all samples within a narrow scoring range, even when the quality varies significantly. This may result from participants’ inexperience, fear of making mistakes (despite there being no wrong answers in subjective evaluations), or other psychological factors. To address this, it is essential to provide comprehensive training that includes examples spanning the full quality spectrum. Allowing participants to practice scoring these examples helps familiarize them with using the entire scoring range effectively.

2. Influence of peer scores

Participants may feel tempted to peek at others’ scores, either to validate their own opinions or due to apprehension about being incorrect. To minimize this behavior, it is best to withhold visibility of other participants’ responses and average scores until the evaluation process is complete.

3. Absence of reference examples

In some cases, participants may not be provided with reference materials such as examples of different quality levels or source videos. Under these circumstances, it is advisable to involve experienced evaluators who possess a solid understanding of key quality indicators and can recognize artifacts that signal a drop in quality.

Additional considerations

In addition to the guidelines provided, here are some other elements to consider when conducting the MOS evaluation event:

  • Subject fatigue. Keep test sessions short to minimize subject fatigue, which can affect the reliability of the results.
  • Training. Provide adequate training to subjects before the actual test to ensure they understand the procedure and the rating scale.
  • Reproducibility. Document the test procedure thoroughly to allow for replication of the study.
  • Context. Consider the context in which the video or images will be viewed in real-world scenarios.
  • Audio-visual interaction. When assessing video quality, especially for content like movies or TV shows, consider the influence of audio quality on the overall perceived quality.

TestDevLab’s experience

Our work in video and image quality assessment spans a range of activities, including metric development, subjective evaluation, and user experience analysis.

1. R&D activities

We are actively involved in the development, training, and validation of video evaluation metrics. This includes benchmarking new metrics against subjective data, developing no-reference metrics for diverse content, and conducting validation experiments. We also create subjective evaluation datasets to train machine learning models and use subjective insights to improve automated assessment systems.

2. Subjective MOS as a stand-alone metric

We conduct MOS sessions to gather user feedback on video quality under controlled conditions. Our work involves assessing video performance across varying network conditions, analyzing video quality consistency over time, performing comparative studies across different conditions, evaluating the trade-offs between video quality and compression efficiency, and ensuring consistent video quality across devices and user groups.

3. Subjective evaluation as an additional user experience indicator

We use subjective evaluations to identify user experience and video-related performance issues. This includes identifying factors contributing to viewer dissatisfaction, evaluating the impact of color grading and dynamic range, and detecting and quantifying video artifacts.

Final thoughts

Subjective MOS (Mean Opinion Score) evaluation is a powerful tool for understanding human perception of video and image quality. By following rigorous methodologies and adhering to ITU recommendations, we can obtain reliable and meaningful results that inform the development and optimization of video technologies, ultimately enhancing user experiences.

While there is no definitive rule for choosing a video quality evaluation method, the ability to capture high-quality, user-perceived quality data offers a significant advantage. Although subjective MOS evaluations may not always be the first choice for analyzing large volumes of videos—especially when quick, precise, and objective results are required—there are situations where MOS provides a more realistic and in-depth analysis of video quality.

The main challenge lies in ensuring that participants possess the necessary experience and that the evaluation environment meets strict requirements. At TestDevLab, we're comitted to conducting training sessions to prepare participants for video quality evaluations and provide them with the appropriate equipment and controlled environments needed for accurate assessments.

Wondering how your video really looks to users? We’ve got you covered. Contact us to learn more about our video quality testing services, and let’s talk about how we can improve your viewers’ experience.

QA engineer having a video call with 5-start rating graphic displayed above

Deliver a product made to impress

Build a product that stands out by implementing best software QA practices.

Get started today