Testing Voice Processing Features in Communication Apps

Two people on a video call using a video conferencing app

We live in a time where almost everything is available online, giving everyone the option to shop, study or work from any desired place with an internet connection. The usage of audio and video communication has grown exponentially in the last decade, and the pace is only picking up – especially in the current circumstances as of writing this article, when everyone has to stay safe and spend most of the time at home. To better understand how much voice communication app usage has increased, here are some statistics:

  • There were more than 300 million daily “Zoom” users in the second quarter of 2020 and more than 200 million daily “Microsoft Teams” users in the same period.
  • The usage of “Microsoft Teams” has grown by 894% since the COVID-19 lockdown began a year ago. That beats the growth of “Zoom,” which achieved 667% growth in the same period.
  • The interest in following Google searches has peaked or increased in India, the USA, and the UK in the year 2020, compared to the last four years:
  • “How to remove background noise from audio”
  • “How to remove echo from audio”
  • “How to adjust microphone volume”
Interest statistics for Google search “How to adjust the microphone volume"
Figure 1. Interest statistics for Google search “How to adjust the microphone volume”.

Seeing that communication apps have become an irreplaceable part of our lives, voice communication must be clear and enjoyable even in rough network and environment conditions, so that no information gets lost and productivity remains at its highest.

A lot of innovation has been happening in the voice processing field, like Microsoft’s AI Background noise suppression, which uses a trained neural network to recognize background noise; Polycom Acoustic Fence technology, that blocks not only random background noise outside of the virtual “fence,” but also voices of other people, that are not supposed to be heard; and many other advanced voice processing solutions. It is also worth mentioning that companies such as LiveSensus are working on no-reference audio evaluation algorithms, and anyone can help their research by filling out their survey.

Also, in the audio industry, more and more voice modding features (like VoiceMod) are becoming available in the market and are favored by gamers and streamers. Frequency response, latency, and CPU usage could be useful metrics to test such features.

To ensure that online communication quality is as good as possible, it is necessary to test audio and video quality. Still, it is just as important to test voice processing features. Good voice processing features in communication software can sway consumers towards that product instead of its competitors. And that is a wise decision, because the voice is the most vital part of a call, and if it is not heard, then no information is passed.

You might be interested in: How We Built a Standardized Testing Environment for Audio and Video Quality Testing

This article will look at 3 of the most critical voice processing features – noise suppression, echo cancellation, and auto volume adjustment.

Most of the popular voice communication apps have at least one of these features built-in, but not every app lets the user adjust and control them. For example, “Zoom” has all of the mentioned features built-in, and in the desktop app it enables the user to adjust them all in the settings.

Zoom audio settings
Figure 2. Zoom audio settings.

“Microsoft Teams” also has all of the mentioned features built-in, but only allows to control noise suppression in the desktop version from settings.

Not having the ability to control these features is a significant problem for many high-quality equipment users who have been complaining about it. Suppose a user has a high-quality audio setup that can produce audio with a minimal level of noise. In this case, they might want the sound to be as close as possible to the original audio, but get robbed of this option. This is also an issue for users who wish to share their gaming experience with game sound and music enabled, as it gets corrupted or cut out if there is built-in noise suppression that cannot be turned off.

Even though the aforementioned voice processing features seem familiar and a must-have, not all the products have them. So choosing to implement these features or improve existing ones and giving the users ability to control them might drastically improve the application’s quality.

You might be interested in: Which Video Conferencing Apps Offer the Best Range of Features and Capabilities?

Noise suppression

What is it?

In scientific terms, noise suppression is the process of removing noise from a signal, which can be useful in audio and image processing. In the current context, when the topic is noise suppression in voice processing, it means suppressing the noise that originates from your surrounding environment to the person you have a call with, and the noise coming from their background to you, ultimately leaving only the voices of speakers.

A total absence of noise is impossible, even in perfect conditions; therefore, it must be dealt with via processing the audio by implementing either hardware and/or software solutions. But no solution is perfect; that is why testing is crucial.

How does it affect audio quality?

It is not rare that you might have to be on a work-related call, but your colleagues are on their own calls or just chatting. Or maybe you are on a call at home, and your family or pets are being loud. Or perhaps you are at the mall or any other busy public space, but you have to take the incoming call. The scenarios are endless, and background noise is a real problem because you cannot always deal with it by yourself. It dramatically impacts the audio volume and clarity in voice communication, making it hard for you to hear others and for others to hear you, leading to misunderstandings and loss of information.

How do we test it?

  1. Using an audio setup from the sender side, we play an audio file with noise or a mix of noise and voice, to see if the voice is not suppressed along with the noise.
  2. On the receiver side, we mute the microphone because there are some features that reduce incoming audio during speech and can impact results.
  3. During the call, we record the processed audio on the receiver side.
  4. After the call, receiver side recording is compared with the original audio file that was played from the sender side using waveform graphs of the audio tracks.
  5. We analyze the amount of time it takes for the noise suppression to start taking effect (delay).
  6. We analyze the amount of suppression applied by comparing receiver side recording volume with the original noise file volume (in dB).
  7. We analyze the consistency of the suppression throughout the receiver-side recording.
  8. We also listen to the recordings, to further determine the amount and quality of suppression.

Figure 3 shows a simplified example of noise suppression, where the gray graph is the noise before suppression, but the blue graph – after suppression. Visual materials like this and the volume data gathered to generate them enable the ability to analyze the amount, consistency, and delay of the suppression.

Noise before and after suppression
Figure 3. Noise before and after suppression.

Echo cancellation

What is it?

Echo is a sound or sounds caused by sound waves’ reflection from a surface back to the listener. In voice processing echo is the sound traveling back to the sender through the receiver with a delay and potential distortion, and echo cancellation tries to get rid of these sounds. In short, if an echo is present in a voice call, you can hear yourself when you are not supposed to. Using speakerphones instead of a headset, slow internet connection, electromagnetic interference, or damaged equipment – all of these can cause echo.

The concept of echo is a bit more complicated than the other voice processing features mentioned in this article; therefore, a simple diagram of how echo is canceled is presented in Figure 4.

Echo cancellation process diagram
Figure 4. Echo cancellation process diagram.

As you can see in the diagram, the echo gets canceled at the receiver side before the audio gets sent back to the sender side. Echo gets canceled by recognizing the originally transmitted audio signal that re-appears with some delay in the transmitted or received signal and then subtracting it from the transmitted or received signal. This can be accomplished either by implementing a signal processor (hardware) or some echo cancellation algorithm (software).

How does it affect audio quality?

The presence of echo can be one of the most frustrating and disruptive factors in a voice call. Compared to background noise, where you have a better chance of controlling the amount of unwanted noise that gets generated by changing your location to a less noisy place or dealing with the source of the noise, you may not always have the ability to control the echo that your setup is generating, if your audio setup is too limited. Most people who use audio communication regularly most likely have experienced this issue and know how unpleasant it can be. Echo makes it very hard to understand what the person/group you are talking to is saying, because you might hear yourself over the other person, and it can be confusing for all call participants. Echo and its improper cancellation can introduce delay, distortion, and unexpected volume changes in audio and can drastically decrease voice communication quality.

How do we test it?

  1. For result consistency, we play a constant audio file from the sender side using an audio setup.
  2. On the receiver side, we either play silent audio (not muted) or play a constant audio file to see how the incoming audio reacts with outgoing audio (how the echo interacts with the audio that the receiver is sending back to the sender).
  3. During a call, we record:
    1. the outgoing audio on the sender side (naturally, it would be the audio recorded by the microphone)
    2. the incoming audio on the receiver side (the audio that is coming out of the receiver device speaker)
    3. the outgoing audio on the receiver side (the audio that is recorded by the receiver device microphone)
    4. the incoming audio on the sender side (the audio that is coming out of the sender device speaker)
  4. After all the necessary audio files have been recorded in a call:
    1. Each file is listened to to assess the echo level and sound quality.
    2. We compare each file with the original audio file played on the sender side to see which parts and how much of the audio became echo, using visual representations of the audio tracks for more straightforward analysis.
    3. We measure the delay between sent and received audio – the delay of echo.
    4. The maximum volume of each audio file is measured to see how the volume changes during voice processing.

We also use multiple audio output and input device combinations to cover most scenarios where echo could occur during testing.

The visual track of the original audio of a voice before any voice processing applied is visible in Figure 5. That is what the audio should look like on the sender side input, receiver side output, and a part of receiver side input recordings before any processing is applied to the voice.

Visual track of the original audio of a voice
Figure 5. Visual track of the original audio of a voice.

In Figure 6, the visual track of the voice that is actually sent from the sender device is displayed. It is clearly visible that some parts of the audio have already been processed and lost.

Visual track of the audio sent from the sender device
Figure 6. Visual track of the audio sent from the sender device.

The graph in Figure 7 represents the audio sent back to the sender after being processed on the receiver side. This graph normally should be a straight line (empty), knowing that there was no audio coming from the receiver side, but the voice coming from the sender device did not get appropriately processed; therefore, a considerable echo got generated, letting through most of the sent audio.

Visual track of the audio received from the receiver device – the echo
Figure 7. Visual track of the audio received from the receiver device – the echo.

Auto volume adjustment

What is it?

Auto volume adjustment is the process of leveling out the differences in volume. This feature can be applied to both incoming and outgoing audio, but in voice processing, it is most often applied to the input audio. Auto input volume adjustments can be used to either adjust the volume of the voice that is being drowned out by noise, increasing the volume of the voice so that it can be heard over the noise, or just to keep the volume of the voice steady at an optimal volume, if the voice is too quiet or too loud.

How does it affect audio quality?

It is not uncommon to hear differences in volume in a voice call, as every person might have different microphone settings, microphone quality, distinct voice (some louder, some quieter), and diverse environment. To prevent unpleasant differences in volume from becoming an issue, the user should be able to turn on auto-input volume adjustments, or it should be turned on by default. Significant differences in volume without auto adjustments can cause audio clipping when it is too loud, or loss of information if it is too quiet, and fine-tuning the volume settings of each speaker manually can take a lot of time, or it could not be possible at all.

How do we test it?

  1. Play an audio file that contains voice in three separate volumes – low, optimal, and high. Also, various combinations of frequency and interval of the volume changes are used, to see how fast and well the auto adjustments work.
  2. During the call, we record the processed audio on the receiver side.
  3. After the recordings are obtained, we compare the recorded audio volume with the original audio using visual tracks to analyze the volume adjustments’ quality and consistency.
  4. We also analyze the decibel values to calculate the average amount of adjustments made.

A simplified visualization of auto volume adjustments for a voice in three levels of volume (too low, optimal, and too high) is presented in Figure 8. As you can see, the lower volume voice should be increased in volume, the optimal volume voice should not be changed, but the higher volume voice should be decreased in volume.

Voice after auto volume adjustments
Figure 8. Voice after auto volume adjustments.

Summary

The use of online voice communication apps has grown exponentially in the last few years. It does not look like this trend is plateauing any time soon, as more and more companies and educational institutions, and individual daily users are completely transitioning from the old/standard cell phone and landline communication means to the cheaper and more convenient online communication methods. This also means that there is growing competition in the online voice communication app market. Therefore in order to become a customer’s first pick, companies must ensure that their app is of the highest quality possible. To do that, audio and video quality should be tested thoroughly, not forgetting about all the voice processing features – and if these features are not checked, the audio quality cannot reach its full potential! And to further improve the process of testing voice communication apps, above mentioned testing scenarios can also be automated to execute tests faster, more consistently and with higher datapoint amounts.

Don’t hesitate to contact us for help if you are struggling with this in your project.

Subscribe to our newsletter

Sign up for our newsletter to get regular updates and insights into our solutions and technologies: