How We Test Dominant Speaker Detection

How We Test Dominant Speaker Detection

Dominant speaker detection is a useful feature in video calling apps because it allows the app to automatically focus on the person who is speaking at any given moment. This can improve the overall user experience by making it easier for participants to follow the conversation, especially in large group calls where multiple people may be speaking at the same time. On the other hand, poor detection can cause confusion during the call, as users may find it more difficult to identify and focus on the main speaker.

In our audio and video testing department, we developed a process that allows us to test how quickly and effectively dominant speakers are detected, as well as a way to explore what your app does in challenging and unexpected scenarios. Let’s break down and explain the process we use to test dominant speaker detection.

Logic and implementation

The first part of implementing the process is understanding how to test the functionality. In our case, the way to do that is to have one dedicated receiver (someone who does not speak but only listens and checks which user they see on screen) and multiple speakers (who are the ones that the receiver will—or will not—see on their screen).

The next step is figuring out how we will detect different users on the receiver screen. One way to do this is by sending single-colored content (in our PoC phase that content was sticky notes put in front of the camera) for each participant. For instance, when User 1 is detected, the screen is red, and when User 2 is detected, the screen is blue. We can record the receiver screen and easily decode the timeline of speakers. However, as we improved our process, we chose to move to unique markers for a couple of reasons. First of all, these markers take up less space so we can use the remaining space to send realistic content—if the content is static, the app/service could detect that and change its behavior. The other reason is that we have been using markers for many other feature testing, so we already have a reliable way of recognizing them.

Again, referencing our PoC project, the way we did the testing was by sending an audio sample that was split into two channels, left (L) and right (R). The left channel would go into the microphone of User 1, while the right channel would go into the microphone of User 2. Though there were some problems with this approach, which we won’t go into too much detail, the end result was that instead of using a sample that is played independently from videos, the audio sample would be embedded into the media we are sending.

This approach brought another challenge—how do we sync the audio between users? The solution was simple. What we did was we added additional dimension to the markers we were sending. Specifically, instead of only sending information about which user is which, we also sent information about what the user is doing by coloring the markers. Each colored marker represents a different aspect of the video—whether the user is speaking, silent, sending noise, etc.:

  • A black marker means that the user is currently speaking;
  • A blue marker means that the user is silent;
  • A red marker means that the user is sending noise, etc.

This way, we know what the active user is doing when we see them on screen and if we experience a marker change we can sync it against the reference video to understand where in the test case we are.

Further down below in the post, we will share an example of what these markers look like in practice.

To sum up our implementation process:

  • We have one receiver user who we record and analyze.
  • We have two (or more) sender users who have unique markers on their sent video that can be automatically detected on the receiver screen.
  • We have separate audio samples that each sender is sending.
  • These audio samples correspond to marker color in the video (speaking, silent, etc.).
  • Each user speaking can have different speaking patterns to create scenarios we want to test.

Creating scenarios

Before we mix the audio and video, what we have to do is decide which use cases we want to test. Even with two users, there are numerous scenarios to test—silence, speech, heavy noise, low noise, speech over noise, and anything else you can think of. For results to make sense in the end, we want to split the test into different scenarios. Here are some examples of possible base scenarios:

Possible base scenarios for dominant speaker detection

It is important to note that not all of these scenarios have a clearly defined expected result. For example, if two speakers are completely talking over each other, each one could be detected and it would be fine, however, I would expect that the preferred approach would be to switch users periodically to see who are the ones arguing, and using our approach we can detect how different applications manage such use cases.

For scenarios where only one user is speaking, the expected result is clear—we want to put the speaker on the screen as quickly as possible.

So after we have decided which scenarios we want to test and created the mapping for the video and audio (when each speaker is speaking/not speaking/sending noise or doing anything else) we can make audio and video files for each user and mix them together.

Testing process and results

If we are only testing the dominant speaker part of the application, without trying to gather any other metrics during the test, then the testing process is simple. We use a virtual camera (for desktop) or display filming (for mobile) to send audio and video from multiple senders and do a video (and optionally audio) recording on the receiver. The only thing we want to be as precise as possible is for the scenarios to be in sync when launching the media for both users.

The output we get is data from each frame where we see:

  • Which user is speaking
  • What the user is actually doing

Let’s see this process in action. Here is an example where we cover test cases where 2 users are speaking right after each other. First, we need to have sound patterns for each user.

Sound pattern for User 1
Sound pattern for User 1
Sound pattern for User 2
Sound pattern for User 2

The expected results for this test looks like this—the orange line represents which user is expected to be shown on screen:

Expected results for test case
Expected results for test case

And the actual test results for this particular test case look like this:


Actual results for test case

Here is a more detailed explanation of the graph above:

  • The red line is a sound pattern of User 1
  • The purple line is a sound pattern of User 2
  • The orange line shows which user is expected to be displayed on screen
  • The green line shows which user is actually displayed on screen

The positive Y-axis value depicts User 1, while the negative Y-axis value depicts User 2. The X-axis shows the time in the test.

Looking at this data and comparing the expected and actual results, we can see how much time it takes for the dominant user to be detected. Namely, it takes around 0.5 seconds for this app to switch to the new dominant speaker and the results are quite stable.


This is just a short overview of how we can test the dominant speaker detection on any app and with as many users as needed, display results in a clear way, and extract data that can be helpful for developers to improve their dominant speaker detection logic.

If you want to learn more about our audio and video testing services or would like to test the dominant speaker detection feature on your video conferencing application, contact us and find out how we can help.


Subscribe to our newsletter

Sign up for our newsletter to get regular updates and insights into our solutions and technologies: