AI Testing

AI testing is a specialized area of software quality assurance focused on ensuring that systems that utilize artificial intelligence, like chatbots or media generation applications, function as intended.

Thanks to our Talented QA expert

Kristina Pop Andov
Quality Assurance Engineer

Differences between AI Testing and Traditional Software Testing

Both traditional Software Testing and the growing field of AI Testing are similar in the sense that they both involve testing software. However, they differ in that AI testing involves more unpredictability, as models are trained on specific datasets and expected outcomes, making their decision-making process less transparent. For example, when identifying a button, the model relies on patterns it has learned, which may not always align perfectly with real-world variations.

Traditional Software Testing involves systems where the behavior is deterministic and predictable. Given the same input, the system will always produce the same output. The software’s logic is based on explicitly defined rules and instructions, making it more straightforward to predict and verify behavior, as well as to identify the root cause of the defect.

AI Testing deals with systems that can be unpredictable. For the same input, a model might produce outputs that vary slightly—or even significantly. Often, when these failures occur, it's because the model is biased toward specific data points. In other words, it has more information about one area and less about another, which can lead to incorrect outputs and makes it harder to pinpoint the root cause of the errors.

Different types of AI models

Before even starting with testing, it is important to identify what type of AI models you’re working with so that you know what key areas need testing.

The AI models can be different types, the most popular type currently being Large Language Models (LLMs) like ChatGPT, who specialize in handling text-based data. Another type are Computer Vision Models (CVMs) like YOLO (You Only Look Once). In TestDevLab we have created the tool CV_POM which uses computer vision to detect and interact with objects on a page. Technically, it can also be applied for desktop use, as it captures an image and allows you to search for elements within that image, as well as Barko Agent is a chat-style LLM system designed to retrieve information from company documentation, while also supporting the use of Agents capable of executing a wide range of Python functions — including integrations with tools like Selenium and Jira.

For example, when asked a question like "What are my Jira tickets?", the system determines the appropriate steps, selects and runs the relevant functions, and returns the correct response.

In regards to functionality, we can try to categorize the models as such:

Generative - models that generate any type of data. The most common type is the text-to-text models, such as ChatGPT, but we also have text-to-image like DALL-E, or text-to-audio like Udio.
Reasoning - models that aim to logically answer questions by reasoning through each step in the process. Especially useful in scientific and engineering fields where it is common to need to solve a problem that may not already have an answer. Popular reasoning models are, for example, OpenAI’s o series of models, like o1 and o3.
Predictive - models that process data and provide a forecast for events. A common type is weather prediction models such as DeepMind’s GraphCast.
Retrieval - models that are specialized in fetching relevant information. Essentially an AI powered search engine, like OpenAI’s SearchGPT.
Interactive - models that handle real-time AI-Human interactions. The same as chatbots, but with no delay. An example is OpenAi’s Voice Assistant.
Multimodal - models can handle multiple types of data (multimodality). For example, OpenAI’s GPT-4o can handle multiple data types such as text, images and audio.

This listing does not cover every single type a model can be, but it serves to show how different each model can be, and why there cannot be an holistic test strategy that fits every type.

Models can even be multifunctional, as there are no strict rules on differentiation. For example, while the model o1 focuses on reasoning, it is still technically a generative text-to-text model. The key is to be able to tell what the main purpose of the model is, as not every text-to-text model is made as a reasoning model.

Defining the Scope and Initial Context of the AI model

Usually, AI services are specialized for certain applications. For example, two models can both be generative text-to-text models, but one is focused on generating movie concepts, while the other focuses on generating math quiz questions. Defining the scope can be done either during the training of the model itself by the types of data used, fine-tuning an already trained model with specific data, or by using a trained model but putting limitations on it, so as to narrow the scope to a wanted purpose.

Initial context means some information or data provided to a model that should be taken into account for every response. This is usually where the model guardrails are defined, as well as information on how to present the response. For example, an initial context may contain information stating that every response must start with a formal and professional greeting.

Product owners limit the AI model so as to not let users use the model in unintended ways, like, for example, using a customer service AI to generate a thesis on marine biology. Identifying the limitations of model output is critical before starting testing, as some general tests are not applicable to models with limited scope.

Differences between AI Testing and Traditional Software Testing​

Different types of AI models​

Defining the Scope and Initial Context of the AI model​

Differences between AI Testing and Traditional Software Testing

Different types of AI models

Defining the Scope and Initial Context of the AI model