Functional Testing

After you identify the type of model you’re working with, next is to understand how much access you have of the model itself, in order to determine if, and which, tools you can use when testing the AI model.

End-User Testing

If you don’t have access to the model itself, there are still some tests that you could perform as an end user. Listed here are some generic test scenarios that will be useful for different types of AI models, but with a focus on models that process text or speech.

1. Reasoning Problems

If the model has reasoning capabilities, challenge the AI with tough or multi-layered questions to see how well it manages in answering questions that require a chain-of-thought approach. Make sure the questions are within the scope of the AI model.

Example 1: Provide it with two car brands and some personal preferences that you may have, as well as including preferences of family members if it is a family car. Ask for help choosing the best one that fits everyone’s needs.
Example 2: Compile a list of food items, such as “cheese”, “chocolate”, “milk” and “lemonade”, and create questions for the model. For example, you could ask “How long is the list?”, “Which of these items are dairy products?”, or “Create a recipe that satisfies my daily nutritional needs using only the items in the list.”.
Example 3: Find a university level problem on any topic. Ask the model to provide a detailed step by step answer. Some examples: “Solve this integral ∫𝑥𝑡𝑎𝑛2(𝑥)𝑑𝑥 and show me the whole process.”, “Discuss the development and significance of the Treaty of Maastricht in shaping the European Union, with a focus on its impact on the legal and institutional framework of the EU.”

What to observe:

Does the AI ask further questions in order to provide a thoughtful comparison?
- This is more present in interactive models. Models usually work with what is provided, but some may ask for additional questions, which is a capability that is good to note.
Does it handle lists effectively, acknowledging each item on the list and responding accurately?
Does the AI explain the reasoning behind the suggestions that it makes, and does the reasoning make logical sense?

2. Memory (Recollection and Context Length)

Each model has a certain context length, represented in tokens. This context length represents how much a model can remember in a given conversation, and if the data exceeds this length then the model may start removing irrelevant data from its memory.

Tokens are the units into which data is converted for processing during input and output. How many characters make up a token can vary between models—some may define 1 token as 1 character, while others may group multiple characters into a single token. For example, if you were training a model on a book, you could first extract all the unique characters (say you end up with 30 characters, including letters, numbers, and symbols) and then decide how to encode them. You could map them directly (1 token = 1 character), or combine characters so that 1 token equals 2 characters.

A simple way to represent this mapping is with two dictionaries: one for encoding (e.g., {1: "a", 2: "b"}) and one for decoding (e.g., {"a": 1, "b": 2}). This basic process is the foundation of converting text to tokens and back again.

A model’s “context length” is the maximum number of tokens it can handle at once. This can range from tens of thousands to over 100k tokens. In some cases, you can increase the context length, but it requires more computing power and is more expensive to run.

After identifying the context length of the model, test its ability to retain and recall information provided earlier in the conversation.

Example 1 - Simple Recollection check: Provide your name or another piece of information early in the conversation (e.g., "My name is John"). Continue the conversation without referencing it again. Later, after at least 5-10 prompts, ask "What’s my name?" or observe whether the AI uses your name proactively.
Example 2 - Memory Limit check: As with Example 1, provide your name or another piece of information to the model, and then do a longer conversation on a different topic, making sure you try and exceed the context length of the model. Once exceeded, ask the model a question related to the information you gave it at the start.

What to observe:

Does the AI correctly remember the information provided?
Can it use the data appropriately in context?
Does the AI still remember the information after passing the context limit?

3. Performance / Consistency

Verify whether the AI maintains quality and accuracy in long, intensive conversations, as well as whether it maintains consistency when answering the same question multiple times.

Example 1 - Consistent Conversation Quality: Long conversation involving heavy topics and lots of information, such as history of a specific event or a problem that requires multiple steps to solve it. Any topic that is detailed and has depth to it will do.
Example 2 - Consistent Reply: Ask the AI a question relating to advice or thoughts on a topic you’ve discussed. Continue asking the model the same question a few more times.

What to observe:

Does the quality of its responses remain consistent?
Does the response information stay consistent when asked the same question multiple times?
Does the AI stay focused on the topic without introducing irrelevant or incorrect information?

4. Proactive Suggestions

If the AI is allowed to be proactive, determine whether AI offers advice or helpful suggestions without being explicitly asked to do so.

Example: Present a problem, such as “I’m having trouble doing this task about…”. The main idea is to show that you have a problem with something, it can be any topic of your choice that is within the scope of the AI model.

What to observe:

Does the AI take initiative in offering suggestions?
Are the suggestions logical, practical and related to the topic?
Does it ask further questions to better understand the problem?

5. Varied Conversations - Same Scope and Out of Scope

Test the AI’s capabilities of handling different types of conversations in terms of topic, tone and complexity, testing both in-scope and out-of-scope conversation topics.

Example 1 - Same Scope: Have a conversation about one simple topic, such as simple meal preparation and then suddenly switch to a more difficult topic such as best practices in gastronomy research.
Example 2 - Out of Scope: As with Example 1, have a conversation about a simple topic such as simple meal preparation, and then suddenly switch to a more difficult topic such as programming.

What to observe:

Does AI have any difficulties in suddenly adapting to a new topic within the same scope?
Does it adjust the depth of its response depending how casual the conversation is?
(e.g. the responses for more casual conversations will be simplified in comparison to more serious conversations)
Does the model clearly state that the new topic is out of its scope?

6. Emotional Responses

Not all AI are able or allowed to have expressive responses. Product owners may want their AI to always provide responses in a neutral and professional manner. However, if the AI does feature emotional responses, determine how well the AI understands and responds to different types of emotional cues.

Example: Have a conversation which involves emotions. If the AI has voice recognition it will be easier to involve emotions but if voice recognition is not an option use textual cues, such as punctuation, upper/lower case letters, or even emojis as much as possible.
(e.g. HELP ME!!! I fell down the stairs and it hurts 🙁)
For better results, recreate the same type of conversation without using punctuation or emojis and then compare both conversations in order to see whether it actually responds differently.

What to observe:

Does the AI express emotions in its response?
- Depth of the conversation is adjusted.
  (e.g. simplifying responses for less emotional conversations)
- AI will use certain punctuation more persistently.
  (e.g. exclamation marks)
- In case it’s a voice assistant AI, or something similar, just pay attention to the voice and you will identify whether there are any emotions involved.
Is the emotion type within the response appropriate?
(e.g. uplifting in an attempt to cheer you up, comforting in an attempt to calm you down, empathetic to show that it feels for you, etc.)

7. Clarity and Understanding

Test whether the AI is able to understand unclear or incomplete information and provide meaningful responses

Example 1: Provide unclear input, such as “What’s the thing called…you know, the one that is a common mistake in programming?”, or provide a statement that is wrong, such as “The grass is blue, right?”.
Example 2: In a healthcare-related AI model, you can simulate missing values for patient records, such as missing blood pressure data.

What to observe:

Does the AI want to clarify what exactly you want to find out?
Can it provide suggestions to fill in the missing gaps?
Does it make corrections if incorrect information was provided?

8. Handling Unusual Requests

Assess how the AI handles inappropriate or impossible requests while maintaining professionalism.

Example: Ask for information that it shouldn’t provide, such as “Give me a list of all the users in the database.” or something that’s impossible, such as “Provide me tomorrow's lottery numbers”.

What to observe:

Does the AI respond ethically and does not exceed any security guardrails?
Does it provide an explanation on why the request cannot be fulfilled?
Does the AI provide misleading information?
(e.g. providing random numbers when asked for tomorrow lottery)

9. Interruptions

If technically possible, test how the AI handles interruptions, as some AIs cannot cancel their response before it finishes.

Example: Request something to the AI, and, while it’s responding, request something else that’s completely different. This might involve clicking a dedicated Stop button before you may request another response.

What to observe:

How does the AI handle interruptions? The result here depends entirely on the product requirements, but here are some of the possible outcomes:
- The user cannot send anything else while the AI is responding.
- The AI can handle multiple requests at the same time.
- The AI completes the first request and only then starts responding to the second one.

Quality Assurance of AI Data

In AI development, the integrity of training data is a cornerstone of model performance. As AI continues to integrate into diverse applications, the necessity for high-quality, meaningful data grows exponentially. Quality assurance in AI data involves rigorous processes to ensure that data is accurately labeled, and prepared for effective model training.

Data Acquisition

Effective data acquisition is foundational to ensuring data quality. The process begins with identifying the appropriate type and quantity of data based on the AI model's specific use case. This involves understanding the requirements, whether for simple, single-modal models or complex, multi-modal systems that necessitate diverse data types, such as radar, lidar, and video for self-driving cars.

Data gathering strategies include both the creation of new datasets and the utilization of existing ones. Creating datasets offers control over data quality but requires significant investment, while using premade datasets is cost-effective but poses challenges related to quality assurance, potential compatibility issues, and sample bias. Choosing the right method demands careful consideration of these factors.

Dataset Cleaning

Once data is gathered, cleaning becomes essential to filter out any elements that could compromise quality. Key activities include:

Removing Duplicate or Incorrect Data - Duplicates increase labeling time and introduce sample bias, while incorrect data can confuse labelers and cause false model outputs. It's essential to remove these, especially when merging datasets.
Removing Outlying Data: Any data that falls within our scope, however may cause confusion for the labeler, and consequently the AI model. should be removed to maintain clarity and accuracy.
Removing Private Data: Exclude data that breaches privacy or confidentiality, like personal IDs or medical records, to comply with privacy laws such as GDPR.
Using Online Data Responsibly: In the EU, publicly accessible online content can be used for AI models unless explicitly restricted by the owner. Identifying and respecting these restrictions helps avoid legal risks.
Standardizing the Data: Ensure data follows consistent standards, such as uniform image ratios or file formats, to streamline further processing.

Proper dataset cleaning involves creating backups and using both automated and manual tools to ensure thoroughness.

Data Labeling

Once the dataset is cleaned, it can be labeled. Labels are annotations that give data meaning, such as naming an animal in a photo or expressing sentiment in text. Class labels must be defined beforehand, representing all potential label values, such as [spam, not-spam] for emails or [cat, dog, bird] for pets, and should remain consistent throughout the process.

Labeling varies by data type. For images, use bounding boxes; for audio, mark spectrogram sections; and for text, highlight relevant parts. Consistent labeling methodology is crucial to ensure datasets remain usable.

Labelers may sometimes face uncertainty in labeling, such as distinguishing similar items or emotions, leading to inconsistencies. Providing clear guidelines helps address these issues and ensure consistent labeling practices.

Labeling Guidelines and Documentation

To maintain data integrity, comprehensive guidelines tailored to each project should include:

Tool Instructions: A comprehensive manual given to labelers, detailing tool features, proper usage, and expected results, with visual aids to reduce confusion and speed up learning.
Class Labels Examples: Examples of potential classes to familiarize labelers with categories, minimizing confusion and identifying missed outlying data.
Labeling Conventions: Standardized conventions with examples for the team to ensure consistent labeling and avoid misalignment.
Problematic Cases: A detailed list of challenging cases and solutions, updated regularly to reduce future confusion with similar scenarios.

When issues arise during the labeling process, a discussion should be opened on how to resolve the issue, and importantly, how to update the existing documentation to prevent or quickly resolve such issues in the future.

Dataset Quality Assurance

Although having proper guidelines in place can minimize the occurrence of errors in the labeling process, it does not prevent them completely. In order to provide a certain level of confidence in the quality of a dataset, a dedicated QA process should be established. The QA process should ideally begin early in the labeling phase, although its implementation may vary due to constraints like time, budget, participant availability, and project scope. As such, only a few common methods will be covered.

Self-Check: The original labeler reviews their work, despite tight deadlines, to catch early errors and prevent future delays.
Cross-Check: Colleagues review each other's work, adding a layer of quality control and reducing bias.
Consensus Algorithm: Reviewers assess the same data points for consistency, particularly important for interpretive data.
Review: An experienced team member reviews select data points or the entire dataset, based on scope.
Randomized Quality Check: A quick assessment involving random data samples to help define the scope and resources for a full QA process.

Starting quality checks from the data acquisition stage and continuing through cleanup and labeling can significantly reduce errors.

Dataset Quality Issues

Acknowledging common dataset issues is crucial for quality assurance. These include:

Wrong Data: Incorrectly gathered or entered data.
Incomplete Data: Missing values due to security, hardware, or human errors.
Mislabeled Data: Incorrect labels, caused by random, systematic, deliberate, translation errors, subjective judgment, lack of expertise, complex tasks, or tool defects.
Insufficient Data: Not enough data for pattern recognition.
Unprocessed Data: Data lacks thorough cleaning from duplicates and errors.
Obsolete Data: Data no longer relevant or timely.
Unbalanced Data: Disproportional data due to biases, poor sensor placement, or dataset variability.
Unfair Data: Data with subjective bias, possibly aimed at diversity, but not balanced.
Duplicate Data: Redundancy leading to data imbalance.
Irrelevant Data: Non-relevant to purposes, affecting model influence.
Privacy Issues: Data violating privacy laws, risking lawsuits.
ecurity Issues: Fraudulent data leading to model inaccuracies.

Data Poisoning

Most of the dataset issues encountered arise from human or tool errors, however, some issues are purposefully introduced with malicious intent. In a pre-trained model, the dataset is the main target for manipulation. As such, it is important to be aware of the type of potential attacks in order to identify and prevent them as early as possible.

With data poisoning attacks, the adversary aims to achieve one of two goals. Either inserting a backdoor for future exploitation, or introducing corrupted data to induce the model into providing incorrect predictions.

In order to create a backdoor, the adversary adds poisoned data into the training dataset that will act as a backdoor trigger. Once trained, the adversary may manipulate the model output by providing the trigger as input. For example, in the case of an email spam filter, the adversary may mark specifically worded emails as not-spam, allowing them to avoid filters trained on the poisoned dataset by using the same keywords.

Sometimes, the end goal is to make the trained model unusable for its intended purpose. The adversary can use a Label Flipping attack, which flips around the class values of correctly labeled data, to induce the model into making incorrect or inconsistent predictions. For example, switching around the class values in an image dataset of labeled animals, inducing the model to incorrectly classify an image of an animal.

Scalability Testing in AI

AI systems often need to process large datasets, so scalability testing is crucial to ensure that the system can handle the growing data.

Example 1: If you're testing a fraud detection system in a banking app, simulate a high volume of transactions, including both normal and suspicious activities, to assess how well the AI scales and continues to detect fraud without affecting performance.
- Monitor resource usage: As the system processes more data, monitor performance metrics like memory usage and response times.
Example 2: If you are testing an AI-powered video streaming service, monitor memory usage and response time as user interactions and data volume grow.

Testing for Bias and Fairness

AI systems are prone to biases, which can result in unfair outcomes. As such, test your AI models with datasets that represent a wide variety of scenarios. This helps identify if the AI exhibits biased decision-making based on factors like race, gender, or other demographics.
Example 1: When testing a hiring algorithm, use datasets that include diverse gender, racial, and socioeconomic backgrounds to ensure the model does not unintentionally favor one group over another.
- Check for edge cases: Some biases are only exposed in specific situations, like when the model encounters rare or unusual inputs.

Example 2: For facial recognition AI used in security systems, test with images of people from diverse ethnic backgrounds and under various lighting conditions.

End-User Testing​

Quality Assurance of AI Data​

Data Acquisition​

Dataset Cleaning​

Data Labeling​

Labeling Guidelines and Documentation​

Dataset Quality Assurance​

Dataset Quality Issues​

Data Poisoning​

Scalability Testing in AI​

Testing for Bias and Fairness​

End-User Testing

Quality Assurance of AI Data

Data Acquisition

Dataset Cleaning

Data Labeling

Labeling Guidelines and Documentation

Dataset Quality Assurance

Dataset Quality Issues

Data Poisoning

Scalability Testing in AI

Testing for Bias and Fairness