Using GPTs and LLMs for Software Test Automation

Since ChatGPT launched in late 2022, large language models have quickly moved from research labs into daily business use. Now, the question isn’t whether to explore AI, but how to use it effectively. Industries such as healthcare (e.g., Thirona, ScreenPoint Medical, Envision Pharma Group), manufacturing (e.g., Siemens, IBM, NVIDIA), and retail are already applying these tools to solve real-world problems.

Artificial intelligence has been actively transforming how we build, deliver, and maintain software for some time, even before the boom. Namely, AI plays an important role in software testing and changing processes. Large language models (LLMs) like GPT are a significant part of that. For QA engineers and software testers, it's worth taking a closer look at what these models can actually do and how they can aid in daily work.

In best practice, LLMs are used to enhance, not replace, existing capabilities. These models can analyze requirements, generate test cases, suggest edge scenarios, interpret logs, and assist with automation scripting. When integrated thoughtfully, they accelerate repetitive tasks, help identify blind spots, and allow teams to focus on user-specific scenarios and exploratory testing.

In this blog, we’ll cover best practices for integrating LLMs and GPTs into your software test automation, as well as the most productive and common use cases. Further on, we’ll also take some time to evaluate the benefits and challenges of using AI in your testing.

TL;DR

30-second summary

TLDR

Large language models are reshaping software testing by automating repetitive tasks, expanding test coverage, and accelerating script creation—but only when applied with the right structure. The real gains come from pairing AI capabilities with strong human oversight, quality inputs, and deliberate implementation. Teams that approach LLM integration strategically see measurable returns; those that don't risk false positives, security gaps, and eroding QA expertise.

Prompt quality determines the value of every AI-generated test. Vague inputs produce generic outputs — precise, context-rich prompts are what make LLMs genuinely useful.
RAG and fine-tuning: choosing the right framework for your team. Retrieval-augmented generation fits most teams better than fine-tuning, delivering project-specific results without the resource overhead.
Where LLMs deliver the clearest, most immediate testing ROI. Test case generation, log analysis, and regression prioritization offer the fastest and most measurable impact on QA efficiency.
Why hallucinations and data sensitivity can't be treated as edge cases. AI-generated tests require mandatory human review. Confident-sounding but incorrect outputs are a known and recurring failure mode.
Human expertise remains the foundation AI-assisted testing is built on. Domain knowledge, risk-based strategy, and user-focused testing are capabilities LLMs cannot replicate or replace.

Integrating LLMs & GPTs into automated testing

When integrating LLMs into your automated testing workflows, success can partially depend on how thoughtfully you approach implementation. The technology itself is powerful, but without a proper structure and understanding of its mechanics, AI-driven quality assurance will be met with challenges. Let's examine the key considerations that will determine whether your AI-assisted testing delivers meaningful value.

Input relevance and quality

The quality of your inputs will directly determine the quality of the AI’s outputs. The more details and clarity you provide, the better the LLM will be able to provide results that you won’t have to spend time fixing. These inputs should include relevant information, such as user stories, acceptance criteria, past test data, bug reports, requirement documentation, and other contextual sources.

Prompt engineering basics for testing contexts

Beyond providing the aforementioned relevant inputs, how you structure these prompts also matters significantly. Effective prompt engineering for testing involves being explicit about what you need.

Rather than asking "generate tests for login," a well-engineered prompt might specify:

As we see in this prompt example, the best prompts include:

Context: what the feature does and why it exists.
Constraints: technical limitations, compliance requirements, or business rules.
Expected format: plain descriptions, code, syntax, or other formats.
Scope boundaries: what should and shouldn't be tested.

The difference between a generic prompt and a well-engineered one can mean the difference between generating 10 basic test cases and generating 50 comprehensive scenarios that actually catch bugs.

Utilizing frameworks

Retrieval-augmented generation (RAG)

Retrieval-augmented generation, or RAG, is one of the most effective ways to make LLMs more practical for testing. Instead of relying only on the LLM's built-in knowledge, RAG enables the model to search through your provided resources, such as documentation, past bug reports, test libraries, and code repositories, before creating test cases.

For testing purposes, this means the LLM can:

Pull information from your API documentation to create relevant integration tests.
Look at previous bugs to suggest tests that prevent those issues from happening again.
Follow your team's naming patterns and coding style when generating test scripts.
Read acceptance criteria directly from tools like Jira or others to ensure tests match requirements.

When your testing assistant can access your documentation, Jira tickets, and GitHub repositories through RAG, it generates test cases that actually fit your project, rather than generic tests that need further modification.

Fine-tuning

Fine-tuning takes it further by actually training the model on your specific testing patterns. Fine-tuning can be valuable when you have a large collection of high-quality test cases and want the LLM to naturally write tests in your team's style, using your preferred tools and terminology.

However, fine-tuning is resource-intensive and needs specialized skills to set up and maintain. For most teams, RAG with good prompts delivers better results with less effort and cost. Fine-tuning only makes sense if you have thousands of test examples, machine learning expertise, and a clear business case.

API integration patterns

Effective integration involves connecting LLMs to your existing testing setup through well-crafted API patterns. The most common methods include:

On-demand generation: triggering test case creation through API calls whenever requirements are updated or new features are added. This works well in CI/CD pipelines, where tests are automatically generated as part of the development workflow.
Batch processing: periodically analyzing accumulated requirements, user stories, or code changes to create comprehensive test suites. This is useful during maintenance cycles or when reviewing test coverage across entire modules.
Interactive assistance: giving testers API-connected tools that provide real-time suggestions, autocomplete for test scenarios, or immediate edge case identification.

No matter which pattern you select, make sure your integration includes proper error handling, timeout management, and fallback options. LLM APIs can be slower than traditional services and may sometimes fail or produce unexpected results.

The key to successful LLM integration in testing is to see it as an assistive layer that enhances existing processes rather than fully replacing human judgment and established workflows. When implemented with focus on input quality, prompt design, and suitable frameworks, LLMs are powerful tools that expand your team's testing capabilities.

What aspects of software testing should you use GPTs and LLMs for?

QA engineer using LLM for test automation

While LLMs and GPTs offer broad capabilities, certain use cases stand out as the best for delivering immediate, measurable impact in your testing process. For this, we've compiled a more precise list of use cases, which will help you to determine whether or not using agents for your automations will be necessary.

Generating test cases from requirements

One of the simplest use cases is converting requirements into test cases. When you provide the LLM with a user story or specification, it can generate comprehensive test cases, covering the positives, the negatives, the boundary conditions, and the edge cases.

For example:

Given the requirement “Users can reset their password by entering their email address”, the LLM could then generate tests which would include validating email addresses, reset links, invalid formats returning errors, link expiration, and so on.

This would save hours of manual work and often would highlight scenarios that might have otherwise been overlooked.

Test data generation and edge case discovery

Usually, creating realistic and varied data is time-consuming, yet critical for productive testing. However, to help, LLMs are useful for generating diverse and realistic datasets, which, in addition, can also match your requirements.

Beyond these simple datasets, the models are also valuable for discovering edge cases. They’re able to suggest scenarios outside typical testing, like names with special characters, timezone differences, and other unique cases that are difficult to guess and write out entirely. With the help of LLMs, you’ll be able to test for bugs that could otherwise slip through standard testing.

Automated script creation and maintenance

Based on your test case descriptions, LLMs can generate test automation code in frameworks like Selenium, Playwright, Cypress, and so on. Moreover, when UI elements change, rather than manually updating hundreds of selectors, LLMs can assist with script maintenance if you provide them with before-and-after HTML snippets.

Understandably, some teams worry about new scripts becoming brittle or drifting out of sync with bit-by-bit product changes. To address this, it's best practice to set up a regular review cadence. For example, schedule a weekly diff-based audit of your AI-generated automation scripts, comparing new script versions against the previous ones and validating any significant changes, especially for critical flows. This habit not only catches unexpected updates before they cause issues, but also builds trust in AI-assisted maintenance. Over time, such systematic reviews turn maintenance skepticism into confident, evidence-driven progress.

Log analysis and debugging assistance

When automated tests fail, looking for root causes can often involve looking through lengthy logs and stack traces. When you utilize LLMs, this analysis happens automatically, saving considerable debugging time. By passing error logs, stack traces, or failed test outputs directly to a model, you get a plain-language breakdown of what went wrong, which component was affected, and where to start looking.

This is especially valuable in complex, multi-service environments where a single failure can produce hundreds of lines of output. Rather than spending an hour analyzing logs manually, a tester can get a focused summary and a suggested course of action in minutes (or even seconds).

Regression test prioritization

As test suites grow, running every test on every build becomes impractical. LLMs can recognize and prioritize which regression tests to run, based on code changes, affected modules, and past failure patterns. By analyzing a diff or a list of modified files, the model can identify which parts of the system are most likely to be affected and which tests should be taken as a priority as a result. Regression test prioritization keeps CI/CD pipelines lean and fast, tightening feedback loops without compromising confidence in the build. Over time, this kind of intelligent prioritization helps teams stay agile even as their codebase continues to scale.

API testing and validation

Regarding API testing, LLMs can generate request payloads, build test scenarios for different endpoints, and verify responses against expected schemas. Given an OpenAPI specification, an LLM can develop complete test suites to check expected behaviors, error handling, and edge cases. They can also recommend security tests, including authentication bypass attempts and rate-limiting checks.

Documentation generation

Testing documentation often lags behind implementation. LLMs can generate or update test documentation from existing code, offering clear explanations of each test’s purpose. This aids onboarding and maintains clarity as test suites grow.

Across these examples, LLMs are used to complement human testers. They take on repetitive, pattern‑driven tasks, allowing testers to focus on strategy, exploratory work, and validation. The technology works best when it removes friction and creates more time for high‑value testing activities.

Not sure where AI fits into your testing process?

Our AI-augmented testing services help you identify the right entry points and implement them without trial-and-error.

Explore our AI-augmented testing services

Benefits of using GPTs and LLMs for software testing

Beyond the usual “replacing manual, repetitive tasks” that’s mentioned when talking about the benefits of using AI, there are also gains in speed, efficiency, coverage, costs, and more. Outweighing whether the benefits of using AI are worth your investment will ultimately prepare you to have the most realistic expectations.

1. Improved speed and efficiency

One of the most visible benefits is the reduction in time spent on repetitive testing tasks. Manually writing test cases for a mid-sized feature can take skilled engineers hours. With LLM assistance, however, that same process can be done in minutes. The impact on test case creation time is well documented; As mentioned in a CapGemini study. “Confidence in AI’s commercial viability is growing, with 40% of organizations expecting positive ROI within one to three years.”

Thanks to the help of agents, engineers can devote more of their time to conducting exploratory or user-specific testing, while the repetitive tasks are taken care of.

2. Broader test coverage

No matter the experience of engineers, the variety in their edge cases is as wide as their own imagination. Factors like assumptions, subjective experiences, or past experiences tend to taint the objectivity of tests.

LLMs approach requirements without assumptions, systematically generating scenarios. This broader coverage ultimately turns into fewer production bugs. Teams using AI-assisted test generation consistently report discovering scenarios that were previously invisible to their QA process.

3. Reduced cost over time

Although implementing AI into QA processes comes with an upfront investment, the long-term cost benefits significantly outweigh it. Fewer bugs mean fewer emergencies, less downtime, and reduced risk allover. Moreover, maintaining automation scripts, which are usually costly and time-consuming, becomes considerably cheaper and faster with LLMs.

Challenges and limitations of using GPTs and LLMs in software testing

As with the benefits, having a realistic understanding of the limitations that come with using LLMs in testing is vital. Knowing the cons will prepare you for what will come and make you less emergency-prone in your processes.

1. Hallucinations and false positives

One of the most common issues that comes with using LLMs are hallucinations and false positives. Oftentimes, LLMs generate confident-sounding, yet totally false outputs. In our context, this means they may generate test cases for functionalities that don’t exist, or produce automation scripts that look valid, yet fail on execution.

Because of this setback, as we discussed previously, skilled, knowledgeable human review is a non-negotiable if you’re planning on using LLMs or other AI in your projects.

2. Security and data sensitivity

Feeding proprietary code, customer data, or internal documentation into public LLM APIs carries significant security risks. Security and privacy breaches, including the leakage of sensitive information, are recognized failure modes of LLM-powered systems.

Before integrating LLMs into your workflow, establish clear policies on what data can and cannot be shared with external AI services. If you’re handling sensitive data, self-hosted or private LLM deployments, like BarkoAgent for optimizing internal document searches, may be necessary.

3. Skill atrophy and reduced critical thinking

It’s easy to get carried away with time-saving tools, especially when they’re trained on your specific requirements and data. That creates the risk of becoming over-reliant on LLMs in testing, ultimately losing originality, a personal feel, and the sense of accomplishment.

Over time, relying on your work being done for you can result in skill and growth loss. Keeping in mind that LLMs should complement, not replace, testers’ work, is what will guarantee you keep up to speed with the most relevant tech and use it to your advantage.

Benefits of using LLMs & GPTs	Drawbacks of using LLMs & GPTs
Improved speed and efficiency	Hallucinations and false positives
Broader test coverage	Security and data sensitivity
Reduced cost over time	Skill atrophy and reduced critical thinking

Why human QA engineers and human-driven testing still matter

Looking at both the benefits and drawbacks, we can conclude that AI cannot replace human QA engineers completely. Although great for thinking outside of the box and automating time-consuming tasks, there are several aspects that teams that have implemented AI into their processes still take into account.

Companies that use LLMs and GPTs also use:

Structured, solid QA processes. LLM and GPT outputs can be inconsistent and inaccurate; rigorous testing frameworks, validation pipelines, and human review checkpoints are essential to catching errors before they reach end users.
Risk-based test strategies. Not everything requires the same level of scrutiny. Experienced teams prioritize testing high-stakes outputs, where an AI hallucination or failure could cause real damage: financial, legal, reputational, regulatory, and beyond.
Industry experience. Without extensive knowledge of your domains, LLM outputs are as good as a guess. Deep sector knowledge is what will separate human expertise from poor AI integration.
The human factor. Most testing that touches on accessibility, user experience, user interfaces, and intuitiveness is directly tied to the human factor. Simply put, all technology (software, hardware, and everything in between) is built to be used by humans. So it only makes sense to test it that way.

All of these aspects are directly related to extensive human expertise, which AI cannot function without.

Final prompts

Using GPTs and LLMs for testing has been proven to be effective, though it isn’t going to guarantee immediate success. The quality of the AI’s outputs will be determined by the quality of your inputs.The context, documentation, bug reports, and existing test cases you provide will impact the usefulness of implementing agents in your testing.

Moreover, it’s important not to get lost in the possibilities. Start with one or two use cases, measure the impact, and build from there. The teams seeing the greatest returns aren't those who've replaced their QA processes with AI—they're the ones who've been deliberate about where and how they've introduced it.

FAQ

Most common questions

What can LLMs actually do in software testing?

They generate test cases, create automation scripts, analyze logs, prioritize regression tests, and identify edge cases from requirements.

Do LLMs replace human QA engineers?

No. They handle repetitive, pattern-driven tasks, while human testers focus on strategy, exploratory testing, and critical validation.

What is the biggest risk of using LLMs in testing?

Hallucinations. AI-generated outputs can appear correct but be entirely wrong, making skilled human review non-negotiable.

Is RAG or fine-tuning better for test automation?

RAG is the better fit for most teams, offering project-specific results with far less cost, effort, and technical expertise required.

How should teams start integrating LLMs into their QA process?

Start with one or two focused use cases, measure the impact carefully, and expand only once clear value has been demonstrated.

Is your team getting the most out of AI-assisted testing?

We combine expert QA engineers with AI-augmented testing to give you faster coverage, fewer blind spots, and results you can actually trust, without the overhead of figuring it out alone.

Get a tailored proposal