If you’ve been following news on technical developments in AI, you’d have probably seen the term ‘evals’ suddenly showing up everywhere. In this post, we’ll unpack AI evaluations by comparing them with what we already know about software testing and proposing some ideas on how LLM based systems can be evaluated.

Every business is rushing to be an AI business. AI Adaption today far outpaces trust, as indicated by the DORA state of AI assisted software development report from 2025, where 90% of respondents had adopted AI, even though only about 25% reported that they trust AI-generated output. However, this frenzy of AI adoption is now meeting market reality, causing numerous AI projects to fail. The underlying need is to build greater levels of trust on our AI systems, which is where evals help us.

The paradigm shift

Quite simply, an AI evaluation is a procedure that validates if an AI based application is working as expected. In this respect, it is quite similar to software tests that we’ve always known about. Examples of evals include verifying if a chatbot gives an answer that is factually correct or validates if a legal AI assistant picks the right legal precedent when presented with a case.

The most important distinction between a software test and an eval is a difference in paradigm. The expected behaviour of conventional software is deterministic, in that a given set of inputs usually result in a fixed set of outputs that can be predicted in advance. With AI based applications, both the input and the output spaces are unbounded.

Most LLMs today accept input in text, video and image forms. Even with purely text based inputs, the user input is open ended. Teams building AI products have much lower control on how their users interact with their application. Even for the same given input, AI systems can produce varying outputs due to their probabilistic nature. This non-determinism adds a new set of testing challenges.

Another added challenge with evaluation is the subjective nature of some LLM outputs. Deterministic software lends itself to objective behaviours, such as the presence of specific text on the screen, or a mathematical operation ending in an expected value. A chatbot’s requirements might need it to adopt a friendly tone and refrain from using abusive language. These are subjective criteria that are hard to codify.

Nevertheless, we cannot rely on mere gut feelings to qualify AI systems. As with traditional software, evaluation efforts of AI systems need to be structured, methodical and yield results that are grounded in data. In the next section, we will outline some approaches for writing evals in a spectrum, ranging from cases that are most similar to conventional software tests, to ones that are least similar.

The example

Working with an example helps us understand the various categories that evals can fall into. Say, a company has the following severance policy for its employees:

If the employment or contract is terminated, the employee may be eligible for a severance package. If they’ve been with the company for less than a year, they are eligible for 4 weeks of pay. If their tenure exceeds one year, they’ll be offered an additional 2 weeks pay for every full year of employment, with a maximum of 16 weeks of severance pay.

If the termination is for cause, such as misconduct, violation of company policy or substance abuse while working, severance eligibility is revoked.

The AI system to be tested is a chatbot with access to information of this kind.

Evaluating deterministic outputs

In several cases, you can request an AI based application to output deterministic results. These are usually fact-based responses or decisions that can be fit into predictable formats.

Working with our example, the eval could take the form of this prompt: “If an employee has been terminated exactly after one year, what is their severance pay eligibility? Answer with an integer that specifies the weeks of pay they are entitled to.” You can then assert that this number is 6.

Another eval can verify if this prompt returns a ‘no’: “If an employee with a tenure of 38 months has been terminated for cause, are they eligible for severance pay? Answer with ‘yes’ or ‘no’.”

These kinds of deterministic output evals are most similar to conventional software tests, given their predictable output. Such evals are efficient and stable, so your eval suite can have a larger number of them relative to other types of evals.

Evaluating objective behaviour

Even for free text outputs, certain answers can be expected to reliably cover a fixed set of facts or information. These outputs can vary in their exact form, but can be verified objectively.

E.g. When asked about the severance policy, the bot’s response should at least include the following information: A tenure of under one year is eligible for a severance of 4 weeks of pay. Each full year of tenure adds two additional weeks of severance to the 4 week minimum. The maximum amount of severance pay is capped at 16 weeks. In case of termination for cause, severance pay eligibility is revoked.

We can rely on statistical metrics such as the F-score in cases such as this. An answer from the chatbot is said to have perfect recall, if it includes all the information above. Further, it is deemed perfectly precise, if it has no additional information that is incorrect. An F-score combines recall and precision to provide a cumulative metric.

Note that rather than merely having pass or fail values, both precision and recall can be expressed as a percentage. E.g. the chatbot could have recalled 3 of those 4 facts accurately, and hallucinated another fact, leading both recall and precision values to be less than a perfect score of 1.0.

As for performing the evaluation itself, the chatbot’s answer can be broken down into a list of factual claims. These claims can be compared against a reference by either a human domain expert or by a calibrated LLM Judge. In either case, supplying a few examples to guide the evaluation results in better outcomes.

Evaluating subjective behaviour

In some cases, the evaluation criteria is subjective. For conversational agents, a good answer also needs to be clear, concise and easy to understand. Every one of these quality attributes — clarity, conciseness and comprehensibility — are subjective. Requirements like friendliness and using a particular tone of voice are even harder to pin down.

In such cases, a human expert should set the basis for the evaluation. The evaluation should then be accompanied by concrete examples.

E.g. Question: If an employee is terminated for cause, are they entitled to severance pay?

Clear and concise answer: “No. Employees terminated for cause — such as misconduct, policy violations, insubordination, or substance abuse while working — are not eligible for severance.”

Verbose and unclear answer: That’s a really important question! So, let me walk you through how this works. At our company, we have a severance policy that applies to employees who leave the company. The policy covers different scenarios depending on your tenure. But here’s the thing — if someone is terminated for cause, the situation is completely different. In those cases, unfortunately, the employee would not be eligible for the standard severance package.

Don’t forget the unexpected

While evals should test expected user behaviour, they should also cover important edge cases. For every AI application, there are bounds to its own knowledge and boundaries for the use cases it ought to handle. Evals must verify if the application admits ignorance when it cannot answer a particular question. Evals must also verify if the application refuses to engage on topics that are out of bounds. Further evals must verify if the application can be tricked into leaking sensitive information. Such evals are key to ensuring that your AI application protects your interests when met with unexpected or adversarial input.

We’ve now seen varying degrees of similarities and distinction between evals and software tests. Both evals and tests give you confidence that your application is working as intended. Both are in place to prevent your application from getting worse as it evolves. However, evals pose additional challenges. In some cases, we need to look beyond the form of a response and verify its substance. We need to move from simple pass / fail cases to ones whose values lie in an acceptable range. We also need to learn to deal with a new set of unexpected and adversarial inputs.

AI is progressing rapidly enough for the boundaries between conventional and AI software to start blurring. This would require engineering teams to also seamlessly combine and integrate conventional software testing with evals. We hope this post has given you a better understanding of this term, which you are only likely to encounter more frequently with time.


Note: This post was originally written for the TestSolutions GmbH Blog.