One of the most promising use cases for LLMs goes by the term RAG - Retrieval-Augmented Generation. This involves giving the LLM access to some data to be used as the context for conversation. E.g., providing an LLM access to your company’s documentation and having your colleagues ask questions about this documentation.

Given how bad the search functionality on most document management systems is (think Confluence search), the LLM has a low bar. And yes, based on my testing, the LLM does exceed this bar in some ways, but it falls short in others.

The test setup

I have an old blog, where I used to write daily posts for more than 5 years. It has a total of 1866 posts. I decided to feed this data to an LLM and ask questions about it.

I used ChatGPT 4.1 for this exploration. The setup was surprisingly fast. All I had to do was to export the xml file from Wordpress which contained all my blog posts and feed it to the LLM. It took less than 5 minutes for the LLM to parse this data automatically and start answering questions. I was impressed!

I invested a total of about 16 hours in testing this setup.

Start with the expectations

When I started testing the setup, I quickly realized that instead of subjecting it to random tests, I needed to specify what I expected from it. Where LLMs differ from conventional software is that they are designed to be inconsistent - a lot depends on your context, which is directly informed by your expectations. Therefore, for exploratory testing of an LLM, it helps to start with your expectations. I wrote down a long list of expectations I had. Some of these include

  • The bot is able to always link relevant blog posts to answers to a question I ask, if they exist
  • When posts related to a specific topic don’t exist, the bot either returns no results
  • The bot merely retrieves insights or content from the blog, without adding anything on its own

I then transformed these expectations into custom instructions

  • Act as a conversational interface for answering questions based on the content of the blog in your knowledge base.
  • Always link the relevant blog posts to your answer for any question you ask.
  • When posts related to a specific topic don’t exist, return no results.
  • Merely retrieve insights or content from the blog, without adding anything on my own.

Doing this gave me a much better basis for testing the LLM and its capabilities. Further, an implicit expectation I had was that the LLM would be better than my existing alternative - the search feature.

Where it succeeded

LLMs today are great at dealing with unstructured data input. This is why the setup was so quick. All I had to do was to throw an xml dump at it, which it managed to accurately parse in record time. This is where building a RAG model has an advantage over more involved alternatives such as model fine-tuning.

Where an LLM really leaves the search feature in the dust is in dealing with abstractions. For e.g. I wrote several blog posts on the topic of perseverance without actually using the word. These posts spoke about habits, consistence, persisting in the face of obstacles etc. If I merely searched for the term ‘perseverance’, I would have missed all these posts. With an LLM, I was able to retrieve some of them.

It is also good, to some extent, at identifying the theme of a post. For instance, when I asked the LLM for posts related to quality, it was good at filtering the posts where quality is the main theme. A search engine, on the other hand, will return all posts where the term ‘quality’ appears, which isn’t very helpful.

The LLM was also good at sticking to the blog’s content while answering questions. When I asked it to retrieve posts on the topic of “disco dancing” or “self medication”, topics which I had not written about, it didn’t return any posts. This behaviour isn’t watertight - E.g. it hallucinated a blogpost when I asked the question: What should I do to find a parking spot quickly? But this hallucination was within bounds.

Where it breaks

LLMs don’t have an accurate model of the world, and this came through when I asked the LLM to retrieve poems from my blog. My blog hat at least 8 poems, of which the LLM consistently retrieved only two of them. On examining deeper, I found that I had tagged these two posts as ‘poems’. An LLM has no understanding of what a poem is. Further, it is unaware of its own lack of understanding. When I asked it if it was confident that it retrieved all poems, it stated that it was 95% confident.

An LLM also doesn’t understand the term ‘confidence’. I asked it “Give me advice on how I should break a bad habit that I have.”, and it retrieved about 5 relevant posts. I then asked “Are these all the posts?”, and it said, “The system’s confidence is high (about 90%) that the five posts listed are the main posts specifically focused on breaking bad habits in the blog.” When I then asked “Can you look again and see if you’ve missed some posts?”, it responded with 12 other posts.

I found the LLMs themselves not rigorously tested by OpenAI and the companies that make them. Given the wild competition and hype in the space, their motto is very much much “move fast, break things”. This is apparent when you test them thoroughly. Two different chat instances of the same LLM can behave with wild inconsistency. While testing, one instance, for some reason, had gotten corrupted. It started hallucinating wildly and returning links that didn’t exist, which reeks of bad error handling. When I started another chat instance, these problems went away.

Other observations

The use of GenAI has cognitive risks. Its conversational interface seduces us into liking it, even if we have better alternatives. Its confidence instills trust in us even if this is misplaced. Its kind manners and sweet words tricks manipulates us into using emotions to deal with a machine that needs to be dealt with coldly and rationally. I spun this observation out into a separate article.

It will prove to be a challenge (and an opportunity) to perform automated testing on GenAI. This is because automation testing across all different levels is rooted in consistency - the one thing that we cannot expect from GenAI. I am looking forward to diving into this problem.

We often use GenAI for tasks where we lack the expertise to do them ourselves. For such tasks, it is difficult to evaluate the technology, given our lack of expertise. Testing an LLM for a use case where I had a clear understanding of what was expected helped me evaluate its capabilities and establish the limits of those capabilities. Before doing this, I had the impression that LLMs were ‘magically’ solving all manners of problems I threw at them. Testing an LLM feels much like understanding how a magic trick works. When you peel back the layers, the illusion comes apart. I urge everybody to do this to avoid the risk of mistakenly using LLMs for tasks that they aren’t suitable for.

All leading models in the markets were pretty much on-par for my particular use case. I subjected ChatGPT 4.1, Gemini Pro 2.5 and Claude Sonnet 4.0 to the same set of test cases and the results were more-or-less the same.

Conclusion

My main takeaway - impressed as I was with LLMs and their RAG capabilities, they still don’t inspire enough confidence for me to put them directly in front of paying customers. Instead, I would recommend using a RAG model internally and tweak it until they become more reliable. These models are difficult to test in an automated fashion - before they are put in front of human customers, they need to be thoroughly tested by human testers. Even with 16 hours of testing, I realized that I have merely scratched the surface here.

This experience of testing an LLM proved to be quite humbling. This is the first time I am testing software that inconsistent by design. Thus far, our testing practices have been established on a bedrock of consistency. With testing LLMs, I felt the ground shifting beneath my feet. Testing them effectively will need us to go back to our first principles. We need to build the necessary skill in critical thinking, designing experiments, and using a disciplined, scientific approach to validate them. This experiment has motivated me to deepen these skills.

Here is a ongoing list of experiments that Michael Bolton and James Bach have conducted on LLMs. When I say that I wish to adopt a more disciplined and scientific approach, it is this kind of work that I am aspiring towards.


Thanks to Neil Fernandes & Shankar Krishnamurthy for their feedback on earlier drafts.