When Guessing Isn’t Good Enough

In Michael Taylor’s work as a prompt engineer, he’s found that many of the issues he encounters in managing AI tools—such as their inconsistency, tendency to make things up, and lack of creativity—are ones he struggled with with people. It’s all about giving these tools the right context to do the job, just like with humans. In the latest piece in his series Also True for Humans, Michael explores retrieval-augmented generation (RAG), in which you first search your documents to pass relevant context to the LLM to generate a more accurate answer.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

In 2012, on a trip from London to Havana, Cuba, I got into a heated debate with friends: Who is the best-selling author of all time—well, at least of modern times? (Sorry, Shakespeare.)

I thought it was Stephenie Meyer, author of the vampire romance series Twilight, but my friends disagreed. One thought it was Harry Potter writer J.K. Rowling, and another was convinced it was Arthur Conan Doyle, the author of the Sherlock Holmes books. There was no mobile internet in Cuba at the time, so we couldn’t look up the answer. Instead, we made plausible-sounding arguments for hours.

When we got home to London, we looked up the answer on Wikipedia and learned the truth: While Meyer had been the best-selling author the past few years, first place belonged to the detective novelist Agatha Christie, with 2–4 billion copies sold. By contrast, Meyer has sold about 100 million. None of us were right, but our answers were within the realm of believability, if not possibility.

ChatGPT does the same thing: It gives plausible-sounding answers, or “hallucinations,” when it doesn’t have all the facts. In fact, everything ChatGPT tells you is made up. Large language models do not really know the answer to any question. They’re just giving you the answer that’s statistically most likely based on its training data. ChatGPT is often close enough, however, that many people feel comfortable using it without having to fact-check everything it says. (You should absolutely fact-check ChatGPT.)

But there’s a simple solution that artificial intelligence developers have built into their LLMs in order to solve the hallucination problem and ultimately make these systems more accurate: retrieval augmented generation, or RAG, where the LLM you’re using first does a vector search (more on that later) to find relevant information, which is then inserted into the prompt as context for the AI model to consider. If you can breathe better accuracy into AI, they can improve from complex probabilistic guesswork to something much more reliable and helpful.

In the AI gold rush, startups such as Pinecone, Glean, Chroma, Weaviate, and Qdrant raised hundreds of millions of dollars selling RAG technology to AI developers. It’s also being built directly into popular generative AI applications. OpenAI’s custom GPTs—the third-party versions of ChatGPT they host—can use RAG in order to make use of the documents that users upload, searching for relevant snippets to include as context in the prompt. Once paired with RAG, AI systems are essentially taking open-book exams when they communicate with you. The answer is somewhere in the book—on Google, in your PDFs, in your chatbot message history. They just have to look for it.

Unsurprisingly, LLMs are better at finding an answer in a long document than they are at guessing the answers without context. With a little help, they don’t need to make something up, and you can get accurate answers based on sources of data that you trust. I’ll show you how we can make AI more trustworthy by teaching it to look up answers instead of guessing using RAG.

Source: Screenshot from ChatGPT.

What if every concept had a postal address?

AI models don’t really see words. They use numbers to represent them. Think of numbers as a postal address or map coordinates, with concepts that are the most similar clustered closer together on a graph. When you upload documents to a RAG-enabled AI system, it will search to find relevant “chunks” of information and hand them over so the application can answer more accurately. To find the most relevant “chunks” in a document, it effectively plots them on a graph and looks for the ones that are closest to your prompt.

Imagine a simple AI model that has three variables: age, gender, and royalty. The three-dimensional space of the graph is called latent space—every word, phrase, or concept you query will have a location in this space. In the 3D graph below, find the point where the word woman is located. If gender is a spectrum, as you change that number from 10 to 0, you’ll get to where the man point is located on the graph. Alternatively, if you move along the age spectrum from man, you’ll get to boy, because a boy is just a young man. Finally, if you increase the, er, royalness of the boy, you get to prince, which is a royal boy.

Source: Carnegie Mellon University.

These models don’t have just three simple dimensions but thousands meant to capture all the nuances, traits, and characteristics of every known concept in the world. Here’s what the vector for the word unicorn looks like, as determined by OpenAI’s embedding model—which returns the vector for any text you send it—called text-embedding-ada-002. (This has been shortened from 1,536 numbers.)

We may not be able to visualize a graph in 1,536 dimensions, but we can intuit that concepts that are similar to each other will have a set of numbers that are close by (i.e., they would be neighbors in latent space). If you have the location of one concept, you can see where on the map it is and who its neighbors are. Every time you do a RAG search, your prompt gets turned into a vector, and that database surfaces chunks of your documents that are closest. These get returned and fed into the prompt as context for the LLM. Putting the relevant parts of the documents into the prompt is a lot more efficient than trying to stuff full documents into the prompt at once, so the LLM can process your query faster and more cheaply.

Finding a needle in a haystack

LLMs aren’t great at everything, but the latest models do an adequate job of finding correct answers in lengthy prompts. When researchers run so-called needle-in-a-haystack tests, in which they hide a question’s answer in the middle of a long document, OpenAI’s GPT-4o model is 90–100 percent accurate, handling up to about 96,000 words at a time.

In Michael Taylor’s work as a prompt engineer, he’s found that many of the issues he encounters in managing AI tools—such as their inconsistency, tendency to make things up, and lack of creativity—are ones he struggled with with people. It’s all about giving these tools the right context to do the job, just like with humans. In the latest piece in his series Also True for Humans, Michael explores retrieval-augmented generation (RAG), in which you first search your documents to pass relevant context to the LLM to generate a more accurate answer.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

In 2012, on a trip from London to Havana, Cuba, I got into a heated debate with friends: Who is the best-selling author of all time—well, at least of modern times? (Sorry, Shakespeare.)

Source: Screenshot from ChatGPT.

What if every concept had a postal address?

Source: Carnegie Mellon University.

Finding a needle in a haystack

Source: Mark Erdmann on X.

However, if you have a lot of documents, it’s not ideal to stuff them into a prompt. It could be costly and slow, and it might confuse the model with too much irrelevant information. It’s like mandating that before your students take an open-book test, they have to read the whole book before answering each question.

Source: Screenshot of SOP Guide, a custom GPT.

To illustrate this point and show how vector-search works, let’s take the humble company handbook. Simply put, no employee wants to read a company handbook. But they have all the rules, policies, and procedures that you can’t ignore. In that spirit, we might feed the handbook to a large language model and ask it questions. We can’t stuff the whole document in the prompt every time—it takes more time and costs more money to process long prompts (so the tool provider will limit your requests). Instead, we have to split the tome up by page, paragraph, or section so that we can use RAG to pull out only the most relevant text snippets or chunks.

Because the search is based on similarity of concepts, this chunking strategy makes a real difference. If you cut off the text mid-sentence or -paragraph, the chunk may lose its meaning. For example, a sentence from Harry Potter might be 80 percent about Quidditch, but the broader paragraph might only be 50 percent about Quidditch, and the page less than 10 percent about Quidditch. If your chunks come from individual sentences, the LLM might not know the context in which Quidditch is being discussed. If your chunks are full pages, the vector search might miss that chunk entirely because it’s not a close enough match—the Quidditch component is too diluted to have a close enough vector. AI engineers optimize their chunking strategy for different documents to determine what approach gives you the right trade-off between context and coherence.

Below is an example I custom-coded using Meta’s FAISS vector store, an open-source library you can use to do RAG on your documents. I uploaded an employee handbook for a fictitious company called Unicorn Enterprises. The fake user of this fake company searches the handbook, “Do we get free unicorn rides?” The LLM finds one chunk from the handbook (we set k=1, which tells it to only retrieve one chunk) that discusses the unicorn petting zoo, but it doesn’t mention unicorn rides. The paragraph is cut off at the words “our culture” at the end of the text. If I had pulled in more than one chunk of text or split the text on a different section of the page, perhaps I might have had the right answer.

Source: Screenshot of a Python Jupyter Notebook courtesy of the author.

When it does manage to get the right chunk of text, it answers without hallucination because it’s grounded in the context we got from RAG. If a user asks about wearing costumes at work, that information is perfectly captured in that chunk of text, and the response generated by the LLM gives the right answer: “You can wear a costume to work on Fridays for the ‘Wear Your Favorite Mythical Creature Costume’ day at Unicorn Enterprises.

Source: Screenshot of a Python Jupyter Notebook courtesy of the author.

RAG as the solution to the hallucination problem

RAG was first proposed by Meta’s AI team in 2020, and although vector-based search had been around for several decades, the idea of putting the results of a vector search dynamically into the prompt to improve factual answers was inspired. Meta’s setup at the time encoded into vectors a user query, “Where was Barack Obama born?” and searched Wikipedia to get the fact, “Barack Obama was born in Hawaii,” before generating an answer using that context. Without RAG we would be relying on what the LLM learned in its training data —pretty much the whole internet, which might produce more hallucinations than our trusted source.

Source: Arxiv.

Human evaluators found this RAG system 42.7 percent more accurate at answering Jeopardy! questions compared to a previous system. For fact-verification tasks, RAG achieved 72.5 percent accuracy on three-way classification questions—i.e., in looking at supporting evidence for a fact, and determine if this evidence supports or refutes the fact, or there’s not enough information either way—compared to the 4.3 percent accuracy other state-of-the-art systems got at the time.

As LLMs have become able to handle longer context lengths (aka bigger prompts) like Anthropic Claude (200,000 token context length), GPT-4o (128,000 tokens), and Google Gemini 1.5 pro (2 million tokens), AI practitioners have questioned whether RAG is still as useful when you can fit whole series of books or videos in the prompt. However, the cost and latency of processing large prompts is still significant, and there is evidence that performance degrades when too much irrelevant information is passed to the LLM. Databricks found that all but the largest models regularly fail to use relevant information that appeared lower down in the prompt.

Source: Databricks.

Solving the discovery problem

Some of history’s greatest inventions and innovations came from the simple act of combining two or more ways of doing something. Wheels existed for 1,500 to 2,000 years in making pottery before we realized their value in making chariots and wagons. Then we had to wait another 4,000 years before somebody had the bright idea to put wheels on luggage. AI is not yet good at inventing completely novel concepts like a new scientific theory, but it is great at synthesis—combining and integrating multiple ideas together from a wider corpus of information.

In most large companies, there is usually someone who has the answer you need, but it’s too time-consuming, costly, or frustrating for you to locate that person or find the right document or locate. It’s often faster to figure out how to do something yourself, so there is a lot of duplication of effort. (Even smaller companies can have this problem, as I found when I ran a 50-person marketing agency.)

Whenever I saw the same question three or more times, I would carve out some time to write an SOP, or Standard Operating Procedure. I put them in a Google Drive folder that anybody in the company could search if they wanted to know how to do a task. But with hundreds of SOPs in the folder, often the team couldn’t find the one they needed and would either message me (hundreds of Slack messages a day!) or, worse, waste time figuring things out themselves.

If I was running an agency today, I’d upload those SOPs into a custom GPT, which could search for the right document with a vector search. Instead of asking their manager “How do we do this task?,” they can ask a custom GPT instead, which has access to all of the company’s SOPs. You could take things further and turn every employee training call into an SOP: Record the call with an AI service such as Grain or Otter.ai, and pass that call transcript to ChatGPT using Zapier to write an SOP for you. Suddenly you have an ever-growing automatically updated company knowledge base with a user-friendly chatbot interface.

Source: A screenshot of Zapier.

Surprisingly, for all the companies selling vector databases for RAG applications on the B2B side, there has been limited progress on the consumer side on making the user experience work seamlessly. OpenAI’s custom GPTs are relatively hard to set up and not commonly used within organizations. I can’t connect a Google Drive folder (or even a Microsoft OneDrive folder from OpenAI’s lead investor, Microsoft) to give a custom GPT access to my SOPs. Technical coding tools like Cursor have made great progress on using RAG internally, so whenever I am writing code in the editor, I use the @ symbol to pull in the documentation of a specific coding library to my prompt. But the best implementation of RAG for non-technical users I’ve seen is Google’s NotebookLM, where you can upload documents and ask questions, getting citations back that point to the specific references within your documents.

Source: A screenshot of Google NotebookLM.

As we figure out the right user experience to interact with RAG systems, AI systems will become more useful, connecting seemingly disparate ideas at the right time to help us do tasks.

These systems can be used to give LLMs a form of memory, searching past interactions or previously uploaded documents for useful information. This memory will work much like our own, in that thinking about one idea might unlock random connections to other ideas we thought we had forgotten. As context windows expand and more of our lives get digitized for the benefit of our AI assistants, they’ll get more trustworthy and be more useful as creative partners. Imagine writing a book and being able to ask ChatGPT what articles you had read in the past five years that were related to the current paragraph you’re working on. The implications of having an AI assistant that never forgets have yet to be figured out.

Michael Taylor is a freelance prompt engineer, the creator of the top prompt engineering course on Udemy, and the coauthor of Prompt Engineering for Generative AI.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also build AI tools for readers like you. Automate repeat writing with Spiral. Organize files automatically with Sparkle. Write something great with Lex.