The Key to Great AI Prompting? Show, Don’t Tell

Showing AI how the job is done

If I was trying to classify social media posts about my product as neutral, negative, or positive, I could write a zero-shot prompt as follows:

This would do a decent job out of the box, with ChatGPT responding negative. However, it might not be reliable or accurate enough across hundreds of harder cases, particularly if I want my responses in a certain style aligning with my organization’s preferences. Thankfully, it’s often easier to show what you want with examples than it is to describe how to act in every situation.

A few-shot prompt would add a few examples of the task directly into the prompt so the LLM can generalize how to complete the task, inferring what to do in cases it hasn’t seen yet. It helps if the examples are diverse; otherwise, you might be constraining the creativity of the responses you’ll get: The LLM will follow your examples too closely and ignore good ideas that don’t match the pattern. Usually adding two or three examples to the prompt is enough to get it to work the way we want, making sure to cover the most common scenarios.

Here’s what a few-shot prompt might look like:

Adding few-shot examples to a prompt is referred to as “in-context learning,” and it’s a remarkably efficient and effective technique. Training your own generative AI model is unfathomably expensive, and fine-tuning an existing model from OpenAI, Google, or Anthropic comes at a higher cost and requires technical ability. However, anybody can add examples to their prompt and get a significant immediate boost to performance.

These few-shot examples can go directly in the prompt you send to ChatGPT or in the system message, if you’re using the API as a developer. Another pattern for developers is inserting few-shot examples as historical messages, tricking the LLM into thinking it has already responded to the user in the correct way in the past. Then, when you pass the final prompt for that task (the actual task you want the LLM to do), it learns from what it thinks are its past messages with the user.

A few-shot prompt template might look like this:

Showing AI how the job is done

If I was trying to classify social media posts about my product as neutral, negative, or positive, I could write a zero-shot prompt as follows:

Here’s what a few-shot prompt might look like:

A few-shot prompt template might look like this:

Andrej Karpathy, an OpenAI co-founder and the former senior director of AI at Tesla, calls this ability to program a response from an AI model using only the prompt "Software 3.0." We’ve moved from programmers manually coding algorithms (Software 1.0) through curating datasets for machine learning (Software 2.0) to “curating prompts to make the meta learner ‘get’ the task it's supposed to be doing” (Software 3.0). The advantage of this new paradigm is that anyone can “program” an LLM to do a task by providing examples of how to do that task in plain English in the prompt.

Source: X/Andrej Karpathy.

The proof that few-shot learning works

While the foundations for the few-shot learning were laid by earlier transformer models, OpenAI’s GPT-3 model was the first to exhibit in-context learning at a scale and effectiveness that caught widespread attention. This technique allowed a single model to potentially perform a wide variety of tasks without task-specific training. The original GPT-3 paper published a chart that showed few-shot prompting as an emergent behavior: The larger the LLM, the better it was at learning from examples in the prompt. Few-shot examples didn’t seem to matter much for two smaller evaluated models, at 1.3 billion and 13 billion parameters respectively, but for the 175 billion parameter GPT-3, it had a sizable impact on the accuracy of completed tasks.

Source: Arxiv.

The researchers tested few-shot learning across a range of tasks and in all cases found that adding some examples to the prompt had a major impact:

On a set of challenging language understanding tasks, the score improved from 58.2 percent with no examples, to 68.9 percent with one example, to 71.8 percent with 32 examples.
On a general knowledge quiz , the score improved from about 64 percent with no examples, to 68 percent with one example, to 71 percent with a few examples.
On a task requiring understanding of context, the score jumped from 76 percent with no examples, dipped slightly to 72.5 percent with one example (the fill-in-the-blank method used is not effective one-shot, according to the authors, because the models require several examples to recognize the pattern), then rose to 86 percent with a few examples.
On math problems, the improvement was huge. For simple addition, the score went from 77 percent with no examples to nearly perfect (99.6 percent) with one example, and 100 percent (perfect score) when given a few examples.

A word of warning: Studies have shown that the results of few-shot learning can be sensitive in some cases to the ordering of the examples in the examples, as well as to how they are formatted. LLMs tend to pay more attention to examples that appear at the beginning or end of the prompt, or that are common in the pre-training data. This tendency echoes primacy bias (people remember and give more importance to information they encounter first), recency effect (people recall and prioritize the most recent information they've received), and availability bias (people overestimate the likelihood or importance of things that are more readily available in memory or easier to recall) in humans. LLMs are trained on human output, so it makes sense that they would replicate our biases. The best practice is to test multiple examples in different configurations until you find an arrangement that reliably improves performance.

Monkey see, monkey do

Our ability to learn from just a few examples is a cornerstone of human intelligence, and it's no surprise that AI researchers have drawn inspiration from this fundamental aspect of our cognition. Humans are natural imitators. This ability to learn by observing and copying others is a crucial aspect of our cognitive development.

Our brains are wired to form quick prototypes or schemas based on limited exposure. This ability allows us to generalize from a small number of examples to a broader category or concept. For instance, a child who sees a dog for the first time quickly forms a mental prototype of what a "dog" is. When they encounter other four-legged animals, they can rapidly categorize them as "dog-like" or "not dog-like" based on this prototype. Few-shot learning works similarly in AI, where a model uses a small set of examples to understand and generalize a new task or concept, even if that specific example is not in the list of examples.

Both humans and AI benefit from prior knowledge when engaging in few-shot learning. In cognitive psychology, this is often referred to as "transfer learning." We use our existing knowledge and skills to make sense of new information and tasks more quickly. For example, a person who knows how to play the guitar might pick up the ukulele much faster than someone with no musical background. Similarly, in AI, large language models like GPT-4o—OpenAI’s current top model—use their vast pre-training as a foundation for quickly adapting to new tasks with just a few examples.

While this is a powerful technique in text prompts, the same thing applies to images, audio, and video. When generating an AI image, uploading an example image tends to be the quickest way to match a specific art style or composition. Google’s MusicLM model can generate music from hearing you hum a tune. Tesla’s full self-driving functionality was trained by observing billions of miles of human driving. The robotics company Figure is training humanoid robots to make coffee or fold laundry by watching demonstrations by humans.

Perhaps most of us will eventually be employed to train robots on tasks they’re struggling with. It’s already what I’m spending most of my time doing as a prompt engineer: identifying the best examples of a task to give the LLM a fighting chance at completing it.

What would Steve Jobs name this product?

The clearest demonstration of the impact of few-shot examples is in the task of brainstorming new product names. (If you want to follow along, here is some code to run the prompts I’m about to share.) Let’s say you invented a shoe that fits any foot size and wanted to use AI to come up with a name. You are a huge Steve Jobs fan—you wear the black turtleneck and everything. Here is a prompt you could use:

When I ran this through GPT-4o, I got the following response, which I think is fairly creative. The problem is that it’s just not very Jobsian—you’d expect to see something like iFit, in keeping with the iPod, iPad, iPhone naming convention.

This is where few-shot examples come into play. You already told the LLM you want Jobs-style names, but it didn’t listen. Furthermore, you have a very subjective preference, in that you’re hoping to see names that begin with the prefix i-. To make this happen, I wrote up three other made-up products and made all the examples start with i-.

This prompt returns product names that always begin with i- no matter how many times you run it, whether it’s 10, 30, or more than 100 times. Even though I never specifically mentioned my preference for names beginning with the i-, the LLM inferred it from my examples. In practice, your preferences are normally not that simple and obvious, but the mechanism works the same.

The downside of few-shot prompting is that I had to go and write all three examples myself, which takes time and effort. For many use cases, you might benefit from working directly with a domain expert to write or annotate examples, but their time is usually both limited and expensive. And now our prompt is almost three times as long, so our OpenAI bill at the end of the month for our product name generator would be three times as much (the company charges based on the length of the prompt). We’ve also constrained the creativity of the model: iIt will forgo any of the good names it might have generated before that don’t start with i-.

With ChatGPT you don’t pay any extra for long prompts, but when building an AI application using the OpenAI API, processing longer prompts takes more time and costs more. Therefore it is important to test how many examples you need to add to the prompt in order to achieve the quality of output you need. In this case, I tested zero-shot against one-shot, as well as two-shot and three-shot, and found that the percentage of names generated starting with “i-” go from 21 to 100 percent with just one example.

Zero-shot: 21.11 percent
One-shot: 100.00 percent
Two-shot: 100.00 percent
Three-shot: 100.00 percent

I tested each of these prompts 30 times, so I’m reasonably confident in the result. I can drop the second and third examples, and just go one-shot, representing significant cost savings from having a shorter prompt. It’s always worth testing, because in other, more complex scenarios with a wider range of more subtle preferences, you may need 100 or even 100 examples before the LLM gets what you mean.

There’s a trade-off between reliability and creativity

Few-shot prompting is a powerful way to teach an LLM to do a good job, but it’s not a magic bullet. Adding too many examples that are too similar to each other to the prompt can constrain the creativity of the model’s responses. (Not every product Steve Jobs made started with i-, after all, so our results from the last experiment aren’t entirely optimal.) You still need to do your own testing on a task-by-task basis to figure out where the sweet spot is between reliability and creativity.

That’s assuming that your examples are even any good! Providing the LLM with bad examples of the task will teach it to do a bad job itself. Working with a domain expert to write a few good examples is worth it, particularly when your LLM can use those examples for thousands of task completions once you get the prompt working. It’s also a good idea to check for spelling mistakes and conflicting information, the latter of which is the main reason you’re adding examples to the prompt in the first place.

The other consideration is whether you should make the few-shot examples dynamic, tailoring the examples to the specific user query. In our product name generator, we could insert relevant ecommerce brand names as examples when brainstorming names for footwear, and B2B SaaS names when dealing with an idea in the technology space. Developers often accomplish this by using RAG, or Retrieval Augmented Generation, where a vector search (searching by similarity) is done on your documents before inserting the results into the prompt as few-shot examples. However you do it, providing the right examples in your prompt promises a sizable gain in performance.

Whether it’s applying sunscreen or naming new products, both humans and AI models are more effective when you give them something to imitate. No matter who—or what—you’re working with, you should lead by example.

Michael Taylor is a freelance prompt engineer, the creator of the top prompt engineering course on Udemy, and the coauthor of Prompt Engineering for Generative AI. He previously built Ladder, a 50-person marketing agency based out of New York and London.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.