🎧 How to Win With Prompt Engineering

TL;DR: Today we’re releasing a new episode of our podcast AI & I. I go in depth with Jared Zoneraich, the cofounder and CEO of PromptLayer, a platform designed to streamline prompt engineering workflows for teams. We get into how the field of prompt engineering is evolving, and the role non-technical domain experts will play in the next big changes in AI. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Prompt engineering isn’t just about telling AI to solve your problems—it’s about knowing which ones to solve.

Yet there’s a mismatch between the people who can identify the right problems—experts with deep domain knowledge—and the technical infrastructure required for developing and refining prompts. Jared Zoneraich, the cofounder and CEO of prompt engineering platform PromptLayer, is bridging the gap with a platform on which non-technical experts can manage, deploy, and evaluate prompts quickly.

The role of human prompt engineers, however, has been the topic of controversy, with some arguing that AI can optimize prompts better than us, while others suggest that more capable LLMs eliminate the need for meticulously crafted prompts altogether. I spent an hour talking to Jared about why he believes prompt engineering isn’t becoming obsolete. He also tells me everything he’s learned about writing a good prompt and what the future of AI tools looks like. Here is a link to the episode transcript.

This is a must-watch for prompt engineers, people interested in building with AI systems, or anyone who wants to generate predictably good responses from LLMs.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

If you want a quick summary, here’s a taste for paying subscribers:

Is prompt engineering dead?

According to Jared, the debate around whether more powerful LLMs are making prompt engineering irrelevant misses a crucial point. He argues that there are “irreducible” elements of a problem that AI cannot independently infer without being guided in the right direction—and prompt engineering is about defining the “exact scope” of the problem to be solved. He takes the example of an AI secretary designed to book flights for him. For a long-haul flight to Japan, there are many decisions to be made: “Do I want an aisle seat? Do I want a window seat? Do I rather book a non-stop [flight] over a business class with a stop?” These choices represent the irreducible part of the problem. “If you have an amazing AGI that can solve any problem, the hard part is, what do you even tell it to solve?” he says.

I play the devil’s advocate, asking Jared why human prompt engineers are necessary if an AI system can be put on a loop, where users rate the model’s responses, and the AI improves based on their feedback. Jared argues that there will be intense competition among companies using data-driven approaches like this to improve their AI products, and the real “differentiation” will come from the “domain expertise you can bake into the application.”

At PromptLayer, prompt engineering is about “putting domain knowledge into your LLM system,” and Jared says that “whether you have to say ‘please’ and ‘thank you’ to the AI will probably go away, but you still need to iterate on the core source code.” He believes that prompt engineering centers on the questions, “How do you close the feedback loop? How do you iterate as quickly as possible?” to which there are multiple answers because “there is no one way to gather data and come to a conclusion.”

The rise of the non-technical prompt engineer

One of PromptLayer’s most exciting applications, according to Jared, is making prompt engineering accessible to non-technical people. He notes that companies “are not going to win in the age of generative AI by hiring the best machine learning engineers,” but rather by “working with domain experts” who can “define the specifications” of the problem they aim to solve. For example, one of PromptLayer’s early clients was a parenting app whose prompt engineer—a teacher with 15 years of experience and no technical skills—brought deep domain expertise while guiding AI responses to parents’ questions.

Here’s what Jared has learned about making good prompts and improving them over time:

Focus on mapping inputs to outputs. According to Jared, prompt engineering is about consistently evaluating your prompts. “The best prompt engineers treat [the LLM] as a black box and say…‘Let's not think about how it works, all I want to think about is, how do I map the inputs to the outputs I want?’”
Speak the LLM’s language. Aligning your prompts with the language that the LLM has been trained on is key. While coding, for example, Jared says that he loves using function calling, a programming concept where an external function is automatically invoked based on user intent, “even for things that are not functions, because implicitly that's the language that [the model] knows and…you're conveying much more information…than you would be by writing.”
Broaden the horizons of prompt engineering. Jared adds that a model’s response is shaped by factors beyond the literal text of the prompt, including “What is the combination of prompts you're using?” and “Are you breaking down the prompts?”

The core elements of prompt engineering

Jared identifies three fundamental “primitives,” or building blocks, of prompt engineering: prompts, evaluations, and datasets. These are the best practices he recommends for each:

Prompts are tailored instructions created by the user to guide models toward completing tasks.

Specialize prompts for tasks. According to Jared, building a workflow that routes users to a pre-built prompt based on their query is better than having a general-purpose prompt to answer all queries. Calling this the “prompt router approach,” he says that “individual prompts to do one and only one thing…work much more of the time and have much [fewer] failure cases.”

Evaluations measure the performance of prompts by comparing their output to established benchmarks or user-defined criteria.

Benchmark against historical data. A good initial step to evaluate a new model is to run it on previously collected data to see how it compares to older versions. At PromptLayer, Jared says they have their users “create a back test based on their last 1,000 or 10,000 prompt-response pairs and run the new prompt using that data to see how much it changes.”
Choose the right metric. Jared notes that the next step of running an eval depends on the use case of the model being evaluated—specifically, whether or not it has “ground truth,” or a correct result that serves as a reference point to evaluate the AI system against.
If you have a ground truth, he says you can “build an eval that gives you a real score” by “anchor[ing] it on real metrics.”
If you don’t have a ground truth, for a task like generating AI summaries, for example, it’s admittedly more “complicated,” and he recommends “having human graders read it” or synthesizing the “heuristics” of what to measure the outcome against, and trying to build a metric that mimics that. According to Jared, the hard part is “understanding what your brain does” when it decides if something is good or not, and breaking that down into individual heuristics.

Datasets provide reference data that ground prompt engineering.

Create reference data. Jared recommends building ground truth datasets, even potentially bootstrapping datasets by synthetically generating them. “If you don't have the back test data, you're gonna want to focus on building ground truth datasets…[because] then you’re sailing [and] prompt engineering is kind of easy.”

The future of prompt engineering—and AI more generally

I asked Jared if he thinks AI tools in the future will be specialized—where tools will vary depending on the type of user query—or will continue to maintain their general-purpose nature. He answered that for the end user, it would probably lean toward the latter: “Look at ChatGPT’s evolution, you had to select which tools you wanted and which plugins you wanted…and [OpenAI] quickly moved to a world where ChatGPT will choose whatever tool they want you to use” for a specific query. Jared adds that from a technical perspective of someone building these applications, it’s “hard to say” because it depends on variables like “what you are building, what are your trade-offs, what’s your latency?”

Beyond prompt engineering, these are Jared’s thoughts on the new types of software and art that LLMs are enabling:

Build custom AI for yourself. According to Jared, the ease with which one can build something with an LLM unlock a new class of software that a person builds for themselves to solve their own specific need. “People call it ‘single-use software’ that you're not really going to sell to other people, but it's easy enough to make.”
AI and human art will find their place. Jared believes that AI-native art and art made by humans will coexist, occupying different niches. “I think we’ll have a lot of AI music, and a lot will be ‘junk food,’ meaning a lot of people will consume it and love it…but there'll still be the organic, farm-to-table musicians where a human makes it and it's just going to solve [for] different things.”

You can check out the episode on X, Spotify, Apple Podcasts, or YouTube. Links and timestamps are below:

Watch on X
Watch on YouTube
Listen on Spotify (make sure to follow to help us rank!)
Listen on Apple Podcasts

Timestamps:

Introduction: 00:01:08
Jared’s hot AGI take: 00:09:54
An inside look at how PromptLayer works: 00:11:49
How AI startups can build defensibility by working with domain experts: 00:15:44
Everything Jared has learned about prompt engineering: 00:25:39
Best practices for evals: 00:29:46
Jared’s take on o-1: 00:32:42
How AI is enabling custom software just for you: 00:39:07
The gnarliest prompt Jared has ever run into: 00:42:02
Who the next generation of non-technical prompt engineers are: 00:46:39

What do you use AI for? Have you found any interesting or surprising use cases? We want to hear from you—and we might even interview you. Reply here to talk to me!

Miss an episode? Catch up on my recent conversations with star podcaster Dwarkesh Patel, LinkedIn cofounder Reid Hoffman, a16z Podcast host Steph Smith, economist Tyler Cowen, writer and entrepreneur David Perell, founder and newsletter operator Ben Tossell, and others, and learn how they use AI to think, create, and relate.

If you’re enjoying my work, here are a few things I recommend:

Subscribe to Every
Follow me on X
Subscribe to Every’s YouTube channel

Thanks to Rhea Purohit for editorial support.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.

We also build AI tools for readers like you. Automate repeat writing with Spiral. Organize files automatically with Sparkle. Write something great with Lex.