Where Copilots Work

Where copilots can be built today

AI-based copilots are quite useful with out-of-the-box technology in situations where small pieces of lightly transformed boilerplate text provide a lot of value.

This is particularly true in areas where:

Text can be checked for accuracy quickly with little user effort
The cost of inaccuracies is low
Relevant text can be found reliably with embeddings search

GitHub CoPilot is a great example of this. But other examples are things like grant writing, contract writing, tax prep, many types of email replies, RFP responses, medical recommendations to doctors, and more.

If you’re building a copilot (or thinking of building one), I’ve put together a little checklist for you to go through in order to figure out whether it will be possible to get good results with today’s technology.

Can you build a copilot for it? A checklist.

If you want to build a copilot for a specific domain using today’s technology here’s the list of things you need to check off:

Is there a corpus of relevant text completions to be used by this copilot?
Can relevant text for completions be found reliably with embeddings search over this text corpus?
Can those pieces of text, without more context needed, be lightly transformed and inserted as an accurate completion?
Can completions be checked for accuracy with little to no user effort?

Where copilots can be built today

AI-based copilots are quite useful with out-of-the-box technology in situations where small pieces of lightly transformed boilerplate text provide a lot of value.

This is particularly true in areas where:

Text can be checked for accuracy quickly with little user effort
The cost of inaccuracies is low
Relevant text can be found reliably with embeddings search

Can you build a copilot for it? A checklist.

If you want to build a copilot for a specific domain using today’s technology here’s the list of things you need to check off:

Is there a corpus of relevant text completions to be used by this copilot?
Can relevant text for completions be found reliably with embeddings search over this text corpus?
Can those pieces of text, without more context needed, be lightly transformed and inserted as an accurate completion?
Can completions be checked for accuracy with little to no user effort?

Is there a corpus of relevant text completions to be used by this copilot?

You want your copilot to be smart and not make things up. It should have access to some source of knowledge that it can bring to the user when they need it. Ideally, this source of knowledge is accurate, up to date, and maybe even personal to the user—for example, it might include all of their emails or their company’s internal wiki.

If you have this, you’re ready to go to the next step.

Can relevant text for completions be found reliably with embeddings search over this text corpus?

Once you have a knowledge base for your copilot to use, you need the copilot to be able to accurately identify chunks of that knowledge base to return to the user when they need it. For example, in my copilot for thought demo, I needed my copilot to find quotes from my Readwise that were relevant to whatever I was currently writing.

The standard way to do this is to use embeddings search. Embeddings are a condensed mathematical representation of a piece of text. Just like latitude and longitude can help you tell how close two cities are on a map, embeddings do the same kind of thing for text chunks. If you want to know if two pieces of text are similar, calculate the embeddings for them and compare them. Text chunks with embeddings that are “closer” together are similar.

Embeddings are useful because when a user is typing something that the copilot wants to autocomplete, it can just look through its knowledge base to find pieces of text that are “close” to whatever the user is typing.

But embeddings aren’t perfect, and it’s where a lot of copilot use cases fail for now. Your copilot quality is going to be bounded by your ability to find relevant chunks of information in your knowledge base to help the user. If you’re not getting relevant results, completion accuracy will suffer.

If you can get relevant results, then you can go to the next step.

Can those pieces of text, without more context needed, be lightly transformed and inserted as an accurate completion?

Once you can find the most relevant pieces of information in your knowledge base from embeddings search, your copilot is going to need to intelligently package them up as a completion for the user.

This works best if they only need to be slightly transformed before they can be suggested. For example, you’re often going to want to rearrange the text so that it carries the same information but is rephrased so that it completes the user's sentence. This kind of transformation is easy to do with GPT-3, but more advanced transformations are harder to do.

Can completions be checked for accuracy with little to no user effort?

Once your copilot suggests a completion, it works best if the user knows whether or not the completion is accurate without a lot of work. If the user has to spend a lot of time figuring out if the completion is accurate or not, they’ll just ignore it.

This is one of the big levers for copilots. If you can make it easy to check a completion without a lot of work, your copilot can return a lot of wrong answers because it doesn’t cost the user much to consider them. I think this is part of why GitHub Copilot is successful: You can just run the code to see if it’s right, so the computer generates the code and then checks it for you.

Other use cases that require more user input will require correspondingly higher rates of accuracy for the user to feel motivated enough to check.

What might change this list?

The limits of a copilot are the limits of the AI’s context window. The context window is the amount of tokens you can feed into the AI in the prompt, and the amount of tokens it can give back in a completion.

Because context windows are limited, you have to use embeddings search to find little pieces of information you can feed to your AI for it to generate a copilot completion. This means that while context windows are still small, the quality of your copilot is bounded by the quality of your embeddings search.

GPT-3’s current context window is 4,096 tokens, which is about 3,000 words. OpenAI is rumored to soon be releasing a version of its models that have a 32K token context window—roughly 8x the current size. This, I think, would be a giant step change in the quality of the responses that are returned for copilot use cases.

You’d be able to return far more information for the AI to reason over and turn into a usable response, which would have a direct impact on accuracy.

The other big limiters here are inference cost, inference speed, embedding cost, and access to usable data. I expect cost to go down and speed to go up significantly enough that I’m not worried about them as true bottlenecks. But access to usable data is a big deal.

Right now, I’m using Readwise as my data source. But my completions for my copilot would be a lot better if it had access to the books that I am pulling from. The average number of tokens in a book is on the order of 80,000 tokens. So in order to increase the quality of my responses, I need to figure out how to make that data available to the AI, and also clean it so that it’s easy for it to find relevant passages.

Advice for builders

If you’re building or investing in this space, my recommendations for creating better copilot experiences are as follows:

Tighten your feedback loops

You can think of a copilot completion as a sequential chain:

Get user input
Query for relevant documents
Prompt model with documents
Return a result

As you’re developing a copilot experience, you’ll want to be able to iterate as quickly as possible on each of these parts of the chain, with as little code as possible. I recommend building tools to help you do this quickly.

As I was building my notes copilot, I built a little UI to visualize and quickly swamp out each part of the chain:

This worked well for me, but you should explore your own solutions.

Get creative with embeddings search

For now, the quality of your completions is limited by the quality of your embeddings search. Because of this, I’d recommend spending time focusing on increasing the quality of your embeddings search.

There are many ways to enhance embeddings search to help you get more relevant documents. For example, check out HyDE for a creative solution to this problem from the query side. Or, try using GPT-3 to summarize the data in your knowledge base to make it easier for embeddings to find a usable text chunk.

Decrease cost of checking for accuracy

The other big lever here to create a good experience with existing technology is to lower the cost to the user if the accuracy of completions is low. An easy one: Before you display anything to the user, use GPT-3 to check if it thinks the completion is any good. If not, then don’t display it.

But there are lots of other ways to do this. For example, make sure completions are quite short. Another example: Make sure that all of the context information the user would need to check for accuracy is included in the completion—so they don’t have to do research or think too hard.

Wrapping up

This is the ideal copilot in my mind:

Every time you touch your keyboard it brings to bear your entire archive of notes, and everything you’ve ever read, to help you complete your next sentence.

It would help you make connections between ideas, bring up pieces of supporting evidence, and suggest quotes to use. It might also bring up writers you love who disagree with the point you’re making—so you could change your mind, or sharpen your argument in response to theirs.

Ideally, it would do this in a fashion that’s seamless, highly accurate, and easily checked. In other words, usually if it completes something it’s making a good point, and it’s easy for you to tell if the point is good or not, without lots of extra effort.

This is far from the reality today. If we want to advance these kinds of tools beyond just being interesting demos, we’re going to have to build them ourselves.

I hope this post pushes a few of you in that direction. I’ll keep you posted as I keep discovering more.