Data and Compute Are the Ultimate Flywheel

Memory is destiny: the data explosion

I taught myself to code on an IBM PC clone running MS-DOS on an 8086. This was in the late 1980s; the computer was cheap, functional and remarkably ugly, but it worked. And it had what was—for its time—a princely amount of storage: 30 whole megabytes of hard disk space.

A decade later, working my first job as a programmer, storage was easier to come by, but it was never out of my mind. I cared about memory leaks and access times and efficiency. Creating a clever data structure to hold yield curves for rapid derivatives pricing was one of my prouder achievements at the time. We didn't store everything, only what we needed to.

Another decade later, I didn't care. At Quandl, the fintech data startup I founded, we saved everything. Not just the data assets that formed the core of our business, with all their updates and vintages and versions, but also all our usage logs, and API records, and customer reports, and website patterns. Everything.

What happened? Well, this:

It turns out it wasn't just Quandl that stored all its data; it was everyone, everywhere, all at once. Driven by falling hardware costs, and—equally as important—by business models that made said hardware easy to access (shoutout to Amazon S3!), data became a critical competitive advantage.

A majority of the business models that “won” the last decade are downstream of the data explosion:

the entire content-adtech-social ecosystem
the entire ecommerce-delivery-logistics ecosystem
the infrastructure required to support these ecosystems
the devtools to build that infrastructure

How so?

Consider advertising, still the largest economic engine of the internet. Facebook, Reddit, YouTube, Instagram, TikTok, and Twitter all rely on the exact same loop, linking users, content and advertisers:

Companies are able to offer users unlimited free storage for photos, videos, blog posts, updates, and music; the content attracts more users, who attract advertisers, who subsidize the platform, and the cycle continues. None of this is possible without cheap, plentiful storage.

Or consider delivery and logistics. It's easy to take for granted today, but the operational chops required to power same-day e-commerce, or ride-sharing, or global supply chains, are simply staggering. The famous Amazon and Uber flywheels are just special cases of a data learning loop:

These flywheels require up-to-date knowledge of customer locations, purchase and travel habits, store inventories and route geographies, driver and car availability, and a whole lot more. Again, none of this is possible without cheap, plentiful data.

Incidentally, these flywheels are not just powered by data; they generate new data in turn. The data explosion is not just an explosion; it's a genuine chain reaction. Data begets data.

Falling data costs and rising software costs

All of these flywheel, data-based business models are, essentially, software. That's no surprise—data and software are two sides of the same coin.

The software used to optimize businesses is useless without data to apply itself to. And data is worthless without software to interpret and act on. More generally, tools are useless without materials; materials don't have value unless worked on with tools.

Memory is destiny: the data explosion

Want to sponsor Every? Click here.

What happened? Well, this:

A majority of the business models that “won” the last decade are downstream of the data explosion:

the entire content-adtech-social ecosystem
the entire ecommerce-delivery-logistics ecosystem
the infrastructure required to support these ecosystems
the devtools to build that infrastructure

How so?

Incidentally, these flywheels are not just powered by data; they generate new data in turn. The data explosion is not just an explosion; it's a genuine chain reaction. Data begets data.

Falling data costs and rising software costs

All of these flywheel, data-based business models are, essentially, software. That's no surprise—data and software are two sides of the same coin.

Economists call these “perfectly complementary inputs”: you need both to generate the desired output, and you can't substitute one for the other. An immediate consequence is that if the price of one input falls by a lot—perhaps due to a positive productivity shock—then the price of the other is almost certain to rise.

Imagine you're a tailor. To sew clothes, you need needles and fabric, and you're limited only by the quantity of each of those you can afford. (Needles and fabric are complementary inputs to the process of sewing.)

Now imagine that the price of needles plummets, due to a new needle-manufacturing technology. Your response? Use the savings to buy more fabric, and thus sew more clothes. But if every tailor does this, then the price of fabric will rise. Tailors and consumers are both better off and the total quantity of clothing produced goes up, but the relative value of needles and fabric has changed.

For the last decade-plus, the quantity of data in the world has been exploding and its price, therefore, implicitly declining. Software has been the relatively scarce input, and its price has increased. You can see this in everything from the salaries of software engineers to the market cap of top software companies. Software ate the world, with a huge assist from cheap, plentiful data.

And then came GPT, and everything changed.

The compute revolution made data a lot more valuable

GPT is a child of the data explosion. The flood of new data, generated not just by users but also by content farms and click factories and link bots and overzealous SEO agencies, necessitated the invention of new techniques to handle it. A team of researchers at Google wrote “Attention Is All You Need,” the paper that introduced the Transformers architecture underlying pretty much every modern generative AI model.

Although large language models were invented to manage the data spun off by these “content wars,” they’re going to be used for a lot more than just search. GPT is prima facie a massive productivity boost for software. Technologists talk about the 10x programmer—the genius who can write high-quality code 10 times faster than anybody else. But thanks to GPT, every programmer has the potential to be 10x more productive than the baseline from just two years ago.

We are about to see the effects. Move over data explosion; say hello to the compute explosion.

The first and perhaps most obvious consequence of the compute revolution is that data just got a whole lot more valuable. This naturally benefits companies that already own data.

But what’s valuable in an AI world is subtly different from what was valuable in the past. Some companies with unique data assets will be able to monetize those assets more effectively. BloombergGPT is my favorite example: it’s trained on decades of high-quality financial data that few others have. To quote a (regrettably but understandably anonymous) senior executive in the fin-data industry: “Bloomberg just bought themselves a 20-year lease of life with this.”

Other companies will realize that they are sitting on latent data assets—data whose value was unrecognized, or at any rate unmonetized. Not anymore! Reddit, for one, is a treasure trove of high-quality human-generated content, surfaced by a hugely effective moderation and upvoting system. But now you have to pay for it.

You don't need huge content archives or expensive training to get meaningful results. Techniques like LoRA let you supplement large base models with your own proprietary data at relatively low cost. As a result, small custom data can hold a lot of value.

Quantity has a quality that’s all its own, but when it comes to training data, the converse is also true. “Data quality scales better than data size”: above a certain corpus size, the ROI from improving quality almost always outweighs that from increasing coverage. This suggests that golden data—data of exceptional quality for a given use case—is, well, golden.

The future of the data business

The increasing value of data has some downstream implications. In a previous essay, I wrote about the economics of data assets: "The gold-rush metaphor may be over-used, but it’s still valid. Prospecting is a lottery; picks-and-shovels have the best risk-reward; jewelers make a decent living; and a handful of gold-mine owners become fabulously rich."

The ecosystem

The very best data assets, reshaped for AI use cases, are the new gold mines. But there are terrific opportunities for picks and shovels specifically designed around the increased salience of data in an AI-first world. These tools will be able to:

build new data assets for AI;
connect existing data assets to AI infra;
extract latent data using AI;
monetize data assets of every sort.

More generally, the entire data stack needs to be refactored, such that generative models become first-class consumers as well as producers of data. Dozens of companies are emerging to do precisely this, from low-level infra providers like Pinecone and Chroma, to high-level content engines like Jasper and Regie, to glue layers like LangChain, and everything in between.

Apart from tooling, there's an entire commercial ecosystem waiting to be built around data in the age of AI. Pricing and usage models, compliance and data rights, a new generation of data marketplaces—everything needs to be updated. No more “content without consent”; even gold-rush towns need their sheriffs.

The flywheel of data and compute

The second major consequence of AI is that the quantity of both data and compute in the world is going to increase dramatically. There’s both flywheel acceleration—data feeds the compute explosion, and compute feeds the data explosion—and a direct effect. After all, generative models don't just consume data; they produce it.

Right now the output is mostly ephemeral. But that's already changing, as ever more business processes begin to incorporate generative components.

What does this imply for data? We’re entering a world of unlimited content. Some of it is legit, but much of it isn’t—spam bots and engagement farmers, deep fakes and psyops, hallucinations and artifacts.

The confidence chain

Confronted with this infinite buffet, how do you maintain a healthy information diet? The answer is the confidence chain—a series of proofs, only as strong as its weakest link:

Who created this data or content?
Can you prove that they created it?
Can you prove they are who they say they are?
Is what they created “good,” and does it match what I need or want?

In other words, signatures, provenance, identity, quality, curation. The first three are closely linked: where does something really come from, and can you prove it? After all, “on the internet, nobody knows you’re an AI.” Technologically, this space remains largely undefined, and therefore interesting. (The irony is that it’s an almost perfect use case for zero-knowledge crypto—and crypto has been utterly supplanted in the public imagination by AI.)

The last two—quality and curation—are also closely linked. Curation is how quality gets surfaced, and we’re already seeing the emergence of trust hierarchies that achieve this. Right now, my best guess at an order is something like this:

filter bubbles > friends > domain experts ~= influencers > second-degree connections > institutions > anonymous experts ~= AI ~= random strangers > obvious trolls

But it's not at all clear where the final order will shake out, and I wouldn’t be surprised to see “curated AI” join and move up the list.

Compute all the things

Data becomes more valuable; ecosystems need to be retooled; the data explosion will accelerate; and trust chains will emerge. What about compute itself?

Just like the default behavior for data flipped from “conserve memory” to “save everything”, the default behavior for software is going to flip to “compute everything.”

What does it mean to compute all the things? Agents, agents, everywhere—AI pilots and co-pilots, AI research and logistics assistants, AI interlocutors and tutors, and a plethora of AI productivity apps. We used to talk about human-in-the-loop to improve software processes; increasingly, we're going to see software-in-the-loop to streamline human processes.

Some of these help on the data/content generation side; they’re productivity tools. Others help on the data/content consumption side; they’re custom curators, tuned to your personal matching preferences.

New abundance, new scarcity

The 19th-century British economist William Jevons observed a paradox in the coal industry. Even though individual coal plants became more efficient over time—using less coal per unit of energy produced—the total amount of coal used by the industry did not decline; instead, it increased. Efficiency lowered the price of coal energy, leading to more demand for that energy from society at large.

Something very similar is happening with the data-software complex. It’s not just that data and software reinforce each other in a productivity flywheel. It’s not just that generative models produce and consume data and code. It’s that the price of “informed computation” has fallen, and the consequence is that there will be a lot more informed computation in the world.

A possibly lucrative question to ask is: where are the new areas of scarcity? The hardware that powers compute and data is an obvious candidate. As the latter exponentiate, the former cannot keep up.

Persistent and recurring chip shortages are a symptom of this; my hypothesis is that this is not a problem of insufficient supply, it’s a problem of literally insatiable demand, driven by the Jevons effect.

See original tweet. Druck is talking about his book, but NVDA’s stock price doesn’t lie.

Another candidate for scarcity is energy. Large model training consumes a huge amount of energy, but at least it’s confined to a few firms. But once you add in accelerating flywheels, the compute explosion, and agents everywhere, the quantities become vast. We haven’t felt the pinch yet because of recent improvements in energy infra—solar efficiency, battery storage, and fracking—and there’s hope that informed compute will help maintain those learning curves.

Watch out for artificial scarcity. Society may benefit from abundance, but individuals and corporations have different incentives. They may seek to constrain or capture the gains from ubiquitous, cheap, and powerful data and computation.

Finally and most provocatively, what happens to human beings in this brave new world? Are we a scarce and valuable resource, and, if so, why? For that nebulous entity we call “creativity,” or for our ability to accomplish physical tasks? Will AI augment human capacity or automate it away? I believe in abundance and I’m optimistic; the only way to find out is to go exploring.

Abraham Thomas is an angel investor based in Toronto. Previously he co-founded Quandl, a successful venture-backed tech startup. He writes occasional essays on data, investing, and startups in his newsletter, Pivotal, where this piece originally appeared.