“Once you stop learning, you start dying.” —Albert Einstein
The pace of improvement in artificial intelligence today is breathtaking.
An exciting new paradigm—reasoning models based on inference-time compute—has emerged in recent months, unlocking a whole new horizon for AI capabilities.
The feeling of a building crescendo is in the air. AGI seems to be on everyone’s lips.
“Systems that start to point to AGI are coming into view,” wrote OpenAI CEO Sam Altman last month. “The economic growth in front of us looks astonishing, and we can now imagine a world where we cure all diseases and can fully realize our creative potential.”
Or, as Anthropic CEO Dario Amodei put it recently: “What I’ve seen inside Anthropic and out over the last few months has led me to believe that we’re on track for human-level AI systems that surpass humans in every task within 2–3 years.”
Yet today’s AI continues to lack one basic capability that any intelligent system should have.
Many industry participants do not even recognize that this shortcoming exists, because the current approach to building AI systems has become so universal and entrenched. But until it is addressed, true human-level AI will remain elusive.
What is this missing capability? The ability to continue learning.
What do we mean by this?
Today’s AI systems go through two distinct phases: training and inference.
First, during training, an AI model is shown a bunch of data from which it learns about the world. Then, during inference, the model is put into use: it generates outputs and completes tasks based on what it learned during training.
All of an AI’s learning happens during the training phase. After training is complete, the AI model’s weights become static. Though the AI is exposed to all sorts of new data and experiences once it is deployed in the world, it does not learn from this new data.
In order for an AI model to gain new knowledge, it typically must be trained again from scratch. In the case of today’s most powerful AI models, each new training run can take months and cost hundreds of millions of dollars.
Take a moment to reflect on how peculiar—and suboptimal—this is. Today’s AI systems do not learn as they go. They cannot incorporate new information on the fly in order to continuously improve themselves or adapt to changing circumstances.
In this sense, artificial intelligence remains quite unlike, and less capable than, human intelligence. Human cognition is not divided into separate “training” and “inference” phases. Rather, humans continuously learn, incorporating new information and understanding in real-time. (One could say that humans are constantly and simultaneously doing both training and inference.)
What if we could eliminate the kludgy, rigid distinction in AI between training and inference, enabling AI systems to continuously learn the way that humans do?
This basic concept goes by many different names in the AI literature: continual learning, lifelong learning, incremental learning, online learning.
It has long been a goal of AI researchers—and has long remained out of reach.
Another term has emerged recently to describe the same idea: “test-time training.”
As Perplexity CEO Aravind Srinivas said recently: “Test-Time Compute is currently just inference with chain of thought. We haven’t started doing test-time-training – where model updates weights to go figure out new things or ingest a ton of new context, without losing generality and raw IQ. Going to be amazing when that happens.”
Fundamental research problems remain to be solved before continual learning is ready for primetime. But startups and research labs are making exciting progress on this front as we speak. The advent of continual learning will have profound implications for the world of AI.
Workarounds and Half-Solutions
It is worth noting that a handful of workarounds exist to mitigate AI’s current inability to learn continuously. Three in particular are worth mentioning. While each of these can help, none fully solve the problem.
The first is model fine-tuning. Once an AI model has been pretrained, it can subsequently be fine-tuned on a smaller amount of new data in order to incrementally update its knowledge base.
In principle, fine-tuning a model on an ongoing basis could be one way to enable an AI system to incorporate new learnings as it goes.
However, periodically fine-tuning a model is still fundamentally a batch-based rather than a continuous approach; it does not unlock true on-the-fly learning.
And while fine-tuning a model is less resource-intensive than pretraining it from scratch, it is still complex, time-consuming and expensive, making it impractical to do too frequently.
Perhaps most importantly, fine-tuning only works well if the new data does not stray too far from the original training data. If the data distribution shifts dramatically—for instance, if a model is presented with a totally new task or environment that is unlike anything it has encountered before—then fine-tuning can fall prey to the foundational challenge of catastrophic forgetting (discussed in more detail below).
The second workaround is to combine some form of retrieval with some form of external memory: for instance, retrieval-augmented generation (RAG) paired with a dynamically updated vector database.
Such AI systems can store new learnings on an ongoing basis in a database that sits outside the model and then pull information from that database when needed. This can be another way for an AI model to continuously incorporate new information.
But this approach does not scale well. The more new learnings an AI system accumulates, the more unwieldy it becomes to store and retrieve all of this new information in an efficient way using an external database. Latency, computational cost, retrieval accuracy and system complexity all limit the usefulness of this approach.
A final way to mitigate AI’s inability to learn continuously is in-context learning.
AI models have a remarkable ability to update their behavior and knowledge based on information presented to them in a prompt and included within their current context window. The model’s weights do not change; rather, the prompt itself is the source of learning. This is referred to as in-context learning. It is in-context learning that, for example, makes possible the practice of “prompt engineering.”
In-context learning is elegant and efficient. It is also, however, ephemeral.
As soon as the information is no longer in the context window, the new learnings are gone: for instance, when a different user starts a session with the same AI model, or when the same user starts a new session with the model the next day. Because the model’s weights have not changed, its new knowledge does not persist over time. This severely limits the usefulness of in-context learning in enabling true continual learning.
Moats, Moats, Moats
One important reason why continual learning represents such a tantalizing possibility: it could create durable moats for the next generation of AI applications.
How would this work?
Today, OpenAI’s GPT-4o is the same model for everyone that uses it. It doesn’t change based on its history with you (although ChatGPT, the product, does incorporate some elements of persistent memory).
This makes it frictionless for users to switch between OpenAI, Anthropic, Google, DeepSeek and so on. Any of these company’s models will give you more or less the same response to a given prompt, whether you’ve had thousands of previous interactions with it or you are trying it for the first time.
Little wonder that the conventional wisdom today is that AI models inevitably commoditize.
In a continual learning regime, by constrast, the more a user uses a model, the more personalized the model becomes. As you work with a model day in and day out, the model becomes more tailored to your context, your use cases, your preferences, your environment. Its neurons literally get rewired as it learns about you and about the things that matter to you. It gets to know you.
Imagine how much more compelling a personal AI agent would be if it reliably adapted to your particular needs and idiosyncracies in real-time, thereby building an enduring relationship with you.
(For a dramatized illustration of what continual learning might look like—and how different this would be from today’s AI—think of the Samantha character in the 2013 film Her.)
The impact of continual learning will be enormous in both consumer and enterprise settings.
A lawyer using a legal AI application will find that, after a few months of using the application, it has a much deeper understanding than it did at the outset about the lawyer’s roster of clients, how she engages with different colleagues, how she likes to craft legal arguments, when she chooses to push back on clients versus acquiesce to their preferences, and so forth. A recruiter will find that, the more he uses an AI product, the more intuitively it understands which candidates he tends to prioritize, how he likes to conduct screening interviews, how he writes job descriptions, how he engages in compensation negotiations, and so on. Ditto for AI products for accountants, for doctors, for software engineers, for product designers, for salespeople, for writers, and beyond.
Continual learning will enable AI to become personalized in a way that it has never been before. This will make AI products sticky in a way that they have never been before.
After you’ve worked with it for a while, your AI model will be very different than someone else’s version or the off-the-shelf version of the same model. Its weights will have adapted to you. This will make it painful and inconvenient to switch to a competing product, in the same way that it is painful and inconvenient to replace a well-trained, high-performing employee with someone who is brand new.
Venture capitalists like to obsess over “moats”—durable sources of competitive advantage for companies.
It remains an open question what the most important new moats will be in the era of AI, particularly at the application layer.
A long-standing narrative about moats in AI relates to proprietary data. According to this narrative, the more user data an AI product collects, the better and more differentiated the product becomes as it learns from that data, and the deeper the moat therefore gets. This story makes intuitive sense and is widely repeated today.
However, the extent to which collecting additional user data has actually led to product differentiation and moats in AI remains limited to date—precisely because AI systems do not actually learn and adapt continuously based on new data. How much lock-in do you as a user experience today with Perplexity versus ChatGPT versus Claude as a result of user-level personalization in those products?
Continual learning will change this. It will, for the first time, unleash AI’s full potential to power hyperpersonalized and hypersticky AI products. It will create a whole new kind of moat for the AI era.
Continual Learning’s Achilles Heel
The potential upsides of continual learning are enormous. It would unlock whole new capabilities and market opportunities for AI.
The idea of continual learning is not new. AI researchers have been talking about it for decades.
So: why are today’s AI systems still not capable of learning continuously?
One fundamental obstacle stands in the way of building AI systems that can learn continuously—an issue known as catastrophic forgetting. Catastrophic forgetting is simple to explain and fiendishly difficult to solve.
In a nutshell, catastrophic forgetting refers to neural networks’ tendency to overwrite and lose old knowledge when they add new knowledge.
Concretely, imagine an AI model whose weights have been optimized to complete task A. It is then exposed to new data related to completing task B. The central premise of continual learning is that the model’s weights can update dynamically in order to learn to solve task B. By updating the weights to complete task B, however, the model’s ability to complete task A inevitably degrades.
Humans do not suffer from catastrophic forgetting. Learning how to drive a car, for instance, does not cause us to forget how to do math. Somehow, the human brain manages to incorporate new learnings on an ongoing basis without sacrificing existing knowledge. As with much relating to the human brain, we don’t understand exactly how it does this. For decades, AI researchers have sought to recreate this ability in artificial neural networks—without much success.
The entire field of continual learning can be understood first and foremost as an attempt to solve catastrophic forgetting.
The core challenge here is to find the right balance between stability and plasticity. Increasing one inevitably jeopardizes the other. As a neural network becomes more stable and less changeable, it is in less danger of forgetting existing learnings, but it is also less capable of incorporating new learnings. Conversely, a highly plastic neural network may be well positioned to integrate new learnings from new data, but it does so at the expense of the knowledge that its weights had previously encoded.
Existing approaches to continual learning can be grouped into three main categories, each of which seeks to address catastrophic forgetting by striking the right balance between stability and plasticity.
The first category is known as replay, or rehearsal. The basic idea behind replay-based methods is to hold on to and revisit samples of old data on an ongoing basis while learning from new data, in order to prevent the loss of older learnings.
The most straightforward way to accomplish this is to store representative data points from previous tasks in a “memory buffer” and then to intersperse those old data with new data when learning new things. A more complex alternative is to train a generative model that can produce synthetic data that approximates the old data and then use that model’s output to “replay” previous knowledge, without needing to actually store earlier data points.
The core shortcoming of replay-based continual learning methods is that they do not scale well (for a similar reason as RAG-based methods, described above). The more data a continual learning system is exposed to over time, the less practicable it is to hold on to and “replay” all of that previous data in a compact way.
The second main approach to continual learning is regularization. Regularization-based methods seek to mitigate catastrophic forgetting by introducing constraints into the learning process that protect existing knowledge: for example, by identifying model weights that are particularly important for existing knowledge and slowing the rate at which those weights can change, while enabling other parts of the neural network to update more freely.
Influential algorithms that fall into this category include elastic weight consolidation (out of DeepMind), Synaptic Intelligence (out of Stanford) and Learning Without Forgetting (out of the University of Illinois).
Regularization-based methods can work well under certain circumstances. They break down, though, when the environment shifts too dramatically—i.e., when the new data looks totally unlike the old data—because their learning constraints prevent them from fully adapting. In short: too much stability, not enough plasticity.
The third approach to continual learning is architectural.
The first two approaches assume a fixed neural network architecture and aim to assimilate new learnings by updating and optimizing one shared set of weights. Architectural methods, by contrast, solve the problem of incremental learning by allocating different components of an AI model’s architecture to different realms of knowledge. This often includes dynamically growing the neural network by adding new neurons, layers or subnetworks in response to new learnings.
One prominent example of an architectural approach to continual learning is Progressive Neural Networks, which came out of DeepMind in 2016.
Devoting different parts of a model’s architecture to different kinds of knowledge helps mitigate catastrophic forgetting because new learnings can be incorporated while leaving existing parameters untouched. A major downside, though, is again scalability: if the neural network grows whenever it adds new knowledge, it will eventually become intractably large and complex.
While replay-based, regularization-based and architecture-based approaches to continual learning have all shown some promise over the years, none of these methods work well enough to enable continual learning at any scale in real-world settings today.