HANNAH BATES: Welcome to HBR On Strategy—case studies and conversations with the world’s top business and management experts, hand-selected to help you unlock new ways of doing business.
In the business world, leaders usually rely on experience or intuition to make decisions—and scientific inquiry is reserved for those who don’t have a clue. And Harvard Business School professor Stefan Thomke says that misconception is a problem. An experiment might not sound as bold or exciting as using gut instincts to make decisions, but it’s far more foolproof.
In this 2020 episode of HBR IdeaCast, Thomke explains why businesses should embrace testing, how leaders can get comfortable with the risks involved, and what happens when companies commit to a culture of experimentation. He starts with a powerful example of an experiment that paid off—big time.
STEFAN THOMKE: Well first of all, it can generate a tremendous amount of value. Let me give you an example. Microsoft’s Bing, which is a search engine. An employee working sort of at Bing, came up with an idea on how to sort of display its ads. The manager didn’t think much of it. And they kind of shelved it. But the employee insisted.
At some point the employee decided just to launch an experiment to run a test, a controlled test. And when he ran the test, that little change, a few days of work generated more than $100 million of additional revenue in that year alone. And of course, more revenue going forward. It was in fact, it was the most successful experiment that was run at Bing.
So, what made the difference? Well the difference was that the employee had the power essentially or the authority to run the experiment, to launch it and to test it. It’s the test that actually told you what works and doesn’t work and —
CURT NICKISCH: And not the manager.
STEFAN THOMKE: And not the manager. The problem is in a lot of innovation, especially sort of when you’re trying to predict customer behavior, we get it wrong most of the time. And so rather than trying to follow our intuition or our opinions, why not just run the test and let the test tell us what works and doesn’t work?
CURT NICKISCH: And what’s the answer to that? Why aren’t people doing it?
STEFAN THOMKE: Well there’s lots of reasons why people are not doing it at scale, especially. So some people are sort of running simple experiments – because they refer to an experiment as something like a trial. We’re trying something. That’s not really an experiment sort of in the scientific sense. And they don’t do many of those because they either don’t have the infrastructure to run many tests. They may not have the tools sort of to do so. It may be too expensive to run it. And then they may decide that listen, we run a test and we get some results and then nobody listens to us anyway.
CURT NICKISCH: Right. Do managers overestimate the downside to experiments and underestimate the upside?
STEFAN THOMKE: I think sometimes they are too overly concerned about the risk of running the experiment. For good reasons – you have a lot of traffic. You may not want to launch something that results in a loss of customers visiting your website for example.
CURT NICKISCH: Right, if it goes down.
STEFAN THOMKE: If it goes down and so if you don’t have good stoppage rule, kill switches and things like that sort of in place and then maybe a risk aversion, it’s also stepping into the unknown. And quite honestly, its, it takes humility to admit that I just don’t know. Walking into a meeting and we’re launching this thing and everybody has some hypothesis about what the outcomes going to look like. And just going to the meeting and telling everybody listen, quite honestly I don’t know what’s going to happen, so let’s just find out.
CURT NICKISCH: Even though I get paid more and I’m in charge, I don’t know either.
STEFAN THOMKE: Exactly. And the higher up you go, the more you get paid. The more senior you get, you get paid to make tough decisions. And you want to be the decision maker. And you got to create sort of an organization that ticks a little differently so to do this sort of thing.
By the way, it’s not just the online world, it’s also the physical world where companies are running experiments and even there we have to make big decisions. Sometimes very expensive decisions and it’s the experiments that can in fact, adjudicate whether we want to do something or not.
Kohl’s – you know, big retailer and so forth. So Kohl’s hires a consulting company and the consulting company basically does a cost analysis and they go to senior management and tell them, listen, we figured out that you can save a lot of money if you open your stores an hour later. Now here you are. You’re running this company and you have to make a decision. Should we do that? Calculating the cost savings is easy. Because you can pretty quickly figure this out. But the big question is, what’s actually going to happen to our revenue? Are customers going to buy less if we open an hour later? So how do you make these kinds of decisions?
We can analyze and analyze, but we won’t know until we actually do it, until we run the test. And in this case they did. And so they ran controlled experiments in which they sort of setup these tests, opening an hour later and lo and behold, at the end the result was it didn’t make much difference.
CURT NICKISCH: Just so we’re on the same page, how do you go about setting up an experiment? Are there playbooks for this?
STEFAN THOMKE: Well, first of all there are tools. A lot of the companies that I describe in the book, describe in the book, a lot of companies that I describe in the book built their own infrastructure, built their own tools because when they got started many years ago, the tools weren’t around. So you look at an Amazon or Microsoft and Netflix, a Booking.com, I mean you go through them and there’s about a dozen or so. They decided to do it themselves.
CURT NICKISCH: So they just, they knew that they had questions they wanted to answer and they just figured out a way to do it.
STEFAN THOMKE: They figured this was going to give them a competitive advantage. If they can kind of go out and just test a lot and they knew that they often get it wrong, and so they started investing in infrastructure and so, at a place like Microsoft for example, you have a very, very large group that basically runs the infrastructuring on something like the last time I checked, it was something like 85, 90 people or so that are just sort of doing infrastructure.
But the good thing that happened a few years ago is there are now third party tools as well that can do this, both in the online spaces and in the brick and mortar spaces. Which do sort of a lot of the heavy lifting for you. A lot of the statistical stuff and so forth. And so, so it’s gotten a lot easier than say if you wanted to start say five or 10 years ago.
CURT NICKISCH: Developing a culture for this is probably a little bit different?
STEFAN THOMKE: I think it may be potentially harder than getting the tools and building the tools because now we’re dealing with behaviors, with beliefs, with norms and all sorts of things.
CURT NICKISCH: How does this show up in companies if the culture for experimentation is not working? What do you actually see and observe?
STEFAN THOMKE: Well the classical example is they start running experiments. We have the experiment. We hand over the results to the group that asks us to run the experiment, and then nothing happens. Or, they will start to challenge the experiments. Something must have gone wrong.
I remember a story where an angry person actually called sort of one of the tool venders, sort of in this space, and complained about the tool being wrong. The person ran an experiment that actually showed, and the experiment showed that actually gave, if you give customers less choice in his setting, you get better performance. And that was kind of just counterintuitive because everything that he believed and up to this point is that you should give people more choices.
And so he was really disturbed by the finding and so he called them and complained that there’s a flaw in the tool. Something in the tool must be wrong because the result doesn’t match the experience that he’s had and he’s been doing this for a long time. And so, you run into that sort of thing.
CURT NICKISCH: Which kind of underlines your point that experiments bring new insights that you just can’t develop on your own.
STEFAN THOMKE: Correct. There’s a company called Booking.com which most of us use. In fact it’s the biggest accommodations platform in the world. More than 1.5 million room nights are booked on the platform each day. It’s a two0-sided platform. This is what we call it. It’s got suppliers on one side which are hotel operators for example. And of course, it’s got customers like us on the other side.
And Booking.com runs a massive number of experiments. My estimates are and I’m probably on the low side they told me, it’s my estimates. It’s over 30,000 a year of experiments. And it’s a really fascinating company. It’s also a highly successful company. Their gross profits are in the high 90’s percent. And they don’t really have any assets. They don’t really own any accommodations. So it’s a super competitive industry too.
And so how do they get away with this? And the answer to this is they run a lot of experiments. And they created an experimentation culture, where almost running experiments is like breathing. You kind of do it every single day. I mean you have to, Curt you have to think about the numbers here. Even if I’m running a low number of experiments, I mean they’re running more than 100 new experiments a day. You have to have an organization that can even come up with so many hypotheses.
CURT NICKISCH: I mean you mentioned the number of transactions that Booking.com does in a day. How key is that to being able to run experiments? Does that also work for places that just don’t have data like that?
STEFAN THOMKE: Yes, it works for places that also have a lot less traffic. The underlying math changes, sort of what you have to do algorithmically is very different. In fact, if you have very large sample sizes, a lot of traffic for example, you can really fine tune. You can sort of do very, very small changes and you can kind of pick up whether that change actually causes something to happen. As your sample size shrinks, you’re going to have to go for bigger changes. We call it the power of an experiment. You have to power an experiment. Statistical power. And so, I recommend for companies that are sort of smaller that maybe they kind of run experiments that are a little bigger.
Now, what happens also and this is something that actually happened at IBM. When they started to do this they realized that they have way too many websites. So yes, they had very little traffic on some of these websites, but they didn’t need all the websites. So that actually led to a process of consolidation. They said listen, we don’t really need all these things so what we’ll do is we’ll consolidate, and we get sort of more traffic on fewer websites which then allows us to sort of run more experiments.
CURT NICKISCH: I wonder if there are companies or industries outside of consumer facing tech, or outside of scientific, or pharmaceutical companies where experimentation really feels foreign?
STEFAN THOMKE: Well, I mean, the classical companies I think are sort of in the creative industries where the assumption is that everything is driven by creatives. Look at entertainment for example. And look at what Netflix has done. So, Netflix kind of flipped it around and they operate in the creative industry, but they are completely experimentation driven. And I think it was a big wakeup call for the entertainment industry because when you go in and you run Netflix, you are part of their ecosystem, their experimentation ecosystem. They run a massive number of tests because they want to find out what works and doesn’t work. By the way, running the test and getting result doesn’t mean that you have to blindly follow what the result is because sometimes there are good, strategic reasons why you may not want to implement what the test tells you.
CURT NICKISCH: Right. Or there are tradeoffs to whatever benefits —
STEFAN THOMKE: Or tradeoffs for example or maybe there may be a contractual violation or something like that. But what that test does is it actually adds transparency to the decision. So you cannot pretend that we’re doing this because it’s good for the customer, or something like, or good for the viewer. It adds clarity to that. We understand from the tests what’s good for the viewer, but there may be other reasons why we may not want to do it. And adding that transparency to what you’re doing I think is sort of a big value and allows a company like Netflix to operate really in the creative industry, with a testing approach.
I don’t want to diminish the value of creative talent because creative talent is really important, but that doesn’t create certainty in terms of decision making. To me the creative talent and the intuition is an important part of experimentation, because it allows us to create hypotheses. You have to ask yourself Curt, where do these hypotheses come from?
CURT NICKISCH: Yes, they’re from people. People asking questions or have some ideas, yeah.
STEFAN THOMKE: Absolutely. So what I’m saying is that running all these experiments, they all have hypotheses that came out of product groups and it’s the people who come up with these hypotheses and so where do they get the ideas? Well, it’s intuition sometimes. Its insights, surprising, you know customer surprises. Things that thought that were true and then they observe something that doesn’t quite fit sort of what they know. Its usability labs.
So there’s still, I mean these companies all run qualitative research and, but they do all the kinds of things that other companies do, but they do it for generating hypotheses which are then rigorously tested versus, other organizations that generate the hypotheses and go directly from hypotheses to launch.
CURT NICKISCH: Right. Based on whoever is the best public speaker, or makes the best case in a meeting rather than —
STEFAN THOMKE: Yeah, yeah, yeah it’s, there’s a word for that in the community called “hippos.”
CURT NICKISCH: Hippos?
STEFAN THOMKE: Yes. Highest paid person’s opinion. Hippos. And we all know that hippos are very dangerous animals.
CURT NICKISCH: I think a lot of executives are probably also not used to knowing how much experimentation to do. How do you know what to experiment on and how do you know what to let be?
STEFAN THOMKE: Yes. You have to empower people to make that decision. And the reality is right now, I think most organizations test too little. So I don’t think you should be too worried about testing too much. Yes, there is probably a point in which you test too much because you need an organization that can absorb all that knowledge, or all sort of that, all those findings that are generated by all these tests. That’s true. And we need to think about that. But I don’t think that’s a problem in most organizations right now. Right now they’re doing, not doing enough.
CURT NICKISCH: If you’re bringing this into a company do you try to do this companywide? Do you try to start with a team or a division and scale it up from there?
STEFAN THOMKE: So there are different ways to organize your experimentation teams. There are three models that I describe in the book. One model is really more a centralized approach. I basically have like a center, a group that’s responsible for experiments and they’re like a service organization, where you can come from a business unit, you can commission an experiment and they’ll run it for you and they give you the results.
CURT NICKISCH: Oh that’s interesting.
STEFAN THOMKE: That’s one model. And a lot of companies start out that way. Because they are kind of a little uncertain how this is all going to work out and they may not believe that the company’s ready to do this at large scale.
CURT NICKISCH: It probably simplifies training and it lets dip their toe in without really having to —
STEFAN THOMKE: Exactly. And you have a few experts and they kind of make sure that people don’t do foolish things. Then another form is to have a decentralized, completely decentralized. So now we’re shifting the autonomy basically to people and allow pretty much anybody to run experiments and we don’t centralize it anymore.
And of course there you have to trust people. You have to know that they’re actually capable of doing this and it’s a way of course to rapidly scale things. But what happens there is, when you start to put all these, you spread all these sort of your experts around and they’re all the way sort of through the company, they get very busy and you kind of lose the focus on building capabilities because you need to always kind of get better and better. And so there’s no coordinated approach to this. Everybody kind of does their own thing.
So what companies have found is they go from centralized to decentralized and they want to scale things, but then they realize that they need to have a more coordinated approach and then they create something which I call a center of excellence. And the center of excellence is kind of a hybrid model then where you have sort of a core group that actually is responsible for developing capabilities, experimentation capabilities, kind of know what tools to use and push the envelope.
But at the same time you take people out of