The world fails all the time. Even a well-run company can miss quarterly earnings. Accidents can happen inside factories. Customer data may leak. Computer servers crash. Obviously, what separates top-performing companies from the rest is their ability to learn from past mistakes. They avoid making the same mistake twice.
But here’s the thing: an organization is not like a person. Unless it’s a catastrophe, the daily lessons don’t transmit beyond your immediate team members. People in other parts of the company won’t receive your learning. So we must resist storytelling that makes us feel good. We need to build systems that help us see reality as it is. Or else the entire organization will keep repeating the wrongs.
These are the differences between a feel-good story and reality:
|Human error is the cause of failure.
“Someone messed it up!”
|Human errors are caused by the systemic vulnerabilities deep inside the organization.
“With our current setup, incidents are bound to happen no matter who’s in charge.”
|Saying what people should have done is a satisfying way to move on.
“They should have acted differently, my god!”
|Saying what people should have done doesn’t explain why it made sense for them to do what they did.
“That was a reasonable assumption the team made at the time. Yes, the outcome was bad, and the assumption proved wrong. Still, the decision made was a high–quality one, given the kind of information people had.”
|Telling people to be more careful will make the problem go away.
“All right, let’s pay more attention to this next time.”
|Only by constantly seeking out vulnerabilities can a company reduce errors.
“We will start with this new drill monthly.”
Once we accept the realities as they are, then we’ll be able to take steps to expose the vulnerabilities lurking in the dark. We need to make the invisible visible. Two steps are involved. One is done through technology; the other, through culture.
You Can’t Improve What You Can’t See
In the world of IT, service outage is the worst nightmare. Website down, forgone revenue. But you don’t often hear about crashes happening at Google, Netflix, Amazon, or Facebook. These websites, despite their vast functionalities, rarely go down. That’s because of painstaking work. Telemetry helps a big deal.
Telemetry is like using a cardiac event monitor that records heart activity. Such a device identifies abnormal heart rhythms. It records and warns us about symptoms so you can seek help at a hospital before the worst arrives.
Tech companies apply the same principle to achieve smooth functioning of their apps. Software developers constantly add telemetry to programs they write. They create enough telemetry to monitor their software in action. Etsy, an online marketplace for all things artisan, is one example.
Back in 2011, Etsy had already been tracking some 200,000 production metrics. They monitor every layer of the software stack. They include application features, application health, database, operating system, storage, networking, and security. By 2018, tracking went up to 800,000 metrics.
Tracking everything is key to moving fast without risking the whole system going down. Etsy, Netflix, Amazon, and Facebook don’t just collect business data like user sign-ups or churn rate. Their telemetry also collects data on application latency and transaction time. They then tally up with data on infrastructure, such as disk space and network bandwidth. It also looks at feature updates already scheduled in the pipeline and the real-time clash reports occurring in your iPhone.
They do all these because no programmer would dare to roll out new features quickly if there’s constant worry over system crashes. Good telemetry allows engineers to see if things are working as intended. The best telemetry illuminates the entire operation so that everyone can see how their actions are affecting other portions of the system as a whole.
This rapid feedback loop about the entire system is crucial. It ensures that problems are detected early and thus corrected quickly. It also prevents the same problems from occurring again in the future. That’s why pervasive telemetry boosts innovation. Only when people feel safe about their own actions will they then innovate more.
How Transparency Supercharges Learning
Of course, an organization can’t truly see what’s going on unless information is shared transparently. But to share data across silos is also to eliminate the culture of corporate secrets.
Booking.com, one of the world’s leading travel aggregators, runs hundreds, if not thousands, of concurrent experiments at any one time. Daily changes can be as small as the color or placement of a button, or they can be about different color headlines for online ads. But they are all subject to A/B testing.
With more than 1.5 million room nights booked on its platform each day, Booking.com saves all the experiments—successes and failures—on its IT platform. They are all searchable to anyone in the company. Every engineer gets access to all experimental protocols and data, regardless of which division they are from.
That’s why at places like Netflix or Google or Booking.com, production metrics on web pages are generated by a centralized sever. For all the data generated by telemetry, data needs to be easy to get and sufficiently centralized to make those metrics highly visible to anyone. This is how smart companies identify all the production vulnerabilities.
This is, of course, not just about technology. This is about the culture an organization is willing to accept. So we ask ourselves:
1. At your company, how much communication is done using email or PowerPoint or a “sanitized” Excel spreadsheet? To what extent can the senders “control” the narrative? Alternatively, how many of the adopted recommendations made by managers are using verifiable data that can be easily retrieved by others?
Coincidentally, Jeff Bezos has long outlawed PowerPoint presentations at Amazon to stop executives from “bluffing their way through the meeting.”
2. Can you encourage a culture of transparency within your team by agreeing to share data sources or contacts with one other? Within a team or a department, data points should be generated to as many as possible, and all team members are encouraged to try out their own analyses. Why? Because sunlight is the best disinfectant.
You don’t need high tech to see the realities clearly. As I argue here before, even virtual meetings are about to turn the art of management into a scalable science.
What’s needed is a culture of transparency that forbids information hoarding, or worse, letting problems hide in the dark.
Great article, clear and pragmatic. Ready to apply!
Great article! We need more of this in times like this. Thank you.
This is a must read article. Great work on this!