Across the span of my career, I’ve noticed new buzzwords entering the collective lexicon every decade or so. The idea of ‘cyber-resilience’ has gained momentum and describes how organizations should protect their IT systems to avoid costly downtime or limit disruption of critical services. A recent report from Splunk reveals that for Global 2000 companies, the cost of downtime is estimated at US$ $400 billion annually, or about 9 percent of profits. At the same time, larger enterprises may have the resources to recover from significant losses resulting from IT outages; small to mid-sized companies often do not. Even beyond cybersecurity, operational resilience is key to those organizations that stand the test of time.
Although more industries are embracing the concept of resilience in their IT systems, the scenario continues to evolve in response to the changing threat landscape. It’s no longer enough to have technology or best practices that help secure the IT infrastructure. Instead, organizations must now develop a comprehensive and proactive approach that places observability at the foundation of a resilient, layered IT system.
A Refresher on Observability
Although there are slightly different ways to define the term, I consider observability to be the capacity to gather insights, analytics, and actionable information through both real-time and historical metrics, logs, and trace data. A modern observability function should be able to collect these insights using multi-domain data correlation, machine learning (ML), and AIOps. The ultimate goal is to have the clearest picture of your IT systems through the outputs from the observability function.
Observability and Today’s IT Landscape
So, what does this have to do with today’s IT threats? For many businesses, their current IT environment is more complex than it has ever been. Due to the world’s growing dependence on digital solutions and workflows, IT environments are larger than ever. We’ve come a long way from the massive migration to the cloud during the height of the COVID-19 era. In a June 2024 International Data Corporation (IDC) report, about 80% of respondents said their companies were planning to grow IT environments to larger levels than ever due to the world’s increasing dependence on digital solutions and workflows requiring some level of repatriation—or moving workloads from public clouds to on-premises data centers—within a year. This suggests many companies may now be deploying a hybrid or multi-cloud strategy, which makes it harder to monitor each area of an IT environment.
Organizations’ IT environments are also leveraging AI more than ever before. A McKinsey survey in March 2025 indicates 71% of companies use generative AI in at least one business function, up from 65% in 2024. This demonstrates that companies have more automated workflows in their IT environment.
While the onset of greater scale, complexity, and automation in IT systems points to a boon in innovation, a positive development for many organizations, it comes at a time when the current threat landscape is evolving.
More AI tools have been democratized, lowering barriers to entry for today’s threat actors. Phishing scams and social engineering have become increasingly sophisticated, putting more organizations at risk of unauthorized access to their systems. This could result in a greater chance of system downtime, loss of revenue, or even damage to the brand's reputation. Data from a recent SolarWinds public sector survey shows specific public organizations are concerned about external threats and internal company security practices. According to the study, 58% of respondents expressed concern about cybersecurity mistakes made by ‘untrained insiders.’
More entry points make your IT environment vulnerable. You need an observability approach that can respond quickly and mitigate breaches. This will help build the resilient systems today’s businesses need.
The Right Approach to Observability and Cyber-Resilience
When done correctly, your approach to observability should look like a well-run hospital. Think about the halls of a busy emergency room (ER). Once a patient comes into the ER, it’s not enough for the doctors and nurses to diagnose the issue. They must respond quickly and accurately, determine if an operating room is available, assess the number of personnel required to treat a particular patient, and determine whether a patient needs to be seen immediately. The way an ER works—quickly and with purpose—exemplifies a resilient system that can handle each problem.
Specific organizations’ approach to observability can be disjointed, with one observability solution used to diagnose unusual activity and another used to address it. This is like someone with a common cold visiting the ER, being diagnosed with pneumonia, and sent to a hospital two blocks away for treatment. Hampering resiliency further, an organization may have multiple, disconnected observability tools to support its on-premises and cloud environments, leading to increased confusion.
Instead of taking a comprehensive approach to observability, you can limit the mean time to remediate (MTTR) and quickly improve the health of your IT system. The right observability solution will have integrations for both on-prem data centers and cloud solutions, along with the remediation services necessary to solve IT issues. This also helps prevent silos in incident remediation—silos can lead to an uncoordinated response and a worse attack outcome.
Proper observability encompasses best practices such as multi-factor authentication, encryption, and employee training to mitigate phishing emails. When you establish a comprehensive observability function, you can quickly identify and address system issues, minimizing the time it takes to recover from operational disruptions. The actual test of cyber and operational resilience is to speed recovery and reduce the impact of an incident.