Main image of article IT Operations: How to Better Survive in the Trenches

Traditionally, IT operations teams are responsible for ‘keeping the lights on’ in an IT organization. This sounds simple, but the reality is harsh, with much complexity behind the scenes. Furthermore, digital transformation trends are quickly changing the IT operations staff’s responsibility from ‘keeping the lights on’ to ‘keeping the business competitive.’ 

IT operations personnel are now not only responsible for uptime, but also for the performance and quality of digital services provided by and to the business. To a large extent, maintaining available and high-performing digital services is precisely what it means to be digitally transformed.

I’ve spent my fair share of time as an MSP team lead, and on the operations floor in large IT organizations. The job of an enterprise IT ops pro is full of uncertainty. Let’s look at a typical day in the life of an IT operations person, and how she addresses common challenges like: 

  • Segregated monitoring and alerting tools causing confusion and unnecessary delays in troubleshooting.
  • Resolving a critical issue quickly through creative investigations that go beyond analyzing alert data.
  • Legacy processes, such as ITIL, working against the kind of open collaboration required to fix issues in the DevOps era.

Starting the Day with a Critical Application Outage

Karen is a Senior Network Analyst (L4 IT Ops rep) who works for a large global financial organization. She is considered a subject matter expert (SME) in network load balancing, network firewalls, and application delivery. She is driving to the office when she gets a call informing her that a major banking application is down at her company. 

Every minute of downtime affects the bottom line of the business. She finds parking and rushes to her desk, only to find hundreds of alert emails queued in her inbox. The alerts are coming from an application monitoring tool she can’t access (more on that later). 

The L1 Ops rep walks to Karen’s desk in a distressed state. Due to the criticality of the app, the outage caused the various monitoring and logging tools to generate hundreds of incidents, all of which were assigned to Karen. She spends considerable time looking through the incidents with no end in sight. Karen logs on to her designated network connectivity, bandwidth analysis, load balancer and firewall uptime monitoring tools, none of which indicate any issues. 

Yet the application is still down, so Karen decides that the best course of action is to ignore the alert flood and the monitoring metrics and tackle the problem head-on. She starts troubleshooting every link in the application chain, confirming that the firewall ports are open and that the load balancer is configured correctly. She crawls through dozens of long log files, and finally, five hours later, discovers that the application servers behind the load balancer are unresponsive: bingo, the culprit has been identified.

Root Cause Found: Now More Stalls

Next, Karen contacts the application team. The person responsible for the application is out of the office, so the application managers schedule a war room call for two hours later. Karen joins the call from home, along with 12 other individuals, most of whom she’s never worked with in her role. 

The manager starts the call by tackling all angles of the issue. Karen, however, knows that the issue was caused by two application servers. After a 30-minute discussion, Karen shares her screen, proving that the issue was caused by the app servers. After further investigation, the application team discovers that an approved change, executed the night before, had changed the application’s TCP port: a critical error on the application’s team part.

Later investigations show that an APM (Application Performance Monitoring) tool generated a relevant alert and an incident that could have helped solve the issue much quicker. The alert was missed by the application team, and, adding to that misery, the IT ops team didn’t have access to the APM system. Karen had no way of gathering telemetry (or lack of) from the APM tool directly. 

A Day Later, The Fix Is Applied

The application team requests approval for an emergency change so they can fix the application configuration file and restart the servers. The repair takes less than 10 minutes, but the application has been down for almost 24 hours. It is now 10 PM on Monday. Karen is exhausted, having worked a 14-hour day with no breaks. How does the business measure the value of the time Karen spent resolving this outage? While her manager applauds her analytical skills, it wasn’t the best use of her specialized skill set and definitely not how she should have spent her day (and night).

Does This Sound Familiar?

I’m sure the story above resonates with IT operations professionals; it is unfortunate that similar occurrences are common. Here are some takeaways:

The segregated monitoring and alerting tools did not provide operational value. That’s because the alerts and metrics are not centralized for view by all the appropriate stakeholders, and aren’t mapped to the business (in this case, the banking application).

Just because a tool generates alerts and incidents, it doesn’t necessarily help the user locate the root cause. 

A flood of uncorrelated alerts and incidents makes matters worse. Many ops pros spend a lot of time looking at irrelevant data, sifting through the noise with their naked eyes. Karen quickly decided to go to the source, the application that was down—but not all IT ops people will do that. 

Legacy processes (such as ITIL) are designed to restrain the user from abrupt changes by implementing a lot of process red tape. On the flipside, this prevents the ops person from fixing issues quickly when they arise. Karen did not have access to the application monitoring tool, nor was she allowed to communicate directly with the application team. She needed a manager to schedule a war room call. This hierarchy created costly delays, which turned a five-to-10-minute fix into an all-day outage! 

Creating a Better Path for IT Ops Pros 

Too many enterprise IT operations teams are living in the past: disconnected tools and antiquated processes that don’t map well to the pace of change and complexity in modern IT environments. Applications are going to live between on-premises and multi-public cloud for the foreseeable future. Coupled with the growing volume of event data and the rising velocity of deployments, complexity will grow and, along with it, increased risks to user productivity and customer experience. 

Here’s an action plan for 2020 to better manage IT performance and enable IT Ops teams to be more productive: 

It’s time to seriously consider machine learning alert and event correlation platforms. It is no longer humanly possible for ops persons to sift through the flood of alarm data. Machine-learning alert correlation products are maturing and providing tangible value to IT organizations.

It’s also time to restructure relic processes designed for mostly static infrastructure and applications. Today’s application agility requires training of IT ops staffers so that they intuitively identify business risk and cooperate fluidly to keep digital services in optimal state.

Finally, it’s time to reconsider the traditional siloed approach for IT ops monitoring and alerting. Having the observable data separated in different buckets does not provide much value unless we can correlate it to the respective business services.

In taking these steps, we can create a new IT operations practice that supports and even enhances the elusive digital transformation that most every company today would like to achieve.

Wael Altaqi is a Solutions Consultant at OpsRamp