Guest Column | March 9, 2020

In The Trenches With IT Operations

By Wael Altaqi, OpsRamp

IT Operations Business Technology

Traditionally, IT operators are responsible for ‘keeping the lights on’ in an IT organization. This sounds simple, but the reality is harsh, with much complexity behind the scenes. Furthermore, digital transformation trends are quickly changing the IT Operations responsibility from ‘keeping the lights on’ to ‘keeping the business competitive’. IT operators are now not only responsible for uptime, but also for the performance and quality of digital services provided by and to the business. To a large extent, maintaining available and high-performing digital services is precisely what it means to be digitally transformed.

I’ve spent my fair share of time as an MSP team lead, and on the operations floor in large IT organizations. The job of an enterprise IT operator is full of uncertainty. Let’s look at a typical day in the life of an IT operator, and how she addresses common challenges like:

  • Segregated monitoring and alerting tools causing confusion and unnecessary delays in troubleshooting;
  • Resolving a critical issue quickly through creative investigations that go beyond analyzing alert data;
  • Legacy processes, such as from ITIL, working against the kind of open collaboration required to fix issues in the DevOps era.

Starting The Day With A Critical Application Outage

Karen is a Senior Network Analyst (L4 IT Operator) who works for a large global financial organization. She is considered a subject matter expert (SME) in network load balancing, network firewalls, and application delivery. She is driving to the office when she gets a call informing her that a major banking application is down at her company. Every minute of downtime affects the bottom line of the business. She finds parking and rushes to her desk, only to find hundreds of alert emails queued in her inbox. The alerts are coming from an application monitoring tool she can’t access. [More on that later]

The L1 operator walks to Karen’s desk in a distressed state. Due to the criticality of the app, the outage caused the various monitoring and logging tools to generate hundreds of incidents, all of which were assigned to Karen. She spends considerable time looking through the incidents with no end in sight. Karen logs on to her designated network connectivity, bandwidth analysis, load balancer, and firewall uptime monitoring tools—none of which indicate any issues.

Yet the application is still down, so Karen decides that the best course of action is to ignore the alert flood and the monitoring metrics and tackle the problem head-on. She starts troubleshooting every link in the application chain, confirming that the firewall ports are open and that the load balancer is configured correctly. She crawled through dozens of long log files, and finally, five hours later, discovered that the application servers behind the load balancer were unresponsive: bingo, the culprit has been identified.

Root Cause Found: Now More Stalls

Next, Karen contacts the application team. The person responsible for the application was out of the office so the application managers scheduled a war room call two hours later. Karen joins the call from home, along with 12 other individuals, most of whom she’s never worked within her role.

The manager starts the call tackling all angles of the issue. Karen, however, knew that the issue was caused by two application servers. After a 30-minute discussion, Karen shared her screen and was able to prove that the issue was caused by the app servers. After further investigation, the application team discovered that an approved change executed the night before had changed the application’s TCP port: a critical error on the application’s team part.

Later investigations showed that an APM (Application Performance Monitoring) tool generated a relevant alert and an incident that could have helped solve the issue much quicker. The alert was missed by the application team and adding to that misery, the IT Ops team didn’t have access to the APM system. Karen had no way of gathering telemetry (or lack of) from the APM tool directly.

A Day Later, The Fix Is Applied

The application team requested approval for emergency change so they could fix the application configuration file and restart the servers. The repair took less than 10 minutes, but the application had been down for almost 24 hours. It is now 10 PM on Monday. Karen is exhausted, having worked a 14- hour day with no breaks. How does the business measure the value of the time Karen spent resolving this outage? While her manager applauded her analytical skills, it wasn’t the best use of her specialized skillset and definitely not how she should have spent her day (and night).

Does this sound familiar?

I’m sure the story above resonates with IT Operations professionals and it is unfortunate that similar occurrences are common.

Here Are Some Takeaways:

  • The segregated monitoring and alerting tools did not provide operational value. That’s because the alerts and metrics are not centralized for view by all the appropriate stakeholders and aren’t mapped to the business and in this case, the banking application.
  • Just because a tool generates alerts and incidents, it doesn’t necessarily help the user locate the root cause.
  • A flood of uncorrelated alerts and incidents makes matters worse. Many operators spend a lot of time looking at irrelevant data, sifting through the noise with their naked eyes. Karen quickly decided to go to the source, the application that was down, but not all IT ops people will do that.
  • Legacy processes (such as ITIL) are designed to restrain the user from abrupt changes by implementing a lot of process red tape. On the flipside, this prevents the operators from fixing issues quickly when they arise. Karen did not have access to the application monitoring tool nor was she allowed to communicate directly with the application team. She needed a manager to schedule a war room call. This hierarchy created costly delays which turned a five-to-10-minute fix into an all-day outage!

Creating A Better Path For IT Operators

Too many enterprise IT operations teams are living in the past: disconnected tools and antiquated processes which don’t map well to the pace of change and complexity in modern IT environments. Applications are going to live between on-premises and multi-public cloud for the foreseeable future. Coupled with the growing volume of event data and the rising velocity of deployments, complexity will grow and along with it, increased risks to user productivity and customer experience.

Here’s an action plan for 2020 to better manage IT performance and enable IT Ops teams to be more productive:

  1. It’s time to seriously consider machine learning alert and event correlation platforms. It is no longer humanly possible for operators to sift through the flood of alarm data. Machine-learning alert correlation products are maturing and providing tangible value to IT organizations.
  2. It’s also time to restructure relic processes designed for mostly static infrastructure and applications. Today’s application agility requires training of IT operators so that they intuitively identify business risk and cooperate fluidly to keep digital services in an optimal state.
  3. Finally, it’s time to reconsider the traditional siloed approach for IT Ops monitoring and alerting. Having the observable data separated in different buckets does not provide much value unless we can correlate it to the respective business services.

In taking these three steps, we can create a new IT operations practice that supports and even enhances the elusive digital transformation that almost every company today would like to achieve.

About The Author

Wael Altaqi is a Solutions Consultant at OpsRamp.