Communication, Collaboration & Orchestration: 10 Vital Steps to IT Alerting Automation

Written by

No matter how robust your IT infrastructure and IT systems, change and testing processes, it is almost impossible to prevent an application slowdown, major service outage or security breach. This can be quite costly to businesses with Ponemon finding potential losses of almost $9000 per minute. Often human error is to blame causing four out of five data breaches in the UK, according to the ICO, showing that even the most robust system can fail if mismanaged.

The challenge for companies is to plan for these major IT incidents before they occur and when they happen, to be able to respond with the velocity, control and communications that are required in today’s digital world, thereby avoiding huge financial costs from lost revenues or damage to brand reputation.

A fast response is vital because as bad as an initial failure may be, when the situation continues and an outage makes it impossible to give customers or employees accurate updates, the repercussions can only become more serious.

A better approach is to define procedures for the proper recovery of workflows and communicate them to all stakeholders. Alongside this, companies should automate the alerting and response processes for when something goes wrong. Here are 10 important steps that can help minimize the damage in such a scenario:

  1. Develop standard processes for the shutting down and restoration of servers, network gear and their power supplies, including each step to be performed, its expected result, and how to return to a previous safe condition if a change produces an unexpected result.
  2. Require approval by management for any changes to these processes.
  3. Create advisories within the workflow about what warning signs (such as noises or error messages) might signal various failures.
  4. Develop safeguards (such as required approval by a second person) for any action taken during the restoration process that could disrupt business-critical systems.
  5. Conduct periodic tests of the configuration and status of backup power systems and the switches that move the power load to backup sources.
  6. Conduct periodic audits to assure that any new or upgraded hardware is provided for in the backup and restoration plan, and is properly configured for the failover of power and connectivity.
  7. Require periodic tests of the recovery of servers and databases and the rapid updating of production data.
  8. Share the approved processes with all outside service providers and internal stakeholders to assure they are followed.
  9. Automate response workflows when something does go wrong (e.g. when an outage is detected) to assure timely communication of information to the right resolvers and stakeholders such as customers.
  10. Keep audit trails to verify which resolvers received which information, whether they confirmed the receipt of the information, and if they took the required action.

This sort of approach highlights the importance of communication as part of a response plan to a crisis. The ‘call us and we’ll fix it’ model no longer works. Support really needs to be far more proactive, and that means less work that involves ‘firefighting’ or ‘keeping the lights on’. We are living in a time when ‘instant’ is expected. We would not use Google or Bing if it meant typing in a question and waiting 10 or 15 minutes for an answer.

We are constantly told that the new-age worker is always on the go, and this is true, but it is also important to remember that people also spend a good deal of time in meetings, take holidays, and even sleep! All of these things must be accounted for as part of your IT response, and traditional methods can’t solely be trusted to provide the speed and accuracy that modern IT systems require for maintenance. For example, a 24-hour online clothing store caters to a non-stop influx of customers. Any downtime, even at 3am, needs to be quickly addressed and resolved.

Asynchronous communication isn’t suitable for situations requiring rapid response. Businesses need to know beyond any doubt that the other party has received and is responding to their message. What’s more, these messages need to be sent right away and responses received as soon as possible. As the velocity of business increases, more and more rapid responses are required. It is not simply enough to leave a voicemail in the midst of a critical situation and rely on the fact that it will be heard and responded to rapidly.

Instead, an automated system that can process responses as well as sending alerts and notify you on whether critical stakeholders have responded should be deployed. Given the threats businesses face day-to-day and the speed that is required to address them, relying on manual action is no longer acceptable. By taking into account the 10 steps laid out above, companies can ensure that their policies and systems are robust enough to handle the modern threats of today’s cyber-landscape.

What’s hot on Infosecurity Magazine?