PSA Software

Outages: What To Do When They Happen

9/09/2020 06:49

Florin R. Ferrs

Florin is a technology writer based in Silicon Valley.

By Florin R. Ferrs (Tech Writer)

Outages are the bane of IT managers, network administrators, and MSPs’ existence the world over. Some IT professionals think of outages as an inevitable part of their job. Somewhat akin to crashes in Formula One racing. You can prepare for them, design technology to make them survivable, but you’ll never be able to eliminate them permanently.

This “s***t happens” attitude towards outages amongst some veteran IT professionals might be considered appropriate, healthy even. Particularly when you consider the fast pace at which IT tools, products, and services are being developed and deployed these days. This exponential growth prevalent in the IT world today pretty much guarantees that outages will happen either locally, on the ISP side, or even on cloud servers.

The inevitability of outages is even accepted as par for the course by tech giants like Microsoft, who’s ubiquitous cloud-based office suite, Office365, doesn’t even bother offering a 100% guaranteed uptime.

The Trick Is To Be Prepared

The reality for IT pros is that the software and services they are implementing statistically will have an outage once, or even twice a year. So how can you prepare your IT team for the inevitable outages they will surely face?

Some IT professionals see the almost ubiquitous implementation of cloud-based servers as one of the solutions to their outage problems. And on the surface, it may make some sense. No more worrying about ants or mice literally eating through the cables of your in-house servers, or a fire at the local server farm. But cloud-based servers, while extremely efficient, also face outage problems, with Azure’s infamous APAC outage earlier this year, and many large enterprises reporting that Amazon Cloud services can be ‘spotty’. Mainframe 2

Just Another Day At The Office

The inevitability of outages is practically embedded in the code of our proverbial veteran IT managers and sysadmins. And their ‘s***t happens’ attitude may even be a good thing, as long as steps are taken to prevent outages in the first place and to also mitigate the damage when they do occur.

However, this hardened attitude towards outages is no comfort for less-experienced system administrators, who are often thrown to the wolves, or are new to the job and are suddenly sucked into the black hole of a full corporate outage.

Consider this desperately candid cry for help from one such overwhelmed sysadmin on Reddit:

“I’m currently hiding in the server room because there is an ISP outage and I’m too afraid to tell everyone that I can’t fix anything yet!”

Sounds familiar? Most IT professionals have been in a similar situation. If you haven’t yet, then it’s practically guaranteed that you will face a similar situation in the near future. For this very reason, we’ve put together a quick list with some best practices that if implemented, could someday save you from having to hide in your server room the next time there’s an outage!

What To Do When Outages Happen (& How To Prevent Them)

Ok, so you’ve been notified of an outage. What should you do? First of all, take a deep breath, and whatever you do, don’t try to hide in the server room, that’s the first place where they will go looking for you!

1-Monitor For Server Outages

Be proactive vs. reactive. Have systems in place that will give you and your team a heads up when things are starting to go wrong. These tools can help prevent major issues from occurring in the first place.

2-Deploy A "Smart Monitor" For All Your Servers

These should be set up to trigger an alarm only under certain conditions:

-The server rebooted during specific business hours.

-The server booted into Safe or Directory Services

-Perform a daily check and send an alarm if the server is configured to boot into SM or DSRM states at the next reboot.

3-Maintain Existing Hardware & Reliable Cloud Solutions

Hardware is like an automobile that needs constant maintenance. Create a checkpoint and schedule tasks to constantly review your systems to make sure everything is operating at full capacity. If you don’t manage your own hardware be sure to review your cloud solutions ‘best practices’ on managing their updates. rain

4-Get Proactive During The First Stages Of The Outage

Take care of the things that are actually under your control, like changing DNS settings or restarting the routers. If you have backup hardware or services, pull them online to see if that gets you out of the outage.

If the outage is external, contact your ISP or hosting provider or cloud server host, and make sure that there is a ticket open, and keep in touch with them for updates.

5-Create Emergency Access Accounts

As we mentioned before, ubiquitous cloud services like Office365 don’t actually guarantee 100% uptime and with the issues that Azure has had lately, it’s wise to create two or more emergency access accounts. These accounts should be cloud-only accounts that are not federated or synchronized from within your on-premise environment. This will become very helpful if an outage also takes out your local cell phone coverage, VPN, or VoIP setup, basically negating two-step verification access to your cloud servers.

6-Suggest Installing A Second ISP Line For Failover Redundancy

Here’s where your people skills as an IT manager with your higher ups will come in handy as some bean counters may balk at the extra expense. Make a business case for having this redundancy vs going through an outage. Some ISPs offer backup via their mobile plan, this can be a relatively easy solution.

7-Internal Communications

Step one: Don’t hide in your server room! Outages are the time when honest communications win the day. Get ahead of the problem by communicating to the staff that there’s a problem, give them an estimated time to recovery, and let them know that your team is all hands on deck working on solving the outage.

8-Customer Communications

Create a workable and honest SLA (Service Level Agreement) with your customers on expected response and completion time. Have a plan to get reliable and transparent communication out to your customers. The speed of response and open transparency will help mitigate most of the fallout you get when your customers are getting bad service. Transparency wins the day.

9-Public Communications

For major outages, like the infamous APAC outage and others like it, it becomes imperative for IT professionals to control the narrative on social media. Have a plan in place with your digital marketing team to respond to customers on social media and get ahead of the game via honest communication.

10-Use NinjaRMM + SherpaDesk

The powerful combination of an RMM (NinjaRMM), integrated with a full-service PSA (SherpaDesk) is the ideal dynamic duo to help you deal with outages by enabling you to set up server alerts, knowledge-base articles with outage procedures, and automated tickets for when servers go down. SherpaDesk becomes your control center from which to manage and share outage emergency tactics and procedures among your team.

In Conclusion

Outages happen. Smart IT professionals do all they can do to try to prevent them, but since a lot of it will be out of your control, then the best mode of action is to have a solid plan beforehand, and to make sure that your team is well trained and knows what to do during an outage.

Just remember, if all else fails, there’s always a very easy solution: The Fake virus attack.