The key to keeping IT systems operational in the face of hardware failure is to reduce risk and have a solid plan B in place. This includes putting redundant systems in place in the event that something goes wrong, being able to predict when infrastructure might fail, and working to prevent it. Here are a few steps to take to avoid downtime and minimize the impact of outages.
Regularly Test Server Backups
When a server does go down, you can reduce the damage if you can get it back online promptly. If you can’t do this, you need to be able to restore a backup. Regularly check both your physical and virtual backups and test them to make sure you can do this restore. A backup is only useful to you if you can get your data back.
Check On Your Facilities
Dangers to your infrastructure exist in both the physical and digital world. Water damage, fires, humans spilling things, or furry intruders chewing cables can all cause big problems. Perform weekly physical checks to make sure everything is in good condition. Check for any obvious issues, like missing plastic enclosures, cables that someone could trip over, blocked air flow, damage to any equipment, or facility issues like an overheating server room, that could be a threat to your hardware.
Monitor Your Devices
You can better understand when a server is about to fail by monitoring its health status so you can watch for potential warning signs. This is where a network monitoring solution can come in handy. You can get alerts on unusual events such as high CPU or memory usage or if a server suddenly reboots itself. Network monitoring software gives you an extra layer of protection.
You should also check occasionally on other devices, such as switches, workstations, and firewalls to make sure everything is in working order, operating as it should, and no settings are incorrect. You can automate these tasks with network inventory and network configuration management software.
Regularly Update Devices
To keep your devices healthy, stable, and secure, you need to install updates and patches for your operating systems, hardware, and applications. You can do this on a regular basis for your workstations, but as regular maintenance on servers needs some downtime that will affect a lot of users, special precautions should be taken to reduce this.
Try to schedule your maintenance for a time that will best minimize the impact on users. This will usually be outside of work hours. Let people know about the planned outage well in advance, and send a reminder just before. An email or calendar invite works well.
If you’re updating multiple servers, automate patching with tools, like WSUS, to minimize the downtime. If you use network monitoring software, prevent your IT team from getting bombarded with alerts from servers going offline during planned maintenance. Use software that will let you specify a time period to mute alerts, so your IT team isn’t getting hit with false alarms.
Image Credits: Christian Wiediger