Server Outage - Prevention and Immediate Measures
A server outage on a personal website is primarily frustrating - however, for a business, the consequences can quickly lead to significant financial loss and a loss of reputation and trust among customers. This is why it is essential to monitor continuous accessibility through server monitoring and to take immediate countermeasures in case of a disruption. Otherwise, it is possible that valuable time is lost between a server outage and the initial response under unfavourable circumstances, leading to significant consequences from a limited incident. Prompt action often minimises both the duration and impact, allowing for a swift redirection of the website to a temporary emergency instance in the event of a server outage.
For businesses of a certain size, a comprehensive security concept is mandatory to avoid, if possible, server risks such as a server outage, and to limit its consequences if they occur, as well as to coordinate the countermeasures. Given the multitude of possible causes, this includes not only a plan for acute crisis management but also, where possible, an alternative IT structure as a backup. This allows for a universal response to external factors such as a DDoS attack, hardware or power supply failures, and force majeure events, as well as internal factors such as configuration and programming errors or inadequate maintenance.
Table of Contents:
Causes of a Server Outage
Risk Minimisation
Costs of a Server Outage
Countermeasures on Multiple Levels
Liability and Compensation
Initial Countermeasures in the Event of an Acute Server Outage
Causes of a Server Outage
Despite various security measures, a server outage is more the rule than the exception, as evidenced by a study commissioned by HP Germany in 2013 on their frequency and impact. The company surveyed approximately 300 medium-sized businesses with 200 to 4,999 employees on the frequency and reasons for server outages. More than three quarters of all participants reported that at least one outage in critical areas had occurred within the last 12 months. Possible triggers include:
- Attacks such as a DDoS attack
- Hardware failures including CPU, hard drives, or expansion cards
- Bugs in the software
- Network issues caused by routers, switches, security servers, or cabling
- Human error
- Targeted cybercrime or espionage such as spear phishing, social engineering, or data theft through man-in-the-middle attacks
- Infiltration of critical areas by viruses, worms, ransomware, or trojans
- Accidents like fire
- Failures of external service providers - such as power outages
- Internal or external sabotage through manipulation of SCADA systems
- Exploitation of security vulnerabilities to penetrate the network
- Issues with the operating system (Windows Blue Screen, Linux Kernel Panic)
The sheer number of affected companies proves that achieving complete security, even in isolated areas, is difficult. Even complete isolation of critical systems does not eliminate the risk that numerous external factors, including accidents, lack of power supply, or sudden hardware failures, can ensure continuous operation and 24/7/365 availability.
Risk Minimisation
Many of the potential scenarios can be completely avoided or at least reduced to a very low level by implementing the appropriate security measures. However, in this context, the cost-benefit factor and the relationship between the effort and the expected impact must always be taken into account. Furthermore, legal and psychological aspects play an important role - while it is technologically feasible to fully monitor an employee, it only promises success in rare cases and is permissible and sensible only in cases of acute suspicion. On the one hand, legislation and data protection necessarily restrict such monitoring, but on the other hand, it also affects fundamental factors such as trust, internal cooperation, work atmosphere, and creativity, leading to a significant decrease in productivity and willingness for innovation under these measures.
Costs of a Server Outage
In every industry and server type, a server outage immediately results in a high financial burden, attributed to both the interruption of operations and the measures to regulate the situation and mitigate the consequences. Real-time backup can be performed as a secondary data backup with relatively low effort, preserving ongoing processes and incoming information, such as in an e-commerce system, without loss. However, in the event of a server outage, it only ensures data integrity but does not eliminate expenses for personnel, system restoration, and root cause analysis. These costs can increase significantly depending on the size of the company if countermeasures are not promptly taken to maintain productivity. According to a study conducted by HP Germany, the average cost of a server outage was €25,000 per hour, reaching €40,000 and higher for medium-sized companies with more than 1,000 employees.
A server outage statistically required about 3.8 hours for repair - the consequences are therefore damage ranging from about £90,000 to £150,000 per incident, amounting to a value of £380,000 per year with an average downtime of 12 to 16 hours. Particularly affected are the manufacturing industry and the interconnected structures of a Smart Factory, as the Just-in-Time production system can lead to the temporary failure of the entire process chain. Even within an intelligent process chain, it is only limitedly possible to compensate for a local server outage by redistributing resources. The consequences of an incident therefore include both short and long-term financial costs that go far beyond the direct repair. These include:
- Personnel costs for actively eliminating the damage
- Components or servers as spare parts
- Loss of revenue due to website unavailability
- Data reconstruction
- Restructuring, monitoring, and restarting processes
- Production or logistics interruption
- Communication with affected existing and potential customers
Countermeasures on multiple levels
Although numerous different and efficient measures exist, a server outage due to accidents, technical and human error, or targeted attacks cannot be completely ruled out. Complete and comprehensive protection requires complex and costly management and is usually worthwhile only for critical IT infrastructure such as energy and water supply, public safety institutions, and telecommunications. In most cases, Website Monitoring in real-time in conjunction with an emergency plan is sufficient. Countermeasures for a server outage vary depending on internal and external influences and multiple scenarios. They include:
- 24/7/365 Real-time server monitoring with automatic alerts for issues
- Operating system security
- Firewall and additional filters for detecting a DDoS attack
- Verified hardware with a low failure probability of less than 99 percent
- Regular modernisation of the IT infrastructure
- Flexible clusters of multiple servers with hot swap
- Redundant networks and IT structures
- Ongoing data backup through mirrored backups at different locations
- Automated standby systems as a secondary instance in emergencies
- Physical countermeasures such as fire protection and access controls
Due to the variety of possible scenarios for a server failure, especially for small and medium-sized enterprises, maintaining an additional, less powerful IT infrastructure for potential emergencies is recommended. This can range from a simple web server for communication and operation of an emergency page (e.g. Technical work, maintenance measures) to redundant process chains involving alternative embedded devices and secondary subnets in production. The transition from the primary to the secondary IT structure is usually automated and in real-time if server monitoring detects a server failure, critical state, or partial failure.
Liability and Damages
With an in-house infrastructure, external providers can only rarely be held responsible for a server failure - for example, if there is proven faulty or grossly negligent behaviour within a regular support contract or if the hardware does not meet the conditions guaranteed by the manufacturer. A web hoster typically guarantees its customers an average availability of 99 percent or higher per calendar year - this depends on the Service Level Agreement. Over a period of 365 days, this equates to a maximum server downtime of 87.6 hours on average - it may, however, last longer for individual customers without entitling them to compensation. For this reason, budget web hosting providers are only recommended to a limited extent for commercial applications - when a high, continuous availability is required, it is preferable to have your own IT infrastructure consisting of multiple geographically separated Dedicated Servers or Virtual Private Servers with continuous server monitoring over a package solution.
First Measures in Case of an Acute Server Failure
A server failure can be significantly mitigated in its impact if the operator immediately takes measures to limit the damage after the failure. These measures take absolute priority over an analysis or reconstruction of the incident - for example, through digital forensics. For this reason, data backup on separate systems plays a crucial role - these backups should be readily available for emergencies and able to be activated automatically or manually without delay in the event of a server failure in the primary IT infrastructure. Compensation from the web hosting provider, a data centre, or the hardware manufacturer is not to be expected. Therefore, safeguarding and preventing a server failure is the responsibility of the operator, unless contracts with external service providers explicitly shift this responsibility. If the existing server/hosting account is also not secure, consider changing providers. A selection of Managed Servers can be found in our comparison.
Tip: Find out what you can do if your hosting provider is no longer reachable.
Write a comment
- Verfügbarkeit
Tags for this article
More web hosts
More interesting articles
Uptime and Downtime in Web Hosting Offers
If you are looking for a suitable web hosting provider, you will quickly come across information about the uptime or dow...
Server Monitoring: Uptime always in view
The permanent accessibility of your own server with fast response times is crucial for every commercial operator.
Website Monitoring: The Best Tools for Monitoring Your Online Presence Compared
The following article gives you a comprehensive overview of Website Monitoring and presents the best tools for monitorin...