This week I was reading an article about a certain data center that burned down not long ago, and how it impacted many enterprises who offer services to their clients. Some people might think: what if a hospital had their data or even worse, their whole system running on that datacenter? How could they continue helping people without their system, data or both?
I personally didn’t pay it much mind because I know this kind of institutions have backup plans and servers and they can keep functioning with minimal issues. What caught my attention was the difference in impact different companies using this datacenter were affected.
I classified them in three main categories, though there are companies that don’t quite fit in these categories, most do.
- The first kind are the ones who couldn’t continue providing a service and lost all customers data.
- The second are the companies that experienced some downtime and are partially back up with some data loss.
- Finally, there are the companies that continued busines as usual (at least on the surface).
So, what is the difference between these three types of companies? Why were some affected so badly while others just continued their business as usual. The answer is in most cases high availability. The companies that weren’t affected had a system planned and working with high availability, while the ones with some downtime and data loss probably had some sort of partial redundancy or simply didn’t need the system to be running 100% of the time. Finally the companies that lost pretty much their whole system might have had their high availability deployed on the same datacenter or no contingency plan at all.
High availability is a characteristic of a system that aims to ensure a system will be operating normally on a specified performance target usually measured on up-time. So, how do we achieve this? The short answer is: Designing and building redundant systems.
Quite easy right? Well… not exactly. Though many people think it is as simple as adding redundancy to some components and that is it. It is a little more complex that that. The professionals who design the systems must plan these redundancies and how these redundant components work with each other. For example: Does the redundancy start working when the main system fails or are both components working in parallel and if one stops working the one remaining takes the whole workload? How will it be detected when a component fails? What to do when a component fails? Are the redundancies on the same place or are they geographically separated to prevent down time in case of a disaster on the area where the datacenter is? Among many other factors to consider.
To help achieve this there are three principles that can help achieve high availability:
· Elimination of single points of failure. This is achieved by adding or building redundancies to make sure that in case any component fails, there will be another to take its place making sure the system continues working as expected.
· Reliable crossover. In case a component failure, the crossover point (or how the systems change from the failed component to the backup one) is in many cases a point of failure. So, a reliable systems must provide a reliable crossover.
· Detection of failures as they occur. The detection of failures is as important if not more important than the redundancies and the cross over. Because a failure is the trigger to activate the redundancies.
The best example for high availability is the systems on an airplane. Every single function has multiple sensors monitoring it and at least two backups in case a system fails. The reason there are two backups for every system, is in case a sensor starts giving false readings, you can see all check all three readings and to determine which one is giving wrong information.
How is the availability measured? There are two main metrics: The mean time to recover and the mean time between failures. For this article I’m going to focus on mean time to recover as it is the one you hear most often. This is the average time it takes to completely recover from a product or system failure. Usually this is measured on the percentage of time in a year the system is available, for example if you have a 99.9% availability that means in a whole year your system can be unavailable for 526 minutes or 8.77 hours.
There are many reasons for unplanned unavailability, this can be grouped on three main groups:
- System failure (Capacity)
- Data and media errors (HDD and humans)
- Site outages (natural disasters, etc.).
When you plan for high availability there are two main levels. System level planning, this includes capacity planning and redundancy planning. And failure protection level planning, this are external failures like an energy outage or a natural disaster.
It is extremely important for most companies, not just software or services companies, to have high availability on their systems or at the very least a contingency plan. For example, if a car manufacturer has to stop their assembly lines because a machine failed and it had no redundancy or contingency plan, the loss is not just the products that weren’t produced and sold as scheduled, but also the idle time of the employees, possibly some fine for not delivering on time and in many cases the loss of reputation for not being able to deliver as expected.
At this point you might be thinking: But what does triangular has to do with high availability?
Triangular can be a powerful ally for big conglomerates as well as small companies, to pan for high availability at a low cost. In case some of your equipment needs to have emergency maintenance, you can use the Triangular platform to continue your company’s process losing as little time and production as possible, in many cases minimizing the consequences of the emergency maintenance and preventing monetary and reputation losses.
High availability is a must for any company. Specially but not limited to companies that provide services over the internet. It might seem complex at a glance, however the benefit of high availability far outweighs the cost and complexity of its implementation.