I’ve been lucky enough to attend this 2 days High Availability Architecture training by Médéric Morel from french IT company SQLI. Médéric also has contributed to an excellent book on the subject (sorry it’s in french). Let me put it straight : one of the best training IT session I ever had. Based on what I’ve learned during these 2 days, this post serie aims to introduce the following aspects of HA architecture :
- Availability
- Scalability
- Performances
- Technology Trends
The basic principle whenever designing this type of architecture : there is no definite good solutions. There just are good practises that are appropriate for your business / budget. It’s up to you to choose amongst those existing possibilities and make the correct balance between development / operability costs and business needs.
Only the Availability subject is addressed in this post. Sequels will be published to address the remaining ones.
Availability
It is critical to define at the very early stages of the project this availability target (i.e the time ratio the system is available for the user). This really is a business input that IT architect need to address, bearing in mind that the characteristic curve of the cost/availability ratio is linear up to the 99.99% range at which point the cost explodes while attempting to reach the last thousandth.
The main issue with multi-tier applications is that by piling up servers, the computed availability of the whole system decreases. The overall availability ratio is the product of all components availability. For instance a platform with 2 servers with 99% availability rate has an availability of 0.99×0.99 = 0.98, i.e 98%.
The natural solution to increase availability would then to get the checkbook out and set up servers in parrallel to ensure for high availability. But there are wiser things to do first.
Sequence of actions
Fact : only 20% of system outages come from hardware problems. While 40% come from human mistakes (operation team) and the remaining 40% from application problems. As a result, the recommended curse of actions is as follow :
1. Improve Integrations and Operations practises
2. Improve Monitoring
3. Scale the system
4. Set up a disaster recovery procedure
The first action to take to improve system availability is to set up standards and to normalize the process at all levels of the operation teams in order to avoid human mistake causing outage and improve detection, diagnosis and reactivity during system problems to reduce the time while customer can’t access the service. Hence the success of ITIL Methods.
The second solution is to implement solutions to closely monitor the system at the five monitoring levels defined : availability, response time, application details, business details and user experience.
There are many solutions available that address these monitoring requirements, for each level. For most not critical applications a good old status page, like Apache’s pinged by a robot on a regular basis is quite often good enough.
Now that you know how to improve your system availability we’ll focus with the next post on how to scale your system.