With rapid change in the IT landscape and growing complexity of systems, there is bound to be failure at one point or other point of time. This is the reason that may well trained teams and resources are employed in mitigation of failure and recovery. As more features are being added to the system, more the number of requests the system is attending to there is high probability of complications arising. But that doesn’t mean that you stop making changes or improving your landscape with added features and so on. You can minimize chances of failure by identifying the most suited time for causing changes in the system, finding out how much scope the system has for errors percolating in, the level of testing you should employ and more. The right KPIs and metrics can make a hell of difference in keeping your system tight and in peak performance. One of the best method for determining the time for making modifications to your system is error budgeting
Error budget is the downtime that the systems or systems may face without violating any contractual terms in the SLA. For example if your service level agreement states that your system will have 99.95% uptime or the success rate of the query will be 99.95%, then the error makes allowance for failure of the system for .05% of the time. The underlying philosophy for creation of error budget is that it is not possible to optimize your systems for uptime 100% of the time.
Error budgets ensure that the overall system performance is maintained at an optimal level and there are no serious long downtimes affecting your system’s performance drastically. By accepting the chances odf failure of your system, it creates room for service improvements which have lower success requirements.
The process system experts employ while working with an error budget is
- Negotiating the error budget based on SLA
- Measuring the actual uptime against the set error budget
- Set benchmarks or expectations for downtime in the rest of the quarter based on the above difference obtained
- Push on new releases until the error budget is met
Benefits of Error Budgeting
- Error budgeting allows you to decide the amount of change or risk that is allowed for your system
- Error Budgeting helps in regulating deployment and maintaining the frequency of releases. The release velocity is controlled by ensuring that the SLOs are met or fulfilled before planning a new release
- By measuring the consumption of the error budgets, you can control the rate of deployment. You can either slow down or speed up as per the remaining error budget.
- Error budgets allow SRE and Dev team work in synchrony for overall improvement of the service.
- It allpows the product development team to take risk and increase the push velocity if there is ample error budgets remaining.
- It can help determine the right time for increasing the pace of development of testing. When error budget diminishes or dries up, then the push velocity can be minimized and more time can be afforded for testing.
- Error budget can be adjusted to allow development team to push in new features or additions or it can be made less for development team if the impact on service availability becomes too high