KPI Categories

Infrastructure operations should align to the following general business priorities:

  • Agility – The measure of adaptability. How well can your systems embrace change?
  • Security – How protected and up-to-date a system is. Are you able to respond to threats?
  • ReliabilityAll about the stability of the ‘as a service’ experience. Is your system’s foundation rock-solid?
  • Consistency – The ability to scale. Are you able to produce predictable results?
  • Repeatability – The core of automation. Can you reuse code and keep conformity?
  • Delivery – The ultimate litmus test. How quickly can you go from concept to execution?

How to use this information

The KPIs presented here are intended as a menu of measures for discussion. Businesses should identify a top priority area (see categories below) to focus efforts rather than selecting a broad set of objectives.

Outside of measurement, the list below may help operational teams reposition themselves by being more realistic about their performance in key areas. We’ve found that many teams accept poor results from automation by believing that the current state is as good as it can be. The list below is designed to expose overlooked areas of improvement that can have disproportionate impact.

Listen to CEO Rob Hirschfeld give advice on choosing and improving KPIs.

Agility KPIsTargetTypicalWhy Important?
Lead Time for Changes (LTTC)1 day15 daysDORA metric. Fast change times enable faster iteration and learning. In production, it enables faster patch and update.
Average System Age30 days120 daysFaster turnover rates represent confidence in automation and reduced customization
# of operations platforms110Each platform represents specialized skills and management interfaces.
% vendor locked infra20%80%Reduced vendor risk, improved portability, easier talent acquisition
% re-creatable environments90%5%Improved fault tolerance, Improved dev/test/prod movement, Improved reusability
Security KPIsTargetTypicalWhy Important?
Patches lag behind1 minor2 majorBeing behind creates risk since maintainers do not typically backport fixes between major versions. It can take significant effort to upgrade between major versions.
% exceptions from standard5%50%Any exception from standards requires additional management and creates exposure risk
Time since patch1 week12 weeksPatch are incremental, low risk updates that often address security issues.
% manual effort to remediate5%80%Requiring manual effort when fixing (remediating) issues is toil, pulls people off other priorities, and dramatically extends the time to fix known issues.
Reliability KPIsTargetTypicalWhy Important?
Mean Time to Recovery (MTTR)1 hour8 hoursDORA metric. System outage has both direct (cannot work) and indirect (work deferred) impacts in business delivery.
½ life of automation18 months3 monthsAutomation that frequently requires updates and fixes is less likely to be used
% Idempotence9020Automation that changes behavior or makes unexpected changes creates risk for the operator and requires human attention and monitoring.
Reliability of 2nd run99%75%Similar to idempotence but easier to measure. Rerunning automation should complete, fail cleanly and never harm a system.
Consistency KPIsTargetTypicalWhy Important?
Change Failure Rate (CFR)10%25%DORA metric. Teams must be confident that they can change systems smoothly or they will resist making or batch changes.
Workflow success rate99%80%To build a strong foundation, operators need to be confident that automated requests will complete. Failures typically require toil or manual effort to correct.
Last System Scan freshness (CMDB)7 days90 daysLack of system awareness makes it difficult to plan and execute change within the system. Key in evaluating drift.
Prod variance from test5%50%Production sites should closely match test and staging environment(s)
Multi-site variance5%30%Production sites should be tightly managed to be using the same components even if they are using different vendors.
Repeatability KPIsTargetTypicalWhy Important?
% custom automation (per team)5%90%Custom automation cannot be holistically maintained when errors or changes are discovered. This causes toil, hidden security risks, personal lock-in and makes it difficult to promote standards.
Effort to Reset1 hour40 hoursResets are toil, fast resets free teams to iterate and test before committing changes
% conforming day 199%50%Systems should start in a conforming state when delivered.
% conforming day 9095%25%Automated processes should ensure conformance is maintained
Delivery KPIsTargetTypicalWhy Important?
Delivery Backlog1 day8 weeksSlow delivery of infrastructure encourages internal customers to bypass operations creating silos and technical debt for compliance and management.
Avg Time to verify10 min3 daysMany operational errors arise when systems do not match expectations. Being able to verify systems before changes and also on an ongoing horizon (drift) is critical to consistent operations.
Avg Time to secure10 min1 dayBringing up, changing systems or zero-days can leave systems exposed. It is important to ensure that the window is as short as possible.
Configuration issues caught in Test90%10%Being able to prevent the escape of critical issues into production dramatically improves systems resilience. It also helps teams be more confident in system automation and changes.
Delivery to In Use Time1 hour45 daysBeyond simply not getting value for an idle asset, onboarding systems allows teams to better coordinate onboarding and to quickly find and resolve issues with new infrastructure while attention is focused on the delivery.

Customer Journey

Walk through the steps and stages of the integration and use of Digital Rebar.

Case Studies

Read through stories about how, by using Digital Rebar, RackN customers improved target metrics by up to 10x.