5 essential NOC Metrics to reach high uptime and detect potential outages

Tue, 19 Sep 2023 00:00:00 +0000

My latest tenure of 2.5 years is closely related to Designing and Adopting Incident Management Framework (as part of Program Management org). This activity was driven with two primary objectives in mind:

Reach and maintain system uptime of 99.99% (our APIs and SDKs).
Ensure engineering is always firsthand source of information for any potential outage that can result in downtime.

In our foundational days, we lacked a comprehensive alerting and monitoring system. Establishing the Network Operations Center (NOC) Team was our strategic move to shape a robust system and take charge of Incident Management. We not only touched the 99.98% uptime benchmark but also heightened our proactivity from spotting 60% of incidents ahead of our merchants to a resounding 95% and higher.

Grafana on Marat Kiniabulatov | Agile Coach, OKR, PMO

5 essential NOC Metrics to reach high uptime and detect potential outages