Non Functional Requirements

Reliability & Availability

A system should be resilient (fault-tolerant) and performant under expected load

Strategies

design for failure and trigger them deliberately e.g. kill processes without a warning
consider hardware faults such as blackouts, hard disk crashes, add redundancy as necessary
consider software faults such as
- processes that slow down or that return corrupted responses
- fault cascading where the a fault triggers faults in other components
measure/monitor the system to identify faults

Scalability

A system should be able to handle load increases

Queries per second (QPS) to a web server
Ratio of read/writes in a DB
Cache hit/miss rate
Number of simultaneous users in a realtime system

Handling load

scaling up (vertical scaling), simple
scaling out (horizontal scaling), complex
manual scale, for predictable systems, simple
elastic scale, add resources as load increases, for unpredictable systems, complex

Performance

throughput: number of requests processed per second
latency: time to handle the request
response time: latency + network/queue delays

For the response time we use percentiles, given some metrics gathered for a set of requests in a period of time sort them from fastests to slowest, the common metrics are p50, p95, p99, p999 (used in SLAs)

When a requests involves parallel calls to multiple services, the response time is equal to the service which took the maximum time

Durability

Data should not be lost once sent to a system

Monitoring & metrics collection

Capture metrics about the data going in/out of the system