This post is also available in the following languages. Japanese, Korean

Adopting SLI/SLO for improving reliability - Part 3: Service application cases

Hello. I'm Ki-cheol Cheon, a site reliability engineer (SRE) on the Service Reliability team. Our SRE team is constantly working to provide users with reliable and trustworthy services. Concretely, we're using service level indicator (SLI) and service level objective (SLO) measurements to evaluate the quality and reliability of LINE messaging core services, and we're putting effort into engineering areas such as demand forecasting, performance testing, automation, and strengthening observability. We also help teams communicate and collaborate smoothly.

This article follows Adopting SLI/SLO for improving reliability - Part 1: Introduction and reasons for adoption and Adopting SLI/SLO for improving reliability - Part 2: Platform implementation cases. Here, I'll share our experience defining and applying SLI and SLO across the LINE app's messaging, channels, and authentication services.

Mindset for implementing SLI/SLO

Implementing SLI/SLO is more than just picking metrics — it's really about redefining how you understand a service. To carry out this process successfully, here are the mindsets you'll need.

Understanding and exploring the service

The first thing to do when implementing SLI/SLO is identify what services and features you provide to users. In this process you can create a list of user journeys — the flow of user experience — and identify the critical user journey (CUJ) by defining which features users use most or which features are essential. You can also align SLOs with organizational goals by understanding how the service connects to the business objectives.

Communication and collaboration

SLI/SLO are metrics and goals that should be implemented and managed collaboratively across departments and teams. They shouldn't be defined and implemented by one person or a single team — the goals only become valuable when all stakeholders agree on them. The organization that understands the service best should define the CUJ, the infrastructure team should build an environment that can measure and manage large-scale metrics, and SRE should implement the tools and methods to measure SLIs so SLOs can be used to monitor and manage service reliability.

Here are the roles each organization typically plays.

roles performed by each organization

It's important to create a culture where all service stakeholders understand the importance of SLI/SLO and share responsibility for achieving them. Only then can SLOs be used as indicators to confirm that users can use services reliably during operation or when new services or features are released.

SLI/SLO implementation methods

So how do you actually implement SLI/SLO in practice? Let's go through the steps.

CUJ analysis

The first step is to define the CUJ — the key services and features — by considering user experience and business goals.

CUJ analysis step in the four-step SLI/SLO implementation process

List the services and features you provide, then ask questions like the following to refine your CUJ.

What features do users use most in our service?
What features are essential for users?

For example, you might define core functions such as the LINE service sign-up flow, which is the app's starting point, or the basic messaging send/receive function. Other targets might include authentication and encryption from a security perspective, LINE Login, or profile information features that let users use other services via LINE.

Critical user journey (CUJ)
Focus on selecting key features and services. Consider user experience and business objectives

Of course, all services and features are important and need to be reliable, but the key at this stage is to select the most critical targets from the user's perspective.

SLI implementation

Once you've defined CUJs, the next step is to define SLIs.

SLI implementation step in the four-step SLI/SLO process

An SLI defines which metric to set and measure for each CUJ.

Where will you measure it?
- Choose the location that best captures the metric for each CUJ, such as the gateway, frontend, or backend.
Which API will you use to measure it?
- Select representative APIs for each CUJ so you don't need to compute metrics in a complex way.

Next, define the SLI criterion that separates success from failure. Generally, you'll measure response time and response success rate. Define each criterion as follows.

Response time: decide which percentile of overall requests you'll use as the threshold, and what response time will be considered successful for that percentile.
Response success rate: decide what the success rate must be over the measurement period.

For example, you might define that the response time for the 99.9th percentile of message send requests should be under 500 ms, and that 99.999% of all requests should receive a successful response.

example of SLI definitions

The key in this step is to define the measurement location and target, and a clear criterion that separates success from failure. You must clearly define the success and failure thresholds so you can measure whether an SLO has been met. If there are too many variables to measure or you can't clearly define success and failure, you can exclude that CUJ from measurement. If measurement is complex, we recommend creating separate metrics specifically for the SLI.

SLO target setting

After defining SLIs, you define SLOs.

SLO target setting step in the four-step SLI/SLO process

In this step you define the SLO: the percentage of time during the measurement period that the service must meet the SLI. For example, over twenty-eight days (40,320 minutes), an SLO of 99.9% means the service must meet the SLI for at least 40,280 minutes, so you're allowed about 40 minutes of error budget.

Service level objective (SLO)

The objectives or specific ranges the service needs to achieve, based on the measured SLIs

example of SLO target setting

Set realistic targets that you can actually meet. If you set an SLO too high you'll need to spend more to achieve it, and if you set it too low you'll erode service quality and harm the user experience. Balance reliability and cost when you set SLOs.

Visualization

Finally, build dashboards that visualize the SLIs and SLOs so all stakeholders can quickly see the current SLO status.

visualization step in the four-step SLI/SLO process

Implement a dashboard that summarizes the SLO and the error budget so you can see the overall status at a glance. If you add detailed dashboards for each CUJ you can inspect SLI metrics in more depth.

visualization example

Keep dashboards simple so stakeholders can quickly grasp the overall situation. Use numbers and color to simplify information — for example, show metrics that reliably meet targets in green, items that need attention in orange, and items that have failed to meet targets in red.

SLI/SLO utilization

Now let's look at how to use SLI and SLO.

Quantitative indicators to understand reliability and stability

Instead of saying a service is "slower" or "unstable", use SLIs to express the state in quantitative terms. For example, you can say "the service response time exceeds the SLI threshold of 400 ms, causing user inconvenience."

graph showing service response time exceeding the SLI threshold of 400 ms

In other cases you might say "our service response success rate is below 99.99%, which is causing user disruption." Dashboards also let you investigate what issues occurred during the periods when the success rate fell short.

graph showing service response success rate below 99.99%

Using quantitative numbers and dashboards makes it easier to monitor and analyze service status.

Use as a basis for resource allocation

SLOs let you check whether targets are being met and use the error budget to determine how much downtime is currently allowable. That in turn helps you estimate how many resources you should invest in preventive and corrective activities to meet SLOs.

If a service is meeting its SLO and has a generous error budget, you can consider investing more aggressively in new feature releases. For example, if you're exceeding a 99.9% SLO and achieving 99.98% with an error budget above eighty percent, you can deem the service stable and allocate resources to ship new features or shorten release cycles.

dashboard example showing SLO 99.98% and error budget 80%

If the error budget is low — for example, only ten percent remains — you should decide to allocate resources to stabilize the service and improve reliability.

dashboard example showing SLO 99.98% but error budget only 10%

Use during on-call response

We apply error budget state alerts to our on-call workflows so teams can quickly understand service status and respond to issues. We also use SLOs in regular meetings to check SLO status and to plan preventive activities to meet targets.

Here are examples of error budget alert messages.

example error budget alert message 1

example error budget alert message 2

Closing thoughts

Reliability and stability are core values for any service. At the same time, improving user convenience and delivering new features are critical business goals. I believe SLOs can help balance service reliability with business objectives by providing quantitative targets.

balancing service reliability and business objectives with SLOs — **SLOs help balance service reliability with business demands**

I hope this article helps those who are interested in or planning to adopt SLI and SLO. I'll close by sharing a short statement that summarizes my view of the SRE role. Thanks for reading.

Core tenets of SRE

Focus on providing stability and reliability to our customers
And strive for continuous improvement and innovation.