Tail Latency Might Matter More Than You Think

2021-10-13 19:35:03 阅读：278 来源： 互联网

标签：Latency service latency but Matter tail services Think more

Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently. Things like: "my service mostly responds in around 10ms, but sometimes takes around 100ms". There are many causes of tail latency in the world, including contention, garbage collection, packet loss, host failure, and weird stuff operating systems do in the background. It's tempting to look at the 99.9th percentile, and feel that it doesn't matter. After all, 999 of 1000 calls are seeing lower latency than that.

Unfortunately, it's not that simple. One reason is that modern architectures (like microservices and SoA) tend to have a lot of components, so one user interaction can translate into many, many, service calls. A common pattern in these systems is that there's some frontend, which could be a service or some Javascript or an app, which calls a number of backend services to do what it needs to do. Those services then call other services, and so on. This forms two kinds of interactions: parallel fan-out, where the service calls many backends in parallel and waits for them all to complete, and serial chains where one service calls another, which calls another, and so on.

Service call graph showing fan-out and serial chains

These patterns make tail latency more important than you may think.

To understand why, let's do a simple numerical experiment. Let's simplify the world so that all services respond with the same latency, and that latency follows a very simple bimodal distribution: 99% of the time with a mean of 10ms (normally distributed with a standard deviation of 2ms), and 1% of the time with a mean of 100ms (and SD of 10ms). In the real world, service latencies are almost always multi-modal like this, but typically not just a sum of normal distributions (but that doesn't matter here).

Parallel Calls

First, let's consider parallel calls. The logic here is simple: we call N services in parallel, and wait for the slowest one. Applying our intuition suggests that as N increases, it becomes more and more likely that we'll wait for a ~100ms slow call. With N=1, that'll happen around 1% of the time. With N=10, around 10% of the time. In this simple model, that basic intuition is right.

The tail mode, which used to be quite rare, starts to dominate as N increases. What was a rare occurrence is now normal. Nearly everybody is having a bad time.

Serial Chains

Serial chains are a little bit more interesting. In this model, services call services, down a chain. The final latency is the sum of all of the service latencies down the chain, and so there are a lot more cases to think about: 1 slow service, 2 slow services, etc. That means that we can expect the overall shape of the distribution to change as N increases. Thanks to the central limit theorem we could work out what that looks like as N gets large, but the journey there is interesting too.

Here, we're simulating the effects of chain length on the latency of two different worlds. One Tail world which has the bimodal distribution we describe above, and one No Tail world which only has the primary distribution around 10ms.

Again, the tail latency becomes more prominent here. That relatively rare tail increases the variance of the distribution we're converging on by a factor of 25. That's a huge difference, caused by something that didn't seem too important to start with.

Choosing Summary Statistics One way that this should influence your thinking is in how you choose which latency statistics to monitor. The truth is that no summary statistic is going to give you the full picture. Looking at histograms is cool, but tends to miss the time component. You could look at some kind of windowed histogram heat map, but probably won't. Instead, make sure you're aware of the high percentiles of service latency, and consider monitoring common customer or client use-cases and monitoring their end-to-end latency experience.

Trimmed means, winsorized means, truncated means, interquartile ranges, and other statistics which trim off some of the tail of the distribution seem to be gaining in popularity. There's a lot to like about the trimmed mean and friends, but cutting off the right tail will cause you to miss effects where that tail is very important, and may become dominant depending on how clients call your service.

I continue to believe that if you're going to measure just one thing, make it the mean. However, you probably want to measure more than one thing.

标签：Latency,service,latency,but,Matter,tail,services,Think,more
来源： https://blog.csdn.net/weixin_44260459/article/details/120750462

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

Tail Latency Might Matter More Than You Think