This blog post explains how Meta’s Kernel team debugged and fixed a high request completion time issue caused by CPU overload.
Our storage stack recently hit a problem where network processing would use 100% of a CPU and yet not be able to send out enough traffic to meet the application’s needs. We, the Kernel team at Meta, traced this problem to a particular Network Interface Controller (NIC) design choice and excessive latencies causing TCP to spread out transmissions into small frames instead of using NIC segmentation offload to its full potential. As a result, we decided to constrain the TCP pacing rate by increasing the net.ipv4.tcp_min_tso_segs
value using sysctl. Long term, starting with Linux 5.18
, TCP will automatically allow full TCP segmentation offload (TSO) bursts for in-datacenter connections.
Our storage team notified us about request completion times skyrocketing. It mostly affected machines with a particular NIC. Affected hosts are configured to use 8 NIC queue sets. Therefore, 8 CPUs can perform the packet processing. Looking at affected machines, we noticed that a portion of the CPUs handling network processing was exclusively running kernel’s software interrupts (softirqs) and stuck at 100%. While the other cores dealt with networking just fine, using only 20-40% of CPU for softirqs.
This overload state is sticky and can last for hours, unlike normal network processing spikes, which usually resolve as soon as traffic goes down. Looking at atop
, we observed that the overload starts when traffic spikes (which seems to happen once an hour for the app), at which point all CPUs process networking at 100%. The problem is that some remain at 100% for a long time (usually one, although we saw multiple CPUs stuck as well).
We also observed packets waiting to be sent on the affected cores (backlogged in the Linux Qdisc), leading to drops at the Qdisc level.
The problem would generally get fixed if we touched any of the NIC settings. NIC setting changes, however, also change the Rx hashing key as a side effect in our case, which changes TCP flow to CPU mapping. So we were not sure if it was a reshuffling of flows or a NIC reset that fixed the situation.
Since the problem is a CPU overload, our first instinct was to increase the number of NIC queues and spread the load across more CPUs. This proved to not be sufficient—a few days after increase we found machines with more rings that exhibited the same problem. We turned our attention to the fact that machines most affected by the problem had packets stuck in the egress Qdisc, even though the transmit (Tx) rate was nowhere near the line rate.
To explain this, we need to look at the design of the queues used for sending packets between the host and the NIC. Each network processing core uses a set of queues. There would be only two queues in a set in very old NICs—one for receive (Rx) and one for Tx. Modern NICs create separate queues for sending notifications back to the host (notifications meaning “packet was sent” and “packet was received”). This allows the host to learn about the events by reading from its local memory, or hopefully even the L3 cache, rather than reading NIC registers which would be very slow.
NICs may either have a separate queue for Rx and Tx events (image on the left) or use just a single notification queue (image on the right):
Tx completions can get stuck behind Rx traffic in the configuration with a single completion queue, which takes more time to process. Under normal scenarios, this doesn’t matter much because the CPU can drain the completion queue and process all events. However, when the CPU is 100% busy, it’s not keeping up with the completions and can process only a portion of them. As a result, even though all the Tx packets have already been sent, the CPU will not know about it until it gets to the corresponding notification, and notifications are spread out through the completion queue, interleaved with Rx packets.
In other words, a single completion queue imposes a form of “fairness” between Rx and Tx. This is a problem for this particular storage workload, which wants to send much more traffic than it receives. It’s a generally accepted best practice in NIC/network stack design to prioritize Tx.
Configuring separate completion queues for this particular NIC design comes at the cost of halving the number of total available queues.
To easily reproduce the problem, we created an experiment—a dozen machines with the number of queues constrained to just 4. The storage team had also invested in creating a new metric tracking the highest softirq load across cores. This allowed us to easily catch machines with single cores stuck at 100% busy even if the application metric did not go dangerously high. Unfortunately, we proved that even with well configured NIC, the TCP misbehavior could still be reproduced.
Looking at the traffic stats of the overloaded queue (blue line) vs. other queues, we observed:
The overloaded core sends much smaller TSO super-frames. It sends a little more traffic in terms of bytes (~33%) than other queues, but it does so using 7 times the number of packets. Since network processing cost is primarily the function of number of frames, not bytes, we now know why the core is at 100%—it generates smaller packets.
To understand why the core produces tiny TSO frames rather than sending full 64kB chunks of data down to the NIC, we need to understand a bit about TCP in the data center.
Traditional TCP aims at achieving a steady state called “ACK clocking:”
Let’s assume the application has a lot of data to send and network latency is relatively high. It’s the job of the congestion control algorithm to figure out “how many packets to put in the network” (i.e., the congestion window or SND_CWND in TCP parlance), and then send a new packet every time an ACK for a previous packet comes back. ACKs are the “go” signal for the next send, hence “ACK clocking.”
Unfortunately, in reality, the ACK signal is not very accurate. The ACKs often get “compressed” to consume less uplink bandwidth in the Internet wilderness. In the data center, the workloads are bursty (RPC vs continuous stream) and the network’s latencies are far lower than the software processing noise. Modern TCP tries to estimate the correct send rate and send out the traffic at that rate without relying on ACK timing. Maintaining a good, well-paced send rate was a cornerstone of the Bottleneck Bandwidth (BBR) algorithm, but today it’s used for all Linux TCP traffic to a larger or smaller extent.
We are back to the storage workload’s problem. Here is the histogram of pacing rates TCP has computed for flows terminating on good cores vs. the bad core (in bytes per second):
@[good cores]: [2M, 4M) 1954 |@@@@ | [4M, 8M) 4188 |@@@@@@@@ | [8M, 16M) 4887 |@@@@@@@@@@ | [16M, 32M) 6778 |@@@@@@@@@@@@@@ | [32M, 64M) 8128 |@@@@@@@@@@@@@@@@ | [64M, 128M) 12713 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128M, 256M) 25025 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256M, 512M) 9010 |@@@@@@@@@@@@@@@@@@ | [512M, 1G) 5649 |@@@@@@@@@@@ | [1G, 2G) 5418 |@@@@@@@@@@@ | [2G, 4G) 11199 |@@@@@@@@@@@@@@@@@@@@@@@ |
@[bad cores]: [16K, 32K) 5671 |@@@ | [32K, 64K) 5273 |@@@ | [64K, 128K) 17296 |@@@@@@@@@@@ | [128K, 256K) 54935 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256K, 512K) 60336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [512K, 1M) 79204 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1M, 2M) 33225 |@@@@@@@@@@@@@@@@@@@@@ | [2M, 4M) 13774 |@@@@@@@@@ | [4M, 8M) 8349 |@@@@@ | [8M, 16M) 4222 |@@ |
The pacing rate for flows on the bad core are abysmal.
The Linux TCP stack tries to send at least one packet per millisecond, which means that to send a TCP Segmentation Offload (TSO) frame with more than 2 MSS worth of data, we need a pacing rate of at least 3MBps. The pacing rate is computed as Congestion Window (CWND) over Round-Trip Time (RTT), let’s look at RTT (in usec):
@good [128, 256) 2272 |@@ | [256, 512) 195 | | [512, 1K) 17187 |@@@@@@@@@@@@@@@@@ | [1K, 2K) 50020 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [2K, 4K) 19803 |@@@@@@@@@@@@@@@@@@@@ | [4K, 8K) 12951 |@@@@@@@@@@@@@ | [8K, 16K) 7616 |@@@@@@@ | [16K, 32K) 5089 |@@@@@ | [32K, 64K) 2765 |@@ | [64K, 128K) 1173 |@ | [128K, 256K) 96 | |
@bad [512K, 1M) 82845 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1M, 2M) 51511 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
The RTT confirms our hunch that CPU overload leads to excessive queuing. The extent of the queuing is mind-blowing. 1M usecs is 1 second of queuing delay in a data center network. However, upon reflection, when we consider the relatively low rate of packets sent and received, it seems entirely plausible.
We see 2k-4k TSO frames stuck in the Qdisc. With a sending rate of only tens of thousands of frames per second (the rate is inversely proportional to the number of queues), packets can spend 100ms+ just waiting to be sent. The situation is more dire on the receiver path. Our NICs are configured with a relatively high length of the Rx and Tx queues to absorb traffic bursts or temporary CPU overloads—there are a few thousand entries on the Rx rings. Again, hundreds of milliseconds worth of packets at the relatively low packet rates at which these servers operate.
This is a textbook example of the bufferbloat problem that has been discussed at length in the last decade. The queue should absorb bursts, but not stay filled.
This section describes a complete tangent, so if you’re getting tired, skip to the next section. The solution to bufferbloat is well known—smaller queues or, better still, actively managing the length of the Queue. CoDel is the poster child of queue management that prevents bufferbloat.
NICs don’t provide us with much of an active queue management (AQM), not even simple Random Early Detection (RED). Even the industry’s newest NICs were aware of implementing AQM at the wrong point:
We can, however, execute the policy after we pick the packet from the queue. In a sense, that may be closer to CoDel’s intent. We can drop/mark the packets quite efficiently, thanks to XDP, Linux’s BPF hook allowing us to filter raw buffers coming from the NIC before any processing was invested into them.
To implement CoDel, we need the sojourn time of the packets. Using timing information to control packet flow and prevent network congestion is a very hot topic that is worked on by multiple teams at Meta. Here we want to know how long the packet stayed in the host queue. NICs have their own clock that can tell us when a packet arrived from the wire with nanosecond precision (as vendors promise). You may have heard of these clocks in the context of PTP, precise time, another important project at Meta. We have to compare the NIC arrival time to the current time when the CPU processes the packet to get the sojourn time.
Two challenges that we need to solve are:
We can solve these problems by attaching a few extra kprobes:
delay = bpf_ktime_get_ns() - convert(skb->hwtstamp);
It saves the packet delay to the per-queue BPF map which is then used by the XDP logic.
The above logic works pretty nicely, but it doesn’t solve the problem. The following table displays the output of machines when not overloaded—the control app prints stats every second. There are only two queues for a test. The first two lines of the output are per queue stats, then a “total” that shows packets per second on the entire system, then a report from the next cycle, etc.
XDP-cpumap CPU pps drop expt lo NAPI busy q(ms) ACK CE nECT XDP-RX 5 241,611 0 0 10 4,289 3,055 0.06 0 0 0 XDP-RX 11 215,923 0 0 23 7,166 2,347 0.07 0 0 0 total 457,535 0 XDP-RX 5 228,791 0 0 36 6,177 2,582 0.06 0 0 0 XDP-RX 11 245,263 0 0 5 5,932 2,923 0.08 0 0 0 total 474,055 0 XDP-RX 5 273,172 0 0 4 2,653 3,779 0.30 75 776 0 XDP-RX 11 255,099 0 0 9 3,767 3,277 0.42 197 514 0 total 528,272 0 XDP-RX 5 259,929 0 0 13 4,340 3,293 1.53 0 0 0 XDP-RX 11 212,942 0 0 13 6,876 2,491 0.07 0 0 0 total 472,872 0 XDP-cpumap CPU pps drop expt lo NAPI busy q(ms) ACK CE nECT XDP-RX 5 208,368 0 0 15 5,824 2,373 0.08 0 0 0 XDP-RX 11 213,279 0 0 13 7,622 2,138 0.06 0 0 0 total 421,648 0 XDP-RX 5 195,566 0 0 4 6,518 2,059 0.18 0 0 0 XDP-RX 11 182,328 0 0 20 9,716 1,643 0.05 0 0 0 total 377,894 0 XDP-RX 5 194,545 0 0 13 3,417 2,423 0.07 0 0 0 XDP-RX 11 195,263 0 0 21 7,508 1,958 0.09 0 0 0 total 389,809 0 XDP-RX 5 215,466 0 0 6 5,484 2,435 0.09 0 0 0 XDP-RX 11 219,894 0 0 14 7,608 2,404 0.05 0 0 0
In the above table, NAPI
is the number of NAPI loops that managed to drain the completion queue, and busy is the number of NAPI loops that read 64 packets and returned to the NAPI scheduler even though the completion queue was not fully drained. We can see that with just 2 queues enabled on the system, our wimpy storage CPU had many completions to process, and busy gets reported quite a lot. The softirq CPU load for the networking cores was at 70% here.
We can see CoDel in action in the above table in the bold section—there is a spike of traffic (528k packets received). Somewhere during that period, latency rose above our target of 10ms and packets got marked (CE is marked, drop would be dropped but there was no CUBIC traffic during the bad period).
After waiting a bit, the machine with just 2 queues gets overloaded, and we have one CPU (CPU 11 in this case) spinning at 100%:
XDP-cpumap CPU pps drop expt lo NAPI busy q(ms) ACK CE nECT XDP-RX 5 35,900 0 0 40 13,676 117 0.07 0 0 0 XDP-RX 11 63,773 71 0 0 0 996 96.01 55081 7261 1430 total 99,673 71 XDP-RX 5 31,213 0 0 54 12,567 124 0.02 0 0 0 XDP-RX 11 54,624 65 0 0 0 853 77.22 47038 6327 1259 total 85,837 65 XDP-RX 5 28,804 0 0 36 12,896 110 0.02 0 0 0 XDP-RX 11 56,384 61 0 0 0 881 82.95 48663 6477 1244 total 85,189 61 XDP-RX 5 21,727 0 0 42 13,005 0 0.04 0 0 0 XDP-RX 11 51,827 61 0 0 0 809 72.17 43349 5767 1185 total 73,554 61
Because the CPU is spinning, the NAPI column (completion queue fully drained) is 0, queue is “standing,” and all NAPI loops end as busy. XDP CoDel calculated the queue delay to oscillate around 80ms. Keep in mind that the max delay of a standing queue is inversely proportional to the load, so with 16 queues, by rough math, this number could be 8 times larger.
XDP CoDel implementation prevents queue build and does help quite a bit in removing Rx queue overflow drops:
The XDP experiment was a partial success, it did not fix the overload, but it revealed why the queue is stable—most of the packets coming to the machine are ACKs. Recall the ACK clocking section—ACKs that we receive will cause a new transmission to be scheduled and subsequently, a new ACK to take a spot in the Rx queue.
In other words, the reason that the queue is ”standing” and doesn’t get drained is most likely because the workload was Tx heavy. Unfortunately, we can’t do much to fix the standing queue full of ACKs using AQM like CoDel because congestion control algorithms do not directly control the ACKs.
Back from the queue tangents and to the important question—why are we sending tiny packets instead of full TSO frames? Full TSO frames are cheaper to process at the sender and generate fewer ACKs.
My naive understanding of TCP congestion control suggests that the increase in queueing delay should be compensated by an increase in the congestion window. It happens but not to an extent large enough to compensate the latency histogram of SND_CWND:
@[CUBIC, good cores]: [4, 8) 218 |@@ | [8, 16) 4136 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@[CUBIC, bad core]: [4, 8) 444 |@@@@@@ | [8, 16) 3452 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 1995 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 3417 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 2603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 2229 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 1454 |@@@@@@@@@@@@@@@@@@@@@ |
@[DCTCP, good cores]: [4, 8) 2868 |@@@@ | [8, 16) 35148 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 16200 |@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 12664 |@@@@@@@@@@@@@@@@@@ | [64, 128) 8250 |@@@@@@@@@@@@ | [128, 256) 4295 |@@@@@@ | [256, 512) 2509 |@@@ | [512, 1K) 1001 |@ | [1K, 2K) 638 | |
@[DCTCP, bad core]: [4, 8) 4455 |@@@@@ | [8, 16) 43640 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 28767 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 35249 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 12776 |@@@@@@@@@@@@@@@ | [128, 256) 8932 |@@@@@@@@@@ | [256, 512) 2648 |@@@ | [512, 1K) 1634 |@ |
The windows do grow but are nowhere near large enough for the pacing rate (i.e., SND_CWND divided by RTT) to reach efficient TSO use. They rarely reach 100 packets, which covers 25ms of latency if we want to send TSO frames of just 6kB. This is likely because the congestion control algorithm is conservative, trying to avoid causing bufferbloat in the network. What might also play a role is that these storage flows are relatively short-lived—they seem to terminate after transferring only 10MB of data.
As mentioned, raising the congestion window too aggressively in response to latency can itself induce bufferbloat. Tuning the congestion control algorithms could be an option, but it requires careful research and testing.
We need to counteract Linux’s attempts to smooth out traffic too much, and for that, we have the tcp_min_tso_segs
variable that we can control using sysctl.
tcp_min_tso_segs - INTEGER
Minimal number of segments per TSO frame. Since linux-3.12, TCP does an automatic sizing of TSO frames, depending on flow rate, instead of filling 64Kbytes packets. For specific usages, it's possible to force TCP to build big TSO frames. Note that TCP stack might split too big TSO packets if available window is too small.
Default: 2
Note that the tcp_min_tso_segs
parameter does not override SND_CWND. If the sender is CWND-constrained, it will still generate small frames. Luckily, in our tests on overloaded storage boxes there’s usually enough headroom to generate a larger frame.
The TCP stack has other parameters which make the pacing rate more aggressive. However, increasing the floor of the TSO packet size seems like a good fit for our problem. In the experiment with NICs configured to use only 4 queues, setting tcp_min_tso_segs to 8 appeared to be sufficient.
After we reported the problem upstream, the Linux TCP maintainer shared his suggestion / patch. Instead of increasing the min_tso
globally, the patch forces min_tso
to be high if the estimated min_rtt
is below 500us, effectively attempting 64kB sends for all in-datacenter traffic, CWND allowing. This fix will be part of the Linux 5.18 release.
Like in most of our investigations, we depended heavily on bpftrace to gather data. Not all of them were saved (or worth saving), but here is an example to give the reader an idea.
To collect the histogram of CWND for currently sent TSO frames, normal queues vs the overloaded queue (queue 3 in the example) we used:
tracepoint:net:net_dev_start_xmit { if (args->gso_type != 16) {return;} /* skip non-TSO */ $skb = (struct sk_buff *)args->skbaddr; $tp = (struct tcp_sock *)$skb->sk; $icsk = (struct inet_connection_sock *)$tp; @[$icsk->icsk_ca_ops, $skb->sk->__sk_common.skc_rx_queue_mapping == 3] = hist($tp->snd_cwnd); }
Following is a little more complex example which gets parameters of kprobe from registers and counts the congestion window on input, including ECN marking of the packet. Using kfunc instead of kprobe does not require jumping through the register casting hoops, but it would occasionally take very long to attach due to the CPU livelock in this investigation. In this case, using kprobes was quicker:
kprobe:tcp_filter { $skb = (struct sk_buff *)reg("si"); $sk = (struct sock *)reg("di"); $tp = (struct tcp_sock *)$sk; $ip = (struct ipv6hdr *)($skb->head + $skb->network_header); $th = (struct tcphdr *)($skb->head + $skb->transport_header); $ecn = ($ip->flow_lbl[0] & 0x30) >> 4; @[cpu, $ecn] = hist($tp->snd_cwnd); }
To learn more about Meta Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter, Facebook and LinkedIn.
TAGS
Sign up for monthly updates from Meta for Developers.