02-FEB-2011: Why is there packet loss ?

Is the Internets dying ?

I work at a datacenter. Really, I work at half a datacenter. The datacenter spans two physical locations: the location I work at, and another location on the coast. Because we are so inter-dependent reliable connectivity between the two sites is a crucial element for us to actually perform our hosting duties.

Since our datacenters are so far apart we don't have anything as simple as a cable connecting us. Instead we lease bandwidth on someone else's network to provide connectivity between the two sites. We are on a budget and leasing dedicated bandwidth is expensive so while we have an OC-12 at both sites connected to our provider's network, they only guarantee that we will be able to sustain 45Mbps -- everything above that is best-effort. We didn't know about this limitation until afterwards.

Recently we started an effort to provide "Continuity of Operations" which involved making the data we store available at the opposite site in case of catastrophic failure at a site. Our continuity of operations plan required that we be able to bring the systems back to the state they were before the failure and as quickly as possible. To accomplish this goal we decided that the best way to get the data to each opposite site and keep it up-to-date was over the network.

We set this up and let it go one evening and began noticing increased latency and packet loss from all hosts at the two sites. We started looking at network graphs and saw that we were only doing 500Mbps across the WAN link (OC-12, 622Mbps). We began to question the network group about the packet loss and they admitted that we really only had 45Mbps of guaranteed bandwidth between the two sites. They said they were using QoS so this shouldn't be a problem.

I looked at their QoS configuration and noticed that they had not specified any sort of bandwidth limit on their Cisco QoS configuration. I mentioned that we weren't reaching our interface capacity but they seemed to think that their QoS configuration was being effective. It did set some DSCP parameters, but all of our traffic between the two sites were being set to the same value so it is unlikely to have been useful.

I attempted to explain the issue to them in the following email:

Bob,

Here's a little justification, explanation, and plan of action for our experiment with Quality of Service.

Late last week, when the SAN team began to use unused bandwidth (while not exceeding our link capacity) we experienced packet loss between the datacenters. Packet loss is caused by one of two things:

A device or transit (e.g. cable or repeater) malfunctioning;
A queue being full (or nearly full):
1. Either an interfaces outbound queue;
2. A devices global queue; or
3. Random Early Detection (RED) signaling that a queue (one of the above) is nearing fullness

Given the general reliability of modern network devices, and the fact that the packet loss stopped once we reduced the amount of traffic we were transmitting across the network I think we can eliminate a device malfunction as a cause that we should attempt to address.

This leaves us with a queue being full or nearing fullness. This queue may be on a device inside our network (e.g., the firewall) or within the WAN network (e.g., a router or switch on our provider's network).

First, let me expound upon:

Why packet loss slows down a TCP stream;
Why queues being full (or nearly full, causing higher RED probabilities) leads to packet loss is our problem; and
Why queues being nearly full leading to increased latency leading to TCP connections slowing down is not our problem.

First, Transport Control Protocol (TCP) is a network protocol that (among other characteristics) guarantees delivery and ordering of packets within a socket stream. In TCP, packets that are transmitted by one side are acknowledged by the other side. This acknowledgment is done by transmitting a packet to the sender indicating which packets were received in a given "acknowledgment window" (range of bytes). If the receiver does not receive enough packets to construct the entire range of bytes that the acknowledgment window covers it will not send this acknowledgment. If the acknowledgment is not received by the sender in a defined amount of time (various algorithms exist, but we will just assume 2 * AverageRoundTripTime) either because the acknowledgment packet was lost due to packet loss, or because it was not sent because packet loss caused a packet to be missing from the receiver's acknowledgment window the sender will resend all of the packets in the acknowledgment window that was not acknowledged.

Because TCP guarantees ordering of packets, even over medium that do not guarantee this (i.e., ethernet) the receiving side of a TCP socket must have a buffer available to re-order incoming packets. This buffer is naturally of finite size. Thus, if packets have been lost, the receiver must wait for them to be retransmitted before the receiver can emit any later packets we may have to the system's socket and evict them from the buffer. The size of the receiver's buffer, therefore defines how many packets can outstanding/unacknowledged at a given time and the TCP window size.

Once the sender has sent enough packets to fulfill the TCP window size without a contiguous range starting from the last cleared acknowledgment window being acknowledged, it will stop transmitting until further data is acknowledged and buffer becomes available. If we presume that AverageRoundTripTime is 100 ms, the sender will wait 200 ms for an acknowledgment, in the presence of packet loss it will then retransmit the missing segment, which will take an additional 100 ms to get there and be acknowledged. During that 300 ms interval, a link that is capable of carrying 100Mbps will have transmitted 4MBytes of data. It is likely that the receiver's buffer will have been exhausted by then and the transmitter will have stopped transmitting. With transmitting stopped, the average throughput for the link will be decreased significantly.

In addition, in the presence of packet loss TCP decreases the window size and require more frequent acknowledgments, which will be limited by the latency.

Second, devices which engage in "store and forward" must utilize some data structure for storing which packets remain to be sent out a given interface. This data structure is typically a queue as it will limit the amount of out of order packets transmitted out the interface, while maximizing the number of packets the device can manage. Some network devices will have multiple queues for a given interface and process the queues in a pre-determined order (e.g., when de-queuing a packet, check the highest priority queue first, if no packets, try the next, etc). These queues are naturally of finite size, and if they are being filled faster than they are being emptied eventually no more packets can be queued.

What happens when an interfaces queue is full is simple, packets that would have been added to it are dropped ("tail-drop"). Most modern network devices will also implement "Random Early Detection" (RED) where packets will be dropped from a non-empty queue (at a probability that increases the more full the queue is) to prevent the queue from becoming full and forcing tail-drop (since tail-drop can lead to massive failure with TCP).

This packet loss (either from tail-drop or from RED) slows the TCP connection down significantly, allowing the packets in the queue to be processed. We were seeing increased packet loss (~15%) over the WAN link during period where we were heavily utilizing the network path between the two datacenters.

Third, in the process of inserting packets into a queue and waiting for their turn to de-queue, time passes. This time causes increased latency. Increased latency will cause an increase in the average utilization of the receiver's TCP receive buffer, and the TCP window. As long as the latency does not grow to the point where that buffer becomes full waiting for acknowledgments it will not significantly impact the throughput of a TCP socket. We did not note exceedingly high latency during the period where we were heavily utilizing the network path between the two datacenters.

Given that this is an issue with a queue being full somewhere causing packet loss, it would be beneficial to us to control the queue that was dropping packets so that we can control WHAT packets are being dropped. Given that the packet loss may be happening on network devices outside our control (e.g., within our provider's network) we have limited options to attempt to control which queue is filled, and dropped:

Differentiated Services Code Point (DSCP) Marking; and
Enact queues on our end that are not dequeued faster than the rate we can sustain without acceptable packet loss between endpoints

I will expound upon the benefits and costs of the two approaches:

Differentiated Services Code Point (DSCP) Marking allows us to attempt to manipulate which queue a packet is assigned to on interfaces that support multiple queues. This is the simplest approach, however it has several draw backs:
1. Not all network interfaces along the path may support multiple queues, or if they do DSCP;
2. DSCP may be ignored or re-written within Provider; and
3. DSCP is very coarse, beyond the default there are only 13 classes that packets can be placed in
If DSCP marking is unavailable, ignored, or insufficient a more complicated approach must be taken in order to accomplish the goal of managing the priority in which packets are dropped in favor of others. One possible solution that I support exploring (again, if DSCP Marking proves ineffective) is using Quality of Service (QoS) and Traffic Shaping (TS) on a system that sits in-line to our WAN routers. This device could be setup with an arbitrary number of queues, of arbitrary depth that we control and also we control which packets are inserted into. This would allow us infinite flexibility in controlling which packets are dropped as long as it was combined with traffic shaping. Traffic shaping is required to ensure that we are not transmitting at such a rate that exceeds the current de-queue rate of remote queues -- if we did not shape the traffic, the remote queues would still become full and drop packets regardless of what queue they began in on our side. In order to shape the traffic, we need to know at what rate we currently can transmit from one system to another across the network path. Or, more precisely we need to know if we have exceeded the capacity of the network or not. The same device that is doing the Quality of Service and Traffic shaping could also be used to determine whether we are exceeding the rate across the network path for a given set of queues. The device could measure the amount of packet loss passively by determining how many TCP retransmits are occurring over the link and determine whether the effective rate is too high for a given set of queues.

I propose the following plan of action:

Determine which hosts, subnets, or other definable characteristics can have lower priority;
Attempt to place these hosts and subnets into appropriate DSCP classes;
Perform some testing with DSCP to see if it is effective;
If it is effective, and we are able to classify all of our traffic into to DSCP classes declare victory.
If it is not effective, or we are unable to classify all of our traffic into appropriate DSCP classes we should investigate QoS/TS:
1. We should first start by coming up with a tool to passively monitor the number of retransmits to determine if we are currently sending down a network path beyond some device's in that path's limit.
2. When we are ready to test that tool, we can create a monitor port on one of the WAN routers to mirror outbound traffic from the outbound WAN link and then attempt to saturate the network path
3. Once the tool is able to determine the network path's current limit (as this may change over time, depending on what other depends are being placed on the queues on the remote network devices) we can test QoS and Traffic Shaping with the limit determined by the tool by configuring the WAN router QoS rate between the given networks (e.g., DC2 and DC1) using the determined rate manually
4. Once that is done we can implement a queuing hierarchy and define traffic shaping requirements on a box that will sit inline
5. Once we have defined our queues (#5.d.), have defined which packets will be inserted into them (#1.), and have a tool that continuously determines the throughput of a given network path (#5.a.) we can put the pieces together and have the tool set the traffic shaping parameters on a box that will sit inline with our WAN routers and our WAN link to ensure we do not overflow remote queues.

Ultimately, we decided the best thing to do was to do nothing and hope for the best.

Hooray.