02-FEB-2011: Why is there packet loss ?

02-FEB-2011: Why is there packet loss ? r7 (See the current copy)

Is the Internets dying ?

Bob,

Here's a little justification, explanation, and plan of action for our experiment with Quality of Service.

Late last week, when the SAN team began to use unused bandwidth (while not exceeding our link capacity) we experienced packet loss between the datacenters. Packet loss is caused by one of two things:

A device or transit (e.g. cable or repeater) malfunctioning;
A queue being full (or nearly full):
1. Either an interfaces outbound queue;
2. A devices global queue; or
3. Random Early Detection (RED) signaling that a queue (one of the above) is nearing fullness

Given the general reliability of modern network devices, and the fact that the packet loss stopped once we reduced the amount of traffic we were transmitting across the network I think we can eliminate a device malfunction as a cause that we should attempt to address.

This leaves us with a queue being full or nearing fullness. This queue may be on a device inside our network (e.g., the firewall) or within the WAN network (e.g., a router or switch on our provider's network).

First, let me expound upon:

Why packet loss slows down a TCP stream;
Why queues being full (or nearly full, causing higher RED probabilities) leads to packet loss is our problem; and
Why queues being nearly full leading to increased latency leading to TCP connections slowing down is not our problem.

The following addresses those points:

Transport Control Protocol (TCP) is a network protocol that (among other characteristics) guarantees delivery and ordering of packets within a socket stream. In TCP, packets that are transmitted by one side are acknowledged by the other side. This acknowledgment is done by transmitting a packet to the sender indicating which packets were received in a given "acknowledgment window" (range of bytes). If the receiver does not receive enough packets to construct the entire range of bytes that the acknowledgment window covers it will not send this acknowledgment. If the acknowledgment is not received by the sender in a defined amount of time (various algorithms exist, but we will just assume 2 * AverageRoundTripTime) either because the acknowledgment packet was lost due to packet loss, or because it was not sent because packet loss caused a packet to be missing from the receiver's acknowledgment window the sender will resend all of the packets in the acknowledgment window that was not acknowledged. Because TCP guarantees ordering of packets, even over medium that do not guarantee this (i.e., ethernet) the receiving side of a TCP socket must have a buffer available to re-order incoming packets. This buffer is naturally of finite size. Thus, if packets have been lost, we must wait for them to be retransmitted before we can emit any later packets we may have to the socket and evict them from the buffer. The size of the receiver's buffer, therefore defines how many packets can outstanding/unacknowledged at a given time and the TCP window size. Once the sender has sent enough packets to fulfill the TCP window size without a contiguous range starting from the last cleared acknowledgment window being acknowledged, it will stop transmitting until further data is acknowledged and buffer becomes available. If we presume that AverageRoundTripTime is 100 ms, it will wait 200 ms for an acknowledgment, in the presence of packet loss it will then retransmit the missing segment, which will take an additional 100 ms to get there

and be acknowledged. During that 300 ms interval, a link that is capable of carrying 100Mbps will have transmitted 4MBytes of data. It is likely that the receiver's buffer will have been exhausted by then and the transmitter will have stopped transmitting. With transmitting stopped, the average throughput for the link will be decreased significantly. In addition, in the presence of packet loss TCP decrease the window size and require more frequent acknowledgments, which will be limited by the latency.

Devices which engage in "store and forward" must utilize some data structure for storing which packets remain to be sent out a given interface. This data structure is typically a queue as it will limit the amount of out of order packets transmitted out the interface, while maximizing the number of packets the device can manage. Some network devices will have multiple queues for a given interface and process the queues in a pre-determined order (e.g., when de-queuing a packet, check the highest priority queue first, if no packets, try the next, etc). These queues are naturally of finite size, and if they are being filled faster than they are being emptied eventually no more packets can be queued. What happens when an interfaces queue is full is simple, packets that would have been added to it are dropped ("tail-drop"). Most modern network devices will also implement "Random Early Detection" (RED) where packets will be dropped from a non-empty queue (at a probability that increases the more full the queue is) to prevent the queue from becoming full and forcing tail-drop (since tail-drop can lead to massive failure with TCP). This packet loss (either from tail-drop or from RED) slows the TCP connection down significantly, allowing the packets in the queue to be processed. We were seeing increased packet loss (~15%) over the WAN link during period where we were heavily utilizing the network path between the two datacenters.
In the process of inserting packets into a queue and waiting for their turn to de-queue, time passes. This time causes increased latency. Increased latency will cause an increase in the average utilization of the receiver's TCP receive buffer, and the TCP window. As long as the latency does not grow to the point where that buffer becomes full waiting for acknowledgments it will not significantly impact the throughput of a TCP socket. We did not note exceedingly high latency during the period where we were heavily utilizing the network path between the two datacenters.

Given that this is an issue with a queue being full somewhere causing packet loss, it would be beneficial to us to control the queue that was dropping packets so that we can control WHAT packets are being dropped. Given that the packet loss may be happening on network devices outside our control (e.g., within our provider's network) we have limited options to attempt to control which queue is filled, and dropped:

Differentiated Services Code Point (DSCP) Marking; and
Enact queues on our end that are not dequeued faster than the rate we can sustain without acceptable packet loss between endpoints

I will expound upon the benefits and costs of the two approaches:

Differentiated Services Code Point (DSCP) Marking allows us to attempt to manipulate which queue a packet is assigned to on interfaces that support multiple queues. This is the simplest approach, however it has several draw backs:
1. Not all network interfaces along the path may support multiple queues, or if they do DSCP;
2. DSCP may be ignored or re-written within Provider; and
3. DSCP is very coarse, beyond the default there are only 13 classes that packets can be placed in
If DSCP marking is unavailable, ignored, or insufficient a more complicated approach must be taken in order to accomplish the goal of managing the priority in which packets are dropped in favor of others. One possible solution that I support exploring (again, if DSCP Marking proves ineffective) is using Quality of Service (QoS) and Traffic Shaping (TS) on a system that sits in-line to our WAN routers. This device could be setup with an arbitrary number of queues, of arbitrary depth that we control and also we control which packets are inserted into. This would allow us infinite flexibility in controlling which packets are dropped as long as it was combined with traffic shaping. Traffic shaping is required to ensure that we are not transmitting at such a rate that exceeds the current de-queue rate of remote queues -- if we did not shape the traffic, the remote queues would still become full and drop packets regardless of what queue they began in on our side. In order to shape the traffic, we need to know at what rate we currently can transmit from one system to another across the network path. Or, more precisely we need to know if we have exceeded the capacity of the network or not. The same device that is doing the Quality of Service and Traffic shaping could also be used to determine whether we are exceeding the rate across the network path for a given set of queues. The device could measure the amount of packet loss passively by determining how many TCP retransmits are occurring over the link and determine whether the effective rate is too high for a given set of queues.