894 lines
34 KiB
Plaintext
894 lines
34 KiB
Plaintext
Network Working Group V. Jacobson
|
||
Request for Comments: 1072 LBL
|
||
R. Braden
|
||
ISI
|
||
October 1988
|
||
|
||
|
||
TCP Extensions for Long-Delay Paths
|
||
|
||
|
||
Status of This Memo
|
||
|
||
This memo proposes a set of extensions to the TCP protocol to provide
|
||
efficient operation over a path with a high bandwidth*delay product.
|
||
These extensions are not proposed as an Internet standard at this
|
||
time. Instead, they are intended as a basis for further
|
||
experimentation and research on transport protocol performance.
|
||
Distribution of this memo is unlimited.
|
||
|
||
1. INTRODUCTION
|
||
|
||
Recent work on TCP performance has shown that TCP can work well over
|
||
a variety of Internet paths, ranging from 800 Mbit/sec I/O channels
|
||
to 300 bit/sec dial-up modems [Jacobson88]. However, there is still
|
||
a fundamental TCP performance bottleneck for one transmission regime:
|
||
paths with high bandwidth and long round-trip delays. The
|
||
significant parameter is the product of bandwidth (bits per second)
|
||
and round-trip delay (RTT in seconds); this product is the number of
|
||
bits it takes to "fill the pipe", i.e., the amount of unacknowledged
|
||
data that TCP must handle in order to keep the pipeline full. TCP
|
||
performance problems arise when this product is large, e.g.,
|
||
significantly exceeds 10**5 bits. We will refer to an Internet path
|
||
operating in this region as a "long, fat pipe", and a network
|
||
containing this path as an "LFN" (pronounced "elephan(t)").
|
||
|
||
High-capacity packet satellite channels (e.g., DARPA's Wideband Net)
|
||
are LFN's. For example, a T1-speed satellite channel has a
|
||
bandwidth*delay product of 10**6 bits or more; this corresponds to
|
||
100 outstanding TCP segments of 1200 bytes each! Proposed future
|
||
terrestrial fiber-optical paths will also fall into the LFN class;
|
||
for example, a cross-country delay of 30 ms at a DS3 bandwidth
|
||
(45Mbps) also exceeds 10**6 bits.
|
||
|
||
Clever algorithms alone will not give us good TCP performance over
|
||
LFN's; it will be necessary to actually extend the protocol. This
|
||
RFC proposes a set of TCP extensions for this purpose.
|
||
|
||
There are three fundamental problems with the current TCP over LFN
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 1]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
paths:
|
||
|
||
|
||
(1) Window Size Limitation
|
||
|
||
The TCP header uses a 16 bit field to report the receive window
|
||
size to the sender. Therefore, the largest window that can be
|
||
used is 2**16 = 65K bytes. (In practice, some TCP
|
||
implementations will "break" for windows exceeding 2**15,
|
||
because of their failure to do unsigned arithmetic).
|
||
|
||
To circumvent this problem, we propose a new TCP option to allow
|
||
windows larger than 2**16. This option will define an implicit
|
||
scale factor, to be used to multiply the window size value found
|
||
in a TCP header to obtain the true window size.
|
||
|
||
|
||
(2) Cumulative Acknowledgments
|
||
|
||
Any packet losses in an LFN can have a catastrophic effect on
|
||
throughput. This effect is exaggerated by the simple cumulative
|
||
acknowledgment of TCP. Whenever a segment is lost, the
|
||
transmitting TCP will (eventually) time out and retransmit the
|
||
missing segment. However, the sending TCP has no information
|
||
about segments that may have reached the receiver and been
|
||
queued because they were not at the left window edge, so it may
|
||
be forced to retransmit these segments unnecessarily.
|
||
|
||
We propose a TCP extension to implement selective
|
||
acknowledgements. By sending selective acknowledgments, the
|
||
receiver of data can inform the sender about all segments that
|
||
have arrived successfully, so the sender need retransmit only
|
||
the segments that have actually been lost.
|
||
|
||
Selective acknowledgments have been included in a number of
|
||
experimental Internet protocols -- VMTP [Cheriton88], NETBLT
|
||
[Clark87], and RDP [Velten84]. There is some empirical evidence
|
||
in favor of selective acknowledgments -- simple experiments with
|
||
RDP have shown that disabling the selective acknowlegment
|
||
facility greatly increases the number of retransmitted segments
|
||
over a lossy, high-delay Internet path [Partridge87]. A
|
||
simulation study of a simple form of selective acknowledgments
|
||
added to the ISO transport protocol TP4 also showed promise of
|
||
performance improvement [NBS85].
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 2]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
(3) Round Trip Timing
|
||
|
||
TCP implements reliable data delivery by measuring the RTT,
|
||
i.e., the time interval between sending a segment and receiving
|
||
an acknowledgment for it, and retransmitting any segments that
|
||
are not acknowledged within some small multiple of the average
|
||
RTT. Experience has shown that accurate, current RTT estimates
|
||
are necessary to adapt to changing traffic conditions and,
|
||
without them, a busy network is subject to an instability known
|
||
as "congestion collapse" [Nagle84].
|
||
|
||
In part because TCP segments may be repacketized upon
|
||
retransmission, and in part because of complications due to the
|
||
cumulative TCP acknowledgement, measuring a segments's RTT may
|
||
involve a non-trivial amount of computation in some
|
||
implementations. To minimize this computation, some
|
||
implementations time only one segment per window. While this
|
||
yields an adequate approximation to the RTT for small windows
|
||
(e.g., a 4 to 8 segment Arpanet window), for an LFN (e.g., 100
|
||
segment Wideband Network windows) it results in an unacceptably
|
||
poor RTT estimate.
|
||
|
||
In the presence of errors, the problem becomes worse. Zhang
|
||
[Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is
|
||
not possible to accumulate reliable RTT estimates if
|
||
retransmitted segments are included in the estimate. Since a
|
||
full window of data will have been transmitted prior to a
|
||
retransmission, all of the segments in that window will have to
|
||
be ACKed before the next RTT sample can be taken. This means at
|
||
least an additional window's worth of time between RTT
|
||
measurements and, as the error rate approaches one per window of
|
||
data (e.g., 10**-6 errors per bit for the Wideband Net), it
|
||
becomes effectively impossible to obtain an RTT measurement.
|
||
|
||
We propose a TCP "echo" option that allows each segment to carry
|
||
its own timestamp. This will allow every segment, including
|
||
retransmissions, to be timed at negligible computational cost.
|
||
|
||
|
||
In designing new TCP options, we must pay careful attention to
|
||
interoperability with existing implementations. The only TCP option
|
||
defined to date is an "initial option", i.e., it may appear only on a
|
||
SYN segment. It is likely that most implementations will properly
|
||
ignore any options in the SYN segment that they do not understand, so
|
||
new initial options should not cause a problem. On the other hand,
|
||
we fear that receiving unexpected non-initial options may cause some
|
||
TCP's to crash.
|
||
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 3]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
Therefore, in each of the extensions we propose, non-initial options
|
||
may be sent only if an exchange of initial options has indicated that
|
||
both sides understand the extension. This approach will also allow a
|
||
TCP to determine when the connection opens how big a TCP header it
|
||
will be sending.
|
||
|
||
2. TCP WINDOW SCALE OPTION
|
||
|
||
The obvious way to implement a window scale factor would be to define
|
||
a new TCP option that could be included in any segment specifying a
|
||
window. The receiver would include it in every acknowledgment
|
||
segment, and the sender would interpret it. Unfortunately, this
|
||
simple approach would not work. The sender must reliably know the
|
||
receiver's current scale factor, but a TCP option in an
|
||
acknowledgement segment will not be delivered reliably (unless the
|
||
ACK happens to be piggy-backed on data).
|
||
|
||
However, SYN segments are always sent reliably, suggesting that each
|
||
side may communicate its window scale factor in an initial TCP
|
||
option. This approach has a disadvantage: the scale must be
|
||
established when the connection is opened, and cannot be changed
|
||
thereafter. However, other alternatives would be much more
|
||
complicated, and we therefore propose a new initial option called
|
||
Window Scale.
|
||
|
||
2.1 Window Scale Option
|
||
|
||
This three-byte option may be sent in a SYN segment by a TCP (1)
|
||
to indicate that it is prepared to do both send and receive window
|
||
scaling, and (2) to communicate a scale factor to be applied to
|
||
its receive window. The scale factor is encoded logarithmically,
|
||
as a power of 2 (presumably to be implemented by binary shifts).
|
||
|
||
Note: the window in the SYN segment itself is never scaled.
|
||
|
||
TCP Window Scale Option:
|
||
|
||
Kind: 3
|
||
|
||
+---------+---------+---------+
|
||
| Kind=3 |Length=3 |shift.cnt|
|
||
+---------+---------+---------+
|
||
|
||
Here shift.cnt is the number of bits by which the receiver right-
|
||
shifts the true receive-window value, to scale it into a 16-bit
|
||
value to be sent in TCP header (this scaling is explained below).
|
||
The value shift.cnt may be zero (offering to scale, while applying
|
||
a scale factor of 1 to the receive window).
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 4]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
This option is an offer, not a promise; both sides must send
|
||
Window Scale options in their SYN segments to enable window
|
||
scaling in either direction.
|
||
|
||
2.2 Using the Window Scale Option
|
||
|
||
A model implementation of window scaling is as follows, using the
|
||
notation of RFC-793 [Postel81]:
|
||
|
||
* The send-window (SND.WND) and receive-window (RCV.WND) sizes
|
||
in the connection state block and in all sequence space
|
||
calculations are expanded from 16 to 32 bits.
|
||
|
||
* Two window shift counts are added to the connection state:
|
||
snd.scale and rcv.scale. These are shift counts to be
|
||
applied to the incoming and outgoing windows, respectively.
|
||
The precise algorithm is shown below.
|
||
|
||
* All outgoing SYN segments are sent with the Window Scale
|
||
option, containing a value shift.cnt = R that the TCP would
|
||
like to use for its receive window.
|
||
|
||
* Snd.scale and rcv.scale are initialized to zero, and are
|
||
changed only during processing of a received SYN segment. If
|
||
the SYN segment contains a Window Scale option with shift.cnt
|
||
= S, set snd.scale to S and set rcv.scale to R; otherwise,
|
||
both snd.scale and rcv.scale are left at zero.
|
||
|
||
* The window field (SEG.WND) in the header of every incoming
|
||
segment, with the exception of SYN segments, will be left-
|
||
shifted by snd.scale bits before updating SND.WND:
|
||
|
||
SND.WND = SEG.WND << snd.scale
|
||
|
||
(assuming the other conditions of RFC793 are met, and using
|
||
the "C" notation "<<" for left-shift).
|
||
|
||
* The window field (SEG.WND) of every outgoing segment, with
|
||
the exception of SYN segments, will have been right-shifted
|
||
by rcv.scale bits:
|
||
|
||
SEG.WND = RCV.WND >> rcv.scale.
|
||
|
||
|
||
TCP determines if a data segment is "old" or "new" by testing if
|
||
its sequence number is within 2**31 bytes of the left edge of the
|
||
window. If not, the data is "old" and discarded. To insure that
|
||
new data is never mistakenly considered old and vice-versa, the
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 5]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
left edge of the sender's window has to be at least 2**31 away
|
||
from the right edge of the receiver's window. Similarly with the
|
||
sender's right edge and receiver's left edge. Since the right and
|
||
left edges of either the sender's or receiver's window differ by
|
||
the window size, and since the sender and receiver windows can be
|
||
out of phase by at most the window size, the above constraints
|
||
imply that 2 * the max window size must be less than 2**31, or
|
||
|
||
max window < 2**30
|
||
|
||
Since the max window is 2**S (where S is the scaling shift count)
|
||
times at most 2**16 - 1 (the maximum unscaled window), the maximum
|
||
window is guaranteed to be < 2*30 if S <= 14. Thus, the shift
|
||
count must be limited to 14. (This allows windows of 2**30 = 1
|
||
Gbyte.) If a Window Scale option is received with a shift.cnt
|
||
value exceeding 14, the TCP should log the error but use 14
|
||
instead of the specified value.
|
||
|
||
|
||
3. TCP SELECTIVE ACKNOWLEDGMENT OPTIONS
|
||
|
||
To minimize the impact on the TCP protocol, the selective
|
||
acknowledgment extension uses the form of two new TCP options. The
|
||
first is an enabling option, "SACK-permitted", that may be sent in a
|
||
SYN segment to indicate that the the SACK option may be used once the
|
||
connection is established. The other is the SACK option itself,
|
||
which may be sent over an established connection once permission has
|
||
been given by SACK-permitted.
|
||
|
||
The SACK option is to be included in a segment sent from a TCP that
|
||
is receiving data to the TCP that is sending that data; we will refer
|
||
to these TCP's as the data receiver and the data sender,
|
||
respectively. We will consider a particular simplex data flow; any
|
||
data flowing in the reverse direction over the same connection can be
|
||
treated independently.
|
||
|
||
3.1 SACK-Permitted Option
|
||
|
||
This two-byte option may be sent in a SYN by a TCP that has been
|
||
extended to receive (and presumably process) the SACK option once
|
||
the connection has opened.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 6]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
TCP Sack-Permitted Option:
|
||
|
||
Kind: 4
|
||
|
||
+---------+---------+
|
||
| Kind=4 | Length=2|
|
||
+---------+---------+
|
||
|
||
3.2 SACK Option
|
||
|
||
The SACK option is to be used to convey extended acknowledgment
|
||
information over an established connection. Specifically, it is
|
||
to be sent by a data receiver to inform the data transmitter of
|
||
non-contiguous blocks of data that have been received and queued.
|
||
The data receiver is awaiting the receipt of data in later
|
||
retransmissions to fill the gaps in sequence space between these
|
||
blocks. At that time, the data receiver will acknowledge the data
|
||
normally by advancing the left window edge in the Acknowledgment
|
||
Number field of the TCP header.
|
||
|
||
It is important to understand that the SACK option will not change
|
||
the meaning of the Acknowledgment Number field, whose value will
|
||
still specify the left window edge, i.e., one byte beyond the last
|
||
sequence number of fully-received data. The SACK option is
|
||
advisory; if it is ignored, TCP acknowledgments will continue to
|
||
function as specified in the protocol.
|
||
|
||
However, SACK will provide additional information that the data
|
||
transmitter can use to optimize retransmissions. The TCP data
|
||
receiver may include the SACK option in an acknowledgment segment
|
||
whenever it has data that is queued and unacknowledged. Of
|
||
course, the SACK option may be sent only when the TCP has received
|
||
the SACK-permitted option in the SYN segment for that connection.
|
||
|
||
TCP SACK Option:
|
||
|
||
Kind: 5
|
||
|
||
Length: Variable
|
||
|
||
|
||
+--------+--------+--------+--------+--------+--------+...---+
|
||
| Kind=5 | Length | Relative Origin | Block Size | |
|
||
+--------+--------+--------+--------+--------+--------+...---+
|
||
|
||
|
||
This option contains a list of the blocks of contiguous sequence
|
||
space occupied by data that has been received and queued within
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 7]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
the window. Each block is contiguous and isolated; that is, the
|
||
octets just below the block,
|
||
|
||
Acknowledgment Number + Relative Origin -1,
|
||
|
||
and just above the block,
|
||
|
||
Acknowledgment Number + Relative Origin + Block Size,
|
||
|
||
have not been received.
|
||
|
||
Each contiguous block of data queued at the receiver is defined in
|
||
the SACK option by two 16-bit integers:
|
||
|
||
|
||
* Relative Origin
|
||
|
||
This is the first sequence number of this block, relative to
|
||
the Acknowledgment Number field in the TCP header (i.e.,
|
||
relative to the data receiver's left window edge).
|
||
|
||
|
||
* Block Size
|
||
|
||
This is the size in octets of this block of contiguous data.
|
||
|
||
|
||
A SACK option that specifies n blocks will have a length of 4*n+2
|
||
octets, so the 44 bytes available for TCP options can specify a
|
||
maximum of 10 blocks. Of course, if other TCP options are
|
||
introduced, they will compete for the 44 bytes, and the limit of
|
||
10 may be reduced in particular segments.
|
||
|
||
There is no requirement on the order in which blocks can appear in
|
||
a single SACK option.
|
||
|
||
Note: requiring that the blocks be ordered would allow a
|
||
slightly more efficient algorithm in the transmitter; however,
|
||
this does not seem to be an important optimization.
|
||
|
||
3.3 SACK with Window Scaling
|
||
|
||
If window scaling is in effect, then 16 bits may not be sufficient
|
||
for the SACK option fields that define the origin and length of a
|
||
block. There are two possible ways to handle this:
|
||
|
||
(1) Expand the SACK origin and length fields to 24 or 32 bits.
|
||
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 8]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
(2) Scale the SACK fields by the same factor as the window.
|
||
|
||
|
||
The first alternative would significantly reduce the number of
|
||
blocks possible in a SACK option; therefore, we have chosen the
|
||
second alternative, scaling the SACK information as well as the
|
||
window.
|
||
|
||
Scaling the SACK information introduces some loss of precision,
|
||
since a SACK option must report queued data blocks whose origins
|
||
and lengths are multiples of the window scale factor rcv.scale.
|
||
These reported blocks must be equal to or smaller than the actual
|
||
blocks of queued data.
|
||
|
||
Specifically, suppose that the receiver has a contiguous block of
|
||
queued data that occupies sequence numbers L, L+1, ... L+N-1, and
|
||
that the window scale factor is S = rcv.scale. Then the
|
||
corresponding block that will be reported in a SACK option will
|
||
be:
|
||
|
||
Relative Origin = int((L+S-1)/S)
|
||
|
||
Block Size = int((L+N)/S) - (Relative Origin)
|
||
|
||
where the function int(x) returns the greatest integer contained
|
||
in x.
|
||
|
||
The resulting loss of precision is not a serious problem for the
|
||
sender. If the data-sending TCP keeps track of the boundaries of
|
||
all segments in its retransmission queue, it will generally be
|
||
able to infer from the imprecise SACK data which full segments
|
||
don't need to be retransmitted. This will fail only if S is
|
||
larger than the maximum segment size, in which case some segments
|
||
may be retransmitted unnecessarily. If the sending TCP does not
|
||
keep track of transmitted segment boundaries, the imprecision of
|
||
the scaled SACK quantities will only result in retransmitting a
|
||
small amount of unneeded sequence space. On the average, the data
|
||
sender will unnecessarily retransmit J*S bytes of the sequence
|
||
space for each SACK received; here J is the number of blocks
|
||
reported in the SACK, and S = snd.scale.
|
||
|
||
3.4 SACK Option Examples
|
||
|
||
Assume the left window edge is 5000 and that the data transmitter
|
||
sends a burst of 8 segments, each containing 500 data bytes.
|
||
Unless specified otherwise, we assume that the scale factor S = 1.
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 9]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
Case 1: The first 4 segments are received but the last 4 are
|
||
dropped.
|
||
|
||
The data receiver will return a normal TCP ACK segment
|
||
acknowledging sequence number 7000, with no SACK option.
|
||
|
||
|
||
Case 2: The first segment is dropped but the remaining 7 are
|
||
received.
|
||
|
||
The data receiver will return a TCP ACK segment that
|
||
acknowledges sequence number 5000 and contains a SACK option
|
||
specifying one block of queued data:
|
||
|
||
Relative Origin = 500; Block Size = 3500
|
||
|
||
|
||
Case 3: The 2nd, 4th, 6th, and 8th (last) segments are
|
||
dropped.
|
||
|
||
The data receiver will return a TCP ACK segment that
|
||
acknowledges sequence number 5500 and contains a SACK option
|
||
specifying the 3 blocks:
|
||
|
||
Relative Origin = 500; Block Size = 500
|
||
Relative Origin = 1500; Block Size = 500
|
||
Relative Origin = 2500; Block Size = 500
|
||
|
||
|
||
Case 4: Same as Case 3, except Scale Factor S = 16.
|
||
|
||
The SACK option would specify the 3 scaled blocks:
|
||
|
||
Relative Origin = 32; Block Size = 30
|
||
Relative Origin = 94; Block Size = 31
|
||
Relative Origin = 157; Block Size = 30
|
||
|
||
These three reported blocks have sequence numbers 512 through
|
||
991, 1504 through 1999, and 2512 through 2992, respectively.
|
||
|
||
|
||
3.5 Generating the SACK Option
|
||
|
||
Let us assume that the data receiver maintains a queue of valid
|
||
segments that it has neither passed to the user nor acknowledged
|
||
because of earlier missing data, and that this queue is ordered by
|
||
starting sequence number. Computation of the SACK option can be
|
||
done with one pass down this queue. Segments that occupy
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 10]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
contiguous sequence space are aggregated into a single SACK block,
|
||
and each gap in the sequence space (except a gap that is
|
||
terminated by the right window edge) triggers the start of a new
|
||
SACK block. If this algorithm defines more than 10 blocks, only
|
||
the first 10 can be included in the option.
|
||
|
||
3.6 Interpreting the SACK Option
|
||
|
||
The data transmitter is assumed to have a retransmission queue
|
||
that contains the segments that have been transmitted but not yet
|
||
acknowledged, in sequence-number order. If the data transmitter
|
||
performs re-packetization before retransmission, the block
|
||
boundaries in a SACK option that it receives may not fall on
|
||
boundaries of segments in the retransmission queue; however, this
|
||
does not pose a serious difficulty for the transmitter.
|
||
|
||
Let us suppose that for each segment in the retransmission queue
|
||
there is a (new) flag bit "ACK'd", to be used to indicate that
|
||
this particular segment has been entirely acknowledged. When a
|
||
segment is first transmitted, it will be entered into the
|
||
retransmission queue with its ACK'd bit off. If the ACK'd bit is
|
||
subsequently turned on (as the result of processing a received
|
||
SACK option), the data transmitter will skip this segment during
|
||
any later retransmission. However, the segment will not be
|
||
dequeued and its buffer freed until the left window edge is
|
||
advanced over it.
|
||
|
||
When an acknowledgment segment arrives containing a SACK option,
|
||
the data transmitter will turn on the ACK'd bits for segments that
|
||
have been selectively acknowleged. More specifically, for each
|
||
block in the SACK option, the data transmitter will turn on the
|
||
ACK'd flags for all segments in the retransmission queue that are
|
||
wholly contained within that block. This requires straightforward
|
||
sequence number comparisons.
|
||
|
||
|
||
4. TCP ECHO OPTIONS
|
||
|
||
A simple method for measuring the RTT of a segment would be: the
|
||
sender places a timestamp in the segment and the receiver returns
|
||
that timestamp in the corresponding ACK segment. When the ACK segment
|
||
arrives at the sender, the difference between the current time and
|
||
the timestamp is the RTT. To implement this timing method, the
|
||
receiver must simply reflect or echo selected data (the timestamp)
|
||
from the sender's segments. This idea is the basis of the "TCP Echo"
|
||
and "TCP Echo Reply" options.
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 11]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
4.1 TCP Echo and TCP Echo Reply Options
|
||
|
||
TCP Echo Option:
|
||
|
||
Kind: 6
|
||
|
||
Length: 6
|
||
|
||
+--------+--------+--------+--------+--------+--------+
|
||
| Kind=6 | Length | 4 bytes of info to be echoed |
|
||
+--------+--------+--------+--------+--------+--------+
|
||
|
||
This option carries four bytes of information that the receiving TCP
|
||
may send back in a subsequent TCP Echo Reply option (see below). A
|
||
TCP may send the TCP Echo option in any segment, but only if a TCP
|
||
Echo option was received in a SYN segment for the connection.
|
||
|
||
When the TCP echo option is used for RTT measurement, it will be
|
||
included in data segments, and the four information bytes will define
|
||
the time at which the data segment was transmitted in any format
|
||
convenient to the sender.
|
||
|
||
TCP Echo Reply Option:
|
||
|
||
Kind: 7
|
||
|
||
Length: 6
|
||
|
||
+--------+--------+--------+--------+--------+--------+
|
||
| Kind=7 | Length | 4 bytes of echoed info |
|
||
+--------+--------+--------+--------+--------+--------+
|
||
|
||
|
||
A TCP that receives a TCP Echo option containing four information
|
||
bytes will return these same bytes in a TCP Echo Reply option.
|
||
|
||
This TCP Echo Reply option must be returned in the next segment
|
||
(e.g., an ACK segment) that is sent. If more than one Echo option is
|
||
received before a reply segment is sent, the TCP must choose only one
|
||
of the options to echo, ignoring the others; specifically, it must
|
||
choose the newest segment with the oldest sequence number (see next
|
||
section.)
|
||
|
||
To use the TCP Echo and Echo Reply options, a TCP must send a TCP
|
||
Echo option in its own SYN segment and receive a TCP Echo option in a
|
||
SYN segment from the other TCP. A TCP that does not implement the
|
||
TCP Echo or Echo Reply options must simply ignore any TCP Echo
|
||
options it receives. However, a TCP should not receive one of these
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 12]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
options in a non-SYN segment unless it included a TCP Echo option in
|
||
its own SYN segment.
|
||
|
||
4.2 Using the Echo Options
|
||
|
||
If we wish to use the Echo/Echo Reply options for RTT measurement, we
|
||
have to define what the receiver does when there is not a one-to-one
|
||
correspondence between data and ACK segments. Assuming that we want
|
||
to minimize the state kept in the receiver (i.e., the number of
|
||
unprocessed Echo options), we can plan on a receiver remembering the
|
||
information value from at most one Echo between ACKs. There are
|
||
three situations to consider:
|
||
|
||
(A) Delayed ACKs.
|
||
|
||
Many TCP's acknowledge only every Kth segment out of a group of
|
||
segments arriving within a short time interval; this policy is
|
||
known generally as "delayed ACK's". The data-sender TCP must
|
||
measure the effective RTT, including the additional time due to
|
||
delayed ACK's, or else it will retransmit unnecessarily. Thus,
|
||
when delayed ACK's are in use, the receiver should reply with
|
||
the Echo option information from the earliest unacknowledged
|
||
segment.
|
||
|
||
(B) A hole in the sequence space (segment(s) have been lost).
|
||
|
||
The sender will continue sending until the window is filled, and
|
||
we may be generating ACKs as these out-of-order segments arrive
|
||
(e.g., for the SACK information or to aid "fast retransmit").
|
||
An Echo Reply option will tell the sender the RTT of some
|
||
recently sent segment (since the ACK can only contain the
|
||
sequence number of the hole, the sender may not be able to
|
||
determine which segment, but that doesn't matter). If the loss
|
||
was due to congestion, these RTTs may be particularly valuable
|
||
to the sender since they reflect the network characteristics
|
||
immediately after the congestion.
|
||
|
||
(C) A filled hole in the sequence space.
|
||
|
||
The segment that fills the hole represents the most recent
|
||
measurement of the network characteristics. On the other hand,
|
||
an RTT computed from an earlier segment would probably include
|
||
the sender's retransmit time-out, badly biasing the sender's
|
||
average RTT estimate.
|
||
|
||
|
||
Case (A) suggests the receiver should remember and return the Echo
|
||
option information from the oldest unacknowledged segment. Cases (B)
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 13]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
and (C) suggest that the option should come from the most recent
|
||
unacknowledged segment. An algorithm that covers all three cases is
|
||
for the receiver to return the Echo option information from the
|
||
newest segment with the oldest sequence number, as specified earlier.
|
||
|
||
A model implementation of these options is as follows.
|
||
|
||
|
||
(1) Receiver Implementation
|
||
|
||
A 32-bit slot for Echo option data, rcv.echodata, is added to
|
||
the receiver connection state, together with a flag,
|
||
rcv.echopresent, that indicates whether there is anything in the
|
||
slot. When the receiver generates a segment, it checks
|
||
rcv.echopresent and, if it is set, adds an echo-reply option
|
||
containing rcv.echodata to the outgoing segment then clears
|
||
rcv.echopresent.
|
||
|
||
If an incoming segment is in the window and contains an echo
|
||
option, the receiver checks rcv.echopresent. If it isn't set,
|
||
the value of the echo option is copied to rcv.echodata and
|
||
rcv.echopresent is set. If rcv.echopresent is already set, the
|
||
receiver checks whether the segment is at the left edge of the
|
||
window. If so, the segment's echo option value is copied to
|
||
rcv.echodata (this is situation (C) above). Otherwise, the
|
||
segment's echo option is ignored.
|
||
|
||
|
||
(2) Sender Implementation
|
||
|
||
The sender's connection state has a single flag bit,
|
||
snd.echoallowed, added. If snd.echoallowed is set or if the
|
||
segment contains a SYN, the sender is free to add a TCP Echo
|
||
option (presumably containing the current time in some units
|
||
convenient to the sender) to every outgoing segment.
|
||
|
||
Snd.echoallowed should be set if a SYN is received with a TCP
|
||
Echo option (presumably, a host that implements the option will
|
||
attempt to use it to time the SYN segment).
|
||
|
||
|
||
5. CONCLUSIONS AND ACKNOWLEDGMENTS
|
||
|
||
We have proposed five new TCP options for scaled windows, selective
|
||
acknowledgments, and round-trip timing, in order to provide efficient
|
||
operation over large-bandwidth*delay-product paths. These extensions
|
||
are designed to provide compatible interworking with TCP's that do not
|
||
implement the extensions.
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 14]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
The Window Scale option was originally suggested by Mike St. Johns of
|
||
USAF/DCA. The present form of the option was suggested by Mike Karels
|
||
of UC Berkeley in response to a more cumbersome scheme proposed by Van
|
||
Jacobson. Gerd Beling of FGAN (West Germany) contributed the initial
|
||
definition of the SACK option.
|
||
|
||
All three options have evolved through discussion with the End-to-End
|
||
Task Force, and the authors are grateful to the other members of the
|
||
Task Force for their advice and encouragement.
|
||
|
||
6. REFERENCES
|
||
|
||
[Cheriton88] Cheriton, D., "VMTP: Versatile Message Transaction
|
||
Protocol", RFC 1045, Stanford University, February 1988.
|
||
|
||
[Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet
|
||
Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm.,
|
||
Scottsdale, Arizona, March 1986.
|
||
|
||
[Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times
|
||
in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT,
|
||
August 1987.
|
||
|
||
[Clark87] Clark, D., Lambert, M., and L. Zhang, "NETBLT: A Bulk
|
||
Data Transfer Protocol", RFC 998, MIT, March 1987.
|
||
|
||
[Nagle84] Nagle, J., "Congestion Control in IP/TCP
|
||
Internetworks", RFC 896, FACC, January 1984.
|
||
|
||
[NBS85] Colella, R., Aronoff, R., and K. Mills, "Performance
|
||
Improvements for ISO Transport", Ninth Data Comm Symposium,
|
||
published in ACM SIGCOMM Comp Comm Review, vol. 15, no. 5,
|
||
September 1985.
|
||
|
||
[Partridge87] Partridge, C., "Private Communication", February
|
||
1987.
|
||
|
||
[Postel81] Postel, J., "Transmission Control Protocol - DARPA
|
||
Internet Program Protocol Specification", RFC 793, DARPA,
|
||
September 1981.
|
||
|
||
[Velten84] Velten, D., Hinden, R., and J. Sax, "Reliable Data
|
||
Protocol", RFC 908, BBN, July 1984.
|
||
|
||
[Jacobson88] Jacobson, V., "Congestion Avoidance and Control", to
|
||
be presented at SIGCOMM '88, Stanford, CA., August 1988.
|
||
|
||
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc.
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 15]
|
||
|
||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||
|
||
|
||
SIGCOMM '86, Stowe, Vt., August 1986.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson & Braden [Page 16]
|
||
|