Porting PicoTCP WIP
This commit is contained in:
893
kernel/picotcp/RFC/rfc1072.txt
Normal file
893
kernel/picotcp/RFC/rfc1072.txt
Normal file
@ -0,0 +1,893 @@
|
||||
Network Working Group V. Jacobson
|
||||
Request for Comments: 1072 LBL
|
||||
R. Braden
|
||||
ISI
|
||||
October 1988
|
||||
|
||||
|
||||
TCP Extensions for Long-Delay Paths
|
||||
|
||||
|
||||
Status of This Memo
|
||||
|
||||
This memo proposes a set of extensions to the TCP protocol to provide
|
||||
efficient operation over a path with a high bandwidth*delay product.
|
||||
These extensions are not proposed as an Internet standard at this
|
||||
time. Instead, they are intended as a basis for further
|
||||
experimentation and research on transport protocol performance.
|
||||
Distribution of this memo is unlimited.
|
||||
|
||||
1. INTRODUCTION
|
||||
|
||||
Recent work on TCP performance has shown that TCP can work well over
|
||||
a variety of Internet paths, ranging from 800 Mbit/sec I/O channels
|
||||
to 300 bit/sec dial-up modems [Jacobson88]. However, there is still
|
||||
a fundamental TCP performance bottleneck for one transmission regime:
|
||||
paths with high bandwidth and long round-trip delays. The
|
||||
significant parameter is the product of bandwidth (bits per second)
|
||||
and round-trip delay (RTT in seconds); this product is the number of
|
||||
bits it takes to "fill the pipe", i.e., the amount of unacknowledged
|
||||
data that TCP must handle in order to keep the pipeline full. TCP
|
||||
performance problems arise when this product is large, e.g.,
|
||||
significantly exceeds 10**5 bits. We will refer to an Internet path
|
||||
operating in this region as a "long, fat pipe", and a network
|
||||
containing this path as an "LFN" (pronounced "elephan(t)").
|
||||
|
||||
High-capacity packet satellite channels (e.g., DARPA's Wideband Net)
|
||||
are LFN's. For example, a T1-speed satellite channel has a
|
||||
bandwidth*delay product of 10**6 bits or more; this corresponds to
|
||||
100 outstanding TCP segments of 1200 bytes each! Proposed future
|
||||
terrestrial fiber-optical paths will also fall into the LFN class;
|
||||
for example, a cross-country delay of 30 ms at a DS3 bandwidth
|
||||
(45Mbps) also exceeds 10**6 bits.
|
||||
|
||||
Clever algorithms alone will not give us good TCP performance over
|
||||
LFN's; it will be necessary to actually extend the protocol. This
|
||||
RFC proposes a set of TCP extensions for this purpose.
|
||||
|
||||
There are three fundamental problems with the current TCP over LFN
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 1]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
paths:
|
||||
|
||||
|
||||
(1) Window Size Limitation
|
||||
|
||||
The TCP header uses a 16 bit field to report the receive window
|
||||
size to the sender. Therefore, the largest window that can be
|
||||
used is 2**16 = 65K bytes. (In practice, some TCP
|
||||
implementations will "break" for windows exceeding 2**15,
|
||||
because of their failure to do unsigned arithmetic).
|
||||
|
||||
To circumvent this problem, we propose a new TCP option to allow
|
||||
windows larger than 2**16. This option will define an implicit
|
||||
scale factor, to be used to multiply the window size value found
|
||||
in a TCP header to obtain the true window size.
|
||||
|
||||
|
||||
(2) Cumulative Acknowledgments
|
||||
|
||||
Any packet losses in an LFN can have a catastrophic effect on
|
||||
throughput. This effect is exaggerated by the simple cumulative
|
||||
acknowledgment of TCP. Whenever a segment is lost, the
|
||||
transmitting TCP will (eventually) time out and retransmit the
|
||||
missing segment. However, the sending TCP has no information
|
||||
about segments that may have reached the receiver and been
|
||||
queued because they were not at the left window edge, so it may
|
||||
be forced to retransmit these segments unnecessarily.
|
||||
|
||||
We propose a TCP extension to implement selective
|
||||
acknowledgements. By sending selective acknowledgments, the
|
||||
receiver of data can inform the sender about all segments that
|
||||
have arrived successfully, so the sender need retransmit only
|
||||
the segments that have actually been lost.
|
||||
|
||||
Selective acknowledgments have been included in a number of
|
||||
experimental Internet protocols -- VMTP [Cheriton88], NETBLT
|
||||
[Clark87], and RDP [Velten84]. There is some empirical evidence
|
||||
in favor of selective acknowledgments -- simple experiments with
|
||||
RDP have shown that disabling the selective acknowlegment
|
||||
facility greatly increases the number of retransmitted segments
|
||||
over a lossy, high-delay Internet path [Partridge87]. A
|
||||
simulation study of a simple form of selective acknowledgments
|
||||
added to the ISO transport protocol TP4 also showed promise of
|
||||
performance improvement [NBS85].
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 2]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
(3) Round Trip Timing
|
||||
|
||||
TCP implements reliable data delivery by measuring the RTT,
|
||||
i.e., the time interval between sending a segment and receiving
|
||||
an acknowledgment for it, and retransmitting any segments that
|
||||
are not acknowledged within some small multiple of the average
|
||||
RTT. Experience has shown that accurate, current RTT estimates
|
||||
are necessary to adapt to changing traffic conditions and,
|
||||
without them, a busy network is subject to an instability known
|
||||
as "congestion collapse" [Nagle84].
|
||||
|
||||
In part because TCP segments may be repacketized upon
|
||||
retransmission, and in part because of complications due to the
|
||||
cumulative TCP acknowledgement, measuring a segments's RTT may
|
||||
involve a non-trivial amount of computation in some
|
||||
implementations. To minimize this computation, some
|
||||
implementations time only one segment per window. While this
|
||||
yields an adequate approximation to the RTT for small windows
|
||||
(e.g., a 4 to 8 segment Arpanet window), for an LFN (e.g., 100
|
||||
segment Wideband Network windows) it results in an unacceptably
|
||||
poor RTT estimate.
|
||||
|
||||
In the presence of errors, the problem becomes worse. Zhang
|
||||
[Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is
|
||||
not possible to accumulate reliable RTT estimates if
|
||||
retransmitted segments are included in the estimate. Since a
|
||||
full window of data will have been transmitted prior to a
|
||||
retransmission, all of the segments in that window will have to
|
||||
be ACKed before the next RTT sample can be taken. This means at
|
||||
least an additional window's worth of time between RTT
|
||||
measurements and, as the error rate approaches one per window of
|
||||
data (e.g., 10**-6 errors per bit for the Wideband Net), it
|
||||
becomes effectively impossible to obtain an RTT measurement.
|
||||
|
||||
We propose a TCP "echo" option that allows each segment to carry
|
||||
its own timestamp. This will allow every segment, including
|
||||
retransmissions, to be timed at negligible computational cost.
|
||||
|
||||
|
||||
In designing new TCP options, we must pay careful attention to
|
||||
interoperability with existing implementations. The only TCP option
|
||||
defined to date is an "initial option", i.e., it may appear only on a
|
||||
SYN segment. It is likely that most implementations will properly
|
||||
ignore any options in the SYN segment that they do not understand, so
|
||||
new initial options should not cause a problem. On the other hand,
|
||||
we fear that receiving unexpected non-initial options may cause some
|
||||
TCP's to crash.
|
||||
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 3]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
Therefore, in each of the extensions we propose, non-initial options
|
||||
may be sent only if an exchange of initial options has indicated that
|
||||
both sides understand the extension. This approach will also allow a
|
||||
TCP to determine when the connection opens how big a TCP header it
|
||||
will be sending.
|
||||
|
||||
2. TCP WINDOW SCALE OPTION
|
||||
|
||||
The obvious way to implement a window scale factor would be to define
|
||||
a new TCP option that could be included in any segment specifying a
|
||||
window. The receiver would include it in every acknowledgment
|
||||
segment, and the sender would interpret it. Unfortunately, this
|
||||
simple approach would not work. The sender must reliably know the
|
||||
receiver's current scale factor, but a TCP option in an
|
||||
acknowledgement segment will not be delivered reliably (unless the
|
||||
ACK happens to be piggy-backed on data).
|
||||
|
||||
However, SYN segments are always sent reliably, suggesting that each
|
||||
side may communicate its window scale factor in an initial TCP
|
||||
option. This approach has a disadvantage: the scale must be
|
||||
established when the connection is opened, and cannot be changed
|
||||
thereafter. However, other alternatives would be much more
|
||||
complicated, and we therefore propose a new initial option called
|
||||
Window Scale.
|
||||
|
||||
2.1 Window Scale Option
|
||||
|
||||
This three-byte option may be sent in a SYN segment by a TCP (1)
|
||||
to indicate that it is prepared to do both send and receive window
|
||||
scaling, and (2) to communicate a scale factor to be applied to
|
||||
its receive window. The scale factor is encoded logarithmically,
|
||||
as a power of 2 (presumably to be implemented by binary shifts).
|
||||
|
||||
Note: the window in the SYN segment itself is never scaled.
|
||||
|
||||
TCP Window Scale Option:
|
||||
|
||||
Kind: 3
|
||||
|
||||
+---------+---------+---------+
|
||||
| Kind=3 |Length=3 |shift.cnt|
|
||||
+---------+---------+---------+
|
||||
|
||||
Here shift.cnt is the number of bits by which the receiver right-
|
||||
shifts the true receive-window value, to scale it into a 16-bit
|
||||
value to be sent in TCP header (this scaling is explained below).
|
||||
The value shift.cnt may be zero (offering to scale, while applying
|
||||
a scale factor of 1 to the receive window).
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 4]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
This option is an offer, not a promise; both sides must send
|
||||
Window Scale options in their SYN segments to enable window
|
||||
scaling in either direction.
|
||||
|
||||
2.2 Using the Window Scale Option
|
||||
|
||||
A model implementation of window scaling is as follows, using the
|
||||
notation of RFC-793 [Postel81]:
|
||||
|
||||
* The send-window (SND.WND) and receive-window (RCV.WND) sizes
|
||||
in the connection state block and in all sequence space
|
||||
calculations are expanded from 16 to 32 bits.
|
||||
|
||||
* Two window shift counts are added to the connection state:
|
||||
snd.scale and rcv.scale. These are shift counts to be
|
||||
applied to the incoming and outgoing windows, respectively.
|
||||
The precise algorithm is shown below.
|
||||
|
||||
* All outgoing SYN segments are sent with the Window Scale
|
||||
option, containing a value shift.cnt = R that the TCP would
|
||||
like to use for its receive window.
|
||||
|
||||
* Snd.scale and rcv.scale are initialized to zero, and are
|
||||
changed only during processing of a received SYN segment. If
|
||||
the SYN segment contains a Window Scale option with shift.cnt
|
||||
= S, set snd.scale to S and set rcv.scale to R; otherwise,
|
||||
both snd.scale and rcv.scale are left at zero.
|
||||
|
||||
* The window field (SEG.WND) in the header of every incoming
|
||||
segment, with the exception of SYN segments, will be left-
|
||||
shifted by snd.scale bits before updating SND.WND:
|
||||
|
||||
SND.WND = SEG.WND << snd.scale
|
||||
|
||||
(assuming the other conditions of RFC793 are met, and using
|
||||
the "C" notation "<<" for left-shift).
|
||||
|
||||
* The window field (SEG.WND) of every outgoing segment, with
|
||||
the exception of SYN segments, will have been right-shifted
|
||||
by rcv.scale bits:
|
||||
|
||||
SEG.WND = RCV.WND >> rcv.scale.
|
||||
|
||||
|
||||
TCP determines if a data segment is "old" or "new" by testing if
|
||||
its sequence number is within 2**31 bytes of the left edge of the
|
||||
window. If not, the data is "old" and discarded. To insure that
|
||||
new data is never mistakenly considered old and vice-versa, the
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 5]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
left edge of the sender's window has to be at least 2**31 away
|
||||
from the right edge of the receiver's window. Similarly with the
|
||||
sender's right edge and receiver's left edge. Since the right and
|
||||
left edges of either the sender's or receiver's window differ by
|
||||
the window size, and since the sender and receiver windows can be
|
||||
out of phase by at most the window size, the above constraints
|
||||
imply that 2 * the max window size must be less than 2**31, or
|
||||
|
||||
max window < 2**30
|
||||
|
||||
Since the max window is 2**S (where S is the scaling shift count)
|
||||
times at most 2**16 - 1 (the maximum unscaled window), the maximum
|
||||
window is guaranteed to be < 2*30 if S <= 14. Thus, the shift
|
||||
count must be limited to 14. (This allows windows of 2**30 = 1
|
||||
Gbyte.) If a Window Scale option is received with a shift.cnt
|
||||
value exceeding 14, the TCP should log the error but use 14
|
||||
instead of the specified value.
|
||||
|
||||
|
||||
3. TCP SELECTIVE ACKNOWLEDGMENT OPTIONS
|
||||
|
||||
To minimize the impact on the TCP protocol, the selective
|
||||
acknowledgment extension uses the form of two new TCP options. The
|
||||
first is an enabling option, "SACK-permitted", that may be sent in a
|
||||
SYN segment to indicate that the the SACK option may be used once the
|
||||
connection is established. The other is the SACK option itself,
|
||||
which may be sent over an established connection once permission has
|
||||
been given by SACK-permitted.
|
||||
|
||||
The SACK option is to be included in a segment sent from a TCP that
|
||||
is receiving data to the TCP that is sending that data; we will refer
|
||||
to these TCP's as the data receiver and the data sender,
|
||||
respectively. We will consider a particular simplex data flow; any
|
||||
data flowing in the reverse direction over the same connection can be
|
||||
treated independently.
|
||||
|
||||
3.1 SACK-Permitted Option
|
||||
|
||||
This two-byte option may be sent in a SYN by a TCP that has been
|
||||
extended to receive (and presumably process) the SACK option once
|
||||
the connection has opened.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 6]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
TCP Sack-Permitted Option:
|
||||
|
||||
Kind: 4
|
||||
|
||||
+---------+---------+
|
||||
| Kind=4 | Length=2|
|
||||
+---------+---------+
|
||||
|
||||
3.2 SACK Option
|
||||
|
||||
The SACK option is to be used to convey extended acknowledgment
|
||||
information over an established connection. Specifically, it is
|
||||
to be sent by a data receiver to inform the data transmitter of
|
||||
non-contiguous blocks of data that have been received and queued.
|
||||
The data receiver is awaiting the receipt of data in later
|
||||
retransmissions to fill the gaps in sequence space between these
|
||||
blocks. At that time, the data receiver will acknowledge the data
|
||||
normally by advancing the left window edge in the Acknowledgment
|
||||
Number field of the TCP header.
|
||||
|
||||
It is important to understand that the SACK option will not change
|
||||
the meaning of the Acknowledgment Number field, whose value will
|
||||
still specify the left window edge, i.e., one byte beyond the last
|
||||
sequence number of fully-received data. The SACK option is
|
||||
advisory; if it is ignored, TCP acknowledgments will continue to
|
||||
function as specified in the protocol.
|
||||
|
||||
However, SACK will provide additional information that the data
|
||||
transmitter can use to optimize retransmissions. The TCP data
|
||||
receiver may include the SACK option in an acknowledgment segment
|
||||
whenever it has data that is queued and unacknowledged. Of
|
||||
course, the SACK option may be sent only when the TCP has received
|
||||
the SACK-permitted option in the SYN segment for that connection.
|
||||
|
||||
TCP SACK Option:
|
||||
|
||||
Kind: 5
|
||||
|
||||
Length: Variable
|
||||
|
||||
|
||||
+--------+--------+--------+--------+--------+--------+...---+
|
||||
| Kind=5 | Length | Relative Origin | Block Size | |
|
||||
+--------+--------+--------+--------+--------+--------+...---+
|
||||
|
||||
|
||||
This option contains a list of the blocks of contiguous sequence
|
||||
space occupied by data that has been received and queued within
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 7]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
the window. Each block is contiguous and isolated; that is, the
|
||||
octets just below the block,
|
||||
|
||||
Acknowledgment Number + Relative Origin -1,
|
||||
|
||||
and just above the block,
|
||||
|
||||
Acknowledgment Number + Relative Origin + Block Size,
|
||||
|
||||
have not been received.
|
||||
|
||||
Each contiguous block of data queued at the receiver is defined in
|
||||
the SACK option by two 16-bit integers:
|
||||
|
||||
|
||||
* Relative Origin
|
||||
|
||||
This is the first sequence number of this block, relative to
|
||||
the Acknowledgment Number field in the TCP header (i.e.,
|
||||
relative to the data receiver's left window edge).
|
||||
|
||||
|
||||
* Block Size
|
||||
|
||||
This is the size in octets of this block of contiguous data.
|
||||
|
||||
|
||||
A SACK option that specifies n blocks will have a length of 4*n+2
|
||||
octets, so the 44 bytes available for TCP options can specify a
|
||||
maximum of 10 blocks. Of course, if other TCP options are
|
||||
introduced, they will compete for the 44 bytes, and the limit of
|
||||
10 may be reduced in particular segments.
|
||||
|
||||
There is no requirement on the order in which blocks can appear in
|
||||
a single SACK option.
|
||||
|
||||
Note: requiring that the blocks be ordered would allow a
|
||||
slightly more efficient algorithm in the transmitter; however,
|
||||
this does not seem to be an important optimization.
|
||||
|
||||
3.3 SACK with Window Scaling
|
||||
|
||||
If window scaling is in effect, then 16 bits may not be sufficient
|
||||
for the SACK option fields that define the origin and length of a
|
||||
block. There are two possible ways to handle this:
|
||||
|
||||
(1) Expand the SACK origin and length fields to 24 or 32 bits.
|
||||
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 8]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
(2) Scale the SACK fields by the same factor as the window.
|
||||
|
||||
|
||||
The first alternative would significantly reduce the number of
|
||||
blocks possible in a SACK option; therefore, we have chosen the
|
||||
second alternative, scaling the SACK information as well as the
|
||||
window.
|
||||
|
||||
Scaling the SACK information introduces some loss of precision,
|
||||
since a SACK option must report queued data blocks whose origins
|
||||
and lengths are multiples of the window scale factor rcv.scale.
|
||||
These reported blocks must be equal to or smaller than the actual
|
||||
blocks of queued data.
|
||||
|
||||
Specifically, suppose that the receiver has a contiguous block of
|
||||
queued data that occupies sequence numbers L, L+1, ... L+N-1, and
|
||||
that the window scale factor is S = rcv.scale. Then the
|
||||
corresponding block that will be reported in a SACK option will
|
||||
be:
|
||||
|
||||
Relative Origin = int((L+S-1)/S)
|
||||
|
||||
Block Size = int((L+N)/S) - (Relative Origin)
|
||||
|
||||
where the function int(x) returns the greatest integer contained
|
||||
in x.
|
||||
|
||||
The resulting loss of precision is not a serious problem for the
|
||||
sender. If the data-sending TCP keeps track of the boundaries of
|
||||
all segments in its retransmission queue, it will generally be
|
||||
able to infer from the imprecise SACK data which full segments
|
||||
don't need to be retransmitted. This will fail only if S is
|
||||
larger than the maximum segment size, in which case some segments
|
||||
may be retransmitted unnecessarily. If the sending TCP does not
|
||||
keep track of transmitted segment boundaries, the imprecision of
|
||||
the scaled SACK quantities will only result in retransmitting a
|
||||
small amount of unneeded sequence space. On the average, the data
|
||||
sender will unnecessarily retransmit J*S bytes of the sequence
|
||||
space for each SACK received; here J is the number of blocks
|
||||
reported in the SACK, and S = snd.scale.
|
||||
|
||||
3.4 SACK Option Examples
|
||||
|
||||
Assume the left window edge is 5000 and that the data transmitter
|
||||
sends a burst of 8 segments, each containing 500 data bytes.
|
||||
Unless specified otherwise, we assume that the scale factor S = 1.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 9]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
Case 1: The first 4 segments are received but the last 4 are
|
||||
dropped.
|
||||
|
||||
The data receiver will return a normal TCP ACK segment
|
||||
acknowledging sequence number 7000, with no SACK option.
|
||||
|
||||
|
||||
Case 2: The first segment is dropped but the remaining 7 are
|
||||
received.
|
||||
|
||||
The data receiver will return a TCP ACK segment that
|
||||
acknowledges sequence number 5000 and contains a SACK option
|
||||
specifying one block of queued data:
|
||||
|
||||
Relative Origin = 500; Block Size = 3500
|
||||
|
||||
|
||||
Case 3: The 2nd, 4th, 6th, and 8th (last) segments are
|
||||
dropped.
|
||||
|
||||
The data receiver will return a TCP ACK segment that
|
||||
acknowledges sequence number 5500 and contains a SACK option
|
||||
specifying the 3 blocks:
|
||||
|
||||
Relative Origin = 500; Block Size = 500
|
||||
Relative Origin = 1500; Block Size = 500
|
||||
Relative Origin = 2500; Block Size = 500
|
||||
|
||||
|
||||
Case 4: Same as Case 3, except Scale Factor S = 16.
|
||||
|
||||
The SACK option would specify the 3 scaled blocks:
|
||||
|
||||
Relative Origin = 32; Block Size = 30
|
||||
Relative Origin = 94; Block Size = 31
|
||||
Relative Origin = 157; Block Size = 30
|
||||
|
||||
These three reported blocks have sequence numbers 512 through
|
||||
991, 1504 through 1999, and 2512 through 2992, respectively.
|
||||
|
||||
|
||||
3.5 Generating the SACK Option
|
||||
|
||||
Let us assume that the data receiver maintains a queue of valid
|
||||
segments that it has neither passed to the user nor acknowledged
|
||||
because of earlier missing data, and that this queue is ordered by
|
||||
starting sequence number. Computation of the SACK option can be
|
||||
done with one pass down this queue. Segments that occupy
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 10]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
contiguous sequence space are aggregated into a single SACK block,
|
||||
and each gap in the sequence space (except a gap that is
|
||||
terminated by the right window edge) triggers the start of a new
|
||||
SACK block. If this algorithm defines more than 10 blocks, only
|
||||
the first 10 can be included in the option.
|
||||
|
||||
3.6 Interpreting the SACK Option
|
||||
|
||||
The data transmitter is assumed to have a retransmission queue
|
||||
that contains the segments that have been transmitted but not yet
|
||||
acknowledged, in sequence-number order. If the data transmitter
|
||||
performs re-packetization before retransmission, the block
|
||||
boundaries in a SACK option that it receives may not fall on
|
||||
boundaries of segments in the retransmission queue; however, this
|
||||
does not pose a serious difficulty for the transmitter.
|
||||
|
||||
Let us suppose that for each segment in the retransmission queue
|
||||
there is a (new) flag bit "ACK'd", to be used to indicate that
|
||||
this particular segment has been entirely acknowledged. When a
|
||||
segment is first transmitted, it will be entered into the
|
||||
retransmission queue with its ACK'd bit off. If the ACK'd bit is
|
||||
subsequently turned on (as the result of processing a received
|
||||
SACK option), the data transmitter will skip this segment during
|
||||
any later retransmission. However, the segment will not be
|
||||
dequeued and its buffer freed until the left window edge is
|
||||
advanced over it.
|
||||
|
||||
When an acknowledgment segment arrives containing a SACK option,
|
||||
the data transmitter will turn on the ACK'd bits for segments that
|
||||
have been selectively acknowleged. More specifically, for each
|
||||
block in the SACK option, the data transmitter will turn on the
|
||||
ACK'd flags for all segments in the retransmission queue that are
|
||||
wholly contained within that block. This requires straightforward
|
||||
sequence number comparisons.
|
||||
|
||||
|
||||
4. TCP ECHO OPTIONS
|
||||
|
||||
A simple method for measuring the RTT of a segment would be: the
|
||||
sender places a timestamp in the segment and the receiver returns
|
||||
that timestamp in the corresponding ACK segment. When the ACK segment
|
||||
arrives at the sender, the difference between the current time and
|
||||
the timestamp is the RTT. To implement this timing method, the
|
||||
receiver must simply reflect or echo selected data (the timestamp)
|
||||
from the sender's segments. This idea is the basis of the "TCP Echo"
|
||||
and "TCP Echo Reply" options.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 11]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
4.1 TCP Echo and TCP Echo Reply Options
|
||||
|
||||
TCP Echo Option:
|
||||
|
||||
Kind: 6
|
||||
|
||||
Length: 6
|
||||
|
||||
+--------+--------+--------+--------+--------+--------+
|
||||
| Kind=6 | Length | 4 bytes of info to be echoed |
|
||||
+--------+--------+--------+--------+--------+--------+
|
||||
|
||||
This option carries four bytes of information that the receiving TCP
|
||||
may send back in a subsequent TCP Echo Reply option (see below). A
|
||||
TCP may send the TCP Echo option in any segment, but only if a TCP
|
||||
Echo option was received in a SYN segment for the connection.
|
||||
|
||||
When the TCP echo option is used for RTT measurement, it will be
|
||||
included in data segments, and the four information bytes will define
|
||||
the time at which the data segment was transmitted in any format
|
||||
convenient to the sender.
|
||||
|
||||
TCP Echo Reply Option:
|
||||
|
||||
Kind: 7
|
||||
|
||||
Length: 6
|
||||
|
||||
+--------+--------+--------+--------+--------+--------+
|
||||
| Kind=7 | Length | 4 bytes of echoed info |
|
||||
+--------+--------+--------+--------+--------+--------+
|
||||
|
||||
|
||||
A TCP that receives a TCP Echo option containing four information
|
||||
bytes will return these same bytes in a TCP Echo Reply option.
|
||||
|
||||
This TCP Echo Reply option must be returned in the next segment
|
||||
(e.g., an ACK segment) that is sent. If more than one Echo option is
|
||||
received before a reply segment is sent, the TCP must choose only one
|
||||
of the options to echo, ignoring the others; specifically, it must
|
||||
choose the newest segment with the oldest sequence number (see next
|
||||
section.)
|
||||
|
||||
To use the TCP Echo and Echo Reply options, a TCP must send a TCP
|
||||
Echo option in its own SYN segment and receive a TCP Echo option in a
|
||||
SYN segment from the other TCP. A TCP that does not implement the
|
||||
TCP Echo or Echo Reply options must simply ignore any TCP Echo
|
||||
options it receives. However, a TCP should not receive one of these
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 12]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
options in a non-SYN segment unless it included a TCP Echo option in
|
||||
its own SYN segment.
|
||||
|
||||
4.2 Using the Echo Options
|
||||
|
||||
If we wish to use the Echo/Echo Reply options for RTT measurement, we
|
||||
have to define what the receiver does when there is not a one-to-one
|
||||
correspondence between data and ACK segments. Assuming that we want
|
||||
to minimize the state kept in the receiver (i.e., the number of
|
||||
unprocessed Echo options), we can plan on a receiver remembering the
|
||||
information value from at most one Echo between ACKs. There are
|
||||
three situations to consider:
|
||||
|
||||
(A) Delayed ACKs.
|
||||
|
||||
Many TCP's acknowledge only every Kth segment out of a group of
|
||||
segments arriving within a short time interval; this policy is
|
||||
known generally as "delayed ACK's". The data-sender TCP must
|
||||
measure the effective RTT, including the additional time due to
|
||||
delayed ACK's, or else it will retransmit unnecessarily. Thus,
|
||||
when delayed ACK's are in use, the receiver should reply with
|
||||
the Echo option information from the earliest unacknowledged
|
||||
segment.
|
||||
|
||||
(B) A hole in the sequence space (segment(s) have been lost).
|
||||
|
||||
The sender will continue sending until the window is filled, and
|
||||
we may be generating ACKs as these out-of-order segments arrive
|
||||
(e.g., for the SACK information or to aid "fast retransmit").
|
||||
An Echo Reply option will tell the sender the RTT of some
|
||||
recently sent segment (since the ACK can only contain the
|
||||
sequence number of the hole, the sender may not be able to
|
||||
determine which segment, but that doesn't matter). If the loss
|
||||
was due to congestion, these RTTs may be particularly valuable
|
||||
to the sender since they reflect the network characteristics
|
||||
immediately after the congestion.
|
||||
|
||||
(C) A filled hole in the sequence space.
|
||||
|
||||
The segment that fills the hole represents the most recent
|
||||
measurement of the network characteristics. On the other hand,
|
||||
an RTT computed from an earlier segment would probably include
|
||||
the sender's retransmit time-out, badly biasing the sender's
|
||||
average RTT estimate.
|
||||
|
||||
|
||||
Case (A) suggests the receiver should remember and return the Echo
|
||||
option information from the oldest unacknowledged segment. Cases (B)
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 13]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
and (C) suggest that the option should come from the most recent
|
||||
unacknowledged segment. An algorithm that covers all three cases is
|
||||
for the receiver to return the Echo option information from the
|
||||
newest segment with the oldest sequence number, as specified earlier.
|
||||
|
||||
A model implementation of these options is as follows.
|
||||
|
||||
|
||||
(1) Receiver Implementation
|
||||
|
||||
A 32-bit slot for Echo option data, rcv.echodata, is added to
|
||||
the receiver connection state, together with a flag,
|
||||
rcv.echopresent, that indicates whether there is anything in the
|
||||
slot. When the receiver generates a segment, it checks
|
||||
rcv.echopresent and, if it is set, adds an echo-reply option
|
||||
containing rcv.echodata to the outgoing segment then clears
|
||||
rcv.echopresent.
|
||||
|
||||
If an incoming segment is in the window and contains an echo
|
||||
option, the receiver checks rcv.echopresent. If it isn't set,
|
||||
the value of the echo option is copied to rcv.echodata and
|
||||
rcv.echopresent is set. If rcv.echopresent is already set, the
|
||||
receiver checks whether the segment is at the left edge of the
|
||||
window. If so, the segment's echo option value is copied to
|
||||
rcv.echodata (this is situation (C) above). Otherwise, the
|
||||
segment's echo option is ignored.
|
||||
|
||||
|
||||
(2) Sender Implementation
|
||||
|
||||
The sender's connection state has a single flag bit,
|
||||
snd.echoallowed, added. If snd.echoallowed is set or if the
|
||||
segment contains a SYN, the sender is free to add a TCP Echo
|
||||
option (presumably containing the current time in some units
|
||||
convenient to the sender) to every outgoing segment.
|
||||
|
||||
Snd.echoallowed should be set if a SYN is received with a TCP
|
||||
Echo option (presumably, a host that implements the option will
|
||||
attempt to use it to time the SYN segment).
|
||||
|
||||
|
||||
5. CONCLUSIONS AND ACKNOWLEDGMENTS
|
||||
|
||||
We have proposed five new TCP options for scaled windows, selective
|
||||
acknowledgments, and round-trip timing, in order to provide efficient
|
||||
operation over large-bandwidth*delay-product paths. These extensions
|
||||
are designed to provide compatible interworking with TCP's that do not
|
||||
implement the extensions.
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 14]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
The Window Scale option was originally suggested by Mike St. Johns of
|
||||
USAF/DCA. The present form of the option was suggested by Mike Karels
|
||||
of UC Berkeley in response to a more cumbersome scheme proposed by Van
|
||||
Jacobson. Gerd Beling of FGAN (West Germany) contributed the initial
|
||||
definition of the SACK option.
|
||||
|
||||
All three options have evolved through discussion with the End-to-End
|
||||
Task Force, and the authors are grateful to the other members of the
|
||||
Task Force for their advice and encouragement.
|
||||
|
||||
6. REFERENCES
|
||||
|
||||
[Cheriton88] Cheriton, D., "VMTP: Versatile Message Transaction
|
||||
Protocol", RFC 1045, Stanford University, February 1988.
|
||||
|
||||
[Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet
|
||||
Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm.,
|
||||
Scottsdale, Arizona, March 1986.
|
||||
|
||||
[Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times
|
||||
in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT,
|
||||
August 1987.
|
||||
|
||||
[Clark87] Clark, D., Lambert, M., and L. Zhang, "NETBLT: A Bulk
|
||||
Data Transfer Protocol", RFC 998, MIT, March 1987.
|
||||
|
||||
[Nagle84] Nagle, J., "Congestion Control in IP/TCP
|
||||
Internetworks", RFC 896, FACC, January 1984.
|
||||
|
||||
[NBS85] Colella, R., Aronoff, R., and K. Mills, "Performance
|
||||
Improvements for ISO Transport", Ninth Data Comm Symposium,
|
||||
published in ACM SIGCOMM Comp Comm Review, vol. 15, no. 5,
|
||||
September 1985.
|
||||
|
||||
[Partridge87] Partridge, C., "Private Communication", February
|
||||
1987.
|
||||
|
||||
[Postel81] Postel, J., "Transmission Control Protocol - DARPA
|
||||
Internet Program Protocol Specification", RFC 793, DARPA,
|
||||
September 1981.
|
||||
|
||||
[Velten84] Velten, D., Hinden, R., and J. Sax, "Reliable Data
|
||||
Protocol", RFC 908, BBN, July 1984.
|
||||
|
||||
[Jacobson88] Jacobson, V., "Congestion Avoidance and Control", to
|
||||
be presented at SIGCOMM '88, Stanford, CA., August 1988.
|
||||
|
||||
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc.
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 15]
|
||||
|
||||
RFC 1072 TCP Extensions for Long-Delay Paths October 1988
|
||||
|
||||
|
||||
SIGCOMM '86, Stowe, Vt., August 1986.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Jacobson & Braden [Page 16]
|
||||
|
||||
Reference in New Issue
Block a user