2075 lines
83 KiB
Plaintext
2075 lines
83 KiB
Plaintext
|
||
|
||
|
||
|
||
|
||
|
||
Network Working Group V. Jacobson
|
||
Request for Comments: 1323 LBL
|
||
Obsoletes: RFC 1072, RFC 1185 R. Braden
|
||
ISI
|
||
D. Borman
|
||
Cray Research
|
||
May 1992
|
||
|
||
|
||
TCP Extensions for High Performance
|
||
|
||
Status of This Memo
|
||
|
||
This RFC specifies an IAB standards track protocol for the Internet
|
||
community, and requests discussion and suggestions for improvements.
|
||
Please refer to the current edition of the "IAB Official Protocol
|
||
Standards" for the standardization state and status of this protocol.
|
||
Distribution of this memo is unlimited.
|
||
|
||
Abstract
|
||
|
||
This memo presents a set of TCP extensions to improve performance
|
||
over large bandwidth*delay product paths and to provide reliable
|
||
operation over very high-speed paths. It defines new TCP options for
|
||
scaled windows and timestamps, which are designed to provide
|
||
compatible interworking with TCP's that do not implement the
|
||
extensions. The timestamps are used for two distinct mechanisms:
|
||
RTTM (Round Trip Time Measurement) and PAWS (Protect Against Wrapped
|
||
Sequences). Selective acknowledgments are not included in this memo.
|
||
|
||
This memo combines and supersedes RFC-1072 and RFC-1185, adding
|
||
additional clarification and more detailed specification. Appendix C
|
||
summarizes the changes from the earlier RFCs.
|
||
|
||
TABLE OF CONTENTS
|
||
|
||
1. Introduction ................................................. 2
|
||
2. TCP Window Scale Option ...................................... 8
|
||
3. RTTM -- Round-Trip Time Measurement .......................... 11
|
||
4. PAWS -- Protect Against Wrapped Sequence Numbers ............. 17
|
||
5. Conclusions and Acknowledgments .............................. 25
|
||
6. References ................................................... 25
|
||
APPENDIX A: Implementation Suggestions ........................... 27
|
||
APPENDIX B: Duplicates from Earlier Connection Incarnations ...... 27
|
||
APPENDIX C: Changes from RFC-1072, RFC-1185 ...................... 30
|
||
APPENDIX D: Summary of Notation .................................. 31
|
||
APPENDIX E: Event Processing ..................................... 32
|
||
Security Considerations .......................................... 37
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 1]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
Authors' Addresses ............................................... 37
|
||
|
||
1. INTRODUCTION
|
||
|
||
The TCP protocol [Postel81] was designed to operate reliably over
|
||
almost any transmission medium regardless of transmission rate,
|
||
delay, corruption, duplication, or reordering of segments.
|
||
Production TCP implementations currently adapt to transfer rates in
|
||
the range of 100 bps to 10**7 bps and round-trip delays in the range
|
||
1 ms to 100 seconds. Recent work on TCP performance has shown that
|
||
TCP can work well over a variety of Internet paths, ranging from 800
|
||
Mbit/sec I/O channels to 300 bit/sec dial-up modems [Jacobson88a].
|
||
|
||
The introduction of fiber optics is resulting in ever-higher
|
||
transmission speeds, and the fastest paths are moving out of the
|
||
domain for which TCP was originally engineered. This memo defines a
|
||
set of modest extensions to TCP to extend the domain of its
|
||
application to match this increasing network capability. It is based
|
||
upon and obsoletes RFC-1072 [Jacobson88b] and RFC-1185 [Jacobson90b].
|
||
|
||
There is no one-line answer to the question: "How fast can TCP go?".
|
||
There are two separate kinds of issues, performance and reliability,
|
||
and each depends upon different parameters. We discuss each in turn.
|
||
|
||
1.1 TCP Performance
|
||
|
||
TCP performance depends not upon the transfer rate itself, but
|
||
rather upon the product of the transfer rate and the round-trip
|
||
delay. This "bandwidth*delay product" measures the amount of data
|
||
that would "fill the pipe"; it is the buffer space required at
|
||
sender and receiver to obtain maximum throughput on the TCP
|
||
connection over the path, i.e., the amount of unacknowledged data
|
||
that TCP must handle in order to keep the pipeline full. TCP
|
||
performance problems arise when the bandwidth*delay product is
|
||
large. We refer to an Internet path operating in this region as a
|
||
"long, fat pipe", and a network containing this path as an "LFN"
|
||
(pronounced "elephan(t)").
|
||
|
||
High-capacity packet satellite channels (e.g., DARPA's Wideband
|
||
Net) are LFN's. For example, a DS1-speed satellite channel has a
|
||
bandwidth*delay product of 10**6 bits or more; this corresponds to
|
||
100 outstanding TCP segments of 1200 bytes each. Terrestrial
|
||
fiber-optical paths will also fall into the LFN class; for
|
||
example, a cross-country delay of 30 ms at a DS3 bandwidth
|
||
(45Mbps) also exceeds 10**6 bits.
|
||
|
||
There are three fundamental performance problems with the current
|
||
TCP over LFN paths:
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 2]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
(1) Window Size Limit
|
||
|
||
The TCP header uses a 16 bit field to report the receive
|
||
window size to the sender. Therefore, the largest window
|
||
that can be used is 2**16 = 65K bytes.
|
||
|
||
To circumvent this problem, Section 2 of this memo defines a
|
||
new TCP option, "Window Scale", to allow windows larger than
|
||
2**16. This option defines an implicit scale factor, which
|
||
is used to multiply the window size value found in a TCP
|
||
header to obtain the true window size.
|
||
|
||
(2) Recovery from Losses
|
||
|
||
Packet losses in an LFN can have a catastrophic effect on
|
||
throughput. Until recently, properly-operating TCP
|
||
implementations would cause the data pipeline to drain with
|
||
every packet loss, and require a slow-start action to
|
||
recover. Recently, the Fast Retransmit and Fast Recovery
|
||
algorithms [Jacobson90c] have been introduced. Their
|
||
combined effect is to recover from one packet loss per
|
||
window, without draining the pipeline. However, more than
|
||
one packet loss per window typically results in a
|
||
retransmission timeout and the resulting pipeline drain and
|
||
slow start.
|
||
|
||
Expanding the window size to match the capacity of an LFN
|
||
results in a corresponding increase of the probability of
|
||
more than one packet per window being dropped. This could
|
||
have a devastating effect upon the throughput of TCP over an
|
||
LFN. In addition, if a congestion control mechanism based
|
||
upon some form of random dropping were introduced into
|
||
gateways, randomly spaced packet drops would become common,
|
||
possible increasing the probability of dropping more than one
|
||
packet per window.
|
||
|
||
To generalize the Fast Retransmit/Fast Recovery mechanism to
|
||
handle multiple packets dropped per window, selective
|
||
acknowledgments are required. Unlike the normal cumulative
|
||
acknowledgments of TCP, selective acknowledgments give the
|
||
sender a complete picture of which segments are queued at the
|
||
receiver and which have not yet arrived. Some evidence in
|
||
favor of selective acknowledgments has been published
|
||
[NBS85], and selective acknowledgments have been included in
|
||
a number of experimental Internet protocols -- VMTP
|
||
[Cheriton88], NETBLT [Clark87], and RDP [Velten84], and
|
||
proposed for OSI TP4 [NBS85]. However, in the non-LFN
|
||
regime, selective acknowledgments reduce the number of
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 3]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
packets retransmitted but do not otherwise improve
|
||
performance, making their complexity of questionable value.
|
||
However, selective acknowledgments are expected to become
|
||
much more important in the LFN regime.
|
||
|
||
RFC-1072 defined a new TCP "SACK" option to send a selective
|
||
acknowledgment. However, there are important technical
|
||
issues to be worked out concerning both the format and
|
||
semantics of the SACK option. Therefore, SACK has been
|
||
omitted from this package of extensions. It is hoped that
|
||
SACK can "catch up" during the standardization process.
|
||
|
||
(3) Round-Trip Measurement
|
||
|
||
TCP implements reliable data delivery by retransmitting
|
||
segments that are not acknowledged within some retransmission
|
||
timeout (RTO) interval. Accurate dynamic determination of an
|
||
appropriate RTO is essential to TCP performance. RTO is
|
||
determined by estimating the mean and variance of the
|
||
measured round-trip time (RTT), i.e., the time interval
|
||
between sending a segment and receiving an acknowledgment for
|
||
it [Jacobson88a].
|
||
|
||
Section 4 introduces a new TCP option, "Timestamps", and then
|
||
defines a mechanism using this option that allows nearly
|
||
every segment, including retransmissions, to be timed at
|
||
negligible computational cost. We use the mnemonic RTTM
|
||
(Round Trip Time Measurement) for this mechanism, to
|
||
distinguish it from other uses of the Timestamps option.
|
||
|
||
|
||
1.2 TCP Reliability
|
||
|
||
Now we turn from performance to reliability. High transfer rate
|
||
enters TCP performance through the bandwidth*delay product.
|
||
However, high transfer rate alone can threaten TCP reliability by
|
||
violating the assumptions behind the TCP mechanism for duplicate
|
||
detection and sequencing.
|
||
|
||
An especially serious kind of error may result from an accidental
|
||
reuse of TCP sequence numbers in data segments. Suppose that an
|
||
"old duplicate segment", e.g., a duplicate data segment that was
|
||
delayed in Internet queues, is delivered to the receiver at the
|
||
wrong moment, so that its sequence numbers falls somewhere within
|
||
the current window. There would be no checksum failure to warn of
|
||
the error, and the result could be an undetected corruption of the
|
||
data. Reception of an old duplicate ACK segment at the
|
||
transmitter could be only slightly less serious: it is likely to
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 4]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
lock up the connection so that no further progress can be made,
|
||
forcing an RST on the connection.
|
||
|
||
TCP reliability depends upon the existence of a bound on the
|
||
lifetime of a segment: the "Maximum Segment Lifetime" or MSL. An
|
||
MSL is generally required by any reliable transport protocol,
|
||
since every sequence number field must be finite, and therefore
|
||
any sequence number may eventually be reused. In the Internet
|
||
protocol suite, the MSL bound is enforced by an IP-layer
|
||
mechanism, the "Time-to-Live" or TTL field.
|
||
|
||
Duplication of sequence numbers might happen in either of two
|
||
ways:
|
||
|
||
(1) Sequence number wrap-around on the current connection
|
||
|
||
A TCP sequence number contains 32 bits. At a high enough
|
||
transfer rate, the 32-bit sequence space may be "wrapped"
|
||
(cycled) within the time that a segment is delayed in queues.
|
||
|
||
(2) Earlier incarnation of the connection
|
||
|
||
Suppose that a connection terminates, either by a proper
|
||
close sequence or due to a host crash, and the same
|
||
connection (i.e., using the same pair of sockets) is
|
||
immediately reopened. A delayed segment from the terminated
|
||
connection could fall within the current window for the new
|
||
incarnation and be accepted as valid.
|
||
|
||
Duplicates from earlier incarnations, Case (2), are avoided by
|
||
enforcing the current fixed MSL of the TCP spec, as explained in
|
||
Section 5.3 and Appendix B. However, case (1), avoiding the
|
||
reuse of sequence numbers within the same connection, requires an
|
||
MSL bound that depends upon the transfer rate, and at high enough
|
||
rates, a new mechanism is required.
|
||
|
||
More specifically, if the maximum effective bandwidth at which TCP
|
||
is able to transmit over a particular path is B bytes per second,
|
||
then the following constraint must be satisfied for error-free
|
||
operation:
|
||
|
||
2**31 / B > MSL (secs) [1]
|
||
|
||
The following table shows the value for Twrap = 2**31/B in
|
||
seconds, for some important values of the bandwidth B:
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 5]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
Network B*8 B Twrap
|
||
bits/sec bytes/sec secs
|
||
_______ _______ ______ ______
|
||
|
||
ARPANET 56kbps 7KBps 3*10**5 (~3.6 days)
|
||
|
||
DS1 1.5Mbps 190KBps 10**4 (~3 hours)
|
||
|
||
Ethernet 10Mbps 1.25MBps 1700 (~30 mins)
|
||
|
||
DS3 45Mbps 5.6MBps 380
|
||
|
||
FDDI 100Mbps 12.5MBps 170
|
||
|
||
Gigabit 1Gbps 125MBps 17
|
||
|
||
|
||
It is clear that wrap-around of the sequence space is not a
|
||
problem for 56kbps packet switching or even 10Mbps Ethernets. On
|
||
the other hand, at DS3 and FDDI speeds, Twrap is comparable to the
|
||
2 minute MSL assumed by the TCP specification [Postel81]. Moving
|
||
towards gigabit speeds, Twrap becomes too small for reliable
|
||
enforcement by the Internet TTL mechanism.
|
||
|
||
The 16-bit window field of TCP limits the effective bandwidth B to
|
||
2**16/RTT, where RTT is the round-trip time in seconds
|
||
[McKenzie89]. If the RTT is large enough, this limits B to a
|
||
value that meets the constraint [1] for a large MSL value. For
|
||
example, consider a transcontinental backbone with an RTT of 60ms
|
||
(set by the laws of physics). With the bandwidth*delay product
|
||
limited to 64KB by the TCP window size, B is then limited to
|
||
1.1MBps, no matter how high the theoretical transfer rate of the
|
||
path. This corresponds to cycling the sequence number space in
|
||
Twrap= 2000 secs, which is safe in today's Internet.
|
||
|
||
It is important to understand that the culprit is not the larger
|
||
window but rather the high bandwidth. For example, consider a
|
||
(very large) FDDI LAN with a diameter of 10km. Using the speed of
|
||
light, we can compute the RTT across the ring as
|
||
(2*10**4)/(3*10**8) = 67 microseconds, and the delay*bandwidth
|
||
product is then 833 bytes. A TCP connection across this LAN using
|
||
a window of only 833 bytes will run at the full 100mbps and can
|
||
wrap the sequence space in about 3 minutes, very close to the MSL
|
||
of TCP. Thus, high speed alone can cause a reliability problem
|
||
with sequence number wrap-around, even without extended windows.
|
||
|
||
Watson's Delta-T protocol [Watson81] includes network-layer
|
||
mechanisms for precise enforcement of an MSL. In contrast, the IP
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 6]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
mechanism for MSL enforcement is loosely defined and even more
|
||
loosely implemented in the Internet. Therefore, it is unwise to
|
||
depend upon active enforcement of MSL for TCP connections, and it
|
||
is unrealistic to imagine setting MSL's smaller than the current
|
||
values (e.g., 120 seconds specified for TCP).
|
||
|
||
A possible fix for the problem of cycling the sequence space would
|
||
be to increase the size of the TCP sequence number field. For
|
||
example, the sequence number field (and also the acknowledgment
|
||
field) could be expanded to 64 bits. This could be done either by
|
||
changing the TCP header or by means of an additional option.
|
||
|
||
Section 5 presents a different mechanism, which we call PAWS
|
||
(Protect Against Wrapped Sequence numbers), to extend TCP
|
||
reliability to transfer rates well beyond the foreseeable upper
|
||
limit of network bandwidths. PAWS uses the TCP Timestamps option
|
||
defined in Section 4 to protect against old duplicates from the
|
||
same connection.
|
||
|
||
1.3 Using TCP options
|
||
|
||
The extensions defined in this memo all use new TCP options. We
|
||
must address two possible issues concerning the use of TCP
|
||
options: (1) compatibility and (2) overhead.
|
||
|
||
We must pay careful attention to compatibility, i.e., to
|
||
interoperation with existing implementations. The only TCP option
|
||
defined previously, MSS, may appear only on a SYN segment. Every
|
||
implementation should (and we expect that most will) ignore
|
||
unknown options on SYN segments. However, some buggy TCP
|
||
implementation might be crashed by the first appearance of an
|
||
option on a non-SYN segment. Therefore, for each of the
|
||
extensions defined below, TCP options will be sent on non-SYN
|
||
segments only when an exchange of options on the SYN segments has
|
||
indicated that both sides understand the extension. Furthermore,
|
||
an extension option will be sent in a <SYN,ACK> segment only if
|
||
the corresponding option was received in the initial <SYN>
|
||
segment.
|
||
|
||
A question may be raised about the bandwidth and processing
|
||
overhead for TCP options. Those options that occur on SYN
|
||
segments are not likely to cause a performance concern. Opening a
|
||
TCP connection requires execution of significant special-case
|
||
code, and the processing of options is unlikely to increase that
|
||
cost significantly.
|
||
|
||
On the other hand, a Timestamps option may appear in any data or
|
||
ACK segment, adding 12 bytes to the 20-byte TCP header. We
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 7]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
believe that the bandwidth saved by reducing unnecessary
|
||
retransmissions will more than pay for the extra header bandwidth.
|
||
|
||
There is also an issue about the processing overhead for parsing
|
||
the variable byte-aligned format of options, particularly with a
|
||
RISC-architecture CPU. To meet this concern, Appendix A contains
|
||
a recommended layout of the options in TCP headers to achieve
|
||
reasonable data field alignment. In the spirit of Header
|
||
Prediction, a TCP can quickly test for this layout and if it is
|
||
verified then use a fast path. Hosts that use this canonical
|
||
layout will effectively use the options as a set of fixed-format
|
||
fields appended to the TCP header. However, to retain the
|
||
philosophical and protocol framework of TCP options, a TCP must be
|
||
prepared to parse an arbitrary options field, albeit with less
|
||
efficiency.
|
||
|
||
Finally, we observe that most of the mechanisms defined in this
|
||
memo are important for LFN's and/or very high-speed networks. For
|
||
low-speed networks, it might be a performance optimization to NOT
|
||
use these mechanisms. A TCP vendor concerned about optimal
|
||
performance over low-speed paths might consider turning these
|
||
extensions off for low-speed paths, or allow a user or
|
||
installation manager to disable them.
|
||
|
||
|
||
2. TCP WINDOW SCALE OPTION
|
||
|
||
2.1 Introduction
|
||
|
||
The window scale extension expands the definition of the TCP
|
||
window to 32 bits and then uses a scale factor to carry this 32-
|
||
bit value in the 16-bit Window field of the TCP header (SEG.WND in
|
||
RFC-793). The scale factor is carried in a new TCP option, Window
|
||
Scale. This option is sent only in a SYN segment (a segment with
|
||
the SYN bit on), hence the window scale is fixed in each direction
|
||
when a connection is opened. (Another design choice would be to
|
||
specify the window scale in every TCP segment. It would be
|
||
incorrect to send a window scale option only when the scale factor
|
||
changed, since a TCP option in an acknowledgement segment will not
|
||
be delivered reliably (unless the ACK happens to be piggy-backed
|
||
on data in the other direction). Fixing the scale when the
|
||
connection is opened has the advantage of lower overhead but the
|
||
disadvantage that the scale factor cannot be changed during the
|
||
connection.)
|
||
|
||
The maximum receive window, and therefore the scale factor, is
|
||
determined by the maximum receive buffer space. In a typical
|
||
modern implementation, this maximum buffer space is set by default
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 8]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
but can be overridden by a user program before a TCP connection is
|
||
opened. This determines the scale factor, and therefore no new
|
||
user interface is needed for window scaling.
|
||
|
||
2.2 Window Scale Option
|
||
|
||
The three-byte Window Scale option may be sent in a SYN segment by
|
||
a TCP. It has two purposes: (1) indicate that the TCP is prepared
|
||
to do both send and receive window scaling, and (2) communicate a
|
||
scale factor to be applied to its receive window. Thus, a TCP
|
||
that is prepared to scale windows should send the option, even if
|
||
its own scale factor is 1. The scale factor is limited to a power
|
||
of two and encoded logarithmically, so it may be implemented by
|
||
binary shift operations.
|
||
|
||
|
||
TCP Window Scale Option (WSopt):
|
||
|
||
Kind: 3 Length: 3 bytes
|
||
|
||
+---------+---------+---------+
|
||
| Kind=3 |Length=3 |shift.cnt|
|
||
+---------+---------+---------+
|
||
|
||
|
||
This option is an offer, not a promise; both sides must send
|
||
Window Scale options in their SYN segments to enable window
|
||
scaling in either direction. If window scaling is enabled,
|
||
then the TCP that sent this option will right-shift its true
|
||
receive-window values by 'shift.cnt' bits for transmission in
|
||
SEG.WND. The value 'shift.cnt' may be zero (offering to scale,
|
||
while applying a scale factor of 1 to the receive window).
|
||
|
||
This option may be sent in an initial <SYN> segment (i.e., a
|
||
segment with the SYN bit on and the ACK bit off). It may also
|
||
be sent in a <SYN,ACK> segment, but only if a Window Scale op-
|
||
tion was received in the initial <SYN> segment. A Window Scale
|
||
option in a segment without a SYN bit should be ignored.
|
||
|
||
The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment
|
||
itself is never scaled.
|
||
|
||
2.3 Using the Window Scale Option
|
||
|
||
A model implementation of window scaling is as follows, using the
|
||
notation of RFC-793 [Postel81]:
|
||
|
||
* All windows are treated as 32-bit quantities for storage in
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 9]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
the connection control block and for local calculations.
|
||
This includes the send-window (SND.WND) and the receive-
|
||
window (RCV.WND) values, as well as the congestion window.
|
||
|
||
* The connection state is augmented by two window shift counts,
|
||
Snd.Wind.Scale and Rcv.Wind.Scale, to be applied to the
|
||
incoming and outgoing window fields, respectively.
|
||
|
||
* If a TCP receives a <SYN> segment containing a Window Scale
|
||
option, it sends its own Window Scale option in the <SYN,ACK>
|
||
segment.
|
||
|
||
* The Window Scale option is sent with shift.cnt = R, where R
|
||
is the value that the TCP would like to use for its receive
|
||
window.
|
||
|
||
* Upon receiving a SYN segment with a Window Scale option
|
||
containing shift.cnt = S, a TCP sets Snd.Wind.Scale to S and
|
||
sets Rcv.Wind.Scale to R; otherwise, it sets both
|
||
Snd.Wind.Scale and Rcv.Wind.Scale to zero.
|
||
|
||
* The window field (SEG.WND) in the header of every incoming
|
||
segment, with the exception of SYN segments, is left-shifted
|
||
by Snd.Wind.Scale bits before updating SND.WND:
|
||
|
||
SND.WND = SEG.WND << Snd.Wind.Scale
|
||
|
||
(assuming the other conditions of RFC793 are met, and using
|
||
the "C" notation "<<" for left-shift).
|
||
|
||
* The window field (SEG.WND) of every outgoing segment, with
|
||
the exception of SYN segments, is right-shifted by
|
||
Rcv.Wind.Scale bits:
|
||
|
||
SEG.WND = RCV.WND >> Rcv.Wind.Scale.
|
||
|
||
|
||
TCP determines if a data segment is "old" or "new" by testing
|
||
whether its sequence number is within 2**31 bytes of the left edge
|
||
of the window, and if it is not, discarding the data as "old". To
|
||
insure that new data is never mistakenly considered old and vice-
|
||
versa, the left edge of the sender's window has to be at most
|
||
2**31 away from the right edge of the receiver's window.
|
||
Similarly with the sender's right edge and receiver's left edge.
|
||
Since the right and left edges of either the sender's or
|
||
receiver's window differ by the window size, and since the sender
|
||
and receiver windows can be out of phase by at most the window
|
||
size, the above constraints imply that 2 * the max window size
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 10]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
must be less than 2**31, or
|
||
|
||
max window < 2**30
|
||
|
||
Since the max window is 2**S (where S is the scaling shift count)
|
||
times at most 2**16 - 1 (the maximum unscaled window), the maximum
|
||
window is guaranteed to be < 2*30 if S <= 14. Thus, the shift
|
||
count must be limited to 14 (which allows windows of 2**30 = 1
|
||
Gbyte). If a Window Scale option is received with a shift.cnt
|
||
value exceeding 14, the TCP should log the error but use 14
|
||
instead of the specified value.
|
||
|
||
The scale factor applies only to the Window field as transmitted
|
||
in the TCP header; each TCP using extended windows will maintain
|
||
the window values locally as 32-bit numbers. For example, the
|
||
"congestion window" computed by Slow Start and Congestion
|
||
Avoidance is not affected by the scale factor, so window scaling
|
||
will not introduce quantization into the congestion window.
|
||
|
||
3. RTTM: ROUND-TRIP TIME MEASUREMENT
|
||
|
||
3.1 Introduction
|
||
|
||
Accurate and current RTT estimates are necessary to adapt to
|
||
changing traffic conditions and to avoid an instability known as
|
||
"congestion collapse" [Nagle84] in a busy network. However,
|
||
accurate measurement of RTT may be difficult both in theory and in
|
||
implementation.
|
||
|
||
Many TCP implementations base their RTT measurements upon a sample
|
||
of only one packet per window. While this yields an adequate
|
||
approximation to the RTT for small windows, it results in an
|
||
unacceptably poor RTT estimate for an LFN. If we look at RTT
|
||
estimation as a signal processing problem (which it is), a data
|
||
signal at some frequency, the packet rate, is being sampled at a
|
||
lower frequency, the window rate. This lower sampling frequency
|
||
violates Nyquist's criteria and may therefore introduce "aliasing"
|
||
artifacts into the estimated RTT [Hamming77].
|
||
|
||
A good RTT estimator with a conservative retransmission timeout
|
||
calculation can tolerate aliasing when the sampling frequency is
|
||
"close" to the data frequency. For example, with a window of 8
|
||
packets, the sample rate is 1/8 the data frequency -- less than an
|
||
order of magnitude different. However, when the window is tens or
|
||
hundreds of packets, the RTT estimator may be seriously in error,
|
||
resulting in spurious retransmissions.
|
||
|
||
If there are dropped packets, the problem becomes worse. Zhang
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 11]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
[Zhang86], Jain [Jain86] and Karn [Karn87] have shown that it is
|
||
not possible to accumulate reliable RTT estimates if retransmitted
|
||
segments are included in the estimate. Since a full window of
|
||
data will have been transmitted prior to a retransmission, all of
|
||
the segments in that window will have to be ACKed before the next
|
||
RTT sample can be taken. This means at least an additional
|
||
window's worth of time between RTT measurements and, as the error
|
||
rate approaches one per window of data (e.g., 10**-6 errors per
|
||
bit for the Wideband satellite network), it becomes effectively
|
||
impossible to obtain a valid RTT measurement.
|
||
|
||
A solution to these problems, which actually simplifies the sender
|
||
substantially, is as follows: using TCP options, the sender places
|
||
a timestamp in each data segment, and the receiver reflects these
|
||
timestamps back in ACK segments. Then a single subtract gives the
|
||
sender an accurate RTT measurement for every ACK segment (which
|
||
will correspond to every other data segment, with a sensible
|
||
receiver). We call this the RTTM (Round-Trip Time Measurement)
|
||
mechanism.
|
||
|
||
It is vitally important to use the RTTM mechanism with big
|
||
windows; otherwise, the door is opened to some dangerous
|
||
instabilities due to aliasing. Furthermore, the option is
|
||
probably useful for all TCP's, since it simplifies the sender.
|
||
|
||
3.2 TCP Timestamps Option
|
||
|
||
TCP is a symmetric protocol, allowing data to be sent at any time
|
||
in either direction, and therefore timestamp echoing may occur in
|
||
either direction. For simplicity and symmetry, we specify that
|
||
timestamps always be sent and echoed in both directions. For
|
||
efficiency, we combine the timestamp and timestamp reply fields
|
||
into a single TCP Timestamps Option.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 12]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
TCP Timestamps Option (TSopt):
|
||
|
||
Kind: 8
|
||
|
||
Length: 10 bytes
|
||
|
||
+-------+-------+---------------------+---------------------+
|
||
|Kind=8 | 10 | TS Value (TSval) |TS Echo Reply (TSecr)|
|
||
+-------+-------+---------------------+---------------------+
|
||
1 1 4 4
|
||
|
||
The Timestamps option carries two four-byte timestamp fields.
|
||
The Timestamp Value field (TSval) contains the current value of
|
||
the timestamp clock of the TCP sending the option.
|
||
|
||
The Timestamp Echo Reply field (TSecr) is only valid if the ACK
|
||
bit is set in the TCP header; if it is valid, it echos a times-
|
||
tamp value that was sent by the remote TCP in the TSval field
|
||
of a Timestamps option. When TSecr is not valid, its value
|
||
must be zero. The TSecr value will generally be from the most
|
||
recent Timestamp option that was received; however, there are
|
||
exceptions that are explained below.
|
||
|
||
A TCP may send the Timestamps option (TSopt) in an initial
|
||
<SYN> segment (i.e., segment containing a SYN bit and no ACK
|
||
bit), and may send a TSopt in other segments only if it re-
|
||
ceived a TSopt in the initial <SYN> segment for the connection.
|
||
|
||
3.3 The RTTM Mechanism
|
||
|
||
The timestamp value to be sent in TSval is to be obtained from a
|
||
(virtual) clock that we call the "timestamp clock". Its values
|
||
must be at least approximately proportional to real time, in order
|
||
to measure actual RTT.
|
||
|
||
The following example illustrates a one-way data flow with
|
||
segments arriving in sequence without loss. Here A, B, C...
|
||
represent data blocks occupying successive blocks of sequence
|
||
numbers, and ACK(A),... represent the corresponding cumulative
|
||
acknowledgments. The two timestamp fields of the Timestamps
|
||
option are shown symbolically as <TSval= x,TSecr=y>. Each TSecr
|
||
field contains the value most recently received in a TSval field.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 13]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
|
||
TCP A TCP B
|
||
|
||
<A,TSval=1,TSecr=120> ------>
|
||
|
||
<---- <ACK(A),TSval=127,TSecr=1>
|
||
|
||
<B,TSval=5,TSecr=127> ------>
|
||
|
||
<---- <ACK(B),TSval=131,TSecr=5>
|
||
|
||
. . . . . . . . . . . . . . . . . . . . . .
|
||
|
||
<C,TSval=65,TSecr=131> ------>
|
||
|
||
<---- <ACK(C),TSval=191,TSecr=65>
|
||
|
||
(etc)
|
||
|
||
|
||
The dotted line marks a pause (60 time units long) in which A had
|
||
nothing to send. Note that this pause inflates the RTT which B
|
||
could infer from receiving TSecr=131 in data segment C. Thus, in
|
||
one-way data flows, RTTM in the reverse direction measures a value
|
||
that is inflated by gaps in sending data. However, the following
|
||
rule prevents a resulting inflation of the measured RTT:
|
||
|
||
A TSecr value received in a segment is used to update the
|
||
averaged RTT measurement only if the segment acknowledges
|
||
some new data, i.e., only if it advances the left edge of the
|
||
send window.
|
||
|
||
Since TCP B is not sending data, the data segment C does not
|
||
acknowledge any new data when it arrives at B. Thus, the inflated
|
||
RTTM measurement is not used to update B's RTTM measurement.
|
||
|
||
3.4 Which Timestamp to Echo
|
||
|
||
If more than one Timestamps option is received before a reply
|
||
segment is sent, the TCP must choose only one of the TSvals to
|
||
echo, ignoring the others. To minimize the state kept in the
|
||
receiver (i.e., the number of unprocessed TSvals), the receiver
|
||
should be required to retain at most one timestamp in the
|
||
connection control block.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 14]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
There are three situations to consider:
|
||
|
||
(A) Delayed ACKs.
|
||
|
||
Many TCP's acknowledge only every Kth segment out of a group
|
||
of segments arriving within a short time interval; this
|
||
policy is known generally as "delayed ACKs". The data-sender
|
||
TCP must measure the effective RTT, including the additional
|
||
time due to delayed ACKs, or else it will retransmit
|
||
unnecessarily. Thus, when delayed ACKs are in use, the
|
||
receiver should reply with the TSval field from the earliest
|
||
unacknowledged segment.
|
||
|
||
(B) A hole in the sequence space (segment(s) have been lost).
|
||
|
||
The sender will continue sending until the window is filled,
|
||
and the receiver may be generating ACKs as these out-of-order
|
||
segments arrive (e.g., to aid "fast retransmit").
|
||
|
||
The lost segment is probably a sign of congestion, and in
|
||
that situation the sender should be conservative about
|
||
retransmission. Furthermore, it is better to overestimate
|
||
than underestimate the RTT. An ACK for an out-of-order
|
||
segment should therefore contain the timestamp from the most
|
||
recent segment that advanced the window.
|
||
|
||
The same situation occurs if segments are re-ordered by the
|
||
network.
|
||
|
||
(C) A filled hole in the sequence space.
|
||
|
||
The segment that fills the hole represents the most recent
|
||
measurement of the network characteristics. On the other
|
||
hand, an RTT computed from an earlier segment would probably
|
||
include the sender's retransmit time-out, badly biasing the
|
||
sender's average RTT estimate. Thus, the timestamp from the
|
||
latest segment (which filled the hole) must be echoed.
|
||
|
||
An algorithm that covers all three cases is described in the
|
||
following rules for Timestamps option processing on a synchronized
|
||
connection:
|
||
|
||
(1) The connection state is augmented with two 32-bit slots:
|
||
TS.Recent holds a timestamp to be echoed in TSecr whenever a
|
||
segment is sent, and Last.ACK.sent holds the ACK field from
|
||
the last segment sent. Last.ACK.sent will equal RCV.NXT
|
||
except when ACKs have been delayed.
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 15]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
(2) If Last.ACK.sent falls within the range of sequence numbers
|
||
of an incoming segment:
|
||
|
||
SEG.SEQ <= Last.ACK.sent < SEG.SEQ + SEG.LEN
|
||
|
||
then the TSval from the segment is copied to TS.Recent;
|
||
otherwise, the TSval is ignored.
|
||
|
||
(3) When a TSopt is sent, its TSecr field is set to the current
|
||
TS.Recent value.
|
||
|
||
The following examples illustrate these rules. Here A, B, C...
|
||
represent data segments occupying successive blocks of sequence
|
||
numbers, and ACK(A),... represent the corresponding
|
||
acknowledgment segments. Note that ACK(A) has the same sequence
|
||
number as B. We show only one direction of timestamp echoing, for
|
||
clarity.
|
||
|
||
|
||
o Packets arrive in sequence, and some of the ACKs are delayed.
|
||
|
||
By Case (A), the timestamp from the oldest unacknowledged
|
||
segment is echoed.
|
||
|
||
TS.Recent
|
||
<A, TSval=1> ------------------->
|
||
1
|
||
<B, TSval=2> ------------------->
|
||
1
|
||
<C, TSval=3> ------------------->
|
||
1
|
||
<---- <ACK(C), TSecr=1>
|
||
(etc)
|
||
|
||
o Packets arrive out of order, and every packet is
|
||
acknowledged.
|
||
|
||
By Case (B), the timestamp from the last segment that
|
||
advanced the left window edge is echoed, until the missing
|
||
segment arrives; it is echoed according to Case (C). The
|
||
same sequence would occur if segments B and D were lost and
|
||
retransmitted..
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 16]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
TS.Recent
|
||
<A, TSval=1> ------------------->
|
||
1
|
||
<---- <ACK(A), TSecr=1>
|
||
1
|
||
<C, TSval=3> ------------------->
|
||
1
|
||
<---- <ACK(A), TSecr=1>
|
||
1
|
||
<B, TSval=2> ------------------->
|
||
2
|
||
<---- <ACK(C), TSecr=2>
|
||
2
|
||
<E, TSval=5> ------------------->
|
||
2
|
||
<---- <ACK(C), TSecr=2>
|
||
2
|
||
<D, TSval=4> ------------------->
|
||
4
|
||
<---- <ACK(E), TSecr=4>
|
||
(etc)
|
||
|
||
|
||
|
||
|
||
4. PAWS: PROTECT AGAINST WRAPPED SEQUENCE NUMBERS
|
||
|
||
4.1 Introduction
|
||
|
||
Section 4.2 describes a simple mechanism to reject old duplicate
|
||
segments that might corrupt an open TCP connection; we call this
|
||
mechanism PAWS (Protect Against Wrapped Sequence numbers). PAWS
|
||
operates within a single TCP connection, using state that is saved
|
||
in the connection control block. Section 4.3 and Appendix C
|
||
discuss the implications of the PAWS mechanism for avoiding old
|
||
duplicates from previous incarnations of the same connection.
|
||
|
||
4.2 The PAWS Mechanism
|
||
|
||
PAWS uses the same TCP Timestamps option as the RTTM mechanism
|
||
described earlier, and assumes that every received TCP segment
|
||
(including data and ACK segments) contains a timestamp SEG.TSval
|
||
whose values are monotone non-decreasing in time. The basic idea
|
||
is that a segment can be discarded as an old duplicate if it is
|
||
received with a timestamp SEG.TSval less than some timestamp
|
||
recently received on this connection.
|
||
|
||
In both the PAWS and the RTTM mechanism, the "timestamps" are 32-
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 17]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
bit unsigned integers in a modular 32-bit space. Thus, "less
|
||
than" is defined the same way it is for TCP sequence numbers, and
|
||
the same implementation techniques apply. If s and t are
|
||
timestamp values, s < t if 0 < (t - s) < 2**31, computed in
|
||
unsigned 32-bit arithmetic.
|
||
|
||
The choice of incoming timestamps to be saved for this comparison
|
||
must guarantee a value that is monotone increasing. For example,
|
||
we might save the timestamp from the segment that last advanced
|
||
the left edge of the receive window, i.e., the most recent in-
|
||
sequence segment. Instead, we choose the value TS.Recent
|
||
introduced in Section 3.4 for the RTTM mechanism, since using a
|
||
common value for both PAWS and RTTM simplifies the implementation
|
||
of both. As Section 3.4 explained, TS.Recent differs from the
|
||
timestamp from the last in-sequence segment only in the case of
|
||
delayed ACKs, and therefore by less than one window. Either
|
||
choice will therefore protect against sequence number wrap-around.
|
||
|
||
RTTM was specified in a symmetrical manner, so that TSval
|
||
timestamps are carried in both data and ACK segments and are
|
||
echoed in TSecr fields carried in returning ACK or data segments.
|
||
PAWS submits all incoming segments to the same test, and therefore
|
||
protects against duplicate ACK segments as well as data segments.
|
||
(An alternative un-symmetric algorithm would protect against old
|
||
duplicate ACKs: the sender of data would reject incoming ACK
|
||
segments whose TSecr values were less than the TSecr saved from
|
||
the last segment whose ACK field advanced the left edge of the
|
||
send window. This algorithm was deemed to lack economy of
|
||
mechanism and symmetry.)
|
||
|
||
TSval timestamps sent on {SYN} and {SYN,ACK} segments are used to
|
||
initialize PAWS. PAWS protects against old duplicate non-SYN
|
||
segments, and duplicate SYN segments received while there is a
|
||
synchronized connection. Duplicate {SYN} and {SYN,ACK} segments
|
||
received when there is no connection will be discarded by the
|
||
normal 3-way handshake and sequence number checks of TCP.
|
||
|
||
It is recommended that RST segments NOT carry timestamps, and that
|
||
RST segments be acceptable regardless of their timestamp. Old
|
||
duplicate RST segments should be exceedingly unlikely, and their
|
||
cleanup function should take precedence over timestamps.
|
||
|
||
4.2.1 Basic PAWS Algorithm
|
||
|
||
The PAWS algorithm requires the following processing to be
|
||
performed on all incoming segments for a synchronized
|
||
connection:
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 18]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
R1) If there is a Timestamps option in the arriving segment
|
||
and SEG.TSval < TS.Recent and if TS.Recent is valid (see
|
||
later discussion), then treat the arriving segment as not
|
||
acceptable:
|
||
|
||
Send an acknowledgement in reply as specified in
|
||
RFC-793 page 69 and drop the segment.
|
||
|
||
Note: it is necessary to send an ACK segment in order
|
||
to retain TCP's mechanisms for detecting and
|
||
recovering from half-open connections. For example,
|
||
see Figure 10 of RFC-793.
|
||
|
||
R2) If the segment is outside the window, reject it (normal
|
||
TCP processing)
|
||
|
||
R3) If an arriving segment satisfies: SEG.SEQ <= Last.ACK.sent
|
||
(see Section 3.4), then record its timestamp in TS.Recent.
|
||
|
||
R4) If an arriving segment is in-sequence (i.e., at the left
|
||
window edge), then accept it normally.
|
||
|
||
R5) Otherwise, treat the segment as a normal in-window, out-
|
||
of-sequence TCP segment (e.g., queue it for later delivery
|
||
to the user).
|
||
|
||
Steps R2, R4, and R5 are the normal TCP processing steps
|
||
specified by RFC-793.
|
||
|
||
It is important to note that the timestamp is checked only when
|
||
a segment first arrives at the receiver, regardless of whether
|
||
it is in-sequence or it must be queued for later delivery.
|
||
Consider the following example.
|
||
|
||
Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has
|
||
been sent, where the letter indicates the sequence number
|
||
and the digit represents the timestamp. Suppose also that
|
||
segment B.1 has been lost. The timestamp in TS.TStamp is
|
||
1 (from A.1), so C.1, ..., Z.1 are considered acceptable
|
||
and are queued. When B is retransmitted as segment B.2
|
||
(using the latest timestamp), it fills the hole and causes
|
||
all the segments through Z to be acknowledged and passed
|
||
to the user. The timestamps of the queued segments are
|
||
*not* inspected again at this time, since they have
|
||
already been accepted. When B.2 is accepted, TS.Stamp is
|
||
set to 2.
|
||
|
||
This rule allows reasonable performance under loss. A full
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 19]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
window of data is in transit at all times, and after a loss a
|
||
full window less one packet will show up out-of-sequence to be
|
||
queued at the receiver (e.g., up to ~2**30 bytes of data); the
|
||
timestamp option must not result in discarding this data.
|
||
|
||
In certain unlikely circumstances, the algorithm of rules R1-R4
|
||
could lead to discarding some segments unnecessarily, as shown
|
||
in the following example:
|
||
|
||
Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have
|
||
been sent in sequence and that segment B.1 has been lost.
|
||
Furthermore, suppose delivery of some of C.1, ... Z.1 is
|
||
delayed until AFTER the retransmission B.2 arrives at the
|
||
receiver. These delayed segments will be discarded
|
||
unnecessarily when they do arrive, since their timestamps
|
||
are now out of date.
|
||
|
||
This case is very unlikely to occur. If the retransmission was
|
||
triggered by a timeout, some of the segments C.1, ... Z.1 must
|
||
have been delayed longer than the RTO time. This is presumably
|
||
an unlikely event, or there would be many spurious timeouts and
|
||
retransmissions. If B's retransmission was triggered by the
|
||
"fast retransmit" algorithm, i.e., by duplicate ACKs, then the
|
||
queued segments that caused these ACKs must have been received
|
||
already.
|
||
|
||
Even if a segment were delayed past the RTO, the Fast
|
||
Retransmit mechanism [Jacobson90c] will cause the delayed
|
||
packets to be retransmitted at the same time as B.2, avoiding
|
||
an extra RTT and therefore causing a very small performance
|
||
penalty.
|
||
|
||
We know of no case with a significant probability of occurrence
|
||
in which timestamps will cause performance degradation by
|
||
unnecessarily discarding segments.
|
||
|
||
4.2.2 Timestamp Clock
|
||
|
||
It is important to understand that the PAWS algorithm does not
|
||
require clock synchronization between sender and receiver. The
|
||
sender's timestamp clock is used to stamp the segments, and the
|
||
sender uses the echoed timestamp to measure RTT's. However,
|
||
the receiver treats the timestamp as simply a monotone-
|
||
increasing serial number, without any necessary connection to
|
||
its clock. From the receiver's viewpoint, the timestamp is
|
||
acting as a logical extension of the high-order bits of the
|
||
sequence number.
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 20]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
The receiver algorithm does place some requirements on the
|
||
frequency of the timestamp clock.
|
||
|
||
(a) The timestamp clock must not be "too slow".
|
||
|
||
It must tick at least once for each 2**31 bytes sent. In
|
||
fact, in order to be useful to the sender for round trip
|
||
timing, the clock should tick at least once per window's
|
||
worth of data, and even with the RFC-1072 window
|
||
extension, 2**31 bytes must be at least two windows.
|
||
|
||
To make this more quantitative, any clock faster than 1
|
||
tick/sec will reject old duplicate segments for link
|
||
speeds of ~8 Gbps. A 1ms timestamp clock will work at
|
||
link speeds up to 8 Tbps (8*10**12) bps!
|
||
|
||
(b) The timestamp clock must not be "too fast".
|
||
|
||
Its recycling time must be greater than MSL seconds.
|
||
Since the clock (timestamp) is 32 bits and the worst-case
|
||
MSL is 255 seconds, the maximum acceptable clock frequency
|
||
is one tick every 59 ns.
|
||
|
||
However, it is desirable to establish a much longer
|
||
recycle period, in order to handle outdated timestamps on
|
||
idle connections (see Section 4.2.3), and to relax the MSL
|
||
requirement for preventing sequence number wrap-around.
|
||
With a 1 ms timestamp clock, the 32-bit timestamp will
|
||
wrap its sign bit in 24.8 days. Thus, it will reject old
|
||
duplicates on the same connection if MSL is 24.8 days or
|
||
less. This appears to be a very safe figure; an MSL of
|
||
24.8 days or longer can probably be assumed by the gateway
|
||
system without requiring precise MSL enforcement by the
|
||
TTL value in the IP layer.
|
||
|
||
Based upon these considerations, we choose a timestamp clock
|
||
frequency in the range 1 ms to 1 sec per tick. This range also
|
||
matches the requirements of the RTTM mechanism, which does not
|
||
need much more resolution than the granularity of the
|
||
retransmit timer, e.g., tens or hundreds of milliseconds.
|
||
|
||
The PAWS mechanism also puts a strong monotonicity requirement
|
||
on the sender's timestamp clock. The method of implementation
|
||
of the timestamp clock to meet this requirement depends upon
|
||
the system hardware and software.
|
||
|
||
* Some hosts have a hardware clock that is guaranteed to be
|
||
monotonic between hardware resets.
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 21]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
* A clock interrupt may be used to simply increment a binary
|
||
integer by 1 periodically.
|
||
|
||
* The timestamp clock may be derived from a system clock
|
||
that is subject to being abruptly changed, by adding a
|
||
variable offset value. This offset is initialized to
|
||
zero. When a new timestamp clock value is needed, the
|
||
offset can be adjusted as necessary to make the new value
|
||
equal to or larger than the previous value (which was
|
||
saved for this purpose).
|
||
|
||
|
||
4.2.3 Outdated Timestamps
|
||
|
||
If a connection remains idle long enough for the timestamp
|
||
clock of the other TCP to wrap its sign bit, then the value
|
||
saved in TS.Recent will become too old; as a result, the PAWS
|
||
mechanism will cause all subsequent segments to be rejected,
|
||
freezing the connection (until the timestamp clock wraps its
|
||
sign bit again).
|
||
|
||
With the chosen range of timestamp clock frequencies (1 sec to
|
||
1 ms), the time to wrap the sign bit will be between 24.8 days
|
||
and 24800 days. A TCP connection that is idle for more than 24
|
||
days and then comes to life is exceedingly unusual. However,
|
||
it is undesirable in principle to place any limitation on TCP
|
||
connection lifetimes.
|
||
|
||
We therefore require that an implementation of PAWS include a
|
||
mechanism to "invalidate" the TS.Recent value when a connection
|
||
is idle for more than 24 days. (An alternative solution to the
|
||
problem of outdated timestamps would be to send keepalive
|
||
segments at a very low rate, but still more often than the
|
||
wrap-around time for timestamps, e.g., once a day. This would
|
||
impose negligible overhead. However, the TCP specification has
|
||
never included keepalives, so the solution based upon
|
||
invalidation was chosen.)
|
||
|
||
Note that a TCP does not know the frequency, and therefore, the
|
||
wraparound time, of the other TCP, so it must assume the worst.
|
||
The validity of TS.Recent needs to be checked only if the basic
|
||
PAWS timestamp check fails, i.e., only if SEG.TSval <
|
||
TS.Recent. If TS.Recent is found to be invalid, then the
|
||
segment is accepted, regardless of the failure of the timestamp
|
||
check, and rule R3 updates TS.Recent with the TSval from the
|
||
new segment.
|
||
|
||
To detect how long the connection has been idle, the TCP may
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 22]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
update a clock or timestamp value associated with the
|
||
connection whenever TS.Recent is updated, for example. The
|
||
details will be implementation-dependent.
|
||
|
||
4.2.4 Header Prediction
|
||
|
||
"Header prediction" [Jacobson90a] is a high-performance
|
||
transport protocol implementation technique that is most
|
||
important for high-speed links. This technique optimizes the
|
||
code for the most common case, receiving a segment correctly
|
||
and in order. Using header prediction, the receiver asks the
|
||
question, "Is this segment the next in sequence?" This
|
||
question can be answered in fewer machine instructions than the
|
||
question, "Is this segment within the window?"
|
||
|
||
Adding header prediction to our timestamp procedure leads to
|
||
the following recommended sequence for processing an arriving
|
||
TCP segment:
|
||
|
||
H1) Check timestamp (same as step R1 above)
|
||
|
||
H2) Do header prediction: if segment is next in sequence and
|
||
if there are no special conditions requiring additional
|
||
processing, accept the segment, record its timestamp, and
|
||
skip H3.
|
||
|
||
H3) Process the segment normally, as specified in RFC-793.
|
||
This includes dropping segments that are outside the win-
|
||
dow and possibly sending acknowledgments, and queueing
|
||
in-window, out-of-sequence segments.
|
||
|
||
Another possibility would be to interchange steps H1 and H2,
|
||
i.e., to perform the header prediction step H2 FIRST, and
|
||
perform H1 and H3 only when header prediction fails. This
|
||
could be a performance improvement, since the timestamp check
|
||
in step H1 is very unlikely to fail, and it requires interval
|
||
arithmetic on a finite field, a relatively expensive operation.
|
||
To perform this check on every single segment is contrary to
|
||
the philosophy of header prediction. We believe that this
|
||
change might reduce CPU time for TCP protocol processing by up
|
||
to 5-10% on high-speed networks.
|
||
|
||
However, putting H2 first would create a hazard: a segment from
|
||
2**32 bytes in the past might arrive at exactly the wrong time
|
||
and be accepted mistakenly by the header-prediction step. The
|
||
following reasoning has been introduced [Jacobson90b] to show
|
||
that the probability of this failure is negligible.
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 23]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
If all segments are equally likely to show up as old
|
||
duplicates, then the probability of an old duplicate
|
||
exactly matching the left window edge is the maximum
|
||
segment size (MSS) divided by the size of the sequence
|
||
space. This ratio must be less than 2**-16, since MSS
|
||
must be < 2**16; for example, it will be (2**12)/(2**32) =
|
||
2**-20 for an FDDI link. However, the older a segment is,
|
||
the less likely it is to be retained in the Internet, and
|
||
under any reasonable model of segment lifetime the
|
||
probability of an old duplicate exactly at the left window
|
||
edge must be much smaller than 2**-16.
|
||
|
||
The 16 bit TCP checksum also allows a basic unreliability
|
||
of one part in 2**16. A protocol mechanism whose
|
||
reliability exceeds the reliability of the TCP checksum
|
||
should be considered "good enough", i.e., it won't
|
||
contribute significantly to the overall error rate. We
|
||
therefore believe we can ignore the problem of an old
|
||
duplicate being accepted by doing header prediction before
|
||
checking the timestamp.
|
||
|
||
However, this probabilistic argument is not universally
|
||
accepted, and the consensus at present is that the performance
|
||
gain does not justify the hazard in the general case. It is
|
||
therefore recommended that H2 follow H1.
|
||
|
||
4.3. Duplicates from Earlier Incarnations of Connection
|
||
|
||
The PAWS mechanism protects against errors due to sequence number
|
||
wrap-around on high-speed connection. Segments from an earlier
|
||
incarnation of the same connection are also a potential cause of
|
||
old duplicate errors. In both cases, the TCP mechanisms to
|
||
prevent such errors depend upon the enforcement of a maximum
|
||
segment lifetime (MSL) by the Internet (IP) layer (see Appendix of
|
||
RFC-1185 for a detailed discussion). Unlike the case of sequence
|
||
space wrap-around, the MSL required to prevent old duplicate
|
||
errors from earlier incarnations does not depend upon the transfer
|
||
rate. If the IP layer enforces the recommended 2 minute MSL of
|
||
TCP, and if the TCP rules are followed, TCP connections will be
|
||
safe from earlier incarnations, no matter how high the network
|
||
speed. Thus, the PAWS mechanism is not required for this case.
|
||
|
||
We may still ask whether the PAWS mechanism can provide additional
|
||
security against old duplicates from earlier connections, allowing
|
||
us to relax the enforcement of MSL by the IP layer. Appendix B
|
||
explores this question, showing that further assumptions and/or
|
||
mechanisms are required, beyond those of PAWS. This is not part
|
||
of the current extension.
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 24]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
5. CONCLUSIONS AND ACKNOWLEDGMENTS
|
||
|
||
This memo presented a set of extensions to TCP to provide efficient
|
||
operation over large-bandwidth*delay-product paths and reliable
|
||
operation over very high-speed paths. These extensions are designed
|
||
to provide compatible interworking with TCP's that do not implement
|
||
the extensions.
|
||
|
||
These mechanisms are implemented using new TCP options for scaled
|
||
windows and timestamps. The timestamps are used for two distinct
|
||
mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect
|
||
Against Wrapped Sequences).
|
||
|
||
The Window Scale option was originally suggested by Mike St. Johns of
|
||
USAF/DCA. The present form of the option was suggested by Mike
|
||
Karels of UC Berkeley in response to a more cumbersome scheme defined
|
||
by Van Jacobson. Lixia Zhang helped formulate the PAWS mechanism
|
||
description in RFC-1185.
|
||
|
||
Finally, much of this work originated as the result of discussions
|
||
within the End-to-End Task Force on the theoretical limitations of
|
||
transport protocols in general and TCP in particular. More recently,
|
||
task force members and other on the end2end-interest list have made
|
||
valuable contributions by pointing out flaws in the algorithms and
|
||
the documentation. The authors are grateful for all these
|
||
contributions.
|
||
|
||
6. REFERENCES
|
||
|
||
[Clark87] Clark, D., Lambert, M., and L. Zhang, "NETBLT: A Bulk
|
||
Data Transfer Protocol", RFC 998, MIT, March 1987.
|
||
|
||
[Garlick77] Garlick, L., R. Rom, and J. Postel, "Issues in
|
||
Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop
|
||
on Distributed Data Management and Computer Networks, May 1977.
|
||
|
||
[Hamming77] Hamming, R., "Digital Filters", ISBN 0-13-212571-4,
|
||
Prentice Hall, Englewood Cliffs, N.J., 1977.
|
||
|
||
[Cheriton88] Cheriton, D., "VMTP: Versatile Message Transaction
|
||
Protocol", RFC 1045, Stanford University, February 1988.
|
||
|
||
[Jacobson88a] Jacobson, V., "Congestion Avoidance and Control",
|
||
SIGCOMM '88, Stanford, CA., August 1988.
|
||
|
||
[Jacobson88b] Jacobson, V., and R. Braden, "TCP Extensions for
|
||
Long-Delay Paths", RFC-1072, LBL and USC/Information Sciences
|
||
Institute, October 1988.
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 25]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
[Jacobson90a] Jacobson, V., "4BSD Header Prediction", ACM
|
||
Computer Communication Review, April 1990.
|
||
|
||
[Jacobson90b] Jacobson, V., Braden, R., and Zhang, L., "TCP
|
||
Extension for High-Speed Paths", RFC-1185, LBL and USC/Information
|
||
Sciences Institute, October 1990.
|
||
|
||
[Jacobson90c] Jacobson, V., "Modified TCP congestion avoidance
|
||
algorithm", Message to end2end-interest mailing list, April 1990.
|
||
|
||
[Jain86] Jain, R., "Divergence of Timeout Algorithms for Packet
|
||
Retransmissions", Proc. Fifth Phoenix Conf. on Comp. and Comm.,
|
||
Scottsdale, Arizona, March 1986.
|
||
|
||
[Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times
|
||
in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT,
|
||
August 1987.
|
||
|
||
[McKenzie89] McKenzie, A., "A Problem with the TCP Big Window
|
||
Option", RFC 1110, BBN STC, August 1989.
|
||
|
||
[Nagle84] Nagle, J., "Congestion Control in IP/TCP
|
||
Internetworks", RFC 896, FACC, January 1984.
|
||
|
||
[NBS85] Colella, R., Aronoff, R., and K. Mills, "Performance
|
||
Improvements for ISO Transport", Ninth Data Comm Symposium,
|
||
published in ACM SIGCOMM Comp Comm Review, vol. 15, no. 5,
|
||
September 1985.
|
||
|
||
[Postel81] Postel, J., "Transmission Control Protocol - DARPA
|
||
Internet Program Protocol Specification", RFC 793, DARPA,
|
||
September 1981.
|
||
|
||
[Velten84] Velten, D., Hinden, R., and J. Sax, "Reliable Data
|
||
Protocol", RFC 908, BBN, July 1984.
|
||
|
||
[Watson81] Watson, R., "Timer-based Mechanisms in Reliable
|
||
Transport Protocol Connection Management", Computer Networks, Vol.
|
||
5, 1981.
|
||
|
||
[Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc.
|
||
SIGCOMM '86, Stowe, Vt., August 1986.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 26]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
APPENDIX A: IMPLEMENTATION SUGGESTIONS
|
||
|
||
The following layouts are recommended for sending options on non-SYN
|
||
segments, to achieve maximum feasible alignment of 32-bit and 64-bit
|
||
machines.
|
||
|
||
|
||
+--------+--------+--------+--------+
|
||
| NOP | NOP | TSopt | 10 |
|
||
+--------+--------+--------+--------+
|
||
| TSval timestamp |
|
||
+--------+--------+--------+--------+
|
||
| TSecr timestamp |
|
||
+--------+--------+--------+--------+
|
||
|
||
|
||
APPENDIX B: DUPLICATES FROM EARLIER CONNECTION INCARNATIONS
|
||
|
||
There are two cases to be considered: (1) a system crashing (and
|
||
losing connection state) and restarting, and (2) the same connection
|
||
being closed and reopened without a loss of host state. These will
|
||
be described in the following two sections.
|
||
|
||
B.1 System Crash with Loss of State
|
||
|
||
TCP's quiet time of one MSL upon system startup handles the loss
|
||
of connection state in a system crash/restart. For an
|
||
explanation, see for example "When to Keep Quiet" in the TCP
|
||
protocol specification [Postel81]. The MSL that is required here
|
||
does not depend upon the transfer speed. The current TCP MSL of 2
|
||
minutes seems acceptable as an operational compromise, as many
|
||
host systems take this long to boot after a crash.
|
||
|
||
However, the timestamp option may be used to ease the MSL
|
||
requirements (or to provide additional security against data
|
||
corruption). If timestamps are being used and if the timestamp
|
||
clock can be guaranteed to be monotonic over a system
|
||
crash/restart, i.e., if the first value of the sender's timestamp
|
||
clock after a crash/restart can be guaranteed to be greater than
|
||
the last value before the restart, then a quiet time will be
|
||
unnecessary.
|
||
|
||
To dispense totally with the quiet time would require that the
|
||
host clock be synchronized to a time source that is stable over
|
||
the crash/restart period, with an accuracy of one timestamp clock
|
||
tick or better. We can back off from this strict requirement to
|
||
take advantage of approximate clock synchronization. Suppose that
|
||
the clock is always re-synchronized to within N timestamp clock
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 27]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
ticks and that booting (extended with a quiet time, if necessary)
|
||
takes more than N ticks. This will guarantee monotonicity of the
|
||
timestamps, which can then be used to reject old duplicates even
|
||
without an enforced MSL.
|
||
|
||
B.2 Closing and Reopening a Connection
|
||
|
||
When a TCP connection is closed, a delay of 2*MSL in TIME-WAIT
|
||
state ties up the socket pair for 4 minutes (see Section 3.5 of
|
||
[Postel81]. Applications built upon TCP that close one connection
|
||
and open a new one (e.g., an FTP data transfer connection using
|
||
Stream mode) must choose a new socket pair each time. The TIME-
|
||
WAIT delay serves two different purposes:
|
||
|
||
(a) Implement the full-duplex reliable close handshake of TCP.
|
||
|
||
The proper time to delay the final close step is not really
|
||
related to the MSL; it depends instead upon the RTO for the
|
||
FIN segments and therefore upon the RTT of the path. (It
|
||
could be argued that the side that is sending a FIN knows
|
||
what degree of reliability it needs, and therefore it should
|
||
be able to determine the length of the TIME-WAIT delay for
|
||
the FIN's recipient. This could be accomplished with an
|
||
appropriate TCP option in FIN segments.)
|
||
|
||
Although there is no formal upper-bound on RTT, common
|
||
network engineering practice makes an RTT greater than 1
|
||
minute very unlikely. Thus, the 4 minute delay in TIME-WAIT
|
||
state works satisfactorily to provide a reliable full-duplex
|
||
TCP close. Note again that this is independent of MSL
|
||
enforcement and network speed.
|
||
|
||
The TIME-WAIT state could cause an indirect performance
|
||
problem if an application needed to repeatedly close one
|
||
connection and open another at a very high frequency, since
|
||
the number of available TCP ports on a host is less than
|
||
2**16. However, high network speeds are not the major
|
||
contributor to this problem; the RTT is the limiting factor
|
||
in how quickly connections can be opened and closed.
|
||
Therefore, this problem will be no worse at high transfer
|
||
speeds.
|
||
|
||
(b) Allow old duplicate segments to expire.
|
||
|
||
To replace this function of TIME-WAIT state, a mechanism
|
||
would have to operate across connections. PAWS is defined
|
||
strictly within a single connection; the last timestamp is
|
||
TS.Recent is kept in the connection control block, and
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 28]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
discarded when a connection is closed.
|
||
|
||
An additional mechanism could be added to the TCP, a per-host
|
||
cache of the last timestamp received from any connection.
|
||
This value could then be used in the PAWS mechanism to reject
|
||
old duplicate segments from earlier incarnations of the
|
||
connection, if the timestamp clock can be guaranteed to have
|
||
ticked at least once since the old connection was open. This
|
||
would require that the TIME-WAIT delay plus the RTT together
|
||
must be at least one tick of the sender's timestamp clock.
|
||
Such an extension is not part of the proposal of this RFC.
|
||
|
||
Note that this is a variant on the mechanism proposed by
|
||
Garlick, Rom, and Postel [Garlick77], which required each
|
||
host to maintain connection records containing the highest
|
||
sequence numbers on every connection. Using timestamps
|
||
instead, it is only necessary to keep one quantity per remote
|
||
host, regardless of the number of simultaneous connections to
|
||
that host.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 29]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
APPENDIX C: CHANGES FROM RFC-1072, RFC-1185
|
||
|
||
The protocol extensions defined in this document differ in several
|
||
important ways from those defined in RFC-1072 and RFC-1185.
|
||
|
||
(a) SACK has been deferred to a later memo.
|
||
|
||
(b) The detailed rules for sending timestamp replies (see Section
|
||
3.4) differ in important ways. The earlier rules could result
|
||
in an under-estimate of the RTT in certain cases (packets
|
||
dropped or out of order).
|
||
|
||
(c) The same value TS.Recent is now shared by the two distinct
|
||
mechanisms RTTM and PAWS. This simplification became possible
|
||
because of change (b).
|
||
|
||
(d) An ambiguity in RFC-1185 was resolved in favor of putting
|
||
timestamps on ACK as well as data segments. This supports the
|
||
symmetry of the underlying TCP protocol.
|
||
|
||
(e) The echo and echo reply options of RFC-1072 were combined into a
|
||
single Timestamps option, to reflect the symmetry and to
|
||
simplify processing.
|
||
|
||
(f) The problem of outdated timestamps on long-idle connections,
|
||
discussed in Section 4.2.2, was realized and resolved.
|
||
|
||
(g) RFC-1185 recommended that header prediction take precedence over
|
||
the timestamp check. Based upon some scepticism about the
|
||
probabilistic arguments given in Section 4.2.4, it was decided
|
||
to recommend that the timestamp check be performed first.
|
||
|
||
(h) The spec was modified so that the extended options will be sent
|
||
on <SYN,ACK> segments only when they are received in the
|
||
corresponding <SYN> segments. This provides the most
|
||
conservative possible conditions for interoperation with
|
||
implementations without the extensions.
|
||
|
||
In addition to these substantive changes, the present RFC attempts to
|
||
specify the algorithms unambiguously by presenting modifications to
|
||
the Event Processing rules of RFC-793; see Appendix E.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 30]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
APPENDIX D: SUMMARY OF NOTATION
|
||
|
||
The following notation has been used in this document.
|
||
|
||
Options
|
||
|
||
WSopt: TCP Window Scale Option
|
||
TSopt: TCP Timestamps Option
|
||
|
||
Option Fields
|
||
|
||
shift.cnt: Window scale byte in WSopt.
|
||
TSval: 32-bit Timestamp Value field in TSopt.
|
||
TSecr: 32-bit Timestamp Reply field in TSopt.
|
||
|
||
Option Fields in Current Segment
|
||
|
||
SEG.TSval: TSval field from TSopt in current segment.
|
||
SEG.TSecr: TSecr field from TSopt in current segment.
|
||
SEG.WSopt: 8-bit value in WSopt
|
||
|
||
Clock Values
|
||
|
||
my.TSclock: Local source of 32-bit timestamp values
|
||
my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec).
|
||
|
||
Per-Connection State Variables
|
||
|
||
TS.Recent: Latest received Timestamp
|
||
Last.ACK.sent: Last ACK field sent
|
||
|
||
Snd.TS.OK: 1-bit flag
|
||
Snd.WS.OK: 1-bit flag
|
||
|
||
Rcv.Wind.Scale: Receive window scale power
|
||
Snd.Wind.Scale: Send window scale power
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 31]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
APPENDIX E: EVENT PROCESSING
|
||
|
||
|
||
Event Processing
|
||
|
||
OPEN Call
|
||
|
||
...
|
||
An initial send sequence number (ISS) is selected. Send a SYN
|
||
segment of the form:
|
||
|
||
<SEQ=ISS><CTL=SYN><TSval=my.TSclock><WSopt=Rcv.Wind.Scale>
|
||
|
||
...
|
||
|
||
SEND Call
|
||
|
||
CLOSED STATE (i.e., TCB does not exist)
|
||
|
||
...
|
||
|
||
LISTEN STATE
|
||
|
||
If the foreign socket is specified, then change the connection
|
||
from passive to active, select an ISS. Send a SYN segment
|
||
containing the options: <TSval=my.TSclock> and
|
||
<WSopt=Rcv.Wind.Scale>. Set SND.UNA to ISS, SND.NXT to ISS+1.
|
||
Enter SYN-SENT state. ...
|
||
|
||
SYN-SENT STATE
|
||
SYN-RECEIVED STATE
|
||
|
||
...
|
||
|
||
ESTABLISHED STATE
|
||
CLOSE-WAIT STATE
|
||
|
||
Segmentize the buffer and send it with a piggybacked
|
||
acknowledgment (acknowledgment value = RCV.NXT). ...
|
||
|
||
If the urgent flag is set ...
|
||
|
||
If the Snd.TS.OK flag is set, then include the TCP Timestamps
|
||
option <TSval=my.TSclock,TSecr=TS.Recent> in each data segment.
|
||
|
||
Scale the receive window for transmission in the segment header:
|
||
|
||
SEG.WND = (SND.WND >> Rcv.Wind.Scale).
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 32]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
SEGMENT ARRIVES
|
||
|
||
...
|
||
|
||
If the state is LISTEN then
|
||
|
||
first check for an RST
|
||
|
||
...
|
||
|
||
second check for an ACK
|
||
|
||
...
|
||
|
||
third check for a SYN
|
||
|
||
if the SYN bit is set, check the security. If the ...
|
||
|
||
...
|
||
|
||
If the SEG.PRC is less than the TCB.PRC then continue.
|
||
|
||
Check for a Window Scale option (WSopt); if one is found, save
|
||
SEG.WSopt in Snd.Wind.Scale and set Snd.WS.OK flag on.
|
||
Otherwise, set both Snd.Wind.Scale and Rcv.Wind.Scale to zero
|
||
and clear Snd.WS.OK flag.
|
||
|
||
Check for a TSopt option; if one is found, save SEG.TSval in the
|
||
variable TS.Recent and turn on the Snd.TS.OK bit.
|
||
|
||
Set RCV.NXT to SEG.SEQ+1, IRS is set to SEG.SEQ and any other
|
||
control or text should be queued for processing later. ISS
|
||
should be selected and a SYN segment sent of the form:
|
||
|
||
<SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>
|
||
|
||
If the Snd.WS.OK bit is on, include a WSopt option
|
||
<WSopt=Rcv.Wind.Scale> in this segment. If the Snd.TS.OK bit is
|
||
on, include a TSopt <TSval=my.TSclock,TSecr=TS.Recent> in this
|
||
segment. Last.ACK.sent is set to RCV.NXT.
|
||
|
||
SND.NXT is set to ISS+1 and SND.UNA to ISS. The connection
|
||
state should be changed to SYN-RECEIVED. Note that any other
|
||
incoming control or data (combined with SYN) will be processed
|
||
in the SYN-RECEIVED state, but processing of SYN and ACK should
|
||
not be repeated. If the listen was not fully specified (i.e.,
|
||
the foreign socket was not fully specified), then the
|
||
unspecified fields should be filled in now.
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 33]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
fourth other text or control
|
||
|
||
...
|
||
|
||
If the state is SYN-SENT then
|
||
|
||
first check the ACK bit
|
||
|
||
...
|
||
|
||
fourth check the SYN bit
|
||
|
||
...
|
||
|
||
If the SYN bit is on and the security/compartment and precedence
|
||
are acceptable then, RCV.NXT is set to SEG.SEQ+1, IRS is set to
|
||
SEG.SEQ, and any acknowledgements on the retransmission queue
|
||
which are thereby acknowledged should be removed.
|
||
|
||
Check for a Window Scale option (WSopt); if is found, save
|
||
SEG.WSopt in Snd.Wind.Scale; otherwise, set both Snd.Wind.Scale
|
||
and Rcv.Wind.Scale to zero.
|
||
|
||
Check for a TSopt option; if one is found, save SEG.TSval in
|
||
variable TS.Recent and turn on the Snd.TS.OK bit in the
|
||
connection control block. If the ACK bit is set, use my.TSclock
|
||
- SEG.TSecr as the initial RTT estimate.
|
||
|
||
If SND.UNA > ISS (our SYN has been ACKed), change the connection
|
||
state to ESTABLISHED, form an ACK segment:
|
||
|
||
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
|
||
|
||
and send it. If the Snd.Echo.OK bit is on, include a TSopt
|
||
option <TSval=my.TSclock,TSecr=TS.Recent> in this ACK segment.
|
||
Last.ACK.sent is set to RCV.NXT.
|
||
|
||
Data or controls which were queued for transmission may be
|
||
included. If there are other controls or text in the segment
|
||
then continue processing at the sixth step below where the URG
|
||
bit is checked, otherwise return.
|
||
|
||
Otherwise enter SYN-RECEIVED, form a SYN,ACK segment:
|
||
|
||
<SEQ=ISS><ACK=RCV.NXT><CTL=SYN,ACK>
|
||
|
||
and send it. If the Snd.Echo.OK bit is on, include a TSopt
|
||
option <TSval=my.TSclock,TSecr=TS.Recent> in this segment. If
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 34]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
the Snd.WS.OK bit is on, include a WSopt option
|
||
<WSopt=Rcv.Wind.Scale> in this segment. Last.ACK.sent is set to
|
||
RCV.NXT.
|
||
|
||
If there are other controls or text in the segment, queue them
|
||
for processing after the ESTABLISHED state has been reached,
|
||
return.
|
||
|
||
fifth, if neither of the SYN or RST bits is set then drop the
|
||
segment and return.
|
||
|
||
|
||
Otherwise,
|
||
|
||
First, check sequence number
|
||
|
||
SYN-RECEIVED STATE
|
||
ESTABLISHED STATE
|
||
FIN-WAIT-1 STATE
|
||
FIN-WAIT-2 STATE
|
||
CLOSE-WAIT STATE
|
||
CLOSING STATE
|
||
LAST-ACK STATE
|
||
TIME-WAIT STATE
|
||
|
||
Segments are processed in sequence. Initial tests on arrival
|
||
are used to discard old duplicates, but further processing is
|
||
done in SEG.SEQ order. If a segment's contents straddle the
|
||
boundary between old and new, only the new parts should be
|
||
processed.
|
||
|
||
Rescale the received window field:
|
||
|
||
TrueWindow = SEG.WND << Snd.Wind.Scale,
|
||
|
||
and use "TrueWindow" in place of SEG.WND in the following steps.
|
||
|
||
Check whether the segment contains a Timestamps option and bit
|
||
Snd.TS.OK is on. If so:
|
||
|
||
If SEG.TSval < TS.Recent, then test whether connection has
|
||
been idle less than 24 days; if both are true, then the
|
||
segment is not acceptable; follow steps below for an
|
||
unacceptable segment.
|
||
|
||
If SEG.SEQ is equal to Last.ACK.sent, then save SEG.ECopt in
|
||
variable TS.Recent.
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 35]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
There are four cases for the acceptability test for an incoming
|
||
segment:
|
||
|
||
...
|
||
|
||
If an incoming segment is not acceptable, an acknowledgment
|
||
should be sent in reply (unless the RST bit is set, if so drop
|
||
the segment and return):
|
||
|
||
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
|
||
|
||
Last.ACK.sent is set to SEG.ACK of the acknowledgment. If the
|
||
Snd.Echo.OK bit is on, include the Timestamps option
|
||
<TSval=my.TSclock,TSecr=TS.Recent> in this ACK segment. Set
|
||
Last.ACK.sent to SEG.ACK and send the ACK segment. After
|
||
sending the acknowledgment, drop the unacceptable segment and
|
||
return.
|
||
|
||
...
|
||
|
||
fifth check the ACK field.
|
||
|
||
if the ACK bit is off drop the segment and return.
|
||
|
||
if the ACK bit is on
|
||
|
||
...
|
||
|
||
ESTABLISHED STATE
|
||
|
||
If SND.UNA < SEG.ACK =< SND.NXT then, set SND.UNA <- SEG.ACK.
|
||
Also compute a new estimate of round-trip time. If Snd.TS.OK
|
||
bit is on, use my.TSclock - SEG.TSecr; otherwise use the
|
||
elapsed time since the first segment in the retransmission
|
||
queue was sent. Any segments on the retransmission queue
|
||
which are thereby entirely acknowledged...
|
||
|
||
...
|
||
|
||
Seventh, process the segment text.
|
||
|
||
ESTABLISHED STATE
|
||
FIN-WAIT-1 STATE
|
||
FIN-WAIT-2 STATE
|
||
|
||
...
|
||
|
||
Send an acknowledgment of the form:
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 36]
|
||
|
||
RFC 1323 TCP Extensions for High Performance May 1992
|
||
|
||
|
||
<SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
|
||
|
||
If the Snd.TS.OK bit is on, include Timestamps option
|
||
<TSval=my.TSclock,TSecr=TS.Recent> in this ACK segment. Set
|
||
Last.ACK.sent to SEG.ACK of the acknowledgment, and send it.
|
||
This acknowledgment should be piggy-backed on a segment being
|
||
transmitted if possible without incurring undue delay.
|
||
|
||
|
||
...
|
||
|
||
|
||
Security Considerations
|
||
|
||
Security issues are not discussed in this memo.
|
||
|
||
Authors' Addresses
|
||
|
||
Van Jacobson
|
||
University of California
|
||
Lawrence Berkeley Laboratory
|
||
Mail Stop 46A
|
||
Berkeley, CA 94720
|
||
|
||
Phone: (415) 486-6411
|
||
EMail: van@CSAM.LBL.GOV
|
||
|
||
|
||
Bob Braden
|
||
University of Southern California
|
||
Information Sciences Institute
|
||
4676 Admiralty Way
|
||
Marina del Rey, CA 90292
|
||
|
||
Phone: (310) 822-1511
|
||
EMail: Braden@ISI.EDU
|
||
|
||
|
||
Dave Borman
|
||
Cray Research
|
||
655-E Lone Oak Drive
|
||
Eagan, MN 55121
|
||
|
||
Phone: (612) 683-5571
|
||
Email: dab@cray.com
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Jacobson, Braden, & Borman [Page 37]
|
||
|