969ff3bbb38b6622800a1a4bd38404e3701193de |
|
08-Feb-2014 |
JP Abgrall <jpa@google.com> |
tcp: add a sysctl to config the tcp_default_init_rwnd The default initial rwnd is hardcoded to 10. Now we allow it to be controlled via /proc/sys/net/ipv4/tcp_default_init_rwnd which limits the values from 3 to 100 This is somewhat needed because ipv6 routes are autoconfigured by the kernel. See "An Argument for Increasing TCP's Initial Congestion Window" in https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf Change-Id: I386b2a9d62de0ebe05c1ebe1b4bd91b314af5c54 Signed-off-by: JP Abgrall <jpa@google.com> Conflicts: net/ipv4/sysctl_net_ipv4.c net/ipv4/tcp_input.c
|
547669d483e5783d722772af1483fa474da7caf9 |
|
23-May-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: xps: fix reordering issues commit 3853b5841c01a ("xps: Improvements in TX queue selection") introduced ooo_okay flag, but the condition to set it is slightly wrong. In our traces, we have seen ACK packets being received out of order, and RST packets sent in response. We should test if we have any packets still in host queue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6a5dc9e598fe90160fee7de098fa319665f5253e |
|
29-Apr-2013 |
Eric Dumazet <edumazet@google.com> |
net: Add MIB counters for checksum errors Add MIB counters for checksum errors in IP layer, and TCP/UDP/ICMP layers, to help diagnose problems. $ nstat -a | grep Csum IcmpInCsumErrors 72 0.0 TcpInCsumErrors 382 0.0 UdpInCsumErrors 463221 0.0 Icmp6InCsumErrors 75 0.0 Udp6InCsumErrors 173442 0.0 IpExtInCsumErrors 10884 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e280af026a5662ffd57c4e623b822df1f7f47ff |
|
18-Apr-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: introduce TCPSpuriousRtxHostQueues SNMP counter Host queues (Qdisc + NIC) can hold packets so long that TCP can eventually retransmit a packet before the first transmit even left the host. Its not clear right now if we could avoid this in the first place : - We could arm RTO timer not at the time we enqueue packets, but at the time we TX complete them (tcp_wfree()) - Cancel the sending of the new copy of the packet if prior one is still in queue. This patch adds instrumentation so that we can at least see how often this problem happens. TCPSpuriousRtxHostQueues SNMP counter is incremented every time we detect the fast clone is not yet freed in tcp_transmit_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d6a4a10411764cf1c3a5dad4f06c5ebe5194488b |
|
12-Apr-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: GSO should be TSQ friendly I noticed that TSQ (TCP Small queues) was less effective when TSO is turned off, and GSO is on. If BQL is not enabled, TSQ has then no effect. It turns out the GSO engine frees the original gso_skb at the time the fragments are generated and queued to the NIC. We should instead call the tcp_wfree() destructor for the last fragment, to keep the flow control as intended in TSQ. This effectively limits the number of queued packets on qdisc + NIC layers. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
50bceae9bd3569d56744882f3012734d48a1d413 |
|
11-Apr-2013 |
Thomas Graf <tgraf@suug.ch> |
tcp: Reallocate headroom if it would overflow csum_start If a TCP retransmission gets partially ACKed and collapsed multiple times it is possible for the headroom to grow beyond 64K which will overflow the 16bit skb->csum_start which is based on the start of the headroom. It has been observed rarely in the wild with IPoIB due to the 64K MTU. Verify if the acking and collapsing resulted in a headroom exceeding what csum_start can cover and reallocate the headroom if so. A big thank you to Jim Foraker <foraker1@llnl.gov> and the team at LLNL for helping out with the investigation and testing. Reported-by: Jim Foraker <foraker1@llnl.gov> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ca10b9e9a8ca7342ee07065289cbe74ac128c169 |
|
08-Apr-2013 |
Eric Dumazet <edumazet@google.com> |
selinux: add a skb_owned_by() hook Commit 90ba9b1986b5ac (tcp: tcp_make_synack() can use alloc_skb()) broke certain SELinux/NetLabel configurations by no longer correctly assigning the sock to the outgoing SYNACK packet. Cost of atomic operations on the LISTEN socket is quite big, and we would like it to happen only if really needed. This patch introduces a new security_ops->skb_owned_by() method, that is a void operation unless selinux is active. Reported-by: Miroslav Vadkerti <mvadkert@redhat.com> Diagnosed-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-security-module@vger.kernel.org Acked-by: James Morris <james.l.morris@oracle.com> Tested-by: Paul Moore <pmoore@redhat.com> Acked-by: Paul Moore <pmoore@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f4541d60a449afd40448b06496dcd510f505928e |
|
21-Mar-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: preserve ACK clocking in TSO A long standing problem with TSO is the fact that tcp_tso_should_defer() rearms the deferred timer, while it should not. Current code leads to following bad bursty behavior : 20:11:24.484333 IP A > B: . 297161:316921(19760) ack 1 win 119 20:11:24.484337 IP B > A: . ack 263721 win 1117 20:11:24.485086 IP B > A: . ack 265241 win 1117 20:11:24.485925 IP B > A: . ack 266761 win 1117 20:11:24.486759 IP B > A: . ack 268281 win 1117 20:11:24.487594 IP B > A: . ack 269801 win 1117 20:11:24.488430 IP B > A: . ack 271321 win 1117 20:11:24.489267 IP B > A: . ack 272841 win 1117 20:11:24.490104 IP B > A: . ack 274361 win 1117 20:11:24.490939 IP B > A: . ack 275881 win 1117 20:11:24.491775 IP B > A: . ack 277401 win 1117 20:11:24.491784 IP A > B: . 316921:332881(15960) ack 1 win 119 20:11:24.492620 IP B > A: . ack 278921 win 1117 20:11:24.493448 IP B > A: . ack 280441 win 1117 20:11:24.494286 IP B > A: . ack 281961 win 1117 20:11:24.495122 IP B > A: . ack 283481 win 1117 20:11:24.495958 IP B > A: . ack 285001 win 1117 20:11:24.496791 IP B > A: . ack 286521 win 1117 20:11:24.497628 IP B > A: . ack 288041 win 1117 20:11:24.498459 IP B > A: . ack 289561 win 1117 20:11:24.499296 IP B > A: . ack 291081 win 1117 20:11:24.500133 IP B > A: . ack 292601 win 1117 20:11:24.500970 IP B > A: . ack 294121 win 1117 20:11:24.501388 IP B > A: . ack 295641 win 1117 20:11:24.501398 IP A > B: . 332881:351881(19000) ack 1 win 119 While the expected behavior is more like : 20:19:49.259620 IP A > B: . 197601:202161(4560) ack 1 win 119 20:19:49.260446 IP B > A: . ack 154281 win 1212 20:19:49.261282 IP B > A: . ack 155801 win 1212 20:19:49.262125 IP B > A: . ack 157321 win 1212 20:19:49.262136 IP A > B: . 202161:206721(4560) ack 1 win 119 20:19:49.262958 IP B > A: . ack 158841 win 1212 20:19:49.263795 IP B > A: . ack 160361 win 1212 20:19:49.264628 IP B > A: . ack 161881 win 1212 20:19:49.264637 IP A > B: . 206721:211281(4560) ack 1 win 119 20:19:49.265465 IP B > A: . ack 163401 win 1212 20:19:49.265886 IP B > A: . ack 164921 win 1212 20:19:49.266722 IP B > A: . ack 166441 win 1212 20:19:49.266732 IP A > B: . 211281:215841(4560) ack 1 win 119 20:19:49.267559 IP B > A: . ack 167961 win 1212 20:19:49.268394 IP B > A: . ack 169481 win 1212 20:19:49.269232 IP B > A: . ack 171001 win 1212 20:19:49.269241 IP A > B: . 215841:221161(5320) ack 1 win 119 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Van Jacobson <vanj@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9b44190dc114c1720b34975b5bfc65aece112ced |
|
20-Mar-2013 |
Yuchung Cheng <ycheng@google.com> |
tcp: refactor F-RTO The patch series refactor the F-RTO feature (RFC4138/5682). This is to simplify the loss recovery processing. Existing F-RTO was developed during the experimental stage (RFC4138) and has many experimental features. It takes a separate code path from the traditional timeout processing by overloading CA_Disorder instead of using CA_Loss state. This complicates CA_Disorder state handling because it's also used for handling dubious ACKs and undos. While the algorithm in the RFC does not change the congestion control, the implementation intercepts congestion control in various places (e.g., frto_cwnd in tcp_ack()). The new code implements newer F-RTO RFC5682 using CA_Loss processing path. F-RTO becomes a small extension in the timeout processing and interfaces with congestion control and Eifel undo modules. It lets congestion control (module) determines how many to send independently. F-RTO only chooses what to send in order to detect spurious retranmission. If timeout is found spurious it invokes existing Eifel undo algorithms like DSACK or TCP timestamp based detection. The first patch removes all F-RTO code except the sysctl_tcp_frto is left for the new implementation. Since CA_EVENT_FRTO is removed, TCP westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1a2c6181c4a1922021b4d7df373bba612c3e5f04 |
|
17-Mar-2013 |
Christoph Paasch <christoph.paasch@uclouvain.be> |
tcp: Remove TCPCT TCPCT uses option-number 253, reserved for experimental use and should not be used in production environments. Further, TCPCT does not fully implement RFC 6013. As a nice side-effect, removing TCPCT increases TCP's performance for very short flows: Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests for files of 1KB size. before this patch: average (among 7 runs) of 20845.5 Requests/Second after: average (among 7 runs) of 21403.6 Requests/Second Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
|
16fad69cfe4adbbfa813de516757b87bcae36d93 |
|
14-Mar-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: fix skb_availroom() Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack : https://code.google.com/p/chromium/issues/detail?id=182056 commit a21d45726acac (tcp: avoid order-1 allocations on wifi and tx path) did a poor choice adding an 'avail_size' field to skb, while what we really needed was a 'reserved_tailroom' one. It would have avoided commit 22b4a4f22da (tcp: fix retransmit of partially acked frames) and this commit. Crash occurs because skb_split() is not aware of the 'avail_size' management (and should not be aware) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mukesh Agrawal <quiche@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9b717a8d245075ffb8e95a2dfb4ee97ce4747457 |
|
11-Mar-2013 |
Nandita Dukkipati <nanditad@google.com> |
tcp: TLP loss detection. This is the second of the TLP patch series; it augments the basic TLP algorithm with a loss detection scheme. This patch implements a mechanism for loss detection when a Tail loss probe retransmission plugs a hole thereby masking packet loss from the sender. The loss detection algorithm relies on counting TLP dupacks as outlined in Sec. 3 of: http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01 The basic idea is: Sender keeps track of TLP "episode" upon retransmission of a TLP packet. An episode ends when the sender receives an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the episode. We want to make sure that before the episode ends the sender receives a "TLP dupack", indicating that the TLP retransmission was unnecessary, so there was no loss/hole that needed plugging. If the sender gets no TLP dupack before the end of the episode, then it reduces ssthresh and the congestion window, because the TLP packet arriving at the receiver probably plugged a hole. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6ba8a3b19e764b6a65e4030ab0999be50c291e6c |
|
11-Mar-2013 |
Nandita Dukkipati <nanditad@google.com> |
tcp: Tail loss probe (TLP) This patch series implement the Tail loss probe (TLP) algorithm described in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The first patch implements the basic algorithm. TLP's goal is to reduce tail latency of short transactions. It achieves this by converting retransmission timeouts (RTOs) occuring due to tail losses (losses at end of transactions) into fast recovery. TLP transmits one packet in two round-trips when a connection is in Open state and isn't receiving any ACKs. The transmitted packet, aka loss probe, can be either new or a retransmission. When there is tail loss, the ACK from a loss probe triggers FACK/early-retransmit based fast recovery, thus avoiding a costly RTO. In the absence of loss, there is no change in the connection state. PTO stands for probe timeout. It is a timer event indicating that an ACK is overdue and triggers a loss probe packet. The PTO value is set to max(2*SRTT, 10ms) and is adjusted to account for delayed ACK timer when there is only one oustanding packet. TLP Algorithm On transmission of new data in Open state: -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms). -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms) -> PTO = min(PTO, RTO) Conditions for scheduling PTO: -> Connection is in Open state. -> Connection is either cwnd limited or no new data to send. -> Number of probes per tail loss episode is limited to one. -> Connection is SACK enabled. When PTO fires: new_segment_exists: -> transmit new segment. -> packets_out++. cwnd remains same. no_new_packet: -> retransmit the last segment. Its ACK triggers FACK or early retransmit based recovery. ACK path: -> rearm RTO at start of ACK processing. -> reschedule PTO if need be. In addition, the patch includes a small variation to the Early Retransmit (ER) algorithm, such that ER and TLP together can in principle recover any N-degree of tail loss through fast recovery. TLP is controlled by the same sysctl as ER, tcp_early_retrans sysctl. tcp_early_retrans==0; disables TLP and ER. ==1; enables RFC5827 ER. ==2; delayed ER. ==3; TLP and delayed ER. [DEFAULT] ==4; TLP only. The TLP patch series have been extensively tested on Google Web servers. It is most effective for short Web trasactions, where it reduced RTOs by 15% and improved HTTP response time (average by 6%, 99th percentile by 10%). The transmitted probes account for <0.5% of the overall transmissions. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1b63edd6ecc55c3a61b40297b49e2323783bddfd |
|
22-Feb-2013 |
Yuchung Cheng <ycheng@google.com> |
tcp: fix SYN-data space mis-accounting In fast open the sender unncessarily reduces the space available for data in SYN by 12 bytes. This is because in the sender incorrectly reserves space for TS option twice in tcp_send_syn_data(): tcp_mtu_to_mss() already accounts for TS option space. But it further reserves MAX_TCP_OPTION_SPACE when computing the payload space. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
14bbd6a565e1bcdc240d44687edb93f721cfdf99 |
|
14-Feb-2013 |
Pravin B Shelar <pshelar@nicira.com> |
net: Add skb_unclone() helper function. This function will be used in next GRE_GSO patch. This patch does not change any functionality. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Eric Dumazet <edumazet@google.com>
|
c9af6db4c11ccc6c3e7f19bbc15d54023956f97c |
|
11-Feb-2013 |
Pravin B Shelar <pshelar@nicira.com> |
net: Fix possible wrong checksum generation. Patch cef401de7be8c4e (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ee684b6f2830047d19877e5547989740f18b1a5d |
|
11-Feb-2013 |
Andrey Vagin <avagin@openvz.org> |
tcp: send packets with a socket timestamp A socket timestamp is a sum of the global tcp_time_stamp and a per-socket offset. A socket offset is added in places where externally visible tcp timestamp option is parsed/initialized. Connections in the SYN_RECV state are not supported, global tcp_time_stamp is used for them, because repair mode doesn't support this state. In a future it can be implemented by the similar way as for TIME_WAIT sockets. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cef401de7be8c4e155c6746bfccf721a4fa5fab9 |
|
25-Jan-2013 |
Eric Dumazet <edumazet@google.com> |
net: fix possible wrong checksum generation Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5d134f1c1f36166e8a738de92c4d2f4c262ff91b |
|
05-Jan-2013 |
Hannes Frederic Sowa <hannes@stressinduktion.org> |
tcp: make sysctl_tcp_ecn namespace aware As per suggestion from Eric Dumazet this patch makes tcp_ecn sysctl namespace aware. The reason behind this patch is to ease the testing of ecn problems on the internet and allows applications to tune their own use of ecn. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
93b174ad71b08e504c2cf6e8a58ecce778b77a40 |
|
06-Dec-2012 |
Yuchung Cheng <ycheng@google.com> |
tcp: bug fix Fast Open client retransmission If SYN-ACK partially acks SYN-data, the client retransmits the remaining data by tcp_retransmit_skb(). This increments lost recovery state variables like tp->retrans_out in Open state. If loss recovery happens before the retransmission is acked, it triggers the WARN_ON check in tcp_fastretrans_alert(). For example: the client sends SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends another 4 data packets and get 3 dupacks. Since the retransmission is not caused by network drop it should not update the recovery state variables. Further the server may return a smaller MSS than the cached MSS used for SYN-data, so the retranmission needs a loop. Otherwise some data will not be retransmitted until timeout or other loss recovery events. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2b9164771efe191c4ef266ae53c8c05ab92dd115 |
|
22-Nov-2012 |
Andrey Vagin <avagin@openvz.org> |
ipv6: adapt connect for repair move This is work the same as for ipv4. All other hacks about tcp repair are in common code for ipv4 and ipv6, so this patch is enough for repairing ipv6 connections. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ec34232575083fd0f43d3a101e8ebb041b203761 |
|
15-Nov-2012 |
Andrew Vagin <avagin@openvz.org> |
tcp: fix retransmission in repair mode Currently if a socket was repaired with a few packet in a write queue, a kernel bug may be triggered: kernel BUG at net/ipv4/tcp_output.c:2330! RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610 According to the initial realization v3.4-rc2-963-gc0e88ff, all skb-s should look like already posted. This patch fixes code according with this sentence. Here are three points, which were not done in the initial patch: 1. A tcp send head should not be changed 2. Initialize TSO state of a skb 3. Reset the retransmission time This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet passes the ussual way, but isn't sent to network. This patch solves all described problems and handles tcp_sendpages. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
684bad1107571d35610a674c61b3544efb5a5b13 |
|
02-Sep-2012 |
Yuchung Cheng <ycheng@google.com> |
tcp: use PRR to reduce cwin in CWR state Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state, in addition to Recovery state. Retire the current rate-halving in CWR. When losses are detected via ACKs in CWR state, the sender enters Recovery state but the cwnd reduction continues and does not restart. Rename and refactor cwnd reduction functions since both CWR and Recovery use the same algorithm: tcp_init_cwnd_reduction() is new and initiates reduction state variables. tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery(). tcp_ends_cwnd_reduction() is previously tcp_complete_cwr(). The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(), and the cwnd moderation inside tcp_enter_cwr() are removed. The unused parameter, flag, in tcp_cwnd_reduction() is also removed. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8336886f786fdacbc19b719c1f7ea91eb70706d4 |
|
31-Aug-2012 |
Jerry Chu <hkchu@google.com> |
tcp: TCP Fast Open Server - support TFO listeners This patch builds on top of the previous patch to add the support for TFO listeners. This includes - 1. allocating, properly initializing, and managing the per listener fastopen_queue structure when TFO is enabled 2. changes to the inet_csk_accept code to support TFO. E.g., the request_sock can no longer be freed upon accept(), not until 3WHS finishes 3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg() if it's a TFO socket 4. properly closing a TFO listener, and a TFO socket before 3WHS finishes 5. supporting TCP_FASTOPEN socket option 6. modifying tcp_check_req() to use to check a TFO socket as well as request_sock 7. supporting TCP's TFO cookie option 8. adding a new SYN-ACK retransmit handler to use the timer directly off the TFO socket rather than the listener socket. Note that TFO server side will not retransmit anything other than SYN-ACK until the 3WHS is completed. The patch also contains an important function "reqsk_fastopen_remove()" to manage the somewhat complex relation between a listener, its request_sock, and the corresponding child socket. See the comment above the function for the detail. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
144d56e91044181ec0ef67aeca91e9a8b5718348 |
|
20-Aug-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: fix possible socket refcount problem Commit 6f458dfb40 (tcp: improve latencies of timer triggered events) added bug leading to following trace : [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.131726] [ 2866.132188] ========================= [ 2866.132281] [ BUG: held lock freed! ] [ 2866.132281] 3.6.0-rc1+ #622 Not tainted [ 2866.132281] ------------------------- [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there! [ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] 4 locks held by kworker/0:1/652: [ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f [ 2866.132281] [ 2866.132281] stack backtrace: [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622 [ 2866.132281] Call Trace: [ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159 [ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a [ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e [ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56 [ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f [ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f [ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85 [ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148 [ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9 [ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e [ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30 [ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6 [ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad [ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4 [ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f [ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1 [ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56 [ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9 [ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26 [ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4 [ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70 [ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4 [ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1 [ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a [ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6 [ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5 [ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4 [ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5 [ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f [ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43 [ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80 [ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0 [ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61 [ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1 [ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e [ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225 [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34 [ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f [ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f [ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225 [ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317 [ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f [ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2 [ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10 [ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13 [ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a [ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13 [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.309689] ============================================================================= [ 2866.310254] BUG TCP (Not tainted): Object already free [ 2866.310254] ----------------------------------------------------------------------------- [ 2866.310254] The bug comes from the fact that timer set in sk_reset_timer() can run before we actually do the sock_hold(). socket refcount reaches zero and we free the socket too soon. timer handler is not allowed to reduce socket refcnt if socket is owned by the user, or we need to change sk_reset_timer() implementation. We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED was used instead of TCP_DELACK_TIMER_DEFERRED. For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED, even if not fired from a timer. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8e7dfbc8d1ea9ca9058aa641a8fe795ebca320e2 |
|
04-Aug-2012 |
Silviu-Mihai Popescu <silviupopescu1990@gmail.com> |
tcp_output: fix sparse warning for tcp_wfree Fix sparse warning: * symbol 'tcp_wfree' was not declared. Should it be static? Signed-off-by: Silviu-Mihai Popescu <silviupopescu1990@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1485348d2424e1131ea42efc033cbd9366462b01 |
|
30-Jul-2012 |
Ben Hutchings <bhutchings@solarflare.com> |
tcp: Apply device TSO segment limit earlier Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to limit the size of TSO skbs. This avoids the need to fall back to software GSO for local TCP senders. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
99a1dec70d5acbd8c6b3928cdebb4a2d1da676c8 |
|
01-Aug-2012 |
Mel Gorman <mgorman@suse.de> |
net: introduce sk_gfp_atomic() to allow addition of GFP flags depending on the individual socket Introduce sk_gfp_atomic(), this function allows to inject sock specific flags to each sock related allocation. It is only used on allocation paths that may be required for writing pages back to network storage. [davem@davemloft.net: Use sk_gfp_atomic only when necessary] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
563d34d05786263893ba4a1042eb9b9374127cf5 |
|
23-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: dont drop MTU reduction indications ICMP messages generated in output path if frame length is bigger than mtu are actually lost because socket is owned by user (doing the xmit) One example is the ipgre_tunnel_xmit() calling icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); We had a similar case fixed in commit a34a101e1e6 (ipv6: disable GSO on sockets hitting dst_allfrag). Problem of such fix is that it relied on retransmit timers, so short tcp sessions paid a too big latency increase price. This patch uses the tcp_release_cb() infrastructure so that MTU reduction messages (ICMP messages) are not lost, and no extra delay is added in TCP transmits. Reported-by: Maciej Żenczykowski <maze@google.com> Diagnosed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6f458dfb409272082c9bfa412f77ff2fc21c626f |
|
20-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: improve latencies of timer triggered events Modern TCP stack highly depends on tcp_write_timer() having a small latency, but current implementation doesn't exactly meet the expectations. When a timer fires but finds the socket is owned by the user, it rearms itself for an additional delay hoping next run will be more successful. tcp_write_timer() for example uses a 50ms delay for next try, and it defeats many attempts to get predictable TCP behavior in term of latencies. Use the recently introduced tcp_release_cb(), so that the user owning the socket will call various handlers right before socket release. This will permit us to post a followup patch to address the tcp_tso_should_defer() syndrome (some deferred packets have to wait RTO timer to be transmitted, while cwnd should allow us to send them sooner) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: John Heffner <johnwheffner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
67da22d23fa6f3324e03bcd0580b914b2e4afbf3 |
|
19-Jul-2012 |
Yuchung Cheng <ycheng@google.com> |
net-tcp: Fast Open client - cookie-less mode In trusted networks, e.g., intranet, data-center, the client does not need to use Fast Open cookie to mitigate DoS attacks. In cookie-less mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless of cookie availability. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aab4874355679c70f93993cf3b3fd74643b9ac33 |
|
19-Jul-2012 |
Yuchung Cheng <ycheng@google.com> |
net-tcp: Fast Open client - detecting SYN-data drops On paths with firewalls dropping SYN with data or experimental TCP options, Fast Open connections will have experience SYN timeout and bad performance. The solution is to track such incidents in the cookie cache and disables Fast Open temporarily. Since only the original SYN includes data and/or Fast Open option, the SYN-ACK has some tell-tale sign (tcp_rcv_fastopen_synack()) to detect such drops. If a path has recurring Fast Open SYN drops, Fast Open is disabled for 2^(recurring_losses) minutes starting from four minutes up to roughly one and half day. sendmsg with MSG_FASTOPEN flag will succeed but it behaves as connect() then write(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
783237e8daf13481ee234997cbbbb823872ac388 |
|
19-Jul-2012 |
Yuchung Cheng <ycheng@google.com> |
net-tcp: Fast Open client - sending SYN-data This patch implements sending SYN-data in tcp_connect(). The data is from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch). The length of the cookie in tcp_fastopen_req, init'd to 0, controls the type of the SYN. If the cookie is not cached (len==0), the host sends data-less SYN with Fast Open cookie request option to solicit a cookie from the remote. If cookie is not available (len > 0), the host sends a SYN-data with Fast Open cookie option. If cookie length is negative, the SYN will not include any Fast Open option (for fall back operations). To deal with middleboxes that may drop SYN with data or experimental TCP option, the SYN-data is only sent once. SYN retransmits do not include data or Fast Open options. The connection will fall back to regular TCP handshake. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2100c8d2d9db23c0a09901a782bb4e3b21bee298 |
|
19-Jul-2012 |
Yuchung Cheng <ycheng@google.com> |
net-tcp: Fast Open base This patch impelements the common code for both the client and server. 1. TCP Fast Open option processing. Since Fast Open does not have an option number assigned by IANA yet, it shares the experiment option code 254 by implementing draft-ietf-tcpm-experimental-options with a 16 bits magic number 0xF989. This enables global experiments without clashing the scarce(2) experimental options available for TCP. When the draft status becomes standard (maybe), the client should switch to the new option number assigned while the server supports both numbers for transistion. 2. The new sysctl tcp_fastopen 3. A place holder init function Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d01cb20711e3c2df41677ee270d6bdeff24e9902 |
|
13-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: add LAST_ACK as a valid state for TSQ Socket state LAST_ACK should allow TSQ to send additional frames, or else we rely on incoming ACKS or timers to send them. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
46d3ceabd8d98ed0ad10f20c595ca784e34786c5 |
|
11-Jul-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: TCP Small Queues This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4aea39c11c610e411768649fdc04777903ebfe07 |
|
03-Jun-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: tcp_make_synack() consumes dst parameter tcp_make_synack() clones the dst, and callers release it. We can avoid two atomic operations per SYNACK if tcp_make_synack() consumes dst instead of cloning it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
90ba9b1986b5ac4b2d184575847147ea7c4280a2 |
|
03-Jun-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: tcp_make_synack() can use alloc_skb() There is no value using sock_wmalloc() in tcp_make_synack(). A listener socket only sends SYNACK packets, they are not queued in a socket queue, only in Qdisc and device layers, so the number of in flight packets is limited in these layers. We used sock_wmalloc() with the %force parameter set to 1 to ignore socket limits anyway. This patch removes two atomic operations per SYNACK packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a2a385d627e1549da4b43a8b3dfe370589766e1c |
|
17-May-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: bool conversions bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
91df42bedccb919902c7cf7eb876c982ae7f1b1d |
|
15-May-2012 |
Joe Perches <joe@perches.com> |
net: ipv4 and ipv6: Convert printk(KERN_DEBUG to pr_debug Use the current debugging style and enable dynamic_debug. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e87cc4728f0e2fb663e592a1141742b1d6c63256 |
|
13-May-2012 |
Joe Perches <joe@perches.com> |
net: Convert net_ratelimit uses to net_<level>_ratelimited Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
750ea2bafa55aaed208b2583470ecd7122225634 |
|
02-May-2012 |
Yuchung Cheng <ycheng@google.com> |
tcp: early retransmit: delayed fast retransmit Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2). Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay. If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform fast retransmit. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
67469601406c12ced3db9956aeb0ef0854e2952f |
|
24-Apr-2012 |
Eric Dumazet <edumazet@google.com> |
ipv6: RTAX_FEATURE_ALLFRAG causes inefficient TCP segment sizing Quoting Tore Anderson from : https://bugzilla.kernel.org/show_bug.cgi?id=42572 When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment size does not take into account the size of the IPv6 Fragmentation header that needs to be included in outbound packets, causing every transmitted TCP segment to be fragmented across two IPv6 packets, the latter of which will only contain 8 bytes of actual payload. RTAX_FEATURE_ALLFRAG is typically set on a route in response to receving a ICMPv6 Packet Too Big message indicating a Path MTU of less than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6 PTBs with MTU < 1280 are still valid, in particular when an IPv6 packet is sent to an IPv4 destination through a stateless translator. Any ICMPv4 Need To Fragment packets originated from the IPv4 part of the path will be translated to ICMPv6 PTB which may then indicate an MTU of less than 1280. The Linux kernel refuses to reduce the effective MTU to anything below 1280 bytes, instead it sets it to exactly 1280 bytes, and RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header), instead of 1232 (additionally taking into account the 8 bytes required by the IPv6 Fragmentation extension header). This in turn results in rather inefficient transmission, as every transmitted TCP segment now is split in two fragments containing 1232+8 bytes of payload. After this patch, all the outgoing packets that includes a Fragmentation header all are "atomic" or "non-fragmented" fragments, i.e., they both have Offset=0 and More Fragments=0. With help from David S. Miller Reported-by: Tore Anderson <tore@fud.no> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Tested-by: Tore Anderson <tore@fud.no> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c0e88ff0f256958401778ff692da4b8891acb5a9 |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Repair socket queues Reading queues under repair mode is done with recvmsg call. The queue-under-repair set by TCP_REPAIR_QUEUE option is used to determine which queue should be read. Thus both send and receive queue can be read with this. Caller must pass the MSG_PEEK flag. Writing to queues is done with sendmsg call and yet again -- the repair-queue option can be used to push data into the receive queue. When putting an skb into receive queue a zero tcp header is appented to its head to address the tcp_hdr(skb)->syn and the ->fin checks by the (after repair) tcp_recvmsg. These flags flags are both set to zero and that's why. The fin cannot be met in the queue while reading the source socket, since the repair only works for closed/established sockets and queueing fin packet always changes its state. The syn in the queue denotes that the respective skb's seq is "off-by-one" as compared to the actual payload lenght. Thus, at the rcv queue refill we can just drop this flag and set the skb's sequences to precice values. When the repair mode is turned off, the write queue seqs are updated so that the whole queue is considered to be 'already sent, waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the protocol POV the send queue looks like it was sent, but the data between the write_seq and snd_nxt is lost in the network. This helps to avoid another sockoption for setting the snd_nxt sequence. Leaving the whole queue in a 'not yet sent' state (as it will be after sendmsg-s) will not allow to receive any acks from the peer since the ack_seq will be after the snd_nxt. Thus even the ack for the window probe will be dropped and the connection will be 'locked' with the zero peer window. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ee9952831cfd0bbe834f4a26489d7dce74582e37 |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Initial repair mode This includes (according the the previous description): * TCP_REPAIR sockoption This one just puts the socket in/out of the repair mode. Allowed for CAP_NET_ADMIN and for closed/establised sockets only. When repair mode is turned off and the socket happens to be in the established state the window probe is sent to the peer to 'unlock' the connection. * TCP_REPAIR_QUEUE sockoption This one sets the queue which we're about to repair. The 'no-queue' is set by default. * TCP_QUEUE_SEQ socoption Sets the write_seq/rcv_nxt of a selected repaired queue. Allowed for TCP_CLOSE-d sockets only. When the socket changes its state the other seq-s are changed by the kernel according to the protocol rules (most of the existing code is actually reused). * Ability to forcibly bind a socket to a port The sk->sk_reuse is set to SK_FORCE_REUSE. * Immediate connect modification The connect syscall initializes the connection, then directly jumps to the code which finalizes it. * Silent close modification The close just aborts the connection (similar to SO_LINGER with 0 time) but without sending any FIN/RST-s to peer. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
370816aef0c5436c2adbec3966038f36ca326933 |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Move code around This is just the preparation patch, which makes the needed for TCP repair code ready for use. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
22b4a4f22da4b39c6f7f679fd35f3d35c91bf851 |
|
18-Apr-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: fix retransmit of partially acked frames Alexander Beregalov reported skb_over_panic errors and provided stack trace. I occurs commit a21d45726aca (tcp: avoid order-1 allocations on wifi and tx path) added a regression, when a retransmit is done after a partial ACK. tcp_retransmit_skb() tries to aggregate several frames if the first one has enough available room to hold the following ones payload. This is controlled by /proc/sys/net/ipv4/tcp_retrans_collapse tunable (default : enabled) Problem is we must make sure _pskb_trim_head() doesnt fool skb_availroom() when pulling some bytes from skb (this pull is done when receiver ACK part of the frame). Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Cc: Marc MERLIN <marc@merlins.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
95c961747284a6b83a5e2d81240e214b0fa3464d |
|
15-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a21d45726acacc963d8baddf74607d9b74e2b723 |
|
10-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: avoid order-1 allocations on wifi and tx path Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. After investigation, it turns out TCP uses sk_stream_alloc_skb() and used as a convention skb_tailroom(skb) to know how many bytes of data payload could be put in this skb (for non SG capable devices) Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER + sizeof(struct skb_shared_info) being above 2048) Later, mac80211 layer need to add some bytes at the tail of skb (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is available has to call pskb_expand_head() and request order-1 allocations. This patch changes sk_stream_alloc_skb() so that only sk->sk_prot->max_header bytes of headroom are reserved, and use a new skb field, avail_size to hold the data payload limit. This way, order-0 allocations done by TCP stack can leave more than 2 KB of tailroom and no more allocation is performed in mac80211 layer (or any layer needing some tailroom) avail_size is unioned with mark/dropcount, since mark will be set later in IP stack for output packets. Therefore, skb size is unchanged. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5b35e1e6e9ca651e6b291c96d1106043c9af314a |
|
28-Jan-2012 |
Neal Cardwell <ncardwell@google.com> |
tcp: fix tcp_trim_head() to adjust segment count with skb MSS This commit fixes tcp_trim_head() to recalculate the number of segments in the skb with the skb's existing MSS, so trimming the head causes the skb segment count to be monotonically non-increasing - it should stay the same or go down, but not increase. Previously tcp_trim_head() used the current MSS of the connection. But if there was a decrease in MSS between original transmission and ACK (e.g. due to PMTUD), this could cause tcp_trim_head() to counter-intuitively increase the segment count when trimming bytes off the head of an skb. This violated assumptions in tcp_tso_acked() that tcp_trim_head() only decreases the packet count, so that packets_acked in tcp_tso_acked() could underflow, leading tcp_clean_rtx_queue() to pass u32 pkts_acked values as large as 0xffffffff to ca_ops->pkts_acked(). As an aside, if tcp_trim_head() had really wanted the skb to reflect the current MSS, it should have called tcp_set_skb_tso_segs() unconditionally, since a decrease in MSS would mean that a single-packet skb should now be sliced into multiple segments. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
09e9b813d34d9a09d64a64580a9959d8bae1f4f5 |
|
25-Jan-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: add LINUX_MIB_TCPRETRANSFAIL counter It might be useful to get a counter of failed tcp_retransmit_skb() calls. Reported-by: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
180d8cd942ce336b2c869d324855c40c5db478ad |
|
11-Dec-2011 |
Glauber Costa <glommer@parallels.com> |
foundations of per-cgroup memory pressure controlling. This patch replaces all uses of struct sock fields' memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem to acessor macros. Those macros can either receive a socket argument, or a mem_cgroup argument, depending on the context they live in. Since we're only doing a macro wrapping here, no performance impact at all is expected in the case where we don't have cgroups disabled. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4fa48bf3c75069d636fc8830743c929a062e80dc |
|
04-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: fix tcp_trim_head() commit f07d960df3 (tcp: avoid frag allocation for small frames) breaked assumption in tcp stack that skb is either linear (skb->data_len == 0), or fully fragged (skb->data_len == skb->len) tcp_trim_head() made this assumption, we must fix it. Thanks to Vijay for providing a very detailed explanation. Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
117632e64d2a5f464e491fe221d7169a3814a77b |
|
03-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: take care of misalignments We discovered that TCP stack could retransmit misaligned skbs if a malicious peer acknowledged sub MSS frame. This currently can happen only if output interface is non SG enabled : If SG is enabled, tcp builds headless skbs (all payload is included in fragments), so the tcp trimming process only removes parts of skb fragments, header stay aligned. Some arches cant handle misalignments, so force a head reallocation and shrink headroom to MAX_TCP_HEADER. Dont care about misaligments on x86 and PPC (or other arches setting NET_IP_ALIGN to 0) This patch introduces __pskb_copy() which can specify the headroom of new head, and pskb_copy() becomes a wrapper on top of __pskb_copy() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6b5a5c0dbb11dcff4e1b0f1ef87a723197948ed4 |
|
21-Nov-2011 |
Neal Cardwell <ncardwell@google.com> |
tcp: do not scale TSO segment size with reordering degree Since 2005 (c1b4a7e69576d65efc31a8cea0714173c2841244) tcp_tso_should_defer has been using tcp_max_burst() as a target limit for deciding how large to make outgoing TSO packets when not using sysctl_tcp_tso_win_divisor. But since 2008 (dd9e0dda66ba38a2ddd1405ac279894260dc5c36) tcp_max_burst() returns the reordering degree. We should not have tcp_tso_should_defer attempt to build larger segments just because there is more reordering. This commit splits the notion of deferral size used in TSO from the notion of burst size used in cwnd moderation, and returns the TSO deferral limit to its original value. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6d67e9beb68bc6712b042c9723cecc87dc8e3f4c |
|
05-Nov-2011 |
Feng King <kinwin2008@gmail.com> |
tcp: Fix comments for Nagle algorithm TCP_NODELAY is weaker than TCP_CORK, when TCP_CORK was set, small segments will always pass Nagle test regardless of TCP_NODELAY option. Signed-off-by: Feng King <kinwin2008@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cf533ea53ebfae41be15b103d78e7ebec30b9969 |
|
21-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: add const qualifiers where possible Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9e903e085262ffbf1fc44a17ac06058aca03524a |
|
18-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add skb frag size accessors To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set|add|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4de075e0438ba54b8f42cbbc1263d404229dc997 |
|
27-Sep-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: rename tcp_skb_cb flags Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and maintenance. Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a262f0cdf1f2916ea918dc329492abb5323d9a6c |
|
21-Aug-2011 |
Nandita Dukkipati <nanditad@google.com> |
Proportional Rate Reduction for TCP. This patch implements Proportional Rate Reduction (PRR) for TCP. PRR is an algorithm that determines TCP's sending rate in fast recovery. PRR avoids excessive window reductions and aims for the actual congestion window size at the end of recovery to be as close as possible to the window determined by the congestion control algorithm. PRR also improves accuracy of the amount of data sent during loss recovery. The patch implements the recommended flavor of PRR called PRR-SSRB (Proportional rate reduction with slow start reduction bound) and replaces the existing rate halving algorithm. PRR improves upon the existing Linux fast recovery under a number of conditions including: 1) burst losses where the losses implicitly reduce the amount of outstanding data (pipe) below the ssthresh value selected by the congestion control algorithm and, 2) losses near the end of short flows where application runs out of data to send. As an example, with the existing rate halving implementation a single loss event can cause a connection carrying short Web transactions to go into the slow start mode after the recovery. This is because during recovery Linux pulls the congestion window down to packets_in_flight+1 on every ACK. A short Web response often runs out of new data to send and its pipe reduces to zero by the end of recovery when all its packets are drained from the network. Subsequent HTTP responses using the same connection will have to slow start to raise cwnd to ssthresh. PRR on the other hand aims for the cwnd to be as close as possible to ssthresh by the end of recovery. A description of PRR and a discussion of its performance can be found at the following links: - IETF Draft: http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01 - IETF Slides: http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf - Paper to appear in Internet Measurements Conference (IMC) 2011: Improving TCP Loss Recovery Nandita Dukkipati, Matt Mathis, Yuchung Cheng Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aff65da0f1be5daec44231972b6b5fc45bfa7a58 |
|
23-Aug-2011 |
Ian Campbell <Ian.Campbell@citrix.com> |
net: ipv4: convert to SKB frag APIs Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
d9d8da805dcb503ef8ee49918a94d49085060f23 |
|
07-May-2011 |
David S. Miller <davem@davemloft.net> |
inet: Pass flowi to ->queue_xmit(). This allows us to acquire the exact route keying information from the protocol, however that might be managed. It handles all of the possibilities, from the simplest case of storing the key in inet->cork.fl to the more complex setup SCTP has where individual transports determine the flow. Signed-off-by: David S. Miller <davem@davemloft.net>
|
2fceec13375e5d98ef033c6b0ee03943fc460950 |
|
02-Apr-2011 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: len check is unnecessarily devastating, change to WARN_ON All callers are prepared for alloc failures anyway, so this error can safely be boomeranged to the callers domain without super bad consequences. ...At worst the connection might go into a state where each RTO tries to (unsuccessfully) re-fragment with such a mis-sized value and eventually dies. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
25985edcedea6396277003854657b5f3cb31a628 |
|
31-Mar-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
|
c24f691b56107feeba076616982093ee2d3c8fb5 |
|
07-Feb-2011 |
Yuchung Cheng <ycheng@google.com> |
tcp: undo_retrans counter fixes Fix a bug that undo_retrans is incorrectly decremented when undo_marker is not set or undo_retrans is already 0. This happens when sender receives more DSACK ACKs than packets retransmitted during the current undo phase. This may also happen when sender receives DSACK after the undo operation is completed or cancelled. Fix another bug that undo_retrans is incorrectly incremented when sender retransmits an skb and tcp_skb_pcount(skb) > 1 (TSO). This case is rare but not impossible. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
356f039822b8d802138f7121c80d2a9286976dbd |
|
20-Dec-2010 |
Nandita Dukkipati <nanditad@google.com> |
TCP: increase default initial receive window. This patch changes the default initial receive window to 10 mss (defined constant). The default window is limited to the maximum of 10*1460 and 2*mss (when mss > 1460). draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends increasing TCP's initial congestion window to 10 mss or about 15KB. Leading up to this proposal were several large-scale live Internet experiments with an initial congestion window of 10 mss (IW10), where we showed that the average latency of HTTP responses improved by approximately 10%. This was accompanied by a slight increase in retransmission rate (0.5%), most of which is coming from applications opening multiple simultaneous connections. To understand the extreme worst case scenarios, and fairness issues (IW10 versus IW3), we further conducted controlled testbed experiments. We came away finding minimal negative impact even under low link bandwidths (dial-ups) and small buffers. These results are extremely encouraging to adopting IW10. However, an initial congestion window of 10 mss is useless unless a TCP receiver advertises an initial receive window of at least 10 mss. Fortunately, in the large-scale Internet experiments we found that most widely used operating systems advertised large initial receive windows of 64KB, allowing us to experiment with a wide range of initial congestion windows. Linux systems were among the few exceptions that advertised a small receive window of 6KB. The purpose of this patch is to fix this shortcoming. References: 1. A comprehensive list of all IW10 references to date. http://code.google.com/speed/protocols/tcpm-IW10.html 2. Paper describing results from large-scale Internet experiments with IW10. http://ccr.sigcomm.org/drupal/?q=node/621 3. Controlled testbed experiments under worst case scenarios and a fairness study. http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf 4. Raw test data from testbed experiments (Linux senders/receivers) with initial congestion and receive windows of both 10 mss. http://research.csc.ncsu.edu/netsrv/?q=content/iw10 5. Internet-Draft. Increasing TCP's Initial Window. https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/ Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0dbaee3b37e118a96bb7b8eb0d9bbaeeb46264be |
|
13-Dec-2010 |
David S. Miller <davem@davemloft.net> |
net: Abstract default ADVMSS behind an accessor. Make all RTAX_ADVMSS metric accesses go through a new helper function, dst_metric_advmss(). Leave the actual default metric as "zero" in the real metric slot, and compute the actual default value dynamically via a new dst_ops AF specific callback. For stacked IPSEC routes, we use the advmss of the path which preserves existing behavior. Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates advmss on pmtu updates. This inconsistency in advmss handling results in more raw metric accesses than I wish we ended up with. Signed-off-by: David S. Miller <davem@davemloft.net>
|
f19872575ff7819a3723154657a497d9bca66b33 |
|
07-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: protect sysctl_tcp_cookie_size reads Make sure sysctl_tcp_cookie_size is read once in tcp_cookie_size_check(), or we might return an illegal value to caller if sysctl_tcp_cookie_size is changed by another cpu. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: William Allen Simpson <william.allen.simpson@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ad9f4f50fe9288bbe65b7dfd76d8820afac6a24c |
|
07-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: avoid a possible divide by zero sysctl_tcp_tso_win_divisor might be set to zero while one cpu runs in tcp_tso_should_defer(). Make sure we dont allow a divide by zero by reading sysctl_tcp_tso_win_divisor exactly once. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b1afde60f2b9ee8444fba4e012dc99a3b28d224d |
|
03-Dec-2010 |
Nandita Dukkipati <nanditad@google.com> |
tcp: Bug fix in initialization of receive window. The bug has to do with boundary checks on the initial receive window. If the initial receive window falls between init_cwnd and the receive window specified by the user, the initial window is incorrectly brought down to init_cwnd. The correct behavior is to allow it to remain unchanged. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
97b1ce25e8fc27f74703537ec09d4996c7a6e38a |
|
01-Dec-2010 |
Shan Wei <shanwei@cn.fujitsu.com> |
tcp: use TCP_BASE_MSS to set basic mss value TCP_BASE_MSS is defined, but not used. commit 5d424d5a introduce this macro, so use it to initial sysctl_tcp_base_mss. commit 5d424d5a674f782d0659a3b66d951f412901faee Author: John Heffner <jheffner@psc.edu> Date: Mon Mar 20 17:53:41 2006 -0800 [TCP]: MTU probing Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3853b5841c01a3f492fe137afaad9c209e5162c6 |
|
21-Nov-2010 |
Tom Herbert <therbert@google.com> |
xps: Improvements in TX queue selection In dev_pick_tx, don't do work in calculating queue index or setting the index in the sock unless the device has more than one queue. This allows the sock to be set only with a queue index of a multi-queue device which is desirable if device are stacked like in a tunnel. We also allow the mapping of a socket to queue to be changed. To maintain in order packet transmission a flag (ooo_okay) has been added to the sk_buff structure. If a transport layer sets this flag on a packet, the transmit queue can be changed for the socket. Presumably, the transport would set this if there was no possbility of creating OOO packets (for instance, there are no packets in flight for the socket). This patch includes the modification in TCP output for setting this flag. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ee58681195bf243bafc44ca53f3c24429d096cce |
|
16-Nov-2010 |
Eric Paris <eparis@redhat.com> |
network: tcp_connect should return certain errors up the stack The current tcp_connect code completely ignores errors from sending an skb. This makes sense in many situations (like -ENOBUFFS) but I want to be able to immediately fail connections if they are denied by the SELinux netfilter hook. Netfilter does not normally return ECONNREFUSED when it drops a packet so we respect that error code as a final and fatal error that can not be recovered. Based-on-patch-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b595076a180a56d1bb170e6eceda6eb9d76f4cd3 |
|
01-Nov-2010 |
Uwe Kleine-König <u.kleine-koenig@pengutronix.de> |
tree-wide: fix comment/printk typos "gadget", "through", "command", "maintain", "maintain", "controller", "address", "between", "initiali[zs]e", "instead", "function", "select", "already", "equal", "access", "management", "hierarchy", "registration", "interest", "relative", "memory", "offset", "already", Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
a02cec2155fbea457eca8881870fd2de1a4c4c76 |
|
22-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: return operator cleanup Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0705c6f0e2d39333645bf77cf1efb94526ff1f82 |
|
01-Sep-2010 |
Gerrit Renker <gerrit@erg.abdn.ac.uk> |
tcp: update also tcp_output with regard to RFC 5681 Thanks to Ilpo Jarvinen, this updates also the initial window setting for tcp_output with regard to RFC 5681. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e88c64f0a42575e01c7ace903d0570bc0b7fcf85 |
|
19-Aug-2010 |
Hagen Paul Pfeifer <hagen@jauu.net> |
tcp: allow effective reduction of TCP's rcv-buffer via setsockopt Via setsockopt it is possible to reduce the socket RX buffer (SO_RCVBUF). TCP method to select the initial window and window scaling option in tcp_select_initial_window() currently misbehaves and do not consider a reduced RX socket buffer via setsockopt. Even though the server's RX buffer is reduced via setsockopt() to 256 byte (Initial Window 384 byte => 256 * 2 - (256 * 2 / 4)) the window scale option is still 7: 192.168.1.38.40676 > 78.47.222.210.5001: Flags [S], seq 2577214362, win 5840, options [mss 1460,sackOK,TS val 338417 ecr 0,nop,wscale 0], length 0 78.47.222.210.5001 > 192.168.1.38.40676: Flags [S.], seq 1570631029, ack 2577214363, win 384, options [mss 1452,sackOK,TS val 2435248895 ecr 338417,nop,wscale 7], length 0 192.168.1.38.40676 > 78.47.222.210.5001: Flags [.], ack 1, win 5840, options [nop,nop,TS val 338421 ecr 2435248895], length 0 Within tcp_select_initial_window() the original space argument - a representation of the rx buffer size - is expanded during tcp_select_initial_window(). Only sysctl_tcp_rmem[2], sysctl_rmem_max and window_clamp are considered to calculate the initial window. This patch adjust the window_clamp argument if the user explicitly reduce the receive buffer. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Cc: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
45e77d314585869dfe43c82679f7e08c9b35b898 |
|
19-Jul-2010 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: fix crash in tcp_xmit_retransmit_queue It can happen that there are no packets in queue while calling tcp_xmit_retransmit_queue(). tcp_write_queue_head() then returns NULL and that gets deref'ed to get sacked into a local var. There is no work to do if no packets are outstanding so we just exit early. This oops was introduced by 08ebd1721ab8fd (tcp: remove tp->lost_out guard to make joining diff nicer). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de> Tested-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4bc2f18ba4f22a90ab593c0a580fc9a19c4777b6 |
|
09-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net/ipv4: EXPORT_SYMBOL cleanups CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c4ead4c595cd54cf7b1a184d4f888ce0d7cce7d4 |
|
24-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tso_fragment() might avoid GFP_ATOMIC We can pass a gfp argument to tso_fragment() and avoid GFP_ATOMIC allocations sometimes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a3433f35a55f7604742cae620c6dc6edfc70db6a |
|
12-Jun-2010 |
Changli Gao <xiaosuo@gmail.com> |
tcp: unify tcp flag macros unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH, TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced with the corresponding TCPHDR_*. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/tcp.h | 24 ++++++------- net/ipv4/tcp.c | 8 ++-- net/ipv4/tcp_input.c | 2 - net/ipv4/tcp_output.c | 59 ++++++++++++++++----------------- net/netfilter/nf_conntrack_proto_tcp.c | 32 ++++++----------- net/netfilter/xt_TCPMSS.c | 4 -- 6 files changed, 58 insertions(+), 71 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
|
de213e5eedecdfb1b1eea7e6be28bc64cac5c078 |
|
18-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tcp_synack_options() fix Commit 33ad798c924b4a (tcp: options clean up) introduced a problem if MD5+SACK+timestamps were used in initial SYN message. Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP sessions, but since 40 bytes of tcp options space are not enough to store all the bits needed, we chose to disable timestamps in this case. We send a SYN-ACK _without_ timestamp option, but socket has timestamps enabled and all further outgoing messages contain a TS block, all with the initial timestamp of the remote peer. Fix is to really disable timestamps option for the whole session. Reported-by: Bijay Singh <Bijay.Singh@guavus.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a465419b1febb603821f924805529cff89cafeed |
|
16-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Introduce sk_route_nocaps TCP-MD5 sessions have intermittent failures, when route cache is invalidated. ip_queue_xmit() has to find a new route, calls sk_setup_caps(sk, &rt->u.dst), destroying the sk->sk_route_caps &= ~NETIF_F_GSO_MASK that MD5 desperately try to make all over its way (from tcp_transmit_skb() for example) So we send few bad packets, and everything is fine when tcp_transmit_skb() is called again for this socket. Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a socket field, sk_route_nocaps, containing bits to mask on sk_route_caps. Reported-by: Bhaskar Dutta <bhaskie@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa2ea0586d9dbe56a334d835a43b45e8c2104e77 |
|
22-Apr-2010 |
Tom Herbert <therbert@google.com> |
tcp: fix outsegs stat for TSO segments Account for TSO segments of an skb in TCP_MIB_OUTSEGS counter. Without doing this, the counter can be off by orders of magnitude from the actual number of segments sent. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0eae88f31ca2b88911ce843452054139e028771f |
|
21-Apr-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Fix various endianness glitches Sparse can help us find endianness bugs, but we need to make some cleanups to be able to more easily spot real bugs. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4e15ed4d930297c127d280ca1d0c785be870def4 |
|
15-Apr-2010 |
Shan Wei <shanwei@cn.fujitsu.com> |
net: replace ipfragok with skb->local_df As Herbert Xu said: we should be able to simply replace ipfragok with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function) has droped ipfragok and set local_df value properly. The patch kills the ipfragok parameter of .queue_xmit(). Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2e8e18ef52e7dd1af0a3bd1f7d990a1d0b249586 |
|
08-Apr-2010 |
David S. Miller <davem@davemloft.net> |
tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb Back in commit 04a0551c87363f100b04d28d7a15a632b70e18e7 ("loopback: Drop obsolete ip_summed setting") we stopped setting CHECKSUM_UNNECESSARY in the loopback xmit. This is because such a setting was a lie since it implies that the checksum field of the packet is properly filled in. Instead what happens normally is that CHECKSUM_PARTIAL is set and skb->csum is calculated as needed. But this was only happening for TCP data packets (via the skb->ip_summed assignment done in tcp_sendmsg()). It doesn't happen for non-data packets like ACKs etc. Fix this by setting skb->ip_summed in the common non-data packet constructor. It already is setting skb->csum to zero. But this reminds us that we still have things like ip_output.c's ip_dev_loopback_xmit() which sets skb->ip_summed to the value CHECKSUM_UNNECESSARY, which Herbert's patch teaches us is not valid. So we'll have to address that at some point too. Signed-off-by: David S. Miller <davem@davemloft.net>
|
bb29624614c2afe2873ee8ee97cf09df42701694 |
|
11-Apr-2010 |
Herbert Xu <herbert@gondor.apana.org.au> |
inet: Remove unused send_check length argument inet: Remove unused send_check length argument This patch removes the unused length argument from the send_check function in struct inet_connection_sock_af_ops. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Yinghai <yinghai.lu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ae4e8d63b5619d4d95f1d2bfa2b836caa6e62d06 |
|
11-Apr-2010 |
David S. Miller <davem@davemloft.net> |
Revert "tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb" This reverts commit 2626419ad5be1a054d350786b684b41d23de1538. It causes regressions for people with IGB cards. Connection requests don't complete etc. The true cause of the issue is still not known, but we should sort this out in net-next-2.6 not net-2.6 Signed-off-by: David S. Miller <davem@davemloft.net>
|
2626419ad5be1a054d350786b684b41d23de1538 |
|
08-Apr-2010 |
David S. Miller <davem@davemloft.net> |
tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb Back in commit 04a0551c87363f100b04d28d7a15a632b70e18e7 ("loopback: Drop obsolete ip_summed setting") we stopped setting CHECKSUM_UNNECESSARY in the loopback xmit. This is because such a setting was a lie since it implies that the checksum field of the packet is properly filled in. Instead what happens normally is that CHECKSUM_PARTIAL is set and skb->csum is calculated as needed. But this was only happening for TCP data packets (via the skb->ip_summed assignment done in tcp_sendmsg()). It doesn't happen for non-data packets like ACKs etc. Fix this by setting skb->ip_summed in the common non-data packet constructor. It already is setting skb->csum to zero. But this reminds us that we still have things like ip_output.c's ip_dev_loopback_xmit() which sets skb->ip_summed to the value CHECKSUM_UNNECESSARY, which Herbert's patch teaches us is not valid. So we'll have to address that at some point too. Signed-off-by: David S. Miller <davem@davemloft.net>
|
5a0e3ad6af8660be21ca98a971cd00f331318c05 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
28b2774a0d5852236dab77a4147b8b88548110f1 |
|
08-Mar-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: Fix tcp_make_synack() Commit 4957faad (TCPCT part 1g: Responder Cookie => Initiator), part of TCP_COOKIE_TRANSACTION implementation, forgot to correctly size synack skb in case user data must be included. Many thanks to Mika Pentillä for spotting this error. Reported-by: Penttillä Mika <mika.penttila@ixonos.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
31d12926e37291970dd4f6e9940df3897766a81d |
|
15-Dec-2009 |
laurent chavey <chavey@google.com> |
net: Add rtnetlink init_rcvwnd to set the TCP initial receive window Add rtnetlink init_rcvwnd to set the TCP initial receive window size advertised by passive and active TCP connections. The current Linux TCP implementation limits the advertised TCP initial receive window to the one prescribed by slow start. For short lived TCP connections used for transaction type of traffic (i.e. http requests), bounding the advertised TCP initial receive window results in increased latency to complete the transaction. Support for setting initial congestion window is already supported using rtnetlink init_cwnd, but the feature is useless without the ability to set a larger TCP initial receive window. The rtnetlink init_rcvwnd allows increasing the TCP initial receive window, allowing TCP connection to advertise larger TCP receive window than the ones bounded by slow start. Signed-off-by: Laurent Chavey <chavey@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
12d50c46dc0f7fd2e625c4befaa5fa5740a7a594 |
|
08-Dec-2009 |
Krishna Kumar <krkumar2@in.ibm.com> |
tcp: Remove check in __tcp_push_pending_frames tcp_push checks tcp_send_head and calls __tcp_push_pending_frames, which again checks tcp_send_head, and this unnecessary check is done for every other caller of __tcp_push_pending_frames. Remove tcp_send_head check in __tcp_push_pending_frames and add the check to tcp_push_pending_frames. Other functions call __tcp_push_pending_frames only when tcp_send_head would evaluate to true. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bb5b7c11263dbbe78253cd05945a6bf8f55add8e |
|
16-Dec-2009 |
David S. Miller <davem@davemloft.net> |
tcp: Revert per-route SACK/DSACK/TIMESTAMP changes. It creates a regression, triggering badness for SYN_RECV sockets, for example: [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293 [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000 [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32) [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR> CR: 24002442 XER: 00000000 [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000 This is likely caused by the change in the 'estab' parameter passed to tcp_parse_options() when invoked by the functions in net/ipv4/tcp_minisocks.c But even if that is fixed, the ->conn_request() changes made in this patch series is fundamentally wrong. They try to use the listening socket's 'dst' to probe the route settings. The listening socket doesn't even have a route, and you can't get the right route (the child request one) until much later after we setup all of the state, and it must be done by hand. This stuff really isn't ready, so the best thing to do is a full revert. This reverts the following commits: f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d 022c3f7d82f0f1c68018696f2f027b87b9bb45c2 1aba721eba1d84a2defce45b950272cee1e6c72a cda42ebd67ee5fdf09d7057b5a4584d36fe8a335 345cda2fd695534be5a4494f1b59da9daed33663 dc343475ed062e13fc260acccaab91d7d80fd5b2 05eaade2782fb0c90d3034fd7a7d5a16266182bb 6a2a2d6bf8581216e08be15fcb563cfd6c430e1e Signed-off-by: David S. Miller <davem@davemloft.net>
|
e6b09ccada2e4ab8ee25a7c3ac72fcd1bc7101c4 |
|
03-Dec-2009 |
David S. Miller <davem@davemloft.net> |
tcp: sysctl_tcp_cookie_size needs to be exported to modules. Otherwise: ERROR: "sysctl_tcp_cookie_size" [net/ipv6/ipv6.ko] undefined! make[1]: *** [__modpost] Error 1 Signed-off-by: David S. Miller <davem@davemloft.net>
|
f9a2e69e8bb77fe6c8e7da6380ebd4330c00a29e |
|
03-Dec-2009 |
David S. Miller <davem@davemloft.net> |
tcp: Fix warning on 64-bit. net/ipv4/tcp_output.c: In function ‘tcp_make_synack’: net/ipv4/tcp_output.c:2488: warning: cast from pointer to integer of different size Signed-off-by: David S. Miller <davem@davemloft.net>
|
4957faade11b3a278c3b3cade3411ddc20afa791 |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1g: Responder Cookie => Initiator Parse incoming TCP_COOKIE option(s). Calculate <SYN,ACK> TCP_COOKIE option. Send optional <SYN,ACK> data. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1f: Initiator Cookie => Responder Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
bd0388ae77075026d6a9f9eb6026dfd1d52ce0e9 |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1f: Initiator Cookie => Responder Calculate and format <SYN> TCP_COOKIE option. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 Requires: TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
519855c508b9a17878c0977a3cdefc09b59b30df |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS Define sysctl (tcp_cookie_size) to turn on and off the cookie option default globally, instead of a compiled configuration option. Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant data values, retrieving variable cookie values, and other facilities. Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h, near its corresponding struct tcp_options_received (prior to changes). This is a straightforward re-implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 These functions will also be used in subsequent patches that implement additional features. Requires: net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
e6b4d11367519bc71729c09d05a126b133c755be |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1a: add request_values parameter for sending SYNACK Add optional function parameters associated with sending SYNACK. These parameters are not needed after sending SYNACK, and are not used for retransmission. Avoids extending struct tcp_request_sock, and avoids allocating kernel memory. Also affects DCCP as it uses common struct request_sock_ops, but this parameter is currently reserved for future use. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9d4fb27db90043cd2640e4bc778f9c755d3c17c1 |
|
23-Nov-2009 |
Joe Perches <joe@perches.com> |
net/ipv4: Move && and || to end of previous line On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote: > It should be of the form: > if (x && > y) > > or: > if (x && y) > > Fix patches, rather than complaints, for existing cases where things > do not follow this pattern are certainly welcome. Also collapsed some multiple tabs to single space. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
345cda2fd695534be5a4494f1b59da9daed33663 |
|
28-Oct-2009 |
Gilad Ben-Yossef <gilad@codefidence.com> |
Allow to turn off TCP window scale opt per route Add and use no window scale bit in the features field. Note that this is not the same as setting a window scale of 0 as would happen with window limit on route. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cda42ebd67ee5fdf09d7057b5a4584d36fe8a335 |
|
28-Oct-2009 |
Gilad Ben-Yossef <gilad@codefidence.com> |
Allow disabling TCP timestamp options per route Implement querying and acting upon the no timestamp bit in the feature field. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1aba721eba1d84a2defce45b950272cee1e6c72a |
|
28-Oct-2009 |
Gilad Ben-Yossef <gilad@codefidence.com> |
Add the no SACK route option feature Implement querying and acting upon the no sack bit in the features field. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c720c7e8383aff1cb219bddf474ed89d850336e3 |
|
15-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
inet: rename some inet_sock fields In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
89e95a613c8a045ce0c5b992ba19f10613f6ab2f |
|
01-Oct-2009 |
Ori Finkelman <ori@comsleep.com> |
IPv4 TCP fails to send window scale option when window scale is zero Acknowledge TCP window scale support by inserting the proper option in SYN/ACK and SYN headers even if our window scale is zero. This fixes the following observed behavior: 1. Client sends a SYN with TCP window scaling option and non zero window scale value to a Linux box. 2. Linux box notes large receive window from client. 3. Linux decides on a zero value of window scale for its part. 4. Due to compare against requested window scale size option, Linux does not to send windows scale TCP option header on SYN/ACK at all. With the following result: Client box thinks TCP window scaling is not supported, since SYN/ACK had no TCP window scale option, while Linux thinks that TCP window scaling is supported (and scale might be non zero), since SYN had TCP window scale option and we have a mismatched idea between the client and server regarding window sizes. Probably it also fixes up the following bug (not observed in practice): 1. Linux box opens TCP connection to some server. 2. Linux decides on zero value of window scale. 3. Due to compare against computed window scale size option, Linux does not to set windows scale TCP option header on SYN. With the expected result that the server OS does not use window scale option due to not receiving such an option in the SYN headers, leading to suboptimal performance. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Signed-off-by: Ori Finkelman <ori@comsleep.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa1330766c49199bdab4d4a9096d98b072df9044 |
|
03-Sep-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
tcp: replace hard coded GFP_KERNEL with sk_allocation This fixed a lockdep warning which appeared when doing stress memory tests over NFS: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock mount_root => nfs_root_data => tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim David raised a concern that if the allocation fails in tcp_send_fin(), and it's GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting for the allocation to succeed. But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could loop endlessly under memory pressure. CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: David S. Miller <davem@davemloft.net> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
67edfef78639573e9b01c26295a935349aab6fa3 |
|
22-Jul-2009 |
Andi Kleen <andi@firstfloor.org> |
TCP: Add comments to (near) all functions in tcp_output.c v3 While looking for something else I spent some time adding one liner comments to the tcp_output.c functions that didn't have any. That makes the comments more consistent. I hope I documented everything right. No code changes. v2: Incorporated feedback from Ilpo. v3: Change style of one liner comments, add a few more comments. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e3afe7b75ed8f809c1473ea9b39267487c187ccb |
|
16-Jul-2009 |
John Dykstra <john.dykstra1@gmail.com> |
tcp: Fix MD5 signature checking on IPv4 mapped sockets Fix MD5 signature checking so that an IPv4 active open to an IPv6 socket can succeed. In particular, use the correct address family's signature generation function for the SYN/ACK. Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8e5b9dda99cc86bdbd822935fcc37c5808e271b3 |
|
28-Jun-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Stop non-TSO packets morphing into TSO If a socket starts out on a non-TSO route, and then switches to a TSO route, then the tail on the tx queue can morph into a TSO packet, causing mischief because the rest of the stack does not expect a partially linear TSO packet. This patch fixes this by ensuring that skb->ip_summed is set to CHECKSUM_PARTIAL before declaring a packet as TSO. Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
adf30907d63893e4208dfe3f5c88ae12bc2f25d5 |
|
02-Jun-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: skb->dst accessors Define three accessors to get/set dst attached to a skb struct dst_entry *skb_dst(const struct sk_buff *skb) void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
255cac91c3c9ce7dca7713b93ab03c75b7902e0e |
|
04-May-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: extend ECN sysctl to allow server-side only ECN This should be very safe compared with full enabled, so I see no reason why it shouldn't be done right away. As ECN can only be negotiated if the SYN sending party is also supporting it, somebody in the loop probably knows what he/she is doing. If SYN does not ask for ECN, the server side SYN-ACK is identical to what it is without ECN. Thus it's quite safe. The chosen value is safe w.r.t to existing configs which choose to currently set manually either 0 or 1 but silently upgrades those who have not explicitly requested ECN off. Whether to just enable both sides comes up time to time but unless that gets done now we can at least make the servers aware of ECN already. As there are some known problems to occur if ECN is enabled, it's currently questionable whether there's any real gain from enabling clients as servers mostly won't support it anyway (so we'd hit just the negative sides). After enabling the servers and getting that deployed, the client end enable really has some potential gain too. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
52cf3cc8acea52ecb93ef1dddb4ef2ae4e35c319 |
|
18-Apr-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: fix mid-wq adjustment helper Just noticed while doing some new work that the recent mid-wq adjustment logic will misbehave when FACK is not in use (happens either due sysctl'ed off or auto-detected reordering) because I forgot the relevant TCPCB tagbit. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9eb9362e569062e2f841b7a023e5fcde10ed63b4 |
|
02-Apr-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: miscounts due to tcp_fragment pcount reset It seems that trivial reset of pcount to one was not sufficient in tcp_retransmit_skb. Multiple counters experience a positive miscount when skb's pcount gets lowered without the necessary adjustments (depending on skb's sacked bits which exactly), at worst a packets_out miscount can crash at RTO if the write queue is empty! Triggering this requires mss change, so bidir tcp or mtu probe or like. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Tested-by: Uwe Bugla <uwe.bugla@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
797108d134a91afca9fa59c572336b279bc66afb |
|
02-Apr-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: add helper for counter tweaking due mid-wq change We need full-scale adjustment to fix a TCP miscount in the next patch, so just move it into a helper and call for that from the other places. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0c54b85f2828128274f319a1eb3ce7f604fe2a53 |
|
14-Mar-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: simplify tcp_current_mss There's very little need for most of the callsites to get tp->xmit_goal_size updated. That will cost us divide as is, so slice the function in two. Also, the only users of the tp->xmit_goal_size are directly behind tcp_current_mss(), so there's no need to store that variable into tcp_sock at all! The drop of xmit_goal_size currently leaves 16-bit hole and some reorganization would again be necessary to change that (but I'm aiming to fill that hole with u16 xmit_goal_size_segs to cache the results of the remaining divide to get that tso on regression). Bring xmit_goal_size parts into tcp.c Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5861f8e58dd84fc34b691c2e8d4824dea68c360e |
|
14-Mar-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: remove pointless .dsack/.num_sacks code In the pure assignment case, the earlier zeroing is still in effect. David S. Miller raised concerns if the ifs are there to avoid dirtying cachelines. I came to these conclusions: > We'll be dirty it anyway (now that I check), the first "real" statement > in tcp_rcv_established is: > > tp->rx_opt.saw_tstamp = 0; > > ...that'll land on the same dword. :-/ > > I suppose the blocks are there just because they had more complexity > inside when they had to calculate the eff_sacks too (maybe it would > have been better to just remove them in that drop-patch so you would > have had less head-ache :-)). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ee7537b63a28b42b22e48842dfeedc66d96b71f1 |
|
03-Mar-2009 |
Hantzis Fotis <xantzis@ceid.upatras.gr> |
tcp: tcp_init_wl / tcp_update_wl argument cleanup The above functions from include/net/tcp.h have been defined with an argument that they never use. The argument is 'u32 ack' which is never used inside the function body, and thus it can be removed. The rest of the patch involves the necessary changes to the function callers of the above two functions. Signed-off-by: Hantzis Fotis <xantzis@ceid.upatras.gr> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9ce01461028d595a6f1cd724fbd7a0dd70464fe4 |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: get rid of two unnecessary u16s in TCP skb flags copying I guess these fields were one day 16-bit in the struct but nowadays they're just using 8 bits anyway. This is just a precaution, didn't result any change in my case but who knows what all those varying gcc versions & options do. I've been told that 16-bit is not so nice with some cpus. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cabeccbd172cc305f4383f5a4808ae254745275f |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: kill eff_sacks "cache", the sole user can calculate itself Also fixes insignificant bug that would cause sending of stale SACK block (would occur in some corner cases). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e6c7d0857905f1d642cb8dbadae6794bfa1dff30 |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: drop unnecessary local var in collapse Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
02276f3c962fd408fa9d441251067845f948bfcf |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: fix corner case issue in segmentation during rexmitting If cur_mss grew very recently so that the previously G/TSOed skb now fits well into a single segment it would get send up in parts unless we calculate # of segments again. This corner-case could happen eg. after mtu probe completes or less than previously sack blocks are required for the opposite direction. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d3d2ae454501a4dec360995649e1b002a2ad90c5 |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: Don't clear hints when tcp_fragmenting 1) We didn't remove any skbs, so no need to handle stale refs. 2) scoreboard_skb_hint is trivial, no timestamps were changed so no need to clear that one 3) lost_skb_hint needs tweaking similar to that of tcp_sacktag_one(). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
62ad27619cbcf23fb8581ae72f3806c1d90a861d |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: deferring in middle of queue makes very little sense If skb can be sent right away, we certainly should do that if it's in the middle of the queue because it won't get more data into it. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ac11ba753f3aa839292c1a3b6c971c637ad2e839 |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: don't backtrack to sacked skbs Backtracking to sacked skbs is a horrible performance killer since the hint cannot be advanced successfully past them... ...And it's totally unnecessary too. In theory this is 2.6.27..28 regression but I doubt anybody can make .28 to have worse performance because of other TCP improvements. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7691367d71fd77ab668ff3b6edb4340cecddc805 |
|
22-Feb-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Always set urgent pointer if it's beyond snd_nxt Our TCP stack does not set the urgent flag if the urgent pointer does not fit in 16 bits, i.e., if it is more than 64K from the sequence number of a packet. This behaviour is different from the BSDs, and clearly contradicts the purpose of urgent mode, which is to send the notification (though not necessarily the associated data) as soon as possible. Our current behaviour may in fact delay the urgent notification indefinitely if the receiver window does not open up. Simply matching BSD however may break legacy applications which incorrectly rely on the out-of-band delivery of urgent data, and conversely the in-band delivery of non-urgent data. Alexey Kuznetsov suggested a safe solution of following BSD only if the urgent pointer itself has not yet been transmitted. This way we guarantee that when the remote end sees the packet with non-urgent data marked as urgent due to wrap-around we would have advanced the urgent pointer beyond, either to the actual urgent data or to an as-yet untransmitted packet. The only potential downside is that applications on the remote end may see multiple SIGURG notifications. However, this would occur anyway with other TCP stacks. More importantly, the outcome of such a duplicate notification is likely to be harmless since the signal itself does not carry any information other than the fact that we're in urgent mode. Thanks to Ilpo Järvinen for fixing a critical bug in this and Jeff Chua for reporting that bug. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5209921cf15452cbe43097afce11d2846630cb51 |
|
19-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: remove obsoleted comment about different passes This is obsolete since the passes got combined. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a23f4bbd8d27ac8ddc5d71ace1f91bb503f0469a |
|
06-Feb-2009 |
David S. Miller <davem@davemloft.net> |
Revert "tcp: Always set urgent pointer if it's beyond snd_nxt" This reverts commit 64ff3b938ec6782e6585a83d5459b98b0c3f6eb8. Jeff Chua reports that it breaks rlogin for him. Signed-off-by: David S. Miller <davem@davemloft.net>
|
64ff3b938ec6782e6585a83d5459b98b0c3f6eb8 |
|
26-Dec-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Always set urgent pointer if it's beyond snd_nxt Our TCP stack does not set the urgent flag if the urgent pointer does not fit in 16 bits, i.e., if it is more than 64K from the sequence number of a packet. This behaviour is different from the BSDs, and clearly contradicts the purpose of urgent mode, which is to send the notification (though not necessarily the associated data) as soon as possible. Our current behaviour may in fact delay the urgent notification indefinitely if the receiver window does not open up. Simply matching BSD however may break legacy applications which incorrectly rely on the out-of-band delivery of urgent data, and conversely the in-band delivery of non-urgent data. Alexey Kuznetsov suggested a safe solution of following BSD only if the urgent pointer itself has not yet been transmitted. This way we guarantee that when the remote end sees the packet with non-urgent data marked as urgent due to wrap-around we would have advanced the urgent pointer beyond, either to the actual urgent data or to an as-yet untransmitted packet. The only potential downside is that applications on the remote end may see multiple SIGURG notifications. However, this would occur anyway with other TCP stacks. More importantly, the outcome of such a duplicate notification is likely to be harmless since the signal itself does not carry any information other than the fact that we're in urgent mode. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a2acde07711f7d8b19928245c555bce60f91482a |
|
06-Dec-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: fix tso_should_defer in 64bit Since jiffies is unsigned long, the types get expanded into that and after long enough time the difference will therefore always be > 1 (and that probably happens near boot as well as iirc the first jiffies wrap is scheduler close after boot to find out problems related to that early). This was originally noted by Bill Fink in Dec'07 but nobody never ended fixing it. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d5dd9175bc12015ea4d2c1a9b6b15dfa645a3db9 |
|
06-Dec-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: use tcp_write_xmit also in tcp_push_one tcp_minshall_update is not significant difference since it only checks for not full-sized skb which is BUG'ed on the push_one path anyway. tcp_snd_test is tcp_nagle_test+tcp_cwnd_test+tcp_snd_wnd_test, just the order changed slightly. net/ipv4/tcp_output.c: tcp_snd_test | -89 tcp_mss_split_point | -91 tcp_may_send_now | +53 tcp_cwnd_validate | -98 tso_fragment | -239 __tcp_push_pending_frames | -1340 tcp_push_one | -146 7 functions changed, 53 bytes added, 2003 bytes removed, diff: -1950 net/ipv4/tcp_output.c: tcp_write_xmit | +1772 1 function changed, 1772 bytes added, diff: +1772 tcp_output.o.new: 8 functions changed, 1825 bytes added, 2003 bytes removed, diff: -178 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
726e07a8a38168266ac95d87736f9501a2d9e7b2 |
|
06-Dec-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: move some parts from tcp_write_xmit Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f8269a495a1924f8b023532dd3e77423432db810 |
|
04-Dec-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: make urg+gso work for real this time I should have noticed this earlier... :-) The previous solution to URG+GSO/TSO will cause SACK block tcp_fragment to do zig-zig patterns, or even worse, a steep downward slope into packet counting because each skb pcount would be truncated to pcount of 2 and then the following fragments of the later portion would restore the window again. Basically this reverts "tcp: Do not use TSO/GSO when there is urgent data" (33cf71cee1). It also removes some unnecessary code from tcp_current_mss that didn't work as intented either (could be that something was changed down the road, or it might have been broken since the dawn of time) because it only works once urg is already written while this bug shows up starting from ~64k before the urg point. The retransmissions already are split to mss sized chunks, so only new data sending paths need splitting in case they have a segment otherwise suitable for gso/tso. The actually check can be improved to be more narrow but since this is late -rc already, I'll postpone thinking the more fine-grained things. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e1aa680fa40e7492260a09cb57d94002245cc8fe |
|
25-Nov-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: move tcp_simple_retransmit to tcp_input Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4a17fc3add594fcc1c778e93a95b6ecf47f630e5 |
|
25-Nov-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: collapse more than two on retransmission I always had thought that collapsing up to two at a time was intentional decision to avoid excessive processing if 1 byte sized skbs are to be combined for a full mtu, and consecutive retransmissions would make the size of the retransmittee double each round anyway, but some recent discussion made me to understand that was not the case. Thus make collapse work more and wait less. It would be possible to take advantage of the shifting machinery (added in the later patch) in the case of paged data but that can be implemented on top of this change. tcp_skb_is_last check is now provided by the loop. I tested a bit (ss-after-idle-off, fill 4096x4096B xfer, 10s sleep + 4096 x 1byte writes while dropping them for some a while with netem): . 16774097:16775545(1448) ack 1 win 46 . 16775545:16776993(1448) ack 1 win 46 . ack 16759617 win 2399 P 16776993:16777217(224) ack 1 win 46 . ack 16762513 win 2399 . ack 16765409 win 2399 . ack 16768305 win 2399 . ack 16771201 win 2399 . ack 16774097 win 2399 . ack 16776993 win 2399 . ack 16777217 win 2399 P 16777217:16777257(40) ack 1 win 46 . ack 16777257 win 2399 P 16777257:16778705(1448) ack 1 win 46 P 16778705:16780153(1448) ack 1 win 46 FP 16780153:16781313(1160) ack 1 win 46 . ack 16778705 win 2399 . ack 16780153 win 2399 F 1:1(0) ack 16781314 win 2399 While without drop-all period I get this: . 16773585:16775033(1448) ack 1 win 46 . ack 16764897 win 9367 . ack 16767793 win 9367 . ack 16770689 win 9367 . ack 16773585 win 9367 . 16775033:16776481(1448) ack 1 win 46 P 16776481:16777217(736) ack 1 win 46 . ack 16776481 win 9367 . ack 16777217 win 9367 P 16777217:16777218(1) ack 1 win 46 P 16777218:16777219(1) ack 1 win 46 P 16777219:16777220(1) ack 1 win 46 ... P 16777247:16777248(1) ack 1 win 46 . ack 16777218 win 9367 . ack 16777219 win 9367 ... . ack 16777233 win 9367 . ack 16777248 win 9367 P 16777248:16778696(1448) ack 1 win 46 P 16778696:16780144(1448) ack 1 win 46 FP 16780144:16781313(1169) ack 1 win 46 . ack 16780144 win 9367 F 1:1(0) ack 16781314 win 9367 The window seems to be 30-40 segments, which were successfully combined into: P 16777217:16777257(40) ack 1 win 46 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
33cf71cee14743185305c61625c4544885055733 |
|
22-Nov-2008 |
Petr Tesarik <ptesarik@suse.cz> |
tcp: Do not use TSO/GSO when there is urgent data This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12014 Since most (if not all) implementations of TSO and even the in-kernel software GSO do not update the urgent pointer when splitting a large segment, it is necessary to turn off TSO/GSO for all outgoing traffic with the URG pointer set. Looking at tcp_current_mss (and the preceding comment) I even think this was the original intention. However, this approach is insufficient, because TSO/GSO is turned off only for newly created frames, not for frames which were already pending at the arrival of a message with MSG_OOB set. These frames were created when TSO/GSO was enabled, so they may be large, and they will have the urgent pointer set in tcp_transmit_skb(). With this patch, such large packets will be fragmented again before going to the transmit routine. As a side note, at least the following NICs are known to screw up the urgent pointer in the TCP header when doing TSO: Intel 82566MM (PCI ID 8086:1049) Intel 82566DC (PCI ID 8086:104b) Intel 82541GI (PCI ID 8086:1076) Broadcom NetXtreme II BCM5708 (PCI ID 14e4:164c) Signed-off-by: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
09cb105ea78d5644570d52085e2149f784575872 |
|
03-Nov-2008 |
Jianjun Kong <jianjun@zeuux.org> |
net: clean up net/ipv4/ip_sockglue.c tcp_output.c Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8b5f12d04b2e93842f3dda01f029842047bf3f81 |
|
27-Oct-2008 |
Florian Westphal <fw@strlen.de> |
syncookies: fix inclusion of tcp options in syn-ack David Miller noticed that commit 33ad798c924b4a1afad3593f2796d465040aadd5 '(tcp: options clean up') did not move the req->cookie_ts check. This essentially disabled commit 4dfc2817025965a2fc78a18c50f540736a6b5c24 '[Syncookies]: Add support for TCP options via timestamps.'. This restores the original logic. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
fd6149d332973bafa50f03ddb0ea9513e67f4517 |
|
23-Oct-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: Restore ordering of TCP options for the sake of inter-operability This is not our bug! Sadly some devices cannot cope with the change of TCP option ordering which was a result of the recent rewrite of the option code (not that there was some particular reason steming from the rewrite for the reordering) though any ordering of TCP options is perfectly legal. Thus we restore the original ordering to allow interoperability with/through such broken devices and add some warning about this trap. Since the reordering just happened without any particular reason, this change shouldn't cost us anything. There are already couple of known failure reports (within close proximity of the last release), so the problem might be more wide-spread than a single device. And other reports which may be due to the same problem though the symptoms were less obvious. Analysis of one of the case revealed (with very high probability) that sack capability cannot be negotiated as the first option (SYN never got a response). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Aldo Maggi <sentiniate@tiscali.it> Tested-by: Aldo Maggi <sentiniate@tiscali.it> Signed-off-by: David S. Miller <davem@davemloft.net>
|
75e3d8db531b462b875c1adb13eeb6b0be7374c0 |
|
22-Oct-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: should use number of sack blocks instead of -1 While looking for the recent "sack issue" I also read all eff_sacks usage that was played around by some relevant commit. I found out that there's another thing that is asking for a fix (unrelated to the "sack issue" though). This feature has probably very little significance in practice. Opposite direction timeout with bidirectional tcp comes to me as the most likely scenario though there might be other cases as well related to non-data segments we send (e.g., response to the opposite direction segment). Also some ACK losses or option space wasted for other purposes is necessary to prevent the earlier SACK feedback getting to the sender. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
33f5f57eeb0c6386fdd85f9c690dc8d700ba7928 |
|
07-Oct-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: kill pointless urg_mode It all started from me noticing that this urgent check in tcp_clean_rtx_queue is unnecessarily inside the loop. Then I took a longer look to it and found out that the users of urg_mode can trivially do without, well almost, there was one gotcha. Bonus: those funny people who use urg with >= 2^31 write_seq - snd_una could now rejoice too (that's the only purpose for the between being there, otherwise a simple compare would have done the thing). Not that I assume that the rest of the tcp code happily lives with such mind-boggling numbers :-). Alas, it turned out to be impossible to set wmem to such numbers anyway, yes I really tried a big sendfile after setting some wmem but nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf... So I hacked a bit variable to long and found out that it seems to work... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a3116ac5c216fc3c145906a46df9ce542ff7dcf2 |
|
01-Oct-2008 |
KOVACS Krisztian <hidden@sch.bme.hu> |
tcp: Port redirection support for TCP Current TCP code relies on the local port of the listening socket being the same as the destination address of the incoming connection. Port redirection used by many transparent proxying techniques obviously breaks this, so we have to store the original destination port address. This patch extends struct inet_request_sock and stores the incoming destination port value there. It also modifies the handshake code to use that value as the source port when sending reply packets. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
77d40a0952b16e020ce07c4cf9fb22024448275b |
|
23-Sep-2008 |
David S. Miller <davem@davemloft.net> |
tcp: Fix order of tests in tcp_retransmit_skb() tcp_write_queue_next() must only be made if we know that tcp_skb_is_last() evaluates to false. Signed-off-by: David S. Miller <davem@davemloft.net>
|
f5fff5dc8a7a3f395b0525c02ba92c95d42b7390 |
|
21-Sep-2008 |
Tom Quetchenbach <virtualphtn@gmail.com> |
tcp: advertise MSS requested by user I'm trying to use the TCP_MAXSEG option to setsockopt() to set the MSS for both sides of a bidirectional connection. man tcp says: "If this option is set before connection establishment, it also changes the MSS value announced to the other end in the initial packet." However, the kernel only uses the MTU/route cache to set the advertised MSS. That means if I set the MSS to, say, 500 before calling connect(), I will send at most 500-byte packets, but I will still receive 1500-byte packets in reply. This is a bug, either in the kernel or the documentation. This patch (applies to latest net-2.6) reduces the advertised value to that requested by the user as long as setsockopt() is called before connect() or accept(). This seems like the behavior that one would expect as well as that which is documented. I've tried to make sure that things that depend on the advertised MSS are set correctly. Signed-off-by: Tom Quetchenbach <virtualphtn@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
618d9f25548ba6fc3a9cd2ce5cd56f4f015b0635 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: back retransmit_high when it over-estimated If lost skb is sacked, we might have nothing to retransmit as high as the retransmit_high is pointing to, so place it lower to avoid unnecessary walking. This is mainly for the case where high L'ed skbs gets sacked. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ef9da47c7cc64d69526331f315e76b5680d4048f |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: don't clear retransmit_skb_hint when not necessary Most importantly avoid doing it with cumulative ACK. Not clearing means that we no longer need n^2 processing in resolution of each fast recovery. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f0ceb0ed86b4792a4ed9d3438f5f7572e48f9803 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: remove retransmit_skb_hint clearing from failure This doesn't much sense here afaict, probably never has. Since fragmenting and collapsing deal the hints by themselves, there should be very little reason for the rexmit loop to do that. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e1c54c2a405494281e0639aacc90db03b50ae77 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: reorganize retransmit code loops Both loops are quite similar, so they can be combined with little effort. As a result, forward_skb_hint becomes obsolete as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
08ebd1721ab8fd362e90ae17b461c07b23fa2824 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: remove tp->lost_out guard to make joining diff nicer The validity of the retransmit_high must then be ensured if no L'ed skb exits! This makes a minor change to behavior, we now have to iterate the head to find out that the loop terminates. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
61eb55f4db7eaf5fb2d5ec12981a8cda755bb0e1 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: Reorganize skb tagbit checks Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
34638570b58290e8cb875fb24dcbe836ffeb6cb8 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: remove obsolete validity concern Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b5afe7bc71a1689376c9b547376d17568469f3b3 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: add tcp_can_forward_retransmit Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
006f582c73f4eda35e06fd323193c3df43fb3459 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: convert retransmit_cnt_hint to seqno Main benefit in this is that we can then freely point the retransmit_skb_hint to anywhere we want to because there's no longer need to know what would be the count changes involve, and since this is really used only as a terminator, unnecessary work is one time walk at most, and if some retransmissions are necessary after that point later on, the walk is not full waste of time anyway. Since retransmit_high must be kept valid, all lost markers must ensure that. Now I also have learned how those "holes" in the rexmittable skbs can appear, mtu probe does them. So I removed the misleading comment as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
64edc2736e23994e0334b70c5ff08dc33e2ebbd9 |
|
21-Sep-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: Partial hint clearing has again become meaningless Ie., the difference between partial and all clearing doesn't exists anymore since the SACK optimizations got dropped by an sacktag rewrite. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7982d5e1b350acb96aa156916c44c25ef87bb809 |
|
27-Aug-2008 |
Philip Love <love_phil@emc.com> |
tcp: fix tcp header size miscalculation when window scale is unused The size of the TCP header is miscalculated when the window scale ends up being 0. Additionally, this can be induced by sending a SYN to a passive open port with a window scale option with value 0. Signed-off-by: Philip Love <love_phil@emc.com> Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b32d13102d39ed411d152a7ffcc5f66d5b3b1b49 |
|
22-Jul-2008 |
David S. Miller <davem@davemloft.net> |
tcp: Fix bitmask test in tcp_syn_options() As reported by Alexey Dobriyan: CHECK net/ipv4/tcp_output.c net/ipv4/tcp_output.c:475:7: warning: dubious: !x & y And sparse is damn right! if (unlikely(!OPTION_TS & opts->options)) ^^^ size += TCPOLEN_SACKPERM_ALIGNED; OPTION_TS is (1 << 1), so condition will never trigger. Signed-off-by: David S. Miller <davem@davemloft.net>
|
33ad798c924b4a1afad3593f2796d465040aadd5 |
|
19-Jul-2008 |
Adam Langley <agl@imperialviolet.org> |
tcp: options clean up This should fix the following bugs: * Connections with MD5 signatures produce invalid packets whenever SACK options are included * MD5 signatures are counted twice in the MSS calculations Behaviour changes: * A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK This is because we can't fit any SACK blocks in a packet with MD5 + TS options. There was discussion about disabling SACK rather than TS in order to fit in better with old, buggy kernels, but that was deemed to be unnecessary. * SYNs with MD5 don't include a TS option See above. Additionally, it removes a bunch of duplicated logic for calculating options, which should help avoid these sort of issues in the future. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
49a72dfb8814c2d65bd9f8c9c6daf6395a1ec58d |
|
19-Jul-2008 |
Adam Langley <agl@imperialviolet.org> |
tcp: Fix MD5 signatures for non-linear skbs Currently, the MD5 code assumes that the SKBs are linear and, in the case that they aren't, happily goes off and hashes off the end of the SKB and into random memory. Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy Polyakov. Also includes a couple of missed route_caps from Stephen's patch in [2]. [1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2 [2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2 Signed-off-by: Adam Langley <agl@imperialviolet.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
de0744af1fe2d0a3d428f6af0f2fe1f6179b1a9c |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4e6734447dbc7a0a85e09616821c0782d9fb1141 |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_INC_STATS Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
81cc8a75d944fa39fc333c2c329c8e8b3c62cada |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to TCP_INC_STATS Fortunately (almost) all the TCP code has a sock to get the net from :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
40b215e594b65a3488576c9d24b367548e18902a |
|
03-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
tcp: de-bloat a bit with factoring NET_INC_STATS_BH out There are some places in TCP that select one MIB index to bump snmp statistics like this: if (<something>) NET_INC_STATS_BH(<some_id>); else if (<something_else>) NET_INC_STATS_BH(<some_other_id>); ... else NET_INC_STATS_BH(<default_id>); or in a more tricky but still similar way. On the other hand, this NET_INC_STATS_BH is a camouflaged increment of percpu variable, which is not that small. Factoring those cases out de-bloats 235 bytes on non-preemptible i386 config and drives parts of the code into 80 columns. add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-235 (-235) function old new delta tcp_fastretrans_alert 1437 1424 -13 tcp_dsack_set 137 124 -13 tcp_xmit_retransmit_queue 690 676 -14 tcp_try_undo_recovery 283 265 -18 tcp_sacktag_write_queue 1550 1515 -35 tcp_update_reordering 162 106 -56 tcp_retransmit_timer 990 904 -86 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0b040829952d84bf2a62526f0e24b624e0699447 |
|
11-Jun-2008 |
Adrian Bunk <bunk@kernel.org> |
net: remove CVS keywords This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
076fb7223357769c39f3ddf900bba6752369c76a |
|
16-Apr-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
tcp md5sig: Remove redundant protocol argument. Protocol is always TCP, so remove useless protocol argument. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
26af65cbeb2467a486ae4fc7242c94e470c67c50 |
|
05-Jun-2008 |
Sridhar Samudrala <sri@us.ibm.com> |
tcp: Increment OUTRSTS in tcp_send_active_reset() TCP "resets sent" counter is not incremented when a TCP Reset is sent via tcp_send_active_reset(). Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7d227cd235c809c36c847d6a597956ad9e9d2bae |
|
22-May-2008 |
Sridhar Samudrala <sri@us.ibm.com> |
tcp: TCP connection times out if ICMP frag needed is delayed We are seeing an issue with TCP in handling an ICMP frag needed message that is received after net.ipv4.tcp_retries1 retransmits. The default value of retries1 is 3. So if the path mtu changes and ICMP frag needed is lost for the first 3 retransmits or if it gets delayed until 3 retransmits are done, TCP doesn't update MSS correctly and continues to retransmit the orginal message until it timesout after tcp_retries2 retransmits. I am seeing this issue even with the latest 2.6.25.4 kernel. In tcp_retransmit_timer(), when retransmits counter exceeds tcp_retries1 value, the dst cache entry of the socket is reset. At this time, if we receive an ICMP frag needed message, the dst entry gets updated with the new MTU, but the TCP sockets dst_cache entry remains NULL. So the next time when we try to retransmit after the ICMP frag needed is received, tcp_retransmit_skb() gets called. Here the cur_mss value is calculated at the start of the routine with a NULL sk_dst_cache. Instead we should call tcp_current_mss after the rebuild_header that caches the dst entry with the updated mtu. Also the rebuild_header should be called before tcp_fragment so that skb is fragmented if the mss goes down. Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
17515408a15fa51c553e67c415502e785145cd7f |
|
16-Apr-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Remove superflushious skb == write_queue_tail() check Needed can only be more strict than what was checked by the earlier common case check for non-tail skbs, thus cwnd_len <= needed will never match in that case anyway. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4dfc2817025965a2fc78a18c50f540736a6b5c24 |
|
10-Apr-2008 |
Florian Westphal <fw@strlen.de> |
[Syncookies]: Add support for TCP options via timestamps. Allow the use of SACK and window scaling when syncookies are used and the client supports tcp timestamps. Options are encoded into the timestamp sent in the syn-ack and restored from the timestamp echo when the ack is received. Based on earlier work by Glenn Griffin. This patch avoids increasing the size of structs by encoding TCP options into the least significant bits of the timestamp and by not using any 'timestamp offset'. The downside is that the timestamp sent in the packet after the synack will increase by several seconds. changes since v1: don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c and have ipv6/syncookies.c use it. Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if () Reviewed-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
882bebaaca4bb1484078d44ef011f918c0e1e14e |
|
08-Apr-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: tcp_simple_retransmit can cause S+L This fixes Bugzilla #10384 tcp_simple_retransmit does L increment without any checking whatsoever for overflowing S+L when Reno is in use. The simplest scenario I can currently think of is rather complex in practice (there might be some more straightforward cases though). Ie., if mss is reduced during mtu probing, it may end up marking everything lost and if some duplicate ACKs arrived prior to that sacked_out will be non-zero as well, leading to S+L > packets_out, tcp_clean_rtx_queue on the next cumulative ACK or tcp_fastretrans_alert on the next duplicate ACK will fix the S counter. More straightforward (but questionable) solution would be to just call tcp_reset_reno_sack() in tcp_simple_retransmit but it would negatively impact the probe's retransmission, ie., the retransmissions would not occur if some duplicate ACKs had arrived. So I had to add reno sacked_out reseting to CA_Loss state when the first cumulative ACK arrives (this stale sacked_out might actually be the explanation for the reports of left_out overflows in kernel prior to 2.6.23 and S+L overflow reports of 2.6.24). However, this alone won't be enough to fix kernel before 2.6.24 because it is building on top of the commit 1b6d427bb7e ([TCP]: Reduce sacked_out with reno when purging write_queue) to keep the sacked_out from overflowing. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Alessandro Suardi <alessandro.suardi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
82cc1a7a56872056af0ead6c7d695aa223f36695 |
|
21-Mar-2008 |
Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> |
[NET]: Add per-connection option to set max TSO frame size Update: My mailer ate one of Jarek's feedback mails... Fixed the parameter in netif_set_gso_max_size() to be u32, not u16. Fixed the whitespace issue due to a patch import botch. Changed the types from u32 to unsigned int to be more consistent with other variables in the area. Also brought the patch up to the latest net-2.6.26 tree. Update: Made gso_max_size container 32 bits, not 16. Moved the location of gso_max_size within netdev to be less hotpath. Made more consistent names between the sock and netdev layers, and added a define for the max GSO size. Update: Respun for net-2.6.26 tree. Update: changed max_gso_frame_size and sk_gso_max_size from signed to unsigned - thanks Stephen! This patch adds the ability for device drivers to control the size of the TSO frames being sent to them, per TCP connection. By setting the netdevice's gso_max_size value, the socket layer will set the GSO frame size based on that value. This will propogate into the TCP layer, and send TSO's of that size to the hardware. This can be desirable to help tune the bursty nature of TSO on a per-adapter basis, where one may have 1 GbE and 10 GbE devices coexisting in a system, one running multiqueue and the other not, etc. This can also be desirable for devices that cannot support full 64 KB TSO's, but still want to benefit from some level of segmentation offloading. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
607bfbf2d55dd1cfe5368b41c2a81a8c9ccf4723 |
|
21-Mar-2008 |
Patrick McHardy <kaber@trash.net> |
[TCP]: Fix shrinking windows with window scaling When selecting a new window, tcp_select_window() tries not to shrink the offered window by using the maximum of the remaining offered window size and the newly calculated window size. The newly calculated window size is always a multiple of the window scaling factor, the remaining window size however might not be since it depends on rcv_wup/rcv_nxt. This means we're effectively shrinking the window when scaling it down. The dump below shows the problem (scaling factor 2^7): - Window size of 557 (71296) is advertised, up to 3111907257: IP 172.2.2.3.33000 > 172.2.2.2.33000: . ack 3111835961 win 557 <...> - New window size of 514 (65792) is advertised, up to 3111907217, 40 bytes below the last end: IP 172.2.2.3.33000 > 172.2.2.2.33000: . 3113575668:3113577116(1448) ack 3111841425 win 514 <...> The number 40 results from downscaling the remaining window: 3111907257 - 3111841425 = 65832 65832 / 2^7 = 514 65832 % 2^7 = 40 If the sender uses up the entire window before it is shrunk, this can have chaotic effects on the connection. When sending ACKs, tcp_acceptable_seq() will notice that the window has been shrunk since tcp_wnd_end() is before tp->snd_nxt, which makes it choose tcp_wnd_end() as sequence number. This will fail the receivers checks in tcp_sequence() however since it is before it's tp->rcv_wup, making it respond with a dupack. If both sides are in this condition, this leads to a constant flood of ACKs until the connection times out. Make sure the window is never shrunk by aligning the remaining window to the window scaling factor. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5ea3a7480606cef06321cd85bc5113c72d2c7c68 |
|
12-Mar-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Prevent sending past receiver window with TSO (at last skb) With TSO it was possible to send past the receiver window when the skb to be sent was the last in the write queue while the receiver window is the limiting factor. One can notice that there's a loophole in the tcp_mss_split_point that lacked a receiver window check for the tcp_write_queue_tail() if also cwnd was smaller than the full skb. Noticed by Thomas Gleixner <tglx@linutronix.de> in form of "Treason uncloaked! Peer ... shrinks window .... Repaired." messages (the peer didn't actually shrink its window as the message suggests, we had just sent something past it without a permission to do so). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Tested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c6aefafb7ec620911d46174eed514f9df639e5a4 |
|
08-Feb-2008 |
Glenn Griffin <ggriffin.kernel@gmail.com> |
[TCP]: Add IPv6 support to TCP SYN cookies Updated to incorporate Eric's suggestion of using a per cpu buffer rather than allocating on the stack. Just a two line change, but will resend in it's entirety. Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
30a50cc56679f56d7b866b1c29fd05802606fc5a |
|
01-Feb-2008 |
Adrian Bunk <bunk@kernel.org> |
[TCP]: Unexport sysctl_tcp_tso_win_divisor This patch removes the no longer used EXPORT_SYMBOL(sysctl_tcp_tso_win_divisor). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e870a8efcddaaa3da7e180b6ae21239fb96aa2bb |
|
04-Jan-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Perform setting of common control fields in one place In case of segments which are purely for control without any data (SYN/ACK/FIN/RST), many fields are set to common values in multiple places. i386 results: $ gcc --version gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13) $ codiff tcp_output.o.old tcp_output.o.new net/ipv4/tcp_output.c: tcp_xmit_probe_skb | -48 tcp_send_ack | -56 tcp_retransmit_skb | -79 tcp_connect | -43 tcp_send_active_reset | -35 tcp_make_synack | -42 tcp_send_fin | -48 7 functions changed, 351 bytes removed net/ipv4/tcp_output.c: tcp_init_nondata_skb | +90 1 function changed, 90 bytes added tcp_output.o.mid: 8 functions changed, 90 bytes added, 351 bytes removed, diff: -261 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
19773b4923ed0e21e3289361dba5e69e1ce6e00b |
|
04-Jan-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Urgent parameter effect can be simplified. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d436d68630a74ba3c898ff1b53591ddc4eb7f2bf |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Remove unnecessary local variable Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
409d22b470532cb92b91b9aeb7257357a176b849 |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Code duplication removal, added tcp_bound_to_half_wnd() Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
056834d9f6f6eaf4cc7268569e53acab957aac27 |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: cleanup tcp_{in,out}put.c style These were manually selected from indent's results which as is are too noisy to be of any use without human reason. In addition, some extra newlines between function and its comment were removed too. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
058dc3342b71ffb3531c4f9df7c35f943f392b8d |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: reduce tcp_output's indentation levels a bit Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4828e7f49a402930e8b3e72de695c8d37e0f98ee |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Remove TCPCB_URG & TCPCB_AT_TAIL as unnecessary The snd_up check should be enough. I suspect this has been there to provide a minor optimization in clean_rtx_queue which used to have a small if (!->sacked) block which could skip snd_up check among the other work. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
90840defabbd819180c7528e12d550776848f833 |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Introduce tcp_wnd_end() to reduce line lengths Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
66f5fe624fa5f1d4574d2dd2bc0c72a17a92079c |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Rename update_send_head & include related increment to it There's very little need to have the packets_out incrementing in a separate function. Also name the combined function appropriately. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3ab224be6d69de912ee21302745ea45a99274dbc |
|
31-Dec-2007 |
Hideo Aoki <haoki@redhat.com> |
[NET] CORE: Introducing new memory accounting interface. This patch introduces new memory accounting functions for each network protocol. Most of them are renamed from memory accounting functions for stream protocols. At the same time, some stream memory accounting functions are removed since other functions do same thing. Renaming: sk_stream_free_skb() -> sk_wmem_free_skb() __sk_stream_mem_reclaim() -> __sk_mem_reclaim() sk_stream_mem_reclaim() -> sk_mem_reclaim() sk_stream_mem_schedule -> __sk_mem_schedule() sk_stream_pages() -> sk_mem_pages() sk_stream_rmem_schedule() -> sk_rmem_schedule() sk_stream_wmem_schedule() -> sk_wmem_schedule() sk_charge_skb() -> sk_mem_charge() Removeing sk_stream_rfree(): consolidates into sock_rfree() sk_stream_set_owner_r(): consolidates into skb_set_owner_r() sk_stream_mem_schedule() The following functions are added. sk_has_account(): check if the protocol supports accounting sk_mem_uncharge(): do the opposite of sk_mem_charge() In addition, to achieve consolidation, updating sk_wmem_queued is removed from sk_mem_charge(). Next, to consolidate memory accounting functions, this patch adds memory accounting calls to network core functions. Moreover, present memory accounting call is renamed to new accounting call. Finally we replace present memory accounting calls with new interface in TCP and SCTP. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e3a4803aa06cd7bc2cfc1d04289df4f6027640a |
|
25-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Force TSO splits to MSS boundaries If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on odd boundary by the callers of tcp_window_allows. We try really hard to avoid unnecessary modulos. Therefore the old caller side check "if (skb->len < limit)" was too wide as well because limit is not bound in any way to skb->len and can cause spurious testing for trimming in the middle of the queue while we only wanted that to happen at the tail of the queue. A simple additional caller side check for tcp_write_queue_tail would likely have resulted 2 x modulos because the limit would have to be first calculated from window, however, doing that unnecessary modulo is not mandatory. After a minor change to the algorithm, simply determine first if the modulo is needed at all and at that point immediately decide also from which value it should be calculated from. This approach also kills some duplicated code. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b92edbe0b8a36a833c16b0cbafb6e899b81ffc08 |
|
21-Dec-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[TCP] Avoid two divides in tcp_output.c Because 'free_space' variable in __tcp_select_window() is signed, expression (free_space / 2) forces compiler to emit an integer divide. This can be changed to a plain right shift, less expensive. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bd515c3e48ececd774eb3128e81b669dbbd32637 |
|
21-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Fix TSO deferring I'd say that most of what tcp_tso_should_defer had in between there was dead code because of this. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6859d49475d4f32abe640372117e4b687906e6b6 |
|
01-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Abstract tp->highest_sack accessing & point to next skb Pointing to the next skb is necessary to avoid referencing already SACKed skbs which will soon be on a separate list. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
234b68607006f3721679e900809ccb99e8bfb10c |
|
01-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Add tcp_for_write_queue_from_safe and use it in mtu_probe Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d67c58e9ae80ea577785111534e49d3ca757ec50 |
|
01-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Remove local variable and use packets_in_flight directly Lines won't be that long and it's compiler's job to optimize them. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
50c4817e9919132639be0adc387b509e04a9ed0a |
|
01-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: MTUprobe: prepare skb fields earlier They better be valid when call to write_queue functions is made once things that follow are going in. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
df97c708d5e6eebdd9ded1fa588eae09acf53793 |
|
29-Nov-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Eliminate unused argument from sk_stream_alloc_pskb The 3rd argument is always zero (according to grep :) Eliminate it and merge the function with sk_stream_alloc_skb. This saves 44 more bytes, and together with the previous patch we have: add/remove: 1/0 grow/shrink: 0/8 up/down: 183/-751 (-568) function old new delta sk_stream_alloc_skb - 183 +183 ip_rt_init 529 525 -4 arp_ignore 112 107 -5 __inet_lookup_listener 284 274 -10 tcp_sendmsg 2583 2481 -102 tcp_sendpage 1449 1300 -149 tso_fragment 417 258 -159 tcp_fragment 1149 988 -161 __tcp_push_pending_frames 1998 1837 -161 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8512430e554a84275669f78f86dce18566d5cf7a |
|
26-Nov-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Move FRTO checks out from write queue abstraction funcs Better place exists in update_send_head (other non-queue related adjustments are done there as well) which is the only caller of tcp_advance_send_head (now that the bogus call from mtu_probe is gone). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
68f8353b480e5f2e136c38a511abdbb88eaa8ce2 |
|
16-Nov-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Rewrite SACK block processing & sack_recv_cache use Key points of this patch are: - In case new SACK information is advance only type, no skb processing below previously discovered highest point is done - Optimize cases below highest point too since there's no need to always go up to highest point (which is very likely still present in that SACK), this is not entirely true though because I'm dropping the fastpath_skb_hint which could previously optimize those cases even better. Whether that's significant, I'm not too sure. Currently it will provide skipping by walking. Combined with RB-tree, all skipping would become fast too regardless of window size (can be done incrementally later). Previously a number of cases in TCP SACK processing fails to take advantage of costly stored information in sack_recv_cache, most importantly, expected events such as cumulative ACK and new hole ACKs. Processing on such ACKs result in rather long walks building up latencies (which easily gets nasty when window is huge). Those latencies are often completely unnecessary compared with the amount of _new_ information received, usually for cumulative ACK there's no new information at all, yet TCP walks whole queue unnecessary potentially taking a number of costly cache misses on the way, etc.! Since the inclusion of highest_sack, there's a lot information that is very likely redundant (SACK fastpath hint stuff, fackets_out, highest_sack), though there's no ultimate guarantee that they'll remain the same whole the time (in all unearthly scenarios). Take advantage of this knowledge here and drop fastpath hint and use direct access to highest SACKed skb as a replacement. Effectively "special cased" fastpath is dropped. This change adds some complexity to introduce better coveraged "fastpath", though the added complexity should make TCP behave more cache friendly. The current ACK's SACK blocks are compared against each cached block individially and only ranges that are new are then scanned by the high constant walk. For other parts of write queue, even when in previously known part of the SACK blocks, a faster skip function is used (if necessary at all). In addition, whenever possible, TCP fast-forwards to highest_sack skb that was made available by an earlier patch. In typical case, no other things but this fast-forward and mandatory markings after that occur making the access pattern quite similar to the former fastpath "special case". DSACKs are special case that must always be walked. The local to recv_sack_cache copying could be more intelligent w.r.t DSACKs which are likely to be there only once but that is left to a separate patch. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a47e5a988a575e64c8c9bae65a1dfe3dca7cba32 |
|
16-Nov-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Convert highest_sack to sk_buff to allow direct access It is going to replace the sack fastpath hint quite soon... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4e67d876ce07471e02be571038d5435a825f0215 |
|
05-Dec-2007 |
Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: NAGLE_PUSH seems to be a wrong way around The comment in tcp_nagle_test suggests that. This bug is very very old, even 2.4.0 seems to have it. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7f9c33e515353ea91afc62341161fead19e78567 |
|
23-Nov-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP] MTUprobe: Cleanup send queue check (no need to loop) The original code has striking complexity to perform a query which can be reduced to a very simple compare. FIN seqno may be included to write_seq but it should not make any significant difference here compared to skb->len which was used previously. One won't end up there with SYN still queued. Use of write_seq check guarantees that there's a valid skb in send_head so I removed the extra check. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: John Heffner <jheffner@psc.edu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
91cc17c0e5e5ada156a8d5787a2509d263ea6bbf |
|
23-Nov-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: MTUprobe: receiver window & data available checks fixed It seems that the checked range for receiver window check should begin from the first rather than from the last skb that is going to be included to the probe. And that can be achieved without reference to skbs at all, snd_nxt and write_seq provides the correct seqno already. Plus, it SHOULD account packets that are necessary to trigger fast retransmit [RFC4821]. Location of snd_wnd < probe_size/size_needed check is bogus because it will cause the other if() match as well (due to snd_nxt >= snd_una invariant). Removed dead obvious comment. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
6e42141009ff18297fe19d19296738b742f861db |
|
20-Nov-2007 |
Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> |
[TCP] MTUprobe: fix potential sk_send_head corruption When the abstraction functions got added, conversion here was made incorrectly. As a result, the skb may end up pointing to skb which got included to the probe skb and then was freed. For it to trigger, however, skb_transmit must fail sending as well. Signed-off-by: Ilpo J�rvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b08d6cb22c777c8c91c16d8e3b8aafc93c98cbd9 |
|
12-Oct-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Limit processing lost_retrans loop to work-to-do cases This addition of lost_retrans_low to tcp_sock might be unnecessary, it's not clear how often lost_retrans worker is executed when there wasn't work to do. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
29d0a309d11bac9e57af914d0d6a35cde0080861 |
|
08-Oct-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Fix two off-by-one errors in fackets_out adjusting logic 1) Passing wrong skb to tcp_adjust_fackets_out could corrupt fastpath_cnt_hint as tcp_skb_pcount(next_skb) is not included to it if hint points exactly to the next_skb (it's lagging behind, see sacktag). 2) When fastpath_skb_hint is put backwards to avoid dangling skb reference, the skb's pcount must also be removed from count (not included like above). Reported by Cedric Le Goater <legoater@free.fr> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dc86967b54aaf64fb053cce83c05a4476d48583b |
|
02-Oct-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: No fackets_out/highest_sack tuning when SACK isn't enabled This was found due to bug report from Cedric Le Goater though it turned this turned out to be unrelated bug. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a6963a6b3d2d3921b60f45e0cf18be26495d5ad1 |
|
26-Sep-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Re-place highest_sack check to a more robust position I previously added checking to position that is rather poor as state has already been adjusted quite a bit. Re-placing it above all state changes should be more robust though the return should never ever get executed regardless of its place :-). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b76892051cf1c04d95872838e70146f65e3b9d75 |
|
20-Sep-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Avoid clearing sacktag hint in trivial situations There's no reason to clear the sacktag skb hint when small part of the rexmit queue changes. Account changes (if any) instead when fragmenting/collapsing. RTO/FRTO do not touch SACKED_ACKED bits so no need to discard SACK tag hint at all. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5af4ec236f7c98f3671fb26731457a172d85e0e6 |
|
20-Sep-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: clear_all_retrans_hints prefixed by tcp_ In addition, fix its function comment spacing. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
|
91fed7a15c9222af29a653ecb0ee72cff178fdd8 |
|
09-Oct-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Make fackets_out accurate Substraction for fackets_out is unconditional when snd_una advances, thus there's no need to do it inside the loop. Just make sure correct bounds are honored. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0dde7b5404a3d52dcd9ce66d46197f6c3ca97dda |
|
20-Sep-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Maintain highest_sack accurately to the highest skb In general, it should not be necessary to call tcp_fragment for already SACKed skbs, but it's better to be safe than sorry. And indeed, it can be called from sacktag when a DSACK arrives or some ACK (with SACK) reordering occurs (sacktag could be made to avoid the call in the latter case though I'm not sure if it's worth of the trouble and added complexity to cover such marginal case). The collapse case has return for SACKED_ACKED case earlier, so just WARN_ON if internal inconsistency is detected for some reason. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
356f89e12e301376f26795643f3b5931c81c9cd5 |
|
25-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[NET] Cleanup: DIV_ROUND_UP Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6ff03ac355cc6c10f7b1f44dd466d41213acebca |
|
25-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: tcp_packets_out_inc to tcp_output.c (no callers elsewhere) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e9144bd8da80f3136b23c615609798e371e885ac |
|
25-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Remove unnecessary wrapper tcp_packets_out_dec Makes caller side more obvious, there's no need to have a wrapper for this oneliner! Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e60402d0a909ca2e6e2fbdf9ed004ef0fae36d33 |
|
09-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Move sack_ok access to obviously named funcs & cleanup Previously code had IsReno/IsFack defined as macros that were local to tcp_input.c though sack_ok field has user elsewhere too for the same purpose. This changes them to static inlines as preferred according the current coding style and unifies the access to sack_ok across multiple files. Magic bitops of sack_ok for FACK and DSACK are also abstracted to functions with appropriate names. Note: - One sack_ok = 1 remains but that's self explanary, i.e., it enables sack - Couple of !IsReno cases are changed to tcp_is_sack - There were no users for IsDSack => I dropped it Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
005903bc3a0e8473fef809e8775db52dcd3cde63 |
|
09-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Left out sync->verify (the new meaning of it) & definify Left_out was dropped a while ago, thus leaving verifying consistency of the "left out" as only task for the function in question. Thus make it's name more appropriate. In addition, it is intentionally converted to #define instead of static inline because the location of the invariant failure is the most important thing to have if this ever triggers. I think it would have been helpful e.g. in this case where the location of the failure point had to be based on some quesswork: http://lkml.org/lkml/2007/5/2/464 ...Luckily the guesswork seems to have proved to be correct. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b5860bbac7be1381626f3dc8a0cb970a60fcefb4 |
|
09-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Tighten tcp_sock's belt, drop left_out It is easily calculable when needed and user are not that many after all. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
af610b4ca19f513a50d47ea93ed57241383c8081 |
|
14-Jun-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Add tcp_dec_pcount_approx int variant Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bdf1ee5d3bd38d0c44bd7baa74e07adcbe4ceab1 |
|
27-May-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it No other users exist for tcp_ecn.h. Very few things remain in tcp.h, for most TCP ECN functions callers reside within a single .c file and can be placed there. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
539d243fdd7900fa5a544c7c154dc3ddf627e840 |
|
27-May-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Access to highest_sack obsoletes forward_cnt_hint In addition, added a reference about the purpose of the loop. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9c681b43fae1e402e39d157feaa5df454b9e4f1f |
|
19-Jul-2007 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
d0410051164bbbc597e15f068b53c06a954ae0d4 |
|
03-Jul-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: SACK fastpath did override adjusted fackets_out Do same adjustment to SACK fastpath counters provided that they're valid. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e63340ae6b6205fef26b40a75673d1c9c0c8bb90 |
|
08-May-2007 |
Randy Dunlap <randy.dunlap@oracle.com> |
header cleaning: don't include smp_lock.h when not used Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
d551e4541dd60ae53459f77a971f2d6043431f5f |
|
30-Apr-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP] FRTO: RFC4138 allows Nagle override when new data must be sent This is a corner case where less than MSS sized new data thingie is awaiting in the send queue. For F-RTO to work correctly, a new data segment must be sent at certain point or F-RTO cannot be used at all. RFC4138 allows overriding of Nagle at that point. Implementation uses frto_counter states 2 and 3 to distinguish when Nagle override is needed. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
65bb723c9502b7ba0a3aad13bdac8832e213ba74 |
|
29-Apr-2007 |
Gerrit Renker <gerrit@erg.abdn.ac.uk> |
[TCP]: Update references in two old comments This updates references to drafts in comments which must be about 10 years old. Internet draft draft-ietf-tcpimpl-prob-03.txt expired in 1998 and was replaced by RFC 2525 in March 1999. Section 3.10 of the draft maps almost identically into section 2.17 of RFC 2525: both are entitled "Failure to RST on close with data pending", the differences in text body amount to a typo and minor sentence change. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
164891aadf1721fca4dce473bb0e0998181537c6 |
|
24-Apr-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[TCP]: Congestion control API update. Do some simple changes to make congestion control API faster/cleaner. * use ktime_t rather than timeval * merge rtt sampling into existing ack callback this means one indirect call versus two per ack. * use flags bits to store options/settings Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9e412ba7632f71259a53085665d4983b78257b7c |
|
21-Apr-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Sed magic converts func(sk, tp, ...) -> func(sk, ...) This is (mostly) automated change using magic: sed -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N' -e 's|struct sock \*sk,[\n\t ]*struct tcp_sock \*tp\([^{]*\n{\n\)| struct sock \*sk\1\tstruct tcp_sock *tp = tcp_sk(sk);\n|g' -e 's|struct sock \*sk, struct tcp_sock \*tp| struct sock \*sk|g' -e 's|sk, tp\([^-]\)|sk\1|g' Fixed four unused variable (tp) warnings that were introduced. In addition, manually added newlines after local variables and tweaked function arguments positioning. $ gcc --version gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) ... $ codiff -fV built-in.o.old built-in.o.new net/ipv4/route.c: rt_cache_flush | +14 1 function changed, 14 bytes added net/ipv4/tcp.c: tcp_setsockopt | -5 tcp_sendpage | -25 tcp_sendmsg | -16 3 functions changed, 46 bytes removed net/ipv4/tcp_input.c: tcp_try_undo_recovery | +3 tcp_try_undo_dsack | +2 tcp_mark_head_lost | -12 tcp_ack | -15 tcp_event_data_recv | -32 tcp_rcv_state_process | -10 tcp_rcv_established | +1 7 functions changed, 6 bytes added, 69 bytes removed, diff: -63 net/ipv4/tcp_output.c: update_send_head | -9 tcp_transmit_skb | +19 tcp_cwnd_validate | +1 tcp_write_wakeup | -17 __tcp_push_pending_frames | -25 tcp_push_one | -8 tcp_send_fin | -4 7 functions changed, 20 bytes added, 63 bytes removed, diff: -43 built-in.o.new: 18 functions changed, 40 bytes added, 178 bytes removed, diff: -138 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1a4e2d093fd5f3eaf8cffc04a1b803f8b0ddef6d |
|
31-Mar-2007 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[SK_BUFF]: Some more conversions to skb_copy_from_linear_data Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
|
27a884dc3cb63b93c2b3b643f5b31eed5f8a4d26 |
|
20-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Convert skb->tail to sk_buff_data_t So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes on 64bit architectures, allowing us to combine the 4 bytes hole left by the layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4 64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN... :-) Many calculations that previously required that skb->{transport,network, mac}_header be first converted to a pointer now can be done directly, being meaningful as offsets or pointers. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa8223c7bb0b05183e1737881ed21827aa5b9e73 |
|
11-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2de979bd7da9c8b39cc0aabb0ab5aa1516d929eb |
|
09-Mar-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[TCP]: whitespace cleanup Add whitespace around keywords. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
fe067e8ab5e0dc5ca3c54634924c628da92090b4 |
|
07-Mar-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Abstract out all write queue operations. This allows the write queue implementation to be changed, for example, to one which allows fast interval searching. Signed-off-by: David S. Miller <davem@davemloft.net>
|
3cfe3baaf07c9e40a75f9a70662de56df1c246a8 |
|
27-Feb-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Add two new spurious RTO responses to FRTO New sysctl tcp_frto_response is added to select amongst these responses: - Rate halving based; reuses CA_CWR state (default) - Very conservative; used to be the only one available (=1) - Undo cwr; undoes ssthresh and cwnd reductions (=2) The response with rate halving requires a new parameter to tcp_enter_cwr because FRTO has already reduced ssthresh and doing a second reduction there has to be prevented. In addition, to keep things nice on 80 cols screen, a local variable was added. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
15d33c070ddde99f7368e6b17b71d22c866f97d9 |
|
09-Apr-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: slow_start_after_idle should influence cwnd validation too For the cases that slow_start_after_idle are meant to deal with, it is almost a certainty that the congestion window tests will think the connection is application limited and we'll thus decrease the cwnd there too. This defeats the whole point of setting slow_start_after_idle to zero. So test it there too. We do not cancel out the entire tcp_cwnd_validate() function so that if the sysctl is changed we still have the validation state maintained. Signed-off-by: David S. Miller <davem@davemloft.net>
|
84565070e442583ec67fb08a5962c80203e491c3 |
|
02-Apr-2007 |
John Heffner <jheffner@psc.edu> |
[TCP]: Do receiver-side SWS avoidance for rcvbuf < MSS. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
600ff0c24bb71482e7f0da948a931d5c5d72838a |
|
13-Feb-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Prevent pseudo garbage in SYN's advertized window TCP may advertize up to 16-bits window in SYN packets (no window scaling allowed). At the same time, TCP may have rcv_wnd (32-bits) that does not fit to 16-bits without window scaling resulting in pseudo garbage into advertized window from the low-order bits of rcv_wnd. This can happen at least when mss <= (1<<wscale) (see tcp_select_initial_window). This patch fixes the handling of SYN advertized windows (compile tested only). In worst case (which is unlikely to occur though), the receiver advertized window could be just couple of bytes. I'm not sure that such situation would be handled very well at all by the receiver!? Fortunately, the situation normalizes after the first non-SYN ACK is received because it has the correct, scaled window. Alternatively, tcp_select_initial_window could be changed to prevent too large rcv_wnd in the first place. [ tcp_make_synack() has the same bug, and I've added a fix for that to this patch -DaveM ] Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e905a9edab7f4f14f9213b52234e4a346c690911 |
|
09-Feb-2007 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
104439a8876a98eac1b6593907a3c7bc51e362fe |
|
06-Feb-2007 |
John Heffner <jheffner@psc.edu> |
[TCP]: Don't apply FIN exception to full TSO segments. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e89862f4c5b3c4ac9afcd8cb1365d2f1e16ddc3b |
|
26-Jan-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Restore SKB socket owner setting in tcp_transmit_skb(). Revert 931731123a103cfb3f70ac4b7abfc71d94ba1f03 We can't elide the skb_set_owner_w() here because things like certain netfilter targets (such as owner MATCH) need a socket to be set on the SKB for correct operation. Thanks to Jan Engelhardt and other netfilter list members for pointing this out. Signed-off-by: David S. Miller <davem@davemloft.net>
|
52d570aabe921663a987b2e4bae2bdc411cee480 |
|
24-Jan-2007 |
Jarek Poplawski <jarkao2@o2.pl> |
[TCP]: rare bad TCP checksum with 2.6.19 The patch "Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE" changed to unconditional copying of ip_summed field from collapsed skb. This patch reverts this change. The majority of substantial work including heavy testing and diagnosing by: Michael Tokarev <mjt@tls.msk.ru> Possible reasons pointed by: Herbert Xu and Patrick McHardy. Signed-off-by: Jarek Poplawski <jarkao2@o2.pl> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cfb6eeb4c860592edd123fdea908d23c6ad1c7dc |
|
15-Nov-2006 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[TCP]: MD5 Signature Option (RFC2385) support. Based on implementation by Rick Payne. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b9df3cb8cf9a96e63dfdcd3056a9cbc71f2459e7 |
|
14-Nov-2006 |
Gerrit Renker <gerrit@erg.abdn.ac.uk> |
[TCP/DCCP]: Introduce net_xmit_eval Throughout the TCP/DCCP (and tunnelling) code, it often happens that the return code of a transmit function needs to be tested against NET_XMIT_CN which is a value that does not indicate a strict error condition. This patch uses a macro for these recurring situations which is consistent with the already existing macro net_xmit_errno, saving on duplicated code. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
931731123a103cfb3f70ac4b7abfc71d94ba1f03 |
|
10-Nov-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Don't set SKB owner in tcp_transmit_skb(). The data itself is already charged to the SKB, doing the skb_set_owner_w() just generates a lot of noise and extra atomics we don't really need. Lmbench improvements on lat_tcp are minimal: before: TCP latency using localhost: 23.2701 microseconds TCP latency using localhost: 23.1994 microseconds TCP latency using localhost: 23.2257 microseconds after: TCP latency using localhost: 22.8380 microseconds TCP latency using localhost: 22.9465 microseconds TCP latency using localhost: 22.8462 microseconds Signed-off-by: David S. Miller <davem@davemloft.net>
|
ae8064ac32d07f609114d73928cdef803be87134 |
|
19-Oct-2006 |
John Heffner <jheffner@psc.edu> |
[TCP]: Bound TSO defer time This patch limits the amount of time you will defer sending a TSO segment to less than two clock ticks, or the time between two acks, whichever is longer. On slow links, deferring causes significant bursts. See attached plots, which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue for (a) non-TSO, (b) currnet TSO, and (c) patched TSO. This burstiness causes significant jitter, tends to overflow queues early (bad for short queues), and makes delay-based congestion control more difficult. Deferring by a couple clock ticks I believe will have a relatively small impact on performance. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
496c98dff8e353880299168d36fa082d6fba5237 |
|
11-Oct-2006 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET]: Use hton{l,s}() for non-initializers. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
df7a3b07c28e7b547d8252a237299e32880b505d |
|
28-Sep-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[TCP] net/ipv4/tcp_output.c: trivial annotations Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ab32ea5d8a760e7dd4339634e95d7be24ee5b842 |
|
22-Sep-2006 |
Brian Haley <brian.haley@hp.com> |
[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly. Couldn't actually measure any performance increase while testing (.3% I consider noise), but seems like the right thing to do. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
84fa7933a33f806bbbaae6775e87459b1ec584c0 |
|
30-Aug-2006 |
Patrick McHardy <kaber@trash.net> |
[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose checksum still needs to be completed) and CHECKSUM_COMPLETE (for incoming packets, device supplied full checksum). Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
316c1592bea94ead75301cb764523661fbbcc1ca |
|
22-Aug-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Limit window scaling if window is clamped. This small change allows for easy per-route workarounds for broken hosts or middleboxes that are not compliant with TCP standards for window scaling. Rather than having to turn off window scaling globally. This patch allows reducing or disabling window scaling if window clamp is present. Example: Mark Lord reported a problem with 2.6.17 kernel being unable to access http://www.everymac.com # ip route add 216.145.246.23/32 via 10.8.0.1 window 65535 Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bd37a088596ccdb2b2dd3299e25e333bca7a9a34 |
|
08-Aug-2006 |
Wei Yongjun <yjwei@nanjing-fnst.com> |
[TCP]: SNMPv2 tcpOutSegs counter error Do not count retransmitted segments. Signed-off-by: Wei Yongjun <yjwei@nanjing-fnst.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bcd76111178ebccedd46a9b3eaff65c78e5a70af |
|
30-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Generalise TSO-specific bits from skb_setup_caps This patch generalises the TSO-specific bits from sk_setup_caps by adding the sk_gso_type member to struct sock. This makes sk_setup_caps generic so that it can be used by TCPv6 or UFO. The only catch is that whoever uses this must provide a GSO implementation for their protocol which I think is a fair deal :) For now UFO continues to live without a GSO implementation which is OK since it doesn't use the sock caps field at the moment. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b0da8537037f337103348f239ad901477e907aa8 |
|
29-Jun-2006 |
Michael Chan <mchan@broadcom.com> |
[NET]: Add ECN support for TSO In the current TSO implementation, NETIF_F_TSO and ECN cannot be turned on together in a TCP connection. The problem is that most hardware that supports TSO does not handle CWR correctly if it is set in the TSO packet. Correct handling requires CWR to be set in the first packet only if it is set in the TSO header. This patch adds the ability to turn on NETIF_F_TSO and ECN using GSO if necessary to handle TSO packets with CWR set. Hardware that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev-> features flag. All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set. If the output device does not have the NETIF_F_TSO_ECN feature set, GSO will split the packet up correctly with CWR only set in the first segment. With help from Herbert Xu <herbert@gondor.apana.org.au>. Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock flag is completely removed. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7967168cefdbc63bf332d6b1548eca7cd65ebbcc |
|
22-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Merge TSO/UFO fields in sk_buff Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not going to scale if we add any more segmentation methods (e.g., DCCP). So let's merge them. They were used to tell the protocol of a packet. This function has been subsumed by the new gso_type field. This is essentially a set of netdev feature bits (shifted by 16 bits) that are required to process a specific skb. As such it's easy to tell whether a given device can process a GSO skb: you just have to and the gso_type field and the netdev's features field. I've made gso_type a conjunction. The idea is that you have a base type (e.g., SKB_GSO_TCPV4) that can be modified further to support new features. For example, if we add a hardware TSO type that supports ECN, they would declare NETIF_F_TSO | NETIF_F_TSO_ECN. All TSO packets with CWR set would have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO packets would be SKB_GSO_TCPV4. This means that only the CWR packets need to be emulated in software. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
35089bb203f44e33b6bbb6c4de0b0708f9a48921 |
|
14-Jun-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Add tcp_slow_start_after_idle sysctl. A lot of people have asked for a way to disable tcp_cwnd_restart(), and it seems reasonable to add a sysctl to do that. Signed-off-by: David S. Miller <davem@davemloft.net>
|
f291196979ca80cdef199ca2b55e2758e8c23a0d |
|
06-Jun-2006 |
Herbert Xu ~{PmVHI~} <herbert@gondor.apana.org.au> |
[TCP]: Avoid skb_pull if possible when trimming head Trimming the head of an skb by calling skb_pull can cause the packet to become unaligned if the length pulled is odd. Since the length is entirely arbitrary for a FIN packet carrying data, this is actually quite common. Unaligned data is not the end of the world, but we should avoid it if it's easily done. In this case it is trivial. Since we're discarding all of the head data it doesn't matter whether we move skb->data forward or back. However, it is still possible to have unaligned skb->data in general. So network drivers should be prepared to handle it instead of crashing. This patch also adds an unlikely marking on len < headlen since partial ACKs on head data are extremely rare in the wild. As the return value of __pskb_trim_head is no longer ever NULL that has been removed. Signed-off-by: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
83de47cd0c5738105f40e65191b0761dfa7431ac |
|
29-Apr-2006 |
Hua Zhong <hzhong@gmail.com> |
[TCP]: Fix unlikely usage in tcp_transmit_skb() The following unlikely should be replaced by likely because the condition happens every time unless there is a hard error to transmit a packet. Signed-off-by: Hua Zhong <hzhong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b60b49ea6a3e1f8dcaf4148dad0daab61ab766d2 |
|
20-Apr-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Account skb overhead in tcp_fragment Make sure that we get the full sizeof(struct sk_buff) plus the data size accounted for in skb->truesize. This will create invariants that will allow adding assertion checks on skb->truesize. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ef5cb9738b488140eb6c3f32fffab08f39a4905e |
|
18-Apr-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix truesize underflow There is a problem with the TSO packet trimming code. The cause of this lies in the tcp_fragment() function. When we allocate a fragment for a completely non-linear packet the truesize is calculated for a payload length of zero. This means that truesize could in fact be less than the real payload length. When that happens the TSO packet trimming can cause truesize to become negative. This in turn can cause sk_forward_alloc to be -n * PAGE_SIZE which would trigger the warning. I've copied the code DaveM used in tso_fragment which should work here. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6c97e72a162648eaf7c401cfc139493cefa6bed2 |
|
12-Apr-2006 |
Adrian Bunk <bunk@stusta.de> |
[IPV4]: Possible cleanups. This patch contains the following possible cleanups: - make the following needlessly global function static: - arp.c: arp_rcv() - remove the following unused EXPORT_SYMBOL's: - devinet.c: devinet_ioctl - fib_frontend.c: ip_rt_ioctl - inet_hashtables.c: inet_bind_bucket_create - inet_hashtables.c: inet_bind_hash - tcp_input.c: sysctl_tcp_abc - tcp_ipv4.c: sysctl_tcp_tw_reuse - tcp_output.c: sysctl_tcp_mtu_probing - tcp_output.c: sysctl_tcp_base_mss Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
15d99e02babae8bc20b836917ace07d93e318149 |
|
21-Mar-2006 |
Rick Jones <rick.jones2@hp.com> |
[TCP]: sysctl to allow TCP window > 32767 sans wscale Back in the dark ages, we had to be conservative and only allow 15-bit window fields if the window scale option was not negotiated. Some ancient stacks used a signed 16-bit quantity for the window field of the TCP header and would get confused. Those days are long gone, so we can use the full 16-bits by default now. There is a sysctl added so that we can still interact with such old stacks Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e7b13685f9a06949ea3070c97c0f0085a08cd37 |
|
21-Mar-2006 |
John Heffner <jheffner@psc.edu> |
[TCP] mtu probing: move tcp-specific data out of inet_connection_sock This moves some TCP-specific MTU probing state out of inet_connection_sock back to tcp_sock. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5d424d5a674f782d0659a3b66d951f412901faee |
|
21-Mar-2006 |
John Heffner <jheffner@psc.edu> |
[TCP]: MTU probing Implementation of packetization layer path mtu discovery for TCP, based on the internet-draft currently found at <http://www.ietf.org/internet-drafts/draft-ietf-pmtud-method-05.txt>. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ba244fe9005323452428fee4b4b7d0c70a06b627 |
|
12-Mar-2006 |
David S. Miller <davem@davemloft.net> |
[TCP]: Fix tcp_tso_should_defer() when limit>=65536 That's >= a full sized TSO frame, so we should always return 0 in that case. Based upon a report and initial patch from Lachlan Andrew, final patch suggested by Herbert Xu. Signed-off-by: David S. Miller <davem@davemloft.net>
|
40efc6fa179f440a008333ea98f701bc35a1f97f |
|
04-Jan-2006 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: less inline's TCP inline usage cleanup: * get rid of inline in several places * replace __inline__ with inline where possible * move functions used in one file out of tcp.h * let compiler decide on used once cases On x86_64: text data bss dec hex filename 3594701 648348 567400 4810449 4966d1 vmlinux.orig 3593133 648580 567400 4809113 496199 vmlinux On sparc64: text data bss dec hex filename 2538278 406152 530392 3474822 350586 vmlinux.ORIG 2536382 406384 530392 3473158 34ff06 vmlinux Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d83d8461f902c672bc1bd8fbc6a94e19f092da97 |
|
14-Dec-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[IP_SOCKGLUE]: Remove most of the tcp specific calls As DCCP needs to be called in the same spots. Now we have a member in inet_sock (is_icsk), set at sock creation time from struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and DCCP) to see if a struct sock instance is a inet_connection_sock for places like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if sk_type was SOCK_STREAM, that is insufficient because we now use the same code for DCCP, that has sk_type SOCK_DCCP. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8292a17a399ffb7c5c8b083db4ad994e090055f7 |
|
14-Dec-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_ops And move it to struct inet_connection_sock. DCCP will use it in the upcoming changesets. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dfb4b9dceb35c567a595ae5e9d035cfda044a103 |
|
07-Dec-2005 |
David S. Miller <davem@davemloft.net> |
[TCP] Vegas: timestamp before clone We have to store the congestion control timestamp on the SKB before we clone it, not after. Else we get no timestamping information at all. tcp_transmit_skb() has been reworked so that we can do the timestamp still in one spot, instead of at all the call sites. Problem discovered, and initial fix, from Tom Young <tyo@ee.unimelb.edu.au>. Signed-off-by: David S. Miller <davem@davemloft.net>
|
6a438bbe68c7013a42d9c5aee5a40d7dafdbe6ec |
|
11-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: speed up SACK processing Use "hints" to speed up the SACK processing. Various forms of this have been used by TCP developers (Web100, STCP, BIC) to avoid the 2x linear search of outstanding segments. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
caa20d9abe810be2ede9612b6c9db6ce7d6edf80 |
|
11-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: spelling fixes Minor spelling fixes for TCP code. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f4805eded7d38c4e42bf473dc5eb2f34853beb06 |
|
11-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: fix congestion window update when using TSO deferal TCP peformance with TSO over networks with delay is awful. On a 100Mbit link with 150ms delay, we get 4Mbits/sec with TSO and 50Mbits/sec without TSO. The problem is with TSO, we intentionally do not keep the maximum number of packets in flight to fill the window, we hold out to until we can send a MSS chunk. But, we also don't update the congestion window unless we have filled, as per RFC2861. This patch replaces the check for the congestion window being full with something smarter that accounts for TSO. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b2cc99f04c5a732c793519aca61a20f719b50db4 |
|
20-Oct-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP] Allow len == skb->len in tcp_fragment It is legitimate to call tcp_fragment with len == skb->len since that is done for FIN packets and the FIN flag counts as one byte. So we should only check for the len > skb->len case. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
046d20b73960b7a2474b6d5e920d54c3fd7c23fe |
|
13-Oct-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Ratelimit debugging warning. Better safe than sorry. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9ff5c59ce278c37bca22fbf98076d199bcaf9845 |
|
13-Oct-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Add code to help track down "BUG at net/ipv4/tcp_output.c:438!" This is the second report of this bug. Unfortunately the first reporter hasn't been able to reproduce it since to provide more debugging info. So let's apply this patch for 2.6.14 to 1) Make this non-fatal. 2) Provide the info we need to track it down. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dd0fc66fb33cd610bc1a5db8a5e232d34879b4d7 |
|
07-Oct-2005 |
Al Viro <viro@ftp.linux.org.uk> |
[PATCH] gfp flags annotations - part 1 - added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
01ff367e62f0474e4d39aa5812cbe2a30d96e1e9 |
|
30-Sep-2005 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Revert 6b251858d377196b8cea20e65cae60f584a42735 But retain the comment fix. Alexey Kuznetsov has explained the situation as follows: -------------------- I think the fix is incorrect. Look, the RFC function init_cwnd(mss) is not continuous: f.e. for mss=1095 it needs initial window 1095*4, but for mss=1096 it is 1096*3. We do not know exactly what mss sender used for calculations. If we advertised 1096 (and calculate initial window 3*1096), the sender could limit it to some value < 1096 and then it will need window his_mss*4 > 3*1096 to send initial burst. See? So, the honest function for inital rcv_wnd derived from tcp_init_cwnd() is: init_rcv_wnd(mss)= min { init_cwnd(mss1)*mss1 for mss1 <= mss } It is something sort of: if (mss < 1096) return mss*4; if (mss < 1096*2) return 1096*4; return mss*2; (I just scrablled a graph of piece of paper, it is difficult to see or to explain without this) I selected it differently giving more window than it is strictly required. Initial receive window must be large enough to allow sender following to the rfc (or just setting initial cwnd to 2) to send initial burst. But besides that it is arbitrary, so I decided to give slack space of one segment. Actually, the logic was: If mss is low/normal (<=ethernet), set window to receive more than initial burst allowed by rfc under the worst conditions i.e. mss*4. This gives slack space of 1 segment for ethernet frames. For msses slighlty more than ethernet frame, take 3. Try to give slack space of 1 frame again. If mss is huge, force 2*mss. No slack space. Value 1460*3 is really confusing. Minimal one is 1096*2, but besides that it is an arbitrary value. It was meant to be ~4096. 1460*3 is just the magic number from RFC, 1460*3 = 1095*4 is the magic :-), so that I guess hands typed this themselves. -------------------- Signed-off-by: David S. Miller <davem@davemloft.net>
|
6b251858d377196b8cea20e65cae60f584a42735 |
|
29-Sep-2005 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Fix init_cwnd calculations in tcp_select_initial_window() Match it up to what RFC2414 really specifies. Noticed by Rick Jones. Signed-off-by: David S. Miller <davem@davemloft.net>
|
83ca28befc43e93849e79c564cda10e39d983e75 |
|
23-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Adjust Reno SACK estimate in tcp_fragment Since the introduction of TSO pcount a year ago, it has been possible for tcp_fragment() to cause packets_out to decrease. Prior to that, tcp_retrans_try_collapse() was the only way for that to happen on the retransmission path. When this happens with Reno, it is possible for sasked_out to become invalid because it is only an estimate and not tied to any particular packet on the retransmission queue. Therefore we need to adjust sacked_out as well as left_out in the Reno case. The following patch does exactly that. This bug is pretty difficult to trigger in practice though since you need a SACKless peer with a retransmission that occurs just as the cached MTU value expires. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e14c3caf605dfd29bd1aac3097e39db94afc9f07 |
|
20-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Handle SACK'd packets properly in tcp_fragment(). The problem is that we're now calling tcp_fragment() in a context where the packets might be marked as SACKED_ACKED or SACKED_RETRANS. This was not possible before as you never retransmitted packets that are so marked. Because of this, we need to adjust sacked_out and retrans_out in tcp_fragment(). This is exactly what the following patch does. We also need to preserve the SACKED_ACKED/SACKED_RETRANS marking if they exist. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3c05d92ed49f644d1f5a960fa48637d63b946016 |
|
15-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Compute in_sacked properly when we split up a TSO frame. The problem is that the SACK fragmenting code may incorrectly call tcp_fragment() with a length larger than the skb->len. This happens when the skb on the transmit queue completely falls to the LHS of the SACK. And add a BUG() check to tcp_fragment() so we can spot this kind of error more quickly in the future. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e130af5dab2abbf01c5d92ec5ac05912cf3d9aa7 |
|
11-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix double adjustment of tp->{lost,left}_out in tcp_fragment(). There is an extra left_out/lost_out adjustment in tcp_fragment which means that the lost_out accounting is always wrong. This patch removes that chunk of code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cf0b450cd5176b68ac7d5bbe68aeae6bb6a5a4b8 |
|
09-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix off by one in tcp_fragment() "already sent" test. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6475be16fd9b3c6746ca4d18959246b13c669ea8 |
|
02-Sep-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Keep TSO enabled even during loss events. All we need to do is resegment the queue so that we record SACK information accurately. The edges of the SACK blocks guide our resegmenting decisions. With help from Herbert Xu. Signed-off-by: David S. Miller <davem@davemloft.net>
|
d179cd12928443f3ec29cfbc3567439644bd0afc |
|
17-Aug-2005 |
David S. Miller <davem@davemloft.net> |
[NET]: Implement SKB fast cloning. Protocols that make extensive use of SKB cloning, for example TCP, eat at least 2 allocations per packet sent as a result. To cut the kmalloc() count in half, we implement a pre-allocation scheme wherein we allocate 2 sk_buff objects in advance, then use a simple reference count to free up the memory at the correct time. Based upon an initial patch by Thomas Graf and suggestions from Herbert Xu. Signed-off-by: David S. Miller <davem@davemloft.net>
|
a61bbcf28a8cb0ba56f8193d512f7222e711a294 |
|
15-Aug-2005 |
Patrick McHardy <kaber@trash.net> |
[NET]: Store skb->timestamp as offset to a base timestamp Reduces skb size by 8 bytes on 64-bit. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6687e988d9aeaccad6774e6a8304f681f3ec0a03 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK]: Move TCP congestion avoidance members to icsk This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(), minimal renaming/moving done in this changeset to ease review. Most of it is just changes of struct tcp_sock * to struct sock * parameters. With this we move to a state closer to two interesting goals: 1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used for any INET transport protocol that has struct inet_hashinfo and are derived from struct inet_connection_sock. Keeps the userspace API, that will just not display DCCP sockets, while newer versions of tools can support DCCP. 2. INET generic transport pluggable Congestion Avoidance infrastructure, using the current TCP CA infrastructure with DCCP. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3f421baa4720b708022f8bcc52a61e5cd6f10bf8 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Just move the inet_connection_sock function from tcp sources Completing the previous changeset, this also generalises tcp_v4_synq_add, renaming it to inet_csk_reqsk_queue_hash_add, already geing used in the DCCP tree, which I plan to merge RSN. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
463c84b97f24010a67cd871746d6a7e4c925a5f9 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Introduce inet_connection_sock This creates struct inet_connection_sock, moving members out of struct tcp_sock that are shareable with other INET connection oriented protocols, such as DCCP, that in my private tree already uses most of these members. The functions that operate on these members were renamed, using a inet_csk_ prefix while not being moved yet to a new file, so as to ease the review of these changes. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8728b834b226ffcf2c94a58530090e292af2a7bf |
|
10-Aug-2005 |
David S. Miller <davem@davemloft.net> |
[NET]: Kill skb->list Remove the "list" member of struct sk_buff, as it is entirely redundant. All SKB list removal callers know which list the SKB is on, so storing this in sk_buff does nothing other than taking up some space. Two tricky bits were SCTP, which I took care of, and two ATM drivers which Francois Romieu <romieu@fr.zoreil.com> fixed up. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
|
14869c388673e8db3348ab3706fa6485d0f0cf95 |
|
23-Aug-2005 |
Dmitry Yusupov <dima@neterion.com> |
[TCP]: Do TSO deferral even if tail SKB can go out now. If the tail SKB fits into the window, it is still benefitical to defer until the goal percentage of the window is available. This give the application time to feed more data into the send queue and thus results in larger TSO frames going out. Patch from Dmitry Yusupov <dima@neterion.com>. Signed-off-by: David S. Miller <davem@davemloft.net>
|
35d59efd105b3b7c1b5878dcc9d1749f41f9740f |
|
17-Aug-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix bug #5070: kernel BUG at net/ipv4/tcp_output.c:864 1) We send out a normal sized packet with TSO on to start off. 2) ICMP is received indicating a smaller MTU. 3) We send the current sk_send_head which needs to be fragmented since it was created before the ICMP event. The first fragment is then sent out. At this point the remaining fragment is allocated by tcp_fragment. However, its size is padded to fit the L1 cache-line size therefore creating tail-room up to 124 bytes long. This fragment will also be sitting at sk_send_head. 4) tcp_sendmsg is called again and it stores data in the tail-room of of the fragment. 5) tcp_push_one is called by tcp_sendmsg which then calls tso_fragment since the packet as a whole exceeds the MTU. At this point we have a packet that has data in the head area being fed to tso_fragment which bombs out. My take on this is that we shouldn't ever call tcp_fragment on a TSO socket for a packet that is yet to be transmitted since this creates a packet on sk_send_head that cannot be extended. So here is a patch to change it so that tso_fragment is always used in this case. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c8ac37746489f05a32a958b048f29ae45487e81e |
|
17-Aug-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix bug #5070: kernel BUG at net/ipv4/tcp_output.c:864 1) We send out a normal sized packet with TSO on to start off. 2) ICMP is received indicating a smaller MTU. 3) We send the current sk_send_head which needs to be fragmented since it was created before the ICMP event. The first fragment is then sent out. At this point the remaining fragment is allocated by tcp_fragment. However, its size is padded to fit the L1 cache-line size therefore creating tail-room up to 124 bytes long. This fragment will also be sitting at sk_send_head. 4) tcp_sendmsg is called again and it stores data in the tail-room of of the fragment. 5) tcp_push_one is called by tcp_sendmsg which then calls tso_fragment since the packet as a whole exceeds the MTU. At this point we have a packet that has data in the head area being fed to tso_fragment which bombs out. My take on this is that we shouldn't ever call tcp_fragment on a TSO socket for a packet that is yet to be transmitted since this creates a packet on sk_send_head that cannot be extended. So here is a patch to change it so that tso_fragment is always used in this case. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b5da623ae9be680ea59f268eeb339f0acb2d88c4 |
|
11-Aug-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Adjust {p,f}ackets_out correctly in tcp_retransmit_skb() Well I've only found one potential cause for the assertion failure in tcp_mark_head_lost. First of all, this can only occur if cnt > 1 since tp->packets_out is never zero here. If it did hit zero we'd have much bigger problems. So cnt is equal to fackets_out - reordering. Normally fackets_out is less than packets_out. The only reason I've found that might cause fackets_out to exceed packets_out is if tcp_fragment is called from tcp_retransmit_skb with a TSO skb and the current MSS is greater than the MSS stored in the TSO skb. This might occur as the result of an expiring dst entry. In that case, packets_out may decrease (line 1380-1381 in tcp_output.c). However, fackets_out is unchanged which means that it may in fact exceed packets_out. Previously tcp_retrans_try_collapse was the only place where packets_out can go down and it takes care of this by decrementing fackets_out. So we should make sure that fackets_out is reduced by an appropriate amount here as well. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b68e9f857271189bd7a59b74c99890de9195b0e1 |
|
05-Aug-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[PATCH] tcp: fix TSO cwnd caching bug tcp_write_xmit caches the cwnd value indirectly in cwnd_quota. When tcp_transmit_skb reduces the cwnd because of tcp_enter_cwr, the cached value becomes invalid. This patch ensures that the cwnd value is always reread after each tcp_transmit_skb call. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
846998ae87a80b0fd45b4cf5cf001a159d746f27 |
|
05-Aug-2005 |
David S. Miller <davem@davemloft.net> |
[PATCH] tcp: fix TSO sizing bugs MSS changes can be lost since we preemptively initialize the tso_segs count for an SKB before we %100 commit to sending it out. So, by the time we send it out, the tso_size information can be stale due to PMTU events. This mucks up all of the logic in our send engine, and can even result in the BUG() triggering in tcp_tso_should_defer(). Another problem we have is that we're storing the tp->mss_cache, not the SACK block normalized MSS, as the tso_size. That's wrong too. Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
86a76caf8705e3524e15f343f3c4806939a06dc8 |
|
08-Jul-2005 |
Victor Fusco <victor@cetuc.puc-rio.br> |
[NET]: Fix sparse warnings From: Victor Fusco <victor@cetuc.puc-rio.br> Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
908a75c17a9e5a888347c2c1d3572203d1b1c7db |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Never TSO defer under periods of congestion. Congestion window recover after loss depends upon the fact that if we have a full MSS sized frame at the head of the send queue, we will send it. TSO deferral can defeat the ACK clocking necessary to exit cleanly from recovery. Signed-off-by: David S. Miller <davem@davemloft.net>
|
c1b4a7e69576d65efc31a8cea0714173c2841244 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Move to new TSO segmenting scheme. Make TSO segment transmit size decisions at send time not earlier. The basic scheme is that we try to build as large a TSO frame as possible when pulling in the user data, but the size of the TSO frame output to the card is determined at transmit time. This is guided by tp->xmit_size_goal. It is always set to a multiple of MSS and tells sendmsg/sendpage how large an SKB to try and build. Later, tcp_write_xmit() and tcp_push_one() chop up the packet if necessary and conditions warrant. These routines can also decide to "defer" in order to wait for more ACKs to arrive and thus allow larger TSO frames to be emitted. A general observation is that TSO elongates the pipe, thus requiring a larger congestion window and larger buffering especially at the sender side. Therefore, it is important that applications 1) get a large enough socket send buffer (this is accomplished by our dynamic send buffer expansion code) 2) do large enough writes. Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa93466bdfd901b926e033801f0b82b3eaa67be2 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Eliminate redundant computations in tcp_write_xmit(). tcp_snd_test() is run for every packet output by a single call to tcp_write_xmit(), but this is not necessary. For one, the congestion window space needs to only be calculated one time, then used throughout the duration of the loop. This cleanup also makes experimenting with different TSO packetization schemes much easier. Signed-off-by: David S. Miller <davem@davemloft.net>
|
7f4dd0a9438c73cbb1c240ece31390cf2c57294e |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Break out tcp_snd_test() into it's constituent parts. tcp_snd_test() does several different things, use inline functions to express this more clearly. 1) It initializes the TSO count of SKB, if necessary. 2) It performs the Nagle test. 3) It makes sure the congestion window is adhered to. 4) It makes sure SKB fits into the send window. This cleanup also sets things up so that things like the available packets in the congestion window does not need to be calculated multiple times by packet sending loops such as tcp_write_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
55c97f3e990c1ff63957c64f6cb10711a09fd70e |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Fix __tcp_push_pending_frames() 'nonagle' handling. 'nonagle' should be passed to the tcp_snd_test() function as 'TCP_NAGLE_PUSH' if we are checking an SKB not at the tail of the write_queue. This is because Nagle does not apply to such frames since we cannot possibly tack more data onto them. However, while doing this __tcp_push_pending_frames() makes all of the packets in the write_queue use this modified 'nonagle' value. Fix the bug and simplify this function by just calling tcp_write_xmit() directly if sk_send_head is non-NULL. As a result, we can now make tcp_data_snd_check() just call tcp_push_pending_frames() instead of the specialized __tcp_data_snd_check(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
a2e2a59c93cc8ba39caa9011c2573f429e40ccd9 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Fix redundant calculations of tcp_current_mss() tcp_write_xmit() uses tcp_current_mss(), but some of it's callers, namely __tcp_push_pending_frames(), already has this value available already. While we're here, fix the "cur_mss" argument to be "unsigned int" instead of plain "unsigned". Signed-off-by: David S. Miller <davem@davemloft.net>
|
92df7b518dcb113de8bc2494e3cd275ad887f12b |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: tcp_write_xmit() tabbing cleanup Put the main basic block of work at the top-level of tabbing, and mark the TCP_CLOSE test with unlikely(). Signed-off-by: David S. Miller <davem@davemloft.net>
|
a762a9800752f05fa8768bb0ac35d0e7f1bcfe7f |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Kill extra cwnd validate in __tcp_push_pending_frames(). The tcp_cwnd_validate() function should only be invoked if we actually send some frames, yet __tcp_push_pending_frames() will always invoke it. tcp_write_xmit() does the call for us, so the call here can simply be removed. Also, tcp_write_xmit() can be marked static. Signed-off-by: David S. Miller <davem@davemloft.net>
|
f44b527177d57ed382bfd93e1b55232465f6d058 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Add missing skb_header_release() call to tcp_fragment(). When we add any new packet to the TCP socket write queue, we must call skb_header_release() on it in order for the TSO sharing checks in the drivers to work. Signed-off-by: David S. Miller <davem@davemloft.net>
|
84d3e7b9573291a1ea845bdd51b74bb484597661 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Move __tcp_data_snd_check into tcp_output.c It reimplements portions of tcp_snd_check(), so it we move it to tcp_output.c we can consolidate it's logic much easier in a later change. Signed-off-by: David S. Miller <davem@davemloft.net>
|
f6302d1d78f77c2d4c8bd32b0afc2df7fdf5f281 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Move send test logic out of net/tcp.h This just moves the code into tcp_output.c, no code logic changes are made by this patch. Using this as a baseline, we can begin to untangle the mess of comparisons for the Nagle test et al. We will also be able to reduce all of the redundant computation that occurs when outputting data packets. Signed-off-by: David S. Miller <davem@davemloft.net>
|
fc6415bcb0f58f03adb910e56d7e1df6368794e0 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Fix quick-ack decrementing with TSO. On each packet output, we call tcp_dec_quickack_mode() if the ACK flag is set. It drops tp->ack.quick until it hits zero, at which time we deflate the ATO value. When doing TSO, we are emitting multiple packets with ACK set, so we should decrement tp->ack.quick that many segments. Note that, unlike this case, tcp_enter_cwr() should not take the tcp_skb_pcount(skb) into consideration. That function, one time, readjusts tp->snd_cwnd and moves into TCP_CA_CWR state. Signed-off-by: David S. Miller <davem@davemloft.net>
|
317a76f9a44b437d6301718f4e5d08bd93f98da7 |
|
23-Jun-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Add pluggable congestion control algorithm infrastructure. Allow TCP to have multiple pluggable congestion control algorithms. Algorithms are defined by a set of operations and can be built in or modules. The legacy "new RENO" algorithm is used as a starting point and fallback. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
60236fdd08b2169045a3bbfc5ffe1576e6c3c17b |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Rename open_request to request_sock Ok, this one just renames some stuff to have a better namespace and to dissassociate it from TCP: struct open_request -> struct request_sock tcp_openreq_alloc -> reqsk_alloc tcp_openreq_free -> reqsk_free tcp_openreq_fastfree -> __reqsk_free With this most of the infrastructure closely resembles a struct sock methods subset. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2e6599cb899ba4b133f42cbf9d2b1883d2dc583a |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Generalise TCP's struct open_request minisock infrastructure Kept this first changeset minimal, without changing existing names to ease peer review. Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn has two new members: ->slab, that replaces tcp_openreq_cachep ->obj_size, to inform the size of the openreq descendant for a specific protocol The protocol specific fields in struct open_request were moved to a class hierarchy, with the things that are common to all connection oriented PF_INET protocols in struct inet_request_sock, the TCP ones in tcp_request_sock, that is an inet_request_sock, that is an open_request. I.e. this uses the same approach used for the struct sock class hierarchy, with sk_prot indicating if the protocol wants to use the open_request infrastructure by filling in sk_prot->rsk_prot with an or_calltable. Results? Performance is improved and TCP v4 now uses only 64 bytes per open request minisock, down from 96 without this patch :-) Next changeset will rename some of the structs, fields and functions mentioned above, struct or_calltable is way unclear, better name it struct request_sock_ops, s/struct open_request/struct request_sock/g, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
02c30a84e6298b6b20a56f0896ac80b47839e134 |
|
06-May-2005 |
Jesper Juhl <juhl-lkml@dif.dk> |
[PATCH] update Ross Biro bouncing email address Ross moved. Remove the bad email address so people will find the correct one in ./CREDITS. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
d5ac99a648b8c61d0c7f1c32a8ab7f1dca0123d2 |
|
25-Apr-2005 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: skb pcount with MTU discovery The problem is that when doing MTU discovery, the too-large segments in the write queue will be calculated as having a pcount of >1. When tcp_write_xmit() is trying to send, tcp_snd_test() fails the cwnd test when pcount > cwnd. The segments are eventually transmitted one at a time by keepalive, but this can take a long time. This patch checks if TSO is enabled when setting pcount. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 |
|
17-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|