cd75eff64dae8856afbf6ef0f0ca3c145465d8e0 |
|
29-Apr-2013 |
Yuchung Cheng <ycheng@google.com> |
tcp: reset timer after any SYNACK retransmit Linux immediately returns SYNACK on (spurious) SYN retransmits, but keeps the SYNACK timer running independently. Thus the timer may fire right after the SYNACK retransmit and causes a SYN-SYNACK cross-fire burst. Adopt the fast retransmit/recovery idea in established state by re-arming the SYNACK timer after the fast (SYNACK) retransmit. The timer may fire late up to 500ms due to the current SYNACK timer wheel, but it's OK to be conservative when network is congested. Eric's new listener design should address this issue. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9b44190dc114c1720b34975b5bfc65aece112ced |
|
20-Mar-2013 |
Yuchung Cheng <ycheng@google.com> |
tcp: refactor F-RTO The patch series refactor the F-RTO feature (RFC4138/5682). This is to simplify the loss recovery processing. Existing F-RTO was developed during the experimental stage (RFC4138) and has many experimental features. It takes a separate code path from the traditional timeout processing by overloading CA_Disorder instead of using CA_Loss state. This complicates CA_Disorder state handling because it's also used for handling dubious ACKs and undos. While the algorithm in the RFC does not change the congestion control, the implementation intercepts congestion control in various places (e.g., frto_cwnd in tcp_ack()). The new code implements newer F-RTO RFC5682 using CA_Loss processing path. F-RTO becomes a small extension in the timeout processing and interfaces with congestion control and Eifel undo modules. It lets congestion control (module) determines how many to send independently. F-RTO only chooses what to send in order to detect spurious retranmission. If timeout is found spurious it invokes existing Eifel undo algorithms like DSACK or TCP timestamp based detection. The first patch removes all F-RTO code except the sysctl_tcp_frto is left for the new implementation. Since CA_EVENT_FRTO is removed, TCP westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1a2c6181c4a1922021b4d7df373bba612c3e5f04 |
|
17-Mar-2013 |
Christoph Paasch <christoph.paasch@uclouvain.be> |
tcp: Remove TCPCT TCPCT uses option-number 253, reserved for experimental use and should not be used in production environments. Further, TCPCT does not fully implement RFC 6013. As a nice side-effect, removing TCPCT increases TCP's performance for very short flows: Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests for files of 1KB size. before this patch: average (among 7 runs) of 20845.5 Requests/Second after: average (among 7 runs) of 21403.6 Requests/Second Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9b717a8d245075ffb8e95a2dfb4ee97ce4747457 |
|
11-Mar-2013 |
Nandita Dukkipati <nanditad@google.com> |
tcp: TLP loss detection. This is the second of the TLP patch series; it augments the basic TLP algorithm with a loss detection scheme. This patch implements a mechanism for loss detection when a Tail loss probe retransmission plugs a hole thereby masking packet loss from the sender. The loss detection algorithm relies on counting TLP dupacks as outlined in Sec. 3 of: http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01 The basic idea is: Sender keeps track of TLP "episode" upon retransmission of a TLP packet. An episode ends when the sender receives an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the episode. We want to make sure that before the episode ends the sender receives a "TLP dupack", indicating that the TLP retransmission was unnecessary, so there was no loss/hole that needed plugging. If the sender gets no TLP dupack before the end of the episode, then it reduces ssthresh and the congestion window, because the TLP packet arriving at the receiver probably plugged a hole. Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ee684b6f2830047d19877e5547989740f18b1a5d |
|
11-Feb-2013 |
Andrey Vagin <avagin@openvz.org> |
tcp: send packets with a socket timestamp A socket timestamp is a sum of the global tcp_time_stamp and a per-socket offset. A socket offset is added in places where externally visible tcp timestamp option is parsed/initialized. Connections in the SYN_RECV state are not supported, global tcp_time_stamp is used for them, because repair mode doesn't support this state. In a future it can be implemented by the similar way as for TIME_WAIT sockets. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ceaa1fef65a7c2e017b260b879b310dd24888083 |
|
11-Feb-2013 |
Andrey Vagin <avagin@openvz.org> |
tcp: adding a per-socket timestamp offset This functionality is used for restoring tcp sockets. A tcp timestamp depends on how long a system has been running, so it's differ for each host. The solution is to set a per-socket offset. A per-socket offset for a TIME_WAIT socket is inherited from a proper tcp socket. tcp_request_sock doesn't have a timestamp offset, because the repair mode for them are not implemented. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ca2eb5679f8ddffff60156af42595df44a315ef0 |
|
05-Feb-2013 |
Stephen Hemminger <stephen@networkplumber.org> |
tcp: remove Appropriate Byte Count support TCP Appropriate Byte Count was added by me, but later disabled. There is no point in maintaining it since it is a potential source of bugs and Linux already implements other better window protection heuristics. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e6c022a4fa2d2d9ca9d0a7ac3b05ad988f39fc30 |
|
28-Oct-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: better retrans tracking for defer-accept For passive TCP connections using TCP_DEFER_ACCEPT facility, we incorrectly increment req->retrans each time timeout triggers while no SYNACK is sent. SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for which we received the ACK from client). Only the last SYNACK is sent so that we can receive again an ACK from client, to move the req into accept queue. We plan to change this later to avoid the useless retransmit (and potential problem as this SYNACK could be lost) TCP_INFO later gives wrong information to user, claiming imaginary retransmits. Decouple req->retrans field into two independent fields : num_retrans : number of retransmit num_timeout : number of timeouts num_timeout is the counter that is incremented at each timeout, regardless of actual SYNACK being sent or not, and used to compute the exponential timeout. Introduce inet_rtx_syn_ack() helper to increment num_retrans only if ->rtx_syn_ack() succeeded. Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans when we re-send a SYNACK in answer to a (retransmitted) SYN. Prior to this patch, we were not counting these retransmits. Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS only if a synack packet was successfully queued. Reported-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Elliott Hughes <enh@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6f73601efb35c7003f5c58c2bc6fd08f3652169c |
|
19-Oct-2012 |
Yuchung Cheng <ycheng@google.com> |
tcp: add SYN/data info to TCP_INFO Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options. It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or client application can use this information to check Fast Open success rate. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
072539880156fafd86c753070e598c23e8e74586 |
|
22-Sep-2012 |
Neal Cardwell <ncardwell@google.com> |
tcp: TCP Fast Open Server - note timestamps and retransmits for SYNACK RTT Previously, when using TCP Fast Open a server would return from tcp_check_req() before updating snt_synack based on TCP timestamp echo replies and whether or not we've retransmitted the SYNACK. The result was that (a) for TFO connections using timestamps we used an incorrect baseline SYNACK send time (tcp_time_stamp of SYNACK send instead of rcv_tsecr), and (b) for TFO connections that do not have TCP timestamps but retransmit the SYNACK we took a SYNACK RTT sample when we should not take a sample. This fix merely moves the snt_synack update logic a bit earlier in the function, so that connections using TCP Fast Open will properly do these updates when the ACK for the SYNACK arrives. Moving this snt_synack update logic means that with TCP_DEFER_ACCEPT enabled we do a few instructions of wasted work on each bare ACK, but that seems OK. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4308fc58dceda40fac8a8d9c05b7cfba0a6bbed3 |
|
19-Sep-2012 |
Alan Cox <alan@linux.intel.com> |
tcp: Document use of undefined variable. Both tcp_timewait_state_process and tcp_check_req use the same basic construct of struct tcp_options received tmp_opt; tmp_opt.saw_tstamp = 0; then call tcp_parse_options However if they are fed a frame containing a TCP_SACK then tbe code behaviour is undefined because opt_rx->sack_ok is undefined data. This ought to be documented if it is intentional. Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8336886f786fdacbc19b719c1f7ea91eb70706d4 |
|
31-Aug-2012 |
Jerry Chu <hkchu@google.com> |
tcp: TCP Fast Open Server - support TFO listeners This patch builds on top of the previous patch to add the support for TFO listeners. This includes - 1. allocating, properly initializing, and managing the per listener fastopen_queue structure when TFO is enabled 2. changes to the inet_csk_accept code to support TFO. E.g., the request_sock can no longer be freed upon accept(), not until 3WHS finishes 3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg() if it's a TFO socket 4. properly closing a TFO listener, and a TFO socket before 3WHS finishes 5. supporting TCP_FASTOPEN socket option 6. modifying tcp_check_req() to use to check a TFO socket as well as request_sock 7. supporting TCP's TFO cookie option 8. adding a new SYN-ACK retransmit handler to use the timer directly off the TFO socket rather than the listener socket. Note that TFO server side will not retransmit anything other than SYN-ACK until the 3WHS is completed. The patch also contains an important function "reqsk_fastopen_remove()" to manage the somewhat complex relation between a listener, its request_sock, and the corresponding child socket. See the comment above the function for the detail. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
fae6ef87faeb8853896920c68ee703d715799d28 |
|
19-Aug-2012 |
Neal Cardwell <ncardwell@google.com> |
net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child() This commit removes the sk_rx_dst_set calls from tcp_create_openreq_child(), because at that point the icsk_af_ops field of ipv6_mapped TCP sockets has not been set to its proper final value. Instead, to make sure we get the right sk_rx_dst_set variant appropriate for the address family of the new connection, we have tcp_v{4,6}_syn_recv_sock() directly call the appropriate function shortly after the call to tcp_create_openreq_child() returns. This also moves inet6_sk_rx_dst_set() to avoid a forward declaration with the new approach. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Artem Savkov <artem.savkov@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5d299f3d3c8a2fbc732b1bf03af36333ccec3130 |
|
06-Aug-2012 |
Eric Dumazet <edumazet@google.com> |
net: ipv6: fix TCP early demux IPv6 needs a cookie in dst_check() call. We need to add rx_dst_cookie and provide a family independent sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
404e0a8b6a55d5e1cd138c6deb1bca9abdf75d8c |
|
30-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
net: ipv4: fix RCU races on dst refcounts commit c6cffba4ffa2 (ipv4: Fix input route performance regression.) added various fatal races with dst refcounts. crashes happen on tcp workloads if routes are added/deleted at the same time. The dst_free() calls from free_fib_info_rcu() are clearly racy. We need instead regular dst refcounting (dst_release()) and make sure dst_release() is aware of RCU grace periods : Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period before dst destruction for cached dst Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero() to make sure we dont increase a zero refcount (On a dst currently waiting an rcu grace period before destruction) rt_cache_route() must take a reference on the new cached route, and release it if was not able to install it. With this patch, my machines survive various benchmarks. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
505fbcf035c245a1a42cd80184feecf61ee868dc |
|
27-Jul-2012 |
Eric Dumazet <edumazet@google.com> |
ipv4: fix TCP early demux commit 92101b3b2e317 (ipv4: Prepare for change of rt->rt_iif encoding.) invalidated TCP early demux, because rx_dst_ifindex is not properly initialized and checked. Also remove the use of inet_iif(skb) in favor or skb->skb_iif Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2100c8d2d9db23c0a09901a782bb4e3b21bee298 |
|
19-Jul-2012 |
Yuchung Cheng <ycheng@google.com> |
net-tcp: Fast Open base This patch impelements the common code for both the client and server. 1. TCP Fast Open option processing. Since Fast Open does not have an option number assigned by IANA yet, it shares the experiment option code 254 by implementing draft-ietf-tcpm-experimental-options with a 16 bits magic number 0xF989. This enables global experiments without clashing the scarce(2) experimental options available for TCP. When the draft status becomes standard (maybe), the client should switch to the new option number assigned while the server supports both numbers for transistion. 2. The new sysctl tcp_fastopen 3. A place holder init function Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
46d3ceabd8d98ed0ad10f20c595ca784e34786c5 |
|
11-Jul-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: TCP Small Queues This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b6242b9b45e84ef71c59002cd128c3197938cb2f |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
tcp: Remove tw->tw_peer No longer used. Signed-off-by: David S. Miller <davem@davemloft.net>
|
81166dd6fa8eb780b2132d32fbc77eb6ac04e44e |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
tcp: Move timestamps from inetpeer to metrics cache. With help from Lin Ming. Signed-off-by: David S. Miller <davem@davemloft.net>
|
41063e9dd11956f2d285e12e4342e1d232ba0ea2 |
|
20-Jun-2012 |
David S. Miller <davem@davemloft.net> |
ipv4: Early TCP socket demux. Input packet processing for local sockets involves two major demuxes. One for the route and one for the socket. But we can optimize this down to one demux for certain kinds of local sockets. Currently we only do this for established TCP sockets, but it could at least in theory be expanded to other kinds of connections. If a TCP socket is established then it's identity is fully specified. This means that whatever input route was used during the three-way handshake must work equally well for the rest of the connection since the keys will not change. Once we move to established state, we cache the receive packet's input route to use later. Like the existing cached route in sk->sk_dst_cache used for output packets, we have to check for route invalidations using dst->obsolete and dst->ops->check(). Early demux occurs outside of a socket locked section, so when a route invalidation occurs we defer the fixup of sk->sk_rx_dst until we are actually inside of established state packet processing and thus have the socket locked. Signed-off-by: David S. Miller <davem@davemloft.net>
|
2397849baa7c44c242e5d5142d5d16d1e7ed53d0 |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
[PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. Since it's guarenteed that we will access the inetpeer if we're trying to do timewait recycling and TCP options were enabled on the connection, just cache the peer in the timewait socket. In the future, inetpeer lookups will be context dependent (per routing realm), and this helps facilitate that as well. Signed-off-by: David S. Miller <davem@davemloft.net>
|
4670fd819e7f47392c7c6fc6168ea2857c66d163 |
|
09-Jun-2012 |
David S. Miller <davem@davemloft.net> |
tcp: Get rid of inetpeer special cases. The get_peer method TCP uses is full of special cases that make no sense accommodating, and it also gets in the way of doing more reasonable things here. First of all, if the socket doesn't have a usable cached route, there is no sense in trying to optimize timewait recycling. Likewise for the case where we have IP options, such as SRR enabled, that make the IP header destination address (and thus the destination address of the route key) differ from that of the connection's destination address. Just return a NULL peer in these cases, and thus we're also able to get rid of the clumsy inetpeer release logic. Signed-off-by: David S. Miller <davem@davemloft.net>
|
a2a385d627e1549da4b43a8b3dfe370589766e1c |
|
17-May-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: bool conversions bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
eed530b6c67624db3f2cf477bac7c4d005d8f7ba |
|
02-May-2012 |
Yuchung Cheng <ycheng@google.com> |
tcp: early retransmit This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a915da9b69273815527ccb3789421cb7027b545b |
|
31-Jan-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: md5: rcu conversion In order to be able to support proper RST messages for TCP MD5 flows, we need to allow access to MD5 keys without locking listener socket. This conversion is a nice cleanup, and shrinks size of timewait sockets by 80 bytes. IPv6 code reuses generic code found in IPv4 instead of duplicating it. Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Shawn Lu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dfd56b8b38fff3586f36232db58e1e9f7885a605 |
|
10-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: use IS_ENABLED(CONFIG_IPV6) Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d8a6e65f8b6b6b0142ebab578472906d89d63657 |
|
30-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: inherit listener congestion control for passive cnx Rick Jones reported that TCP_CONGESTION sockopt performed on a listener was ignored for its children sockets : right after accept() the congestion control for new socket is the system default one. This seems an oversight of the initial design (quoted from Stephen) Based on prior investigation and patch from Rick. Reported-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Yuchung Cheng <ycheng@google.com> Tested-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4e3fd7a06dc20b2d8ec6892233ad2012968fe7b6 |
|
21-Nov-2011 |
Alexey Dobriyan <adobriyan@gmail.com> |
net: remove ipv6_addr_copy() C assignment can handle struct in6_addr copying. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e56c57d0d3fdbbdf583d3af96bfb803b8dfa713e |
|
08-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: rename sk_clone to sk_clone_lock Make clear that sk_clone() and inet_csk_clone() return a locked socket. Add _lock() prefix and kerneldoc. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b903d324bee2627036d024dceed73b3c96558795 |
|
27-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAIT commit 66b13d99d96a (ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT) fixed IPv4 only. This part is for the IPv6 side, adding a tclass param to ip6_xmit() We alias tw_tclass and tw_tos, if socket family is INET6. [ if sockets is ipv4-mapped, only IP_TOS socket option is used to fill TOS field, TCLASS is not taken into account ] Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cf533ea53ebfae41be15b103d78e7ebec30b9969 |
|
21-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: add const qualifiers where possible Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
58af19e387d8821927e49be3f467da5e6a0aa8fd |
|
18-Oct-2011 |
KOVACS Krisztian <hidden@balabit.hu> |
tproxy: copy transparent flag when creating a time wait The transparent socket option setting was not copied to the time wait socket when an inet socket was being replaced by a time wait socket. This broke the --transparent option of the socket match and may have caused that FIN packets belonging to sockets in FIN_WAIT2 or TIME_WAIT state were being dropped by the packet filter. Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9ad7c049f0f79c418e293b1b68cf10d68f54fcdb |
|
08-Jun-2011 |
Jerry Chu <hkchu@google.com> |
tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side This patch lowers the default initRTO from 3secs to 1sec per RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet has been retransmitted, AND the TCP timestamp option is not on. It also adds support to take RTT sample during 3WHS on the passive open side, just like its active open counterpart, and uses it, if valid, to seed the initRTO for the data transmission phase. The patch also resets ssthresh to its initial default at the beginning of the data transmission phase, and reduces cwnd to 1 if there has been MORE THAN ONE retransmission during 3WHS per RFC5681. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
67631510a318d5a930055fe927607f483716e100 |
|
08-Dec-2010 |
Tom Herbert <therbert@google.com> |
tcp: Replace time wait bucket msg by counter Rather than printing the message to the log, use a mib counter to keep track of the count of occurences of time wait bucket overflow. Reduces spam in logs. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ccb7c410ddc054b8c1ae780319bc98ae092d3854 |
|
02-Dec-2010 |
David S. Miller <davem@davemloft.net> |
timewait_sock: Create and use getpeer op. The only thing AF-specific about remembering the timestamp for a time-wait TCP socket is getting the peer. Abstract that behind a new timewait_sock_ops vector. Support for real IPV6 sockets is not filled in yet, but curiously this makes timewait recycling start to work for v4-mapped ipv6 sockets. Signed-off-by: David S. Miller <davem@davemloft.net>
|
3f419d2d487821093ee46e898b5f8747f9edc9cd |
|
29-Nov-2010 |
David S. Miller <davem@davemloft.net> |
inet: Turn ->remember_stamp into ->get_peer in connection AF ops. Then we can make a completely generic tcp_remember_stamp() that uses ->get_peer() as a helper, minimizing the AF specific code and minimizing the eventual code duplication when we implement the ipv6 side of TW recycling. Signed-off-by: David S. Miller <davem@davemloft.net>
|
a02cec2155fbea457eca8881870fd2de1a4c4c76 |
|
22-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: return operator cleanup Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4bc2f18ba4f22a90ab593c0a580fc9a19c4777b6 |
|
09-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net/ipv4: EXPORT_SYMBOL cleanups CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5a0e3ad6af8660be21ca98a971cd00f331318c05 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
907cdda5205b012eec7513f66713749b293188c9 |
|
19-Mar-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: Add SNMP counter for DEFER_ACCEPT Its currently hard to diagnose when ACK frames are dropped because an application set TCP_DEFER_ACCEPT on its listening socket. See http://bugzilla.kernel.org/show_bug.cgi?id=15507 This patch adds a SNMP value, named TCPDeferAcceptDrop netstat -s | grep TCPDeferAcceptDrop TCPDeferAcceptDrop: 0 This counter is incremented every time we drop a pure ACK frame received by a socket in SYN_RECV state because its SYNACK retrans count is lower than defer_accept value. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a3a858ff18a72a8d388e31ab0d98f7e944841a62 |
|
04-Mar-2010 |
Zhu Yi <yi.zhu@intel.com> |
net: backlog functions rename sk_add_backlog -> __sk_add_backlog sk_add_backlog_limited -> sk_add_backlog Signed-off-by: Zhu Yi <yi.zhu@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bb5b7c11263dbbe78253cd05945a6bf8f55add8e |
|
16-Dec-2009 |
David S. Miller <davem@davemloft.net> |
tcp: Revert per-route SACK/DSACK/TIMESTAMP changes. It creates a regression, triggering badness for SYN_RECV sockets, for example: [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293 [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000 [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32) [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR> CR: 24002442 XER: 00000000 [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000 This is likely caused by the change in the 'estab' parameter passed to tcp_parse_options() when invoked by the functions in net/ipv4/tcp_minisocks.c But even if that is fixed, the ->conn_request() changes made in this patch series is fundamentally wrong. They try to use the listening socket's 'dst' to probe the route settings. The listening socket doesn't even have a route, and you can't get the right route (the child request one) until much later after we setup all of the state, and it must be done by hand. This stuff really isn't ready, so the best thing to do is a full revert. This reverts the following commits: f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d 022c3f7d82f0f1c68018696f2f027b87b9bb45c2 1aba721eba1d84a2defce45b950272cee1e6c72a cda42ebd67ee5fdf09d7057b5a4584d36fe8a335 345cda2fd695534be5a4494f1b59da9daed33663 dc343475ed062e13fc260acccaab91d7d80fd5b2 05eaade2782fb0c90d3034fd7a7d5a16266182bb 6a2a2d6bf8581216e08be15fcb563cfd6c430e1e Signed-off-by: David S. Miller <davem@davemloft.net>
|
4957faade11b3a278c3b3cade3411ddc20afa791 |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1g: Responder Cookie => Initiator Parse incoming TCP_COOKIE option(s). Calculate <SYN,ACK> TCP_COOKIE option. Send optional <SYN,ACK> data. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1f: Initiator Cookie => Responder Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
435cf559f02ea3a3159eb316f97dc88bdebe9432 |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1d: define TCP cookie option, extend existing struct's Data structures are carefully composed to require minimal additions. For example, the struct tcp_options_received cookie_plus variable fits between existing 16-bit and 8-bit variables, requiring no additional space (taking alignment into consideration). There are no additions to tcp_request_sock, and only 1 pointer in tcp_sock. This is a significantly revised implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 The principle difference is using a TCP option to carry the cookie nonce, instead of a user configured offset in the data. This is more flexible and less subject to user configuration error. Such a cookie option has been suggested for many years, and is also useful without SYN data, allowing several related concepts to use the same extension option. "Re: SYN floods (was: does history repeat itself?)", September 9, 1996. http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html "Re: what a new TCP header might look like", May 12, 1998. ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail These functions will also be used in subsequent patches that implement additional features. Requires: TCPCT part 1a: add request_values parameter for sending SYNACK TCPCT part 1b: generate Responder Cookie secret TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
e6b4d11367519bc71729c09d05a126b133c755be |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1a: add request_values parameter for sending SYNACK Add optional function parameters associated with sending SYNACK. These parameters are not needed after sending SYNACK, and are not used for retransmission. Avoids extending struct tcp_request_sock, and avoids allocating kernel memory. Also affects DCCP as it uses common struct request_sock_ops, but this parameter is currently reserved for future use. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e994b7c901ded7200b525a707c6da71f2cf6d4bb |
|
21-Nov-2009 |
David S. Miller <davem@davemloft.net> |
tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTL That's extremely non-intuitive, noticed by William Allen Simpson. And let's make the default be on, it's been suggested by a lot of people so we'll give it a try. Signed-off-by: David S. Miller <davem@davemloft.net>
|
bee7ca9ec03a26676ea2b1c28dc4039348eff3e1 |
|
10-Nov-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED Define two symbols needed in both kernel and user space. Remove old (somewhat incorrect) kernel variant that wasn't used in most cases. Default should apply to both RMSS and SMSS (RFC2581). Replace numeric constants with defined symbols. Stand-alone patch, originally developed for TCPCT. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
05eaade2782fb0c90d3034fd7a7d5a16266182bb |
|
05-Nov-2009 |
Gilad Ben-Yossef <gilad@codefidence.com> |
tcp: Do not call IPv4 specific func in tcp_check_req Calling IPv4 specific inet_csk_route_req in tcp_check_req is a bad idea and crashes machine on IPv6 connections, as reported by Valdis Kletnieks Also, all we are really interested in is the timestamp option in the header, so calling tcp_parse_options() with the "estab" set to false flag is an overkill as it tries to parse half a dozen other TCP options. We know whether timestamp should be enabled or not using data from request_sock. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Tested-by: Valdis.Kletnieks@vt.edu Signed-off-by: David S. Miller <davem@davemloft.net>
|
022c3f7d82f0f1c68018696f2f027b87b9bb45c2 |
|
28-Oct-2009 |
Gilad Ben-Yossef <gilad@codefidence.com> |
Allow tcp_parse_options to consult dst entry We need tcp_parse_options to be aware of dst_entry to take into account per dst_entry TCP options settings Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d |
|
28-Oct-2009 |
Gilad Ben-Yossef <gilad@codefidence.com> |
Only parse time stamp TCP option in time wait sock Since we only use tcp_parse_options here to check for the exietence of TCP timestamp option in the header, it is better to call with the "established" flag on. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Signed-off-by: Ori Finkelman <ori@comsleep.com> Signed-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d1b99ba41d6c5aa1ed2fc634323449dd656899e9 |
|
19-Oct-2009 |
Julian Anastasov <ja@ssi.bg> |
tcp: accept socket after TCP_DEFER_ACCEPT period Willy Tarreau and many other folks in recent years were concerned what happens when the TCP_DEFER_ACCEPT period expires for clients which sent ACK packet. They prefer clients that actively resend ACK on our SYN-ACK retransmissions to be converted from open requests to sockets and queued to the listener for accepting after the deferring period is finished. Then application server can decide to wait longer for data or to properly terminate the connection with FIN if read() returns EAGAIN which is an indication for accepting after the deferring period. This change still can have side effects for applications that expect always to see data on the accepted socket. Others can be prepared to work in both modes (with or without TCP_DEFER_ACCEPT period) and their data processing can ignore the read=EAGAIN notification and to allocate resources for clients which proved to have no data to send during the deferring period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not as a timeout) to wait for data will notice clients that didn't send data for 3 seconds but that still resend ACKs. Thanks to Willy Tarreau for the initial idea and to Eric Dumazet for the review and testing the change. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a1a2ad9151c26d92e5c733a33d52108f5d3a5b57 |
|
20-Oct-2009 |
David S. Miller <davem@davemloft.net> |
Revert "tcp: fix tcp_defer_accept to consider the timeout" This reverts commit 6d01a026b7d3009a418326bdcf313503a314f1ea. Julian Anastasov, Willy Tarreau and Eric Dumazet have come up with a more correct way to deal with this. Signed-off-by: David S. Miller <davem@davemloft.net>
|
6d01a026b7d3009a418326bdcf313503a314f1ea |
|
13-Oct-2009 |
Willy Tarreau <w@1wt.eu> |
tcp: fix tcp_defer_accept to consider the timeout I was trying to use TCP_DEFER_ACCEPT and noticed that if the client does not talk, the connection is never accepted and remains in SYN_RECV state until the retransmits expire, where it finally is deleted. This is bad when some firewall such as netfilter sits between the client and the server because the firewall sees the connection in ESTABLISHED state while the server will finally silently drop it without sending an RST. This behaviour contradicts the man page which says it should wait only for some time : TCP_DEFER_ACCEPT (since Linux 2.4) Allows a listener to be awakened only when data arrives on the socket. Takes an integer value (seconds), this can bound the maximum number of attempts TCP will make to complete the connection. This option should not be used in code intended to be portable. Also, looking at ipv4/tcp.c, a retransmit counter is correctly computed : case TCP_DEFER_ACCEPT: icsk->icsk_accept_queue.rskq_defer_accept = 0; if (val > 0) { /* Translate value in seconds to number of * retransmits */ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 && val > ((TCP_TIMEOUT_INIT / HZ) << icsk->icsk_accept_queue.rskq_defer_accept)) icsk->icsk_accept_queue.rskq_defer_accept++; icsk->icsk_accept_queue.rskq_defer_accept++; } break; ==> rskq_defer_accept is used as a counter of retransmits. But in tcp_minisocks.c, this counter is only checked. And in fact, I have found no location which updates it. So I think that what was intended was to decrease it in tcp_minisocks whenever it is checked, which the trivial patch below does. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
657e9649e745b06675aa5063c84430986cdc3afa |
|
16-Sep-2009 |
Robert Varga <nite@hq.alert.sk> |
tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG() I have recently came across a preemption imbalance detected by: <4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101? <0>------------[ cut here ]------------ <2>kernel BUG at /usr/src/linux/kernel/timer.c:664! <0>invalid opcode: 0000 [1] PREEMPT SMP with ffffffff80644630 being inet_twdr_hangman(). This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a bit, so I looked at what might have caused it. One thing that struck me as strange is tcp_twsk_destructor(), as it calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well, as far as I can tell. Signed-off-by: Robert Varga <nite@hq.alert.sk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0b6a05c1dbebe8c616e2e5b0f52b7a01fd792911 |
|
15-Sep-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: fix ssthresh u16 leftover It was once upon time so that snd_sthresh was a 16-bit quantity. ...That has not been true for long period of time. I run across some ancient compares which still seem to trust such legacy. Put all that magic into a single place, I hopefully found all of them. Compile tested, though linking of allyesconfig is ridiculous nowadays it seems. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa1330766c49199bdab4d4a9096d98b072df9044 |
|
03-Sep-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
tcp: replace hard coded GFP_KERNEL with sk_allocation This fixed a lockdep warning which appeared when doing stress memory tests over NFS: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock mount_root => nfs_root_data => tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim David raised a concern that if the allocation fails in tcp_send_fin(), and it's GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting for the allocation to succeed. But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could loop endlessly under memory pressure. CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: David S. Miller <davem@davemloft.net> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9a7030b76ab3c2b23b1629c49b6df33fb5aed305 |
|
19-Aug-2009 |
John Dykstra <john.dykstra1@gmail.com> |
tcp: Remove redundant copy of MD5 authentication key Remove the copy of the MD5 authentication key from tcp_check_req(). This key has already been copied by tcp_v4_syn_recv_sock() or tcp_v6_syn_recv_sock(). Signed-off-by: John Dykstra <john.dykstra1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1ac530b3553e0b4dc1e18a32bed57cfa84cd57cb |
|
25-Jun-2009 |
Wei Yongjun <yjwei@cn.fujitsu.com> |
tcp: missing check ACK flag of received segment in FIN-WAIT-2 state RFC0793 defined that in FIN-WAIT-2 state if the ACK bit is off drop the segment and return[Page 72]. But this check is missing in function tcp_timewait_state_process(). This cause the segment with FIN flag but no ACK has two diffent action: Case 1: Node A Node B <------------- FIN,ACK (enter FIN-WAIT-1) ACK -------------> (enter FIN-WAIT-2) FIN -------------> discard (move sk to tw list) Case 2: Node A Node B <------------- FIN,ACK (enter FIN-WAIT-1) ACK -------------> (enter FIN-WAIT-2) (move sk to tw list) FIN -------------> <------------- ACK This patch fixed the problem. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c887e6d2d9aee56ee7c9f2af4cec3a5efdcc4c72 |
|
14-Mar-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: consolidate paws check Wow, it was quite tricky to merge that stream of negations but I think I finally got it right: check & replace_ts_recent: (s32)(rcv_tsval - ts_recent) >= 0 => 0 (s32)(ts_recent - rcv_tsval) <= 0 => 0 discard: (s32)(ts_recent - rcv_tsval) > TCP_PAWS_WINDOW => 1 (s32)(ts_recent - rcv_tsval) <= TCP_PAWS_WINDOW => 0 I toggled the return values of tcp_paws_check around since the old encoding added yet-another negation making tracking of truth-values really complicated. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ee7537b63a28b42b22e48842dfeedc66d96b71f1 |
|
03-Mar-2009 |
Hantzis Fotis <xantzis@ceid.upatras.gr> |
tcp: tcp_init_wl / tcp_update_wl argument cleanup The above functions from include/net/tcp.h have been defined with an argument that they never use. The argument is 'u32 ack' which is never used inside the function body, and thus it can be removed. The rest of the patch involves the necessary changes to the function callers of the above two functions. Signed-off-by: Hantzis Fotis <xantzis@ceid.upatras.gr> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cabeccbd172cc305f4383f5a4808ae254745275f |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: kill eff_sacks "cache", the sole user can calculate itself Also fixes insignificant bug that would cause sending of stale SACK block (would occur in some corner cases). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5a5f3a8db9d70c90e9d55b46e02b2d8deb1c2c2e |
|
03-Nov-2008 |
Jianjun Kong <jianjun@zeuux.org> |
net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c xfrm4_policy.c Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
33f5f57eeb0c6386fdd85f9c690dc8d700ba7928 |
|
07-Oct-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: kill pointless urg_mode It all started from me noticing that this urgent check in tcp_clean_rtx_queue is unnecessarily inside the loop. Then I took a longer look to it and found out that the users of urg_mode can trivially do without, well almost, there was one gotcha. Bonus: those funny people who use urg with >= 2^31 write_seq - snd_una could now rejoice too (that's the only purpose for the between being there, otherwise a simple compare would have done the thing). Not that I assume that the rest of the tcp code happily lives with such mind-boggling numbers :-). Alas, it turned out to be impossible to set wmem to such numbers anyway, yes I really tried a big sendfile after setting some wmem but nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf... So I hacked a bit variable to long and found out that it seems to work... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2aaab9a0ccfd2ccf1c957cc2120da8d5593955c5 |
|
08-Aug-2008 |
Adam Langley <agl@imperialviolet.org> |
tcp: (whitespace only) fix confusing indentation The indentation in part of tcp_minisocks makes it look like one of the if statements is much more important than it actually is. Signed-off-by: Adam Langley <agl@imperialviolet.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6edafaaf6f5e70ef1e620ff01bd6bacebe1e0718 |
|
07-Aug-2008 |
Gui Jianfeng <guijianfeng@cn.fujitsu.com> |
tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookup If the following packet flow happen, kernel will panic. MathineA MathineB SYN ----------------------> SYN+ACK <---------------------- ACK(bad seq) ----------------------> When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr)) is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is NULL at that moment, so kernel panic happens. This patch fixes this bug. OOPS output is as following: [ 302.812793] IP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 [ 302.817075] Oops: 0000 [#1] SMP [ 302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan] [ 302.849946] [ 302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5) [ 302.855184] EIP: 0060:[<c05cfaa6>] EFLAGS: 00010296 CPU: 0 [ 302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42 [ 302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046 [ 302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54 [ 302.868333] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000) [ 302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400 [ 302.883275] c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0 [ 302.890971] ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634 [ 302.900140] Call Trace: [ 302.902392] [<c05d0d86>] tcp_v4_reqsk_send_ack+0x17/0x35 [ 302.907060] [<c05d28f8>] tcp_check_req+0x156/0x372 [ 302.910082] [<c04250a3>] printk+0x14/0x18 [ 302.912868] [<c05d0aa1>] tcp_v4_do_rcv+0x1d3/0x2bf [ 302.917423] [<c05d26be>] tcp_v4_rcv+0x563/0x5b9 [ 302.920453] [<c05bb20f>] ip_local_deliver_finish+0xe8/0x183 [ 302.923865] [<c05bb10a>] ip_rcv_finish+0x286/0x2a3 [ 302.928569] [<c059e438>] dev_alloc_skb+0x11/0x25 [ 302.931563] [<c05a211f>] netif_receive_skb+0x2d6/0x33a [ 302.934914] [<d0917941>] pcnet32_poll+0x333/0x680 [pcnet32] [ 302.938735] [<c05a3b48>] net_rx_action+0x5c/0xfe [ 302.941792] [<c042856b>] __do_softirq+0x5d/0xc1 [ 302.944788] [<c042850e>] __do_softirq+0x0/0xc1 [ 302.948999] [<c040564b>] do_softirq+0x55/0x88 [ 302.951870] [<c04501b1>] handle_fasteoi_irq+0x0/0xa4 [ 302.954986] [<c04284da>] irq_exit+0x35/0x69 [ 302.959081] [<c0405717>] do_IRQ+0x99/0xae [ 302.961896] [<c040422b>] common_interrupt+0x23/0x28 [ 302.966279] [<c040819d>] default_idle+0x2a/0x3d [ 302.969212] [<c0402552>] cpu_idle+0xb2/0xd2 [ 302.972169] ======================= [ 302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff <8b> 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6 [ 303.011610] EIP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54 [ 303.018360] Kernel panic - not syncing: Fatal exception in interrupt Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
de0744af1fe2d0a3d428f6af0f2fe1f6179b1a9c |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
63231bddf6514778792d3784f63822473d250fc0 |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to TCP_INC_STATS_BH Same as before - the sock is always there to get the net from, but there are also some places with the net already saved on the stack. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ec0a196626bd12e0ba108d7daa6d95a4fb25c2c5 |
|
13-Jun-2008 |
David S. Miller <davem@davemloft.net> |
tcp: Revert 'process defer accept as established' changes. This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38 ("tcp: Fix slab corruption with ipv6 and tcp6fuzz"). This change causes several problems, first reported by Ingo Molnar as a distcc-over-loopback regression where connections were getting stuck. Ilpo Järvinen first spotted the locking problems. The new function added by this code, tcp_defer_accept_check(), only has the child socket locked, yet it is modifying state of the parent listening socket. Fixing that is non-trivial at best, because we can't simply just grab the parent listening socket lock at this point, because it would create an ABBA deadlock. The normal ordering is parent listening socket --> child socket, but this code path would require the reverse lock ordering. Next is a problem noticed by Vitaliy Gusev, he noted: ---------------------------------------- >--- a/net/ipv4/tcp_timer.c >+++ b/net/ipv4/tcp_timer.c >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data) > goto death; > } > >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) { >+ tcp_send_active_reset(sk, GFP_ATOMIC); >+ goto death; Here socket sk is not attached to listening socket's request queue. tcp_done() will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should release this sk) as socket is not DEAD. Therefore socket sk will be lost for freeing. ---------------------------------------- Finally, Alexey Kuznetsov argues that there might not even be any real value or advantage to these new semantics even if we fix all of the bugs: ---------------------------------------- Hiding from accept() sockets with only out-of-order data only is the only thing which is impossible with old approach. Is this really so valuable? My opinion: no, this is nothing but a new loophole to consume memory without control. ---------------------------------------- So revert this thing for now. Signed-off-by: David S. Miller <davem@davemloft.net>
|
0b040829952d84bf2a62526f0e24b624e0699447 |
|
11-Jun-2008 |
Adrian Bunk <bunk@kernel.org> |
net: remove CVS keywords This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 |
|
22-Mar-2008 |
Patrick McManus <mcmanus@ducksong.com> |
[TCP]: TCP_DEFER_ACCEPT updates - process as established Change TCP_DEFER_ACCEPT implementation so that it transitions a connection to ESTABLISHED after handshake is complete instead of leaving it in SYN-RECV until some data arrvies. Place connection in accept queue when first data packet arrives from slow path. Benefits: - established connection is now reset if it never makes it to the accept queue - diagnostic state of established matches with the packet traces showing completed handshake - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be enforced with reasonable accuracy instead of rounding up to next exponential back-off of syn-ack retry. Signed-off-by: Patrick McManus <mcmanus@ducksong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c6aefafb7ec620911d46174eed514f9df639e5a4 |
|
08-Feb-2008 |
Glenn Griffin <ggriffin.kernel@gmail.com> |
[TCP]: Add IPv6 support to TCP SYN cookies Updated to incorporate Eric's suggestion of using a per cpu buffer rather than allocating on the stack. Just a two line change, but will resend in it's entirety. Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com> Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
fd80eb942ad9761f241c9b287b3b9a342b20690d |
|
29-Feb-2008 |
Denis V. Lunev <den@openvz.org> |
[INET]: Remove struct dst_entry *dst from request_sock_ops.rtx_syn_ack. It looks like dst parameter is used in this API due to historical reasons. Actually, it is really used in the direct call to tcp_v4_send_synack only. So, create a wrapper for tcp_v4_send_synack and remove dst from rtx_syn_ack. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e60402d0a909ca2e6e2fbdf9ed004ef0fae36d33 |
|
09-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Move sack_ok access to obviously named funcs & cleanup Previously code had IsReno/IsFack defined as macros that were local to tcp_input.c though sack_ok field has user elsewhere too for the same purpose. This changes them to static inlines as preferred according the current coding style and unifies the access to sack_ok across multiple files. Magic bitops of sack_ok for FACK and DSACK are also abstracted to functions with appropriate names. Note: - One sack_ok = 1 remains but that's self explanary, i.e., it enables sack - Couple of !IsReno cases are changed to tcp_is_sack - There were no users for IsDSack => I dropped it Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b5860bbac7be1381626f3dc8a0cb970a60fcefb4 |
|
09-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Tighten tcp_sock's belt, drop left_out It is easily calculable when needed and user are not that many after all. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bdf1ee5d3bd38d0c44bd7baa74e07adcbe4ceab1 |
|
27-May-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it No other users exist for tcp_ecn.h. Very few things remain in tcp.h, for most TCP ECN functions callers reside within a single .c file and can be placed there. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa8223c7bb0b05183e1737881ed21827aa5b9e73 |
|
11-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2de979bd7da9c8b39cc0aabb0ab5aa1516d929eb |
|
09-Mar-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[TCP]: whitespace cleanup Add whitespace around keywords. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9d729f72dca9406025bcfa9c1f660d71d9ef0ff5 |
|
05-Mar-2007 |
James Morris <jmorris@namei.org> |
[NET]: Convert xtime.tv_sec to get_seconds() Where appropriate, convert references to xtime.tv_sec to the get_seconds() helper function. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
54287cc178cf85dbae0decec8b4dc190bff757ad |
|
22-Feb-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[TCP]: Keep copied_seq, rcv_wup and rcv_next together. I noticed in oprofile study a cache miss in tcp_rcv_established() to read copied_seq. ffffffff80400a80 <tcp_rcv_established>: /* tcp_rcv_established total: 4034293 2.0400 */ 55493 0.0281 :ffffffff80400bc9: mov 0x4c8(%r12),%eax copied_seq 543103 0.2746 :ffffffff80400bd1: cmp 0x3e0(%r12),%eax rcv_nxt if (tp->copied_seq == tp->rcv_nxt && len - tcp_header_len <= tp->ucopy.len) { In this function, the cache line 0x4c0 -> 0x500 is used only for this reading 'copied_seq' field. rcv_wup and copied_seq should be next to rcv_nxt field, to lower number of active cache lines in hot paths. (tcp_rcv_established(), tcp_poll(), ...) As you suggested, I changed tcp_create_openreq_child() so that these fields are changed together, to avoid adding a new store buffer stall. Patch is 64bit friendly (no new hole because of alignment constraints) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a9948a7e15015e7f2cb602190322b8ebb00c54c8 |
|
28-Feb-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[TCP]: Fix minisock tcp_create_openreq_child() typo. On 2/28/07, KOVACS Krisztian <hidden@balabit.hu> wrote: > > Hi, > > While reading TCP minisock code I've found this suspiciously looking > code fragment: > > - 8< - > struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, struct sk_buff *skb) > { > struct sock *newsk = inet_csk_clone(sk, req, GFP_ATOMIC); > > if (newsk != NULL) { > const struct inet_request_sock *ireq = inet_rsk(req); > struct tcp_request_sock *treq = tcp_rsk(req); > struct inet_connection_sock *newicsk = inet_csk(sk); > struct tcp_sock *newtp; > - 8< - > > The above code initializes newicsk to inet_csk(sk), isn't that supposed > to be inet_csk(newsk)? As far as I can tell this might leave > icsk_ack.last_seg_size zero even if we do have received data. Good catch! David, please apply the attached patch. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e905a9edab7f4f14f9213b52234e4a346c690911 |
|
09-Feb-2007 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c67862403e28dd67f38e896736f063d8dd522688 |
|
17-Nov-2006 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[TCP] minisocks: Use kmemdup and LIMIT_NETDEBUG Code diff stats: [acme@newtoy net-2.6.20]$ codiff /tmp/tcp_minisocks.o.before /tmp/tcp_minisocks.o.after /pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv4/tcp_minisocks.c: tcp_check_req | -44 1 function changed, 44 bytes removed [acme@newtoy net-2.6.20]$ Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
714e85be3557222bc25f69c252326207c900a7db |
|
15-Nov-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[IPV6]: Assorted trivial endianness annotations. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a928630a2ffeaf6aa9a6b78456935b6ab1be3066 |
|
15-Nov-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Fix some warning when MD5 is disabled. Just some mis-placed ifdefs: net/ipv4/tcp_minisocks.c: In function ‘tcp_twsk_destructor’: net/ipv4/tcp_minisocks.c:364: warning: unused variable ‘twsk’ net/ipv6/tcp_ipv6.c:1846: warning: ‘tcp_sock_ipv6_specific’ defined but not used net/ipv6/tcp_ipv6.c:1877: warning: ‘tcp_sock_ipv6_mapped_specific’ defined but not used Signed-off-by: David S. Miller <davem@davemloft.net>
|
cfb6eeb4c860592edd123fdea908d23c6ad1c7dc |
|
15-Nov-2006 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[TCP]: MD5 Signature Option (RFC2385) support. Based on implementation by Rick Payne. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
65f27f38446e1976cc98fd3004b110fedcddd189 |
|
22-Nov-2006 |
David Howells <dhowells@redhat.com> |
WorkStruct: Pass the work_struct pointer instead of context data Pass the work_struct pointer to the work function rather than context data. The work function can use container_of() to work out the data. For the cases where the container of the work_struct may go away the moment the pending bit is cleared, it is made possible to defer the release of the structure by deferring the clearing of the pending bit. To make this work, an extra flag is introduced into the management side of the work_struct. This governs auto-release of the structure upon execution. Ordinarily, the work queue executor would release the work_struct for further scheduling or deallocation by clearing the pending bit prior to jumping to the work function. This means that, unless the driver makes some guarantee itself that the work_struct won't go away, the work function may not access anything else in the work_struct or its container lest they be deallocated.. This is a problem if the auxiliary data is taken away (as done by the last patch). However, if the pending bit is *not* cleared before jumping to the work function, then the work function *may* access the work_struct and its container with no problems. But then the work function must itself release the work_struct by calling work_release(). In most cases, automatic release is fine, so this is the default. Special initiators exist for the non-auto-release case (ending in _NAR). Signed-Off-By: David Howells <dhowells@redhat.com>
|
ab32ea5d8a760e7dd4339634e95d7be24ee5b842 |
|
22-Sep-2006 |
Brian Haley <brian.haley@hp.com> |
[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly. Couldn't actually measure any performance increase while testing (.3% I consider noise), but seems like the right thing to do. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3687b1dc6fe83a500ba4d3235704594f6a111a2d |
|
31-Jul-2006 |
Wei Yongjun <yjwei@nanjing-fnst.com> |
[TCP]: SNMPv2 tcpAttemptFails counter error Refer to RFC2012, tcpAttemptFails is defined as following: tcpAttemptFails OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of times TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state." ::= { tcp 7 } When I lookup into RFC793, I found that the state change should occured under following condition: 1. SYN-SENT -> CLOSED a) Received ACK,RST segment when SYN-SENT state. 2. SYN-RCVD -> CLOSED b) Received SYN segment when SYN-RCVD state(came from LISTEN). c) Received RST segment when SYN-RCVD state(came from SYN-SENT). d) Received SYN segment when SYN-RCVD state(came from SYN-SENT). 3. SYN-RCVD -> LISTEN e) Received RST segment when SYN-RCVD state(came from LISTEN). In my test, those direct state transition can not be counted to tcpAttemptFails. Signed-off-by: Wei Yongjun <yjwei@nanjing-fnst.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e4d919188554a77c798a267e098059bc9aa39726 |
|
03-Jul-2006 |
Ingo Molnar <mingo@elte.hu> |
[PATCH] lockdep: locking init debugging improvement Locking init improvement: - introduce and use __SPIN_LOCK_UNLOCKED for array initializations, to pass in the name string of locks, used by debugging Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
6ab3d5624e172c553004ecc862bfeac16d9d68b7 |
|
30-Jun-2006 |
Jörn Engel <joern@wohnheim.fh-wedel.de> |
Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
b0da8537037f337103348f239ad901477e907aa8 |
|
29-Jun-2006 |
Michael Chan <mchan@broadcom.com> |
[NET]: Add ECN support for TSO In the current TSO implementation, NETIF_F_TSO and ECN cannot be turned on together in a TCP connection. The problem is that most hardware that supports TSO does not handle CWR correctly if it is set in the TSO packet. Correct handling requires CWR to be set in the first packet only if it is set in the TSO header. This patch adds the ability to turn on NETIF_F_TSO and ECN using GSO if necessary to handle TSO packets with CWR set. Hardware that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev-> features flag. All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set. If the output device does not have the NETIF_F_TSO_ECN feature set, GSO will split the packet up correctly with CWR only set in the first segment. With help from Herbert Xu <herbert@gondor.apana.org.au>. Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock flag is completely removed. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0fa1a53e1f055a6c790f40e7728f42a825b29248 |
|
14-Dec-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[IPV6]: Introduce inet6_timewait_sock Out of tcp6_timewait_sock, that now is just an aggregation of inet_timewait_sock and inet6_timewait_sock, using tw_ipv6_offset in struct inet_timewait_sock, that is common to the IPv6 transport protocols that use timewait sockets, like DCCP and TCP. tw_ipv6_offset plays the struct inet_sock pinfo6 role, i.e. for the generic code to find the IPv6 area in a timewait sock. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8292a17a399ffb7c5c8b083db4ad994e090055f7 |
|
14-Dec-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_ops And move it to struct inet_connection_sock. DCCP will use it in the upcoming changesets. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
caa20d9abe810be2ede9612b6c9db6ce7d6edf80 |
|
11-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: spelling fixes Minor spelling fixes for TCP code. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9772efb970780aeed488c19d8b4afd46c3b484af |
|
11-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Appropriate Byte Count support This is an updated version of the RFC3465 ABC patch originally for Linux 2.6.11-rc4 by Yee-Ting Li. ABC is a way of counting bytes ack'd rather than packets when updating congestion control. The orignal ABC described in the RFC applied to a Reno style algorithm. For advanced congestion control there is little change after leaving slow start. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7957aed72b104f7528da484f292442c6f3a131a1 |
|
21-Sep-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Set default congestion control correctly for incoming connections. Patch from Joel Sing to fix the default congestion control algorithm for incoming connections. If a new congestion control handler is added (via module), it should become the default for new connections. Instead, the incoming connections use reno. The cause is incorrect initialisation causes the tcp_init_congestion_control() function to return after the initial if test fails. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Acked-by: Ian McDonald <imcdnzl@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6687e988d9aeaccad6774e6a8304f681f3ec0a03 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK]: Move TCP congestion avoidance members to icsk This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(), minimal renaming/moving done in this changeset to ease review. Most of it is just changes of struct tcp_sock * to struct sock * parameters. With this we move to a state closer to two interesting goals: 1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used for any INET transport protocol that has struct inet_hashinfo and are derived from struct inet_connection_sock. Keeps the userspace API, that will just not display DCCP sockets, while newer versions of tools can support DCCP. 2. INET generic transport pluggable Congestion Avoidance infrastructure, using the current TCP CA infrastructure with DCCP. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
696ab2d3bffc746fb8cf3712f066d42b9886aeed |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[TIMEWAIT]: Move inet_timewait_death_row routines to net/ipv4/inet_timewait_sock.c Also export the ones that will be used in the next changeset, when DCCP uses this infrastructure. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
295ff7edb8f72b77d524759266f7524deae379b3 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[TIMEWAIT]: Introduce inet_timewait_death_row That groups all of the tables and variables associated to the TCP timewait schedulling/recycling/killing code, that now can be isolated from the TCP specific code and used by other transport protocols, such as DCCP. Next changeset will move this code to net/ipv4/inet_timewait_sock.c Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
295f7324ff8d9ea58b4d3ec93b1aaa1d80e048a9 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[ICSK]: Introduce reqsk_queue_prune from code in tcp_synack_timer With this we're very close to getting all of the current TCP refactorings in my dccp-2.6 tree merged, next changeset will export some functions needed by the current DCCP code and then dccp-2.6.git will be born! Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9f1d2604c71498579609b1532fedc5a89276bb00 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[ICSK]: Introduce inet_csk_clone Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
463c84b97f24010a67cd871746d6a7e4c925a5f9 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Introduce inet_connection_sock This creates struct inet_connection_sock, moving members out of struct tcp_sock that are shareable with other INET connection oriented protocols, such as DCCP, that in my private tree already uses most of these members. The functions that operate on these members were renamed, using a inet_csk_ prefix while not being moved yet to a new file, so as to ease the review of these changes. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
87d11ceb9deb7a3f13fdee6e89d9bb6be7d27a71 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[SOCK]: Introduce sk_clone Out of tcp_create_openreq_child, will be used in dccp_create_openreq_child, and is a nice sock function anyway. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c676270bcd25015b978722ec0352c330dcc87883 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET_TWSK]: Introduce inet_twsk_alloc With the parts of tcp_time_wait that are not TCP specific, tcp_time_wait uses it and so will dccp_time_wait. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e48c414ee61f4ac8d5cff2973e66a7cbc8a93aa5 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise the TCP sock ID lookup routines And also some TIME_WAIT functions. [acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size /tmp/before.size: 282955 13122 9312 305389 4a8ed net/ipv4/built-in.o /tmp/after.size: 281566 13122 9312 304000 4a380 net/ipv4/built-in.o [acme@toy net-2.6.14]$ I kept them still inlined, will uninline at some point to see what would be the performance difference. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8feaf0c0a5488b3d898a9c207eb6678f44ba3f26 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise tcp_tw_bucket, aka TIME_WAIT sockets This paves the way to generalise the rest of the sock ID lookup routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro kernels (where IPv6 is always built as a module): [root@qemu ~]# grep tw_sock /proc/slabinfo tw_sock_TCPv6 0 0 128 31 1 tw_sock_TCP 0 0 96 41 1 [root@qemu ~]# Now if a protocol wants to use the TIME_WAIT generic infrastructure it only has to set the sk_prot->twsk_obj_size field with the size of its inet_timewait_sock derived sock and proto_register will create sk_prot->twsk_slab, for now its only for INET sockets, but we can introduce timewait_sock later if some non INET transport protocolo wants to use this stuff. Next changesets will take advantage of this new infrastructure to generalise even more TCP code. [acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size /tmp/before.size: 188646 11764 5068 205478 322a6 net/ipv4/built-in.o /tmp/after.size: 188144 11764 5068 204976 320b0 net/ipv4/built-in.o [acme@toy net-2.6.14]$ Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1 (qemu host)). Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6e04e02165a7209a71db553b7bc48d68421e5ebf |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Move tcp_port_rover to inet_hashinfo Also expose all of the tcp_hashinfo members, i.e. killing those tcp_ehash, etc macros, this will more clearly expose already generic functions and some that need just a bit of work to become generic, as we'll see in the upcoming changesets. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a55ebcc4c4532107ad9eee1c9bb698ab5f12c00f |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Move bind_hash from tcp_sk to inet_sk This should really be in a inet_connection_sock, but I'm leaving it for a later optimization, when some more fields common to INET transport protocols now in tcp_sk or inet_sk will be chunked out into inet_connection_sock, for now its better to concentrate on getting the changes in the core merged to leave the DCCP tree with only DCCP specific code. Next changesets will take advantage of this move to generalise things like tcp_bind_hash, tcp_put_port, tcp_inherit_port, making the later receive a inet_hashinfo parameter, and even __tcp_tw_hashdance, etc in the future, when tcp_tw_bucket gets transformed into the struct timewait_sock hierarchy. tcp_destroy_sock also is eligible as soon as tcp_orphan_count gets moved to sk_prot. A cascade of incremental changes will ultimately make the tcp_lookup functions be fully generic. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0f7ff9274e72fd254fbd1ab117bbc1db6e7cdb34 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Just rename the TCP hashtable functions/structs to inet_ This is to break down the complexity of the series of patches, making it very clear that this one just does: 1. renames tcp_ prefixed hashtable functions and data structures that were already mostly generic to inet_ to share it with DCCP and other INET transport protocols. 2. Removes not used functions (__tb_head & tb_head) 3. Removes some leftover prototypes in the headers (tcp_bucket_unlock & tcp_v4_build_header) Next changesets will move tcp_sk(sk)->bind_hash to inet_sock so that we can make functions such as tcp_inherit_port, __tcp_inherit_port, tcp_v4_get_port, __tcp_put_port, generic and get others like tcp_destroy_sock closer to generic (tcp_orphan_count will go to sk->sk_prot to allow this). Eventually most of these functions will be used passing the transport protocol inet_hashinfo structure. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6cbb0df788b90777a7ed0f9d8261260353f48076 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[SOCK]: Introduce sk_setup_caps From tcp_v4_setup_caps, that always is preceded by a call to __sk_dst_set, so coalesce this sequence into sk_setup_caps, removing one call to a TCP function in the IP layer. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e6848976b721eeb5551cd94673faafeef78d9f35 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Cleanup INET_REFCNT_DEBUG code Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
317a76f9a44b437d6301718f4e5d08bd93f98da7 |
|
23-Jun-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Add pluggable congestion control algorithm infrastructure. Allow TCP to have multiple pluggable congestion control algorithms. Algorithms are defined by a set of operations and can be built in or modules. The legacy "new RENO" algorithm is used as a starting point and fallback. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e87506fcc734647c7b2497eee4eb81e785c857a |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Generalise tcp_listen_opt This chunks out the accept_queue and tcp_listen_opt code and moves them to net/core/request_sock.c and include/net/request_sock.h, to make it useful for other transport protocols, DCCP being the first one to use it. Next patches will rename tcp_listen_opt to accept_sock and remove the inline tcp functions that just call a reqsk_queue_ function. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
60236fdd08b2169045a3bbfc5ffe1576e6c3c17b |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Rename open_request to request_sock Ok, this one just renames some stuff to have a better namespace and to dissassociate it from TCP: struct open_request -> struct request_sock tcp_openreq_alloc -> reqsk_alloc tcp_openreq_free -> reqsk_free tcp_openreq_fastfree -> __reqsk_free With this most of the infrastructure closely resembles a struct sock methods subset. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2e6599cb899ba4b133f42cbf9d2b1883d2dc583a |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Generalise TCP's struct open_request minisock infrastructure Kept this first changeset minimal, without changing existing names to ease peer review. Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn has two new members: ->slab, that replaces tcp_openreq_cachep ->obj_size, to inform the size of the openreq descendant for a specific protocol The protocol specific fields in struct open_request were moved to a class hierarchy, with the things that are common to all connection oriented PF_INET protocols in struct inet_request_sock, the TCP ones in tcp_request_sock, that is an inet_request_sock, that is an open_request. I.e. this uses the same approach used for the struct sock class hierarchy, with sk_prot indicating if the protocol wants to use the open_request infrastructure by filling in sk_prot->rsk_prot with an or_calltable. Results? Performance is improved and TCP v4 now uses only 64 bytes per open request minisock, down from 96 without this patch :-) Next changeset will rename some of the structs, fields and functions mentioned above, struct or_calltable is way unclear, better name it struct request_sock_ops, s/struct open_request/struct request_sock/g, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
02c30a84e6298b6b20a56f0896ac80b47839e134 |
|
06-May-2005 |
Jesper Juhl <juhl-lkml@dif.dk> |
[PATCH] update Ross Biro bouncing email address Ross moved. Remove the bad email address so people will find the correct one in ./CREDITS. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 |
|
17-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|