3ac97f2411ad66f872bd0fe401ad94af5ff4d7d3 |
|
03-Sep-2015 |
Dmitry Torokhov <dtor@google.com> |
net: fix crash in tcp_nuke_addr() When iterating through sockets we need to skip sockets in TIME_WAIT state as they use lightweight structure inet_timewait_sock that does not have sk_lock member, and if we try to lock them we'll crash thusly: [ 89.376383] BUG: spinlock lockup suspected on CPU#0, netd/431 [ 89.382139] lock: 0xffffffc039d05070, .magic: 66d30606, .owner: /-1682098992, .owner_cpu: 0 [ 89.390598] CPU: 0 PID: 431 Comm: netd Tainted: G U W 3.18.0 #5 [ 89.397389] Hardware name: Google Tegra210 Smaug Rev 1+ (DT) [ 89.403049] Call trace: [ 89.405501] [<ffffffc0002072b4>] dump_backtrace+0x0/0x10c [ 89.410918] [<ffffffc0002073d0>] show_stack+0x10/0x1c [ 89.415971] [<ffffffc000a88608>] dump_stack+0x74/0x94 [ 89.421018] [<ffffffc000257e8c>] spin_dump+0x78/0x88 [ 89.425984] [<ffffffc0002580d8>] do_raw_spin_lock+0xfc/0x158 [ 89.431666] [<ffffffc000a90090>] _raw_spin_lock+0x34/0x44 [ 89.437059] [<ffffffc0009509a8>] tcp_nuke_addr+0x1fc/0x29c [ 89.442548] [<ffffffc0009735f4>] devinet_ioctl+0x288/0x680 [ 89.448053] [<ffffffc000975004>] inet_ioctl+0xc4/0xf4 [ 89.453103] [<ffffffc0008baedc>] sock_do_ioctl+0x2c/0x5c [ 89.458408] [<ffffffc0008bbb54>] sock_ioctl+0x210/0x230 [ 89.463633] [<ffffffc000317088>] do_vfs_ioctl+0x4ac/0x590 [ 89.469049] [<ffffffc0003171c8>] SyS_ioctl+0x5c/0x88 (or with NULL pointer dereference if lockdep is still working). Change-Id: I07c70d9a60b125b1070ff05c4eec27daee1a3e90 Signed-off-by: Dmitry Torokhov <dtor@google.com>
|
9db825899c5c68b03d784efaf42620e0a12da8ee |
|
10-Jul-2015 |
Dmitry Torokhov <dtor@google.com> |
net: fix iterating over hashtable in tcp_nuke_addr() The actual size of the tcp hashinfo table is tcp_hashinfo.ehash_mask + 1 so we need to adjust the loop accordingly to get the sockets hashed into the last bucket. Change-Id: I796b3c7b4a1a7fa35fba9e5192a4a403eb6e17de Signed-off-by: Dmitry Torokhov <dtor@google.com>
|
f6dc94292c5f911044a6ee7d29398656800d4ca2 |
|
28-Mar-2014 |
John Stultz <john.stultz@linaro.org> |
net: kuid/kguid build fixes Small build fixes for xt_quota2 and ipv4 changes Change-Id: Ib098768040c8875887b2081c3165a6c83b37e180 Signed-off-by: John Stultz <john.stultz@linaro.org>
|
7114169926d1e1958cc014c537b1bc6910a77527 |
|
07-Jan-2009 |
Mike Chan <mike@android.com> |
misc: uidstat: Adding uid stat driver to collect network statistics. Signed-off-by: Mike Chan <mike@android.com>
|
c7c3ec4903d32c60423ee013d96e94602f66042c |
|
12-May-2008 |
Robert Love <rlove@google.com> |
net: socket ioctl to reset connections matching local address Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets bound to the same local address. This is useful in situations with dynamic IPs, to kill stuck connections. Signed-off-by: Brian Swetland <swetland@google.com> net: fix tcp_v4_nuke_addr Signed-off-by: Dima Zavin <dima@android.com> net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke. We can't hold the lock while calling to tcp_done(), so we drop it before calling. We then have to start at the top of the chain again. Signed-off-by: Dima Zavin <dima@android.com> net: ipv4: Fix race in tcp_v4_nuke_addr(). To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock across tcp_done() calls. This fixed the deadlock, but introduced a race where the socket could die or change state. Fix: Before unlocking the hash table, we grab a reference to the socket. We can then unlock the hash table without risk of the socket going away. We then lock the socket, which is safe because it is pinned. We can then call tcp_done() without recursive deadlock and without race. Upon return, we unlock the socket and then unpin it, killing it. Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045 Signed-off-by: Robert Love <rlove@google.com> Acked-by: Dima Zavin <dima@android.com> ipv4: disable bottom halves around call to tcp_done(). Signed-off-by: Robert Love <rlove@google.com> Signed-off-by: Colin Cross <ccross@android.com> ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr When sk_error_report is called, it wakes up the user-space thread, which then calls tcp_close. When the tcp_close is interrupted by the tcp_v4_nuke_addr ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON. This patch moves the call to sk_error_report inside the bh_lock_sock, which matches the locking used in tcp_v4_err. Signed-off-by: Colin Cross <ccross@android.com>
|
f4713a3dfad045d46afcb9c2a7d0bba288920ed4 |
|
26-Nov-2014 |
Willem de Bruijn <willemb@google.com> |
net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets. If the socket is of family AF_INET6, call ipv6_recv_error instead of ip_recv_error. This change is more complex than a single branch due to the loadable ipv6 module. It reuses a pre-existing indirect function call from ping. The ping code is safe to call, because it is part of the core ipv6 module and always present when AF_INET6 sockets are active. Fixes: 4ed2d765 (net-timestamp: TCP timestamping) Signed-off-by: Willem de Bruijn <willemb@google.com> ---- It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6) to ip_recv_error. Signed-off-by: David S. Miller <davem@davemloft.net>
|
349ce993ac706869d553a1816426d3a4bfda02b1 |
|
23-Oct-2014 |
Eric Dumazet <edumazet@google.com> |
tcp: md5: do not use alloc_percpu() percpu tcp_md5sig_pool contains memory blobs that ultimately go through sg_set_buf(). -> sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf)); This requires that whole area is in a physically contiguous portion of memory. And that @buf is not backed by vmalloc(). Given that alloc_percpu() can use vmalloc() areas, this does not fit the requirements. Replace alloc_percpu() by a static DEFINE_PER_CPU() as tcp_md5sig_pool is small anyway, there is no gain to dynamically allocate it. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 765cf9976e93 ("tcp: md5: remove one indirection level in tcp_md5sig_pool") Reported-by: Crestez Dan Leonard <cdleonard@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b248230c34970a6c1c17c591d63b464e8d2cfc33 |
|
29-Sep-2014 |
Yuchung Cheng <ycheng@google.com> |
tcp: abort orphan sockets stalling on zero window probes Currently we have two different policies for orphan sockets that repeatedly stall on zero window ACKs. If a socket gets a zero window ACK when it is transmitting data, the RTO is used to probe the window. The socket is aborted after roughly tcp_orphan_retries() retries (as in tcp_write_timeout()). But if the socket was idle when it received the zero window ACK, and later wants to send more data, we use the probe timer to probe the window. If the receiver always returns zero window ACKs, icsk_probes keeps getting reset in tcp_ack() and the orphan socket can stall forever until the system reaches the orphan limit (as commented in tcp_probe_timer()). This opens up a simple attack to create lots of hanging orphan sockets to burn the memory and the CPU, as demonstrated in the recent netdev post "TCP connection will hang in FIN_WAIT1 after closing if zero window is advertised." http://www.spinics.net/lists/netdev/msg296539.html This patch follows the design in RTO-based probe: we abort an orphan socket stalling on zero window when the probe timer reaches both the maximum backoff and the maximum RTO. For example, an 100ms RTT connection will timeout after roughly 153 seconds (0.3 + 0.6 + .... + 76.8) if the receiver keeps the window shut. If the orphan socket passes this check, but the system already has too many orphans (as in tcp_out_of_resources()), we still abort it but we'll also send an RST packet as the connection may still be active. In addition, we change TCP_USER_TIMEOUT to cover (life or dead) sockets stalled on zero-window probes. This changes the semantics of TCP_USER_TIMEOUT slightly because it previously only applies when the socket has pending transmission. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
|
47d7a88c188f06ffaea3a539f84fe10cb4e77787 |
|
01-Oct-2014 |
Fabian Frederick <fabf@skynet.be> |
tcp: add __init to tcp_init_mem tcp_init_mem is only called by __init tcp_init. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
|
55d8694fa82c9b5858ae5a78a210353961f908f9 |
|
26-Sep-2014 |
Florian Westphal <fw@strlen.de> |
net: tcp: assign tcp cong_ops when tcp sk is created Split assignment and initialization from one into two functions. This is required by followup patches that add Datacenter TCP (DCTCP) congestion control algorithm - we need to be able to determine if the connection is moderated by DCTCP before the 3WHS has finished. As we walk the available congestion control list during the assignment, we are always guaranteed to have Reno present as it's fixed compiled-in. Therefore, since we're doing the early assignment, we don't have a real use for the Reno alias tcp_init_congestion_ops anymore and can thus remove it. Actual usage of the congestion control operations are being made after the 3WHS has finished, in some cases however we can access get_info() via diag if implemented, therefore we need to zero out the private area for those modules. Joint work with Daniel Borkmann and Glenn Judd. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cd7d8498c9a5d510c64db38d9f4f4fbc41790f09 |
|
24-Sep-2014 |
Eric Dumazet <edumazet@google.com> |
tcp: change tcp_skb_pcount() location Our goal is to access no more than one cache line access per skb in a write or receive queue when doing the various walks. After recent TCP_SKB_CB() reorganizations, it is almost done. Last part is tcp_skb_pcount() which currently uses skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs 3 cache lines in current kernel (skb->head, skb->end, and shinfo->gso_segs are all in 3 different cache lines, far from skb->cb) This very simple patch reuses space currently taken by tcp_tw_isn only in input path, as tcp_skb_pcount is only needed for skb stored in write queue. This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags to get SKBTX_ACK_TSTAMP, which seems possible. This also speeds up all sack processing in general. This speeds up tcp_sendmsg() because it no longer has to access/dirty shinfo. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3f334078567245429540e6461c81c749fce87f70 |
|
31-Dec-2013 |
Dan Williams <dan.j.williams@intel.com> |
net: make tcp_cleanup_rbuf private net_dma was the only external user so this can become local to tcp.c again. Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
7bced397510ab569d31de4c70b39e13355046387 |
|
30-Dec-2013 |
Dan Williams <dan.j.williams@intel.com> |
net_dma: simple removal Per commit "77873803363c net_dma: mark broken" net_dma is no longer used and there is no plan to fix it. This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards. Reverting the remainder of the net_dma induced changes is deferred to subsequent patches. Marked for stable due to Roman's report of a memory leak in dma_pin_iovec_pages(): https://lkml.org/lkml/2014/9/3/177 Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: David Whipple <whipple@securedatainnovations.ch> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: <stable@vger.kernel.org> Reported-by: Roman Gushchin <klamm@yandex-team.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
f4a775d14489a801a5b8b0540e23ab82e2703091 |
|
23-Sep-2014 |
Eric Dumazet <edumazet@google.com> |
net: introduce __skb_header_release() While profiling TCP stack, I noticed one useless atomic operation in tcp_sendmsg(), caused by skb_header_release(). It turns out all current skb_header_release() users have a fresh skb, that no other user can see, so we can avoid one atomic operation. Introduce __skb_header_release() to clearly document this. This gave me a 1.5 % improvement on TCP_RR workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e11ecddf5128011c936cc5360780190cbc901fdc |
|
15-Sep-2014 |
Eric Dumazet <edumazet@google.com> |
tcp: use TCP_SKB_CB(skb)->tcp_flags in input path Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags, which is only used in output path. tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue, and its unfortunate because this bit is located in a cache line right before the payload. We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags. This patch does so, and avoids the cache line miss in tcp_recvmsg() Following patches will - allow a segment with FIN being coalesced in tcp_try_coalesce() - simplify tcp_collapse() by not copying the headers. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
908c7f1949cb7cc6e92ba8f18f2998e87e265b8e |
|
08-Sep-2014 |
Tejun Heo <tj@kernel.org> |
percpu_counter: add @gfp to percpu_counter_init() Percpu allocator now supports allocation mask. Add @gfp to percpu_counter_init() so that !GFP_KERNEL allocation masks can be used with percpu_counters too. We could have left percpu_counter_init() alone and added percpu_counter_init_gfp(); however, the number of users isn't that high and introducing _gfp variants to all percpu data structures would be quite ugly, so let's just do the conversion. This is the one with the most users. Other percpu data structures are a lot easier to convert. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: "David S. Miller" <davem@davemloft.net> Cc: x86@kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org>
|
903ceff7ca7b4d80c083a80ee5163b74e9fa359f |
|
17-Aug-2014 |
Christoph Lameter <cl@linux.com> |
net: Replace get_cpu_var through this_cpu_ptr Replace uses of get_cpu_var for address calculation through this_cpu_ptr. Cc: netdev@vger.kernel.org Cc: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
|
9d186cac7ffb1831e9f34cb4a3a8b22abb9dd9d4 |
|
13-Aug-2014 |
Andrey Vagin <avagin@openvz.org> |
tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) We don't know right timestamp for repaired skb-s. Wrong RTT estimations isn't good, because some congestion modules heavily depends on it. This patch adds the TCPCB_REPAIRED flag, which is included in TCPCB_RETRANS. Thanks to Eric for the advice how to fix this issue. This patch fixes the warning: [ 879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380() [ 879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1 [ 879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 879.568177] 0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2 [ 879.568776] 0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80 [ 879.569386] ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc [ 879.569982] Call Trace: [ 879.570264] [<ffffffff817aa2d2>] dump_stack+0x4d/0x66 [ 879.570599] [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0 [ 879.570935] [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20 [ 879.571292] [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380 [ 879.571614] [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710 [ 879.571958] [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370 [ 879.572315] [<ffffffff81657459>] release_sock+0x89/0x1d0 [ 879.572642] [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860 [ 879.573000] [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80 [ 879.573352] [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40 [ 879.573678] [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20 [ 879.574031] [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0 [ 879.574393] [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b [ 879.574730] ---[ end trace a17cbc38eb8c5c00 ]--- v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit, where it's set for other skb-s. Fixes: 431a91242d8d ("tcp: timestamp SYN+DATA messages") Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution") Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f066e2b091a50f0b76ade87250065d65996b93dd |
|
06-Aug-2014 |
Willem de Bruijn <willemb@google.com> |
net-timestamp: cumulative tcp timestamping fixes A set of small fixes pointed out just after the merge: - make tcp_tx_timestamp static - make tcp_gso_tstamp static - use before() to compare TCP seqno, instead of cast to u64 - add tstamp to tx_flags in GSO, instead of overwrite tx_flags - record skb_shinfo(skb)->tskey for all timestamps, also HW. - optimization in tcp_tx_timestamp: call sock_tx_timestamp only if a tstamp option is set. Signed-off-by: Willem de Bruijn <willemb@google.com> Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping") Signed-off-by: David S. Miller <davem@davemloft.net>
|
4ed2d765dfaccff5ebdac68e2064b59125033a3b |
|
05-Aug-2014 |
Willem de Bruijn <willemb@google.com> |
net-timestamp: TCP timestamping TCP timestamping extends SO_TIMESTAMPING to bytestreams. Bytestreams do not have a 1:1 relationship between send() buffers and network packets. The feature interprets a send call on a bytestream as a request for a timestamp for the last byte in that send() buffer. The choice corresponds to a request for a timestamp when all bytes in the buffer have been sent. That assumption depends on in-order kernel transmission. This is the common case. That said, it is possible to construct a traffic shaping tree that would result in reordering. The guarantee is strong, then, but not ironclad. This implementation supports send and sendpages (splice). GSO replaces one large packet with multiple smaller packets. This patch also copies the option into the correct smaller packet. This patch does not yet support timestamping on data in an initial TCP Fast Open SYN, because that takes a very different data path. If ID generation in ee_data is enabled, bytestream timestamps return a byte offset, instead of the packet counter for datagrams. The implementation supports a single timestamp per packet. It silenty replaces requests for previous timestamps. To avoid missing tstamps, flush the tcp queue by disabling Nagle, cork and autocork. Missing tstamps can be detected by offset when the ee_data ID is enabled. Implementation details: - On GSO, the timestamping code can be included in the main loop. I moved it into its own loop to reduce the impact on the common case to a single branch. - To avoid leaking the absolute seqno to userspace, the offset returned in ee_data must always be relative. It is an offset between an skb and sk field. The first is always set (also for GSO & ACK). The second must also never be uninitialized. Only allow the ID option on sockets in the ESTABLISHED state, for which the seqno is available. Never reset it to zero (instead, move it to the current seqno when reenabling the option). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5924f17a8a30c2ae18d034a86ee7581b34accef6 |
|
28-Jun-2014 |
Christoph Paasch <christoph.paasch@uclouvain.be> |
tcp: Fix divide by zero when pushing during tcp-repair When in repair-mode and TCP_RECV_QUEUE is set, we end up calling tcp_push with mss_now being 0. If data is in the send-queue and tcp_set_skb_tso_segs gets called, we crash because it will divide by mss_now: [ 347.151939] divide error: 0000 [#1] SMP [ 347.152907] Modules linked in: [ 347.152907] CPU: 1 PID: 1123 Comm: packetdrill Not tainted 3.16.0-rc2 #4 [ 347.152907] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 [ 347.152907] task: f5b88540 ti: f3c82000 task.ti: f3c82000 [ 347.152907] EIP: 0060:[<c1601359>] EFLAGS: 00210246 CPU: 1 [ 347.152907] EIP is at tcp_set_skb_tso_segs+0x49/0xa0 [ 347.152907] EAX: 00000b67 EBX: f5acd080 ECX: 00000000 EDX: 00000000 [ 347.152907] ESI: f5a28f40 EDI: f3c88f00 EBP: f3c83d10 ESP: f3c83d00 [ 347.152907] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 347.152907] CR0: 80050033 CR2: 083158b0 CR3: 35146000 CR4: 000006b0 [ 347.152907] Stack: [ 347.152907] c167f9d9 f5acd080 000005b4 00000002 f3c83d20 c16013e6 f3c88f00 f5acd080 [ 347.152907] f3c83da0 c1603b5a f3c83d38 c10a0188 00000000 00000000 f3c83d84 c10acc85 [ 347.152907] c1ad5ec0 00000000 00000000 c1ad679c 010003e0 00000000 00000000 f3c88fc8 [ 347.152907] Call Trace: [ 347.152907] [<c167f9d9>] ? apic_timer_interrupt+0x2d/0x34 [ 347.152907] [<c16013e6>] tcp_init_tso_segs+0x36/0x50 [ 347.152907] [<c1603b5a>] tcp_write_xmit+0x7a/0xbf0 [ 347.152907] [<c10a0188>] ? up+0x28/0x40 [ 347.152907] [<c10acc85>] ? console_unlock+0x295/0x480 [ 347.152907] [<c10ad24f>] ? vprintk_emit+0x1ef/0x4b0 [ 347.152907] [<c1605716>] __tcp_push_pending_frames+0x36/0xd0 [ 347.152907] [<c15f4860>] tcp_push+0xf0/0x120 [ 347.152907] [<c15f7641>] tcp_sendmsg+0xf1/0xbf0 [ 347.152907] [<c116d920>] ? kmem_cache_free+0xf0/0x120 [ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40 [ 347.152907] [<c106a682>] ? __sigqueue_free+0x32/0x40 [ 347.152907] [<c114f0f0>] ? do_wp_page+0x3e0/0x850 [ 347.152907] [<c161c36a>] inet_sendmsg+0x4a/0xb0 [ 347.152907] [<c1150269>] ? handle_mm_fault+0x709/0xfb0 [ 347.152907] [<c15a006b>] sock_aio_write+0xbb/0xd0 [ 347.152907] [<c1180b79>] do_sync_write+0x69/0xa0 [ 347.152907] [<c1181023>] vfs_write+0x123/0x160 [ 347.152907] [<c1181d55>] SyS_write+0x55/0xb0 [ 347.152907] [<c167f0d8>] sysenter_do_call+0x12/0x28 This can easily be reproduced with the following packetdrill-script (the "magic" with netem, sk_pacing and limit_output_bytes is done to prevent the kernel from pushing all segments, because hitting the limit without doing this is not so easy with packetdrill): 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1460> +0 > S. 0:0(0) ack 1 <mss 1460> +0.1 < . 1:1(0) ack 1 win 65000 +0 accept(3, ..., ...) = 4 // This forces that not all segments of the snd-queue will be pushed +0 `tc qdisc add dev tun0 root netem delay 10ms` +0 `sysctl -w net.ipv4.tcp_limit_output_bytes=2` +0 setsockopt(4, SOL_SOCKET, 47, [2], 4) = 0 +0 write(4,...,10000) = 10000 +0 write(4,...,10000) = 10000 // Set tcp-repair stuff, particularly TCP_RECV_QUEUE +0 setsockopt(4, SOL_TCP, 19, [1], 4) = 0 +0 setsockopt(4, SOL_TCP, 20, [1], 4) = 0 // This now will make the write push the remaining segments +0 setsockopt(4, SOL_SOCKET, 47, [20000], 4) = 0 +0 `sysctl -w net.ipv4.tcp_limit_output_bytes=130000` // Now we will crash +0 write(4,...,1000) = 1000 This happens since ec3423257508 (tcp: fix retransmission in repair mode). Prior to that, the call to tcp_push was prevented by a check for tp->repair. The patch fixes it, by adding the new goto-label out_nopush. When exiting tcp_sendmsg and a push is not required, which is the case for tp->repair, we go to this label. When repairing and calling send() with TCP_RECV_QUEUE, the data is actually put in the receive-queue. So, no push is required because no data has been added to the send-queue. Cc: Andrew Vagin <avagin@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Fixes: ec3423257508 (tcp: fix retransmission in repair mode) Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Acked-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1536e2857bd38e3bcd19963fd6b3c3287b4747c4 |
|
16-Apr-2014 |
Kenjiro Nakayama <nakayamakenjiro@gmail.com> |
tcp: Add a TCP_FASTOPEN socket option to get a max backlog on its listner This patch adds a TCP_FASTOPEN socket option to get a max backlog on its listener to getsockopt(). Signed-off-by: Kenjiro Nakayama <nakayamakenjiro@gmail.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
740b0f1841f6e39085b711d41db9ffb07198682b |
|
26-Feb-2014 |
Eric Dumazet <edumazet@google.com> |
tcp: switch rtt estimations to usec resolution Upcoming congestion controls for TCP require usec resolution for RTT estimations. Millisecond resolution is simply not enough these days. FQ/pacing in DC environments also require this change for finer control and removal of bimodal behavior due to the current hack in tcp_update_pacing_rate() for 'small rtt' TCP_CONG_RTT_STAMP is no longer needed. As Julian Anastasov pointed out, we need to keep user compatibility : tcp_metrics used to export RTT and RTTVAR in msec resolution, so we added RTT_US and RTTVAR_US. An iproute2 patch is needed to use the new attributes if provided by the kernel. In this example ss command displays a srtt of 32 usecs (10Gbit link) lpk51:~# ./ss -i dst lpk52 Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port tcp ESTAB 0 1 10.246.11.51:42959 10.246.11.52:64614 cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448 cwnd:10 send 3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559 Updated iproute2 ip command displays : lpk51:~# ./ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source 10.246.11.51 Old binary displays : lpk51:~# ip tcp_metrics | grep 10.246.11.52 10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source 10.246.11.51 With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Larry Brakmo <brakmo@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f5ddcbbb40aa0ba7fbfe22355d287603dbeeaaac |
|
20-Feb-2014 |
Eric Dumazet <edumazet@google.com> |
net-tcp: fastopen: fix high order allocations This patch fixes two bugs in fastopen : 1) The tcp_sendmsg(..., @size) argument was ignored. Code was relying on user not fooling the kernel with iovec mismatches 2) When MTU is about 64KB, tcp_send_syn_data() attempts order-5 allocations, which are likely to fail when memory gets fragmented. Fixes: 783237e8daf13 ("net-tcp: Fast Open client - sending SYN-data") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
977cb0ecf82eb6d15562573c31edebf90db35163 |
|
13-Feb-2014 |
Eric Dumazet <edumazet@google.com> |
tcp: add pacing_rate information into tcp_info Add two new fields to struct tcp_info, to report sk_pacing_rate and sk_max_pacing_rate to monitoring applications, as ss from iproute2. User exported fields are 64bit, even if kernel is currently using 32bit fields. lpaa5:~# ss -i .. skmem:(r0,rb357120,t0,tb2097152,f1584,w1980880,o0,bl0) ts sack cubic wscale:6,6 rto:400 rtt:0.875/0.75 mss:1448 cwnd:1 ssthresh:12 send 13.2Mbps pacing_rate 3336.2Mbps unacked:15 retrans:1/5448 lost:15 rcv_space:29200 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b10bd54c05a6c3c96549f4642d7de23c99b9853b |
|
09-Feb-2014 |
Jesper Juhl <jj@chaosbits.net> |
tcp: correct code comment stating 3 min timeout for FIN_WAIT2, we only do 1 min As far as I can tell we have used a default of 60 seconds for FIN_WAIT2 timeout for ages (since 2.x times??). In any case, the timeout these days is 60 seconds, so the 3 min comment is wrong (and cost me a few minutes of my life when I was debugging a FIN_WAIT2 related problem in a userspace application and checked the kernel source for details). Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1774e9f3e5c8b38de3b3bc8bd0eacd280f655baf |
|
19-Nov-2013 |
Peter Zijlstra <peterz@infradead.org> |
sched, net: Clean up preempt_enable_no_resched() abuse The only valid use of preempt_enable_no_resched() is if the very next line is schedule() or if we know preemption cannot actually be enabled by that statement due to known more preempt_count 'refs'. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: rjw@rjwysocki.net Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: rui.zhang@intel.com Cc: jacob.jun.pan@linux.intel.com Cc: Mike Galbraith <bitbucket@online.de> Cc: hpa@zytor.com Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: lenb@kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20131119151338.GF3694@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
996b175e39ed42ec2aa0c63b4a03cc500aa6269f |
|
06-Jan-2014 |
Eric Dumazet <edumazet@google.com> |
tcp: out_of_order_queue do not use its lock TCP out_of_order_queue lock is not used, as queue manipulation happens with socket lock held and we therefore use the lockless skb queue routines (as __skb_queue_head()) We can use __skb_queue_head_init() instead of skb_queue_head_init() to make this more consistent. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a181ceb501b31b4bf8812a5c84c716cc31d82c2d |
|
17-Dec-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: autocork should not hold first packet in write queue Willem noticed a TCP_RR regression caused by TCP autocorking on a Mellanox test bed. MLX4_EN_TX_COAL_TIME is 16 us, which can be right above RTT between hosts. We can receive a ACK for a packet still in NIC TX ring buffer or in a softnet completion queue. Fix this by always pushing the skb if it is at the head of write queue. Also, as TX completion is lockless, it's safer to perform sk_wmem_alloc test after setting TSQ_THROTTLED. erd:~# MIB="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY" erd:~# ./netperf -H remote -t TCP_RR -- -o $MIB | tail -n 1 (repeat 3 times) Before patch : 18,1049.87,41004,39631,6295.47 17,239.52,40804,48,2912.79 18,348.40,40877,54,3573.39 After patch : 18,22.84,4606,38,16.39 17,21.56,2871,36,13.51 17,22.46,2705,37,11.83 Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: f54b311142a9 ("tcp: auto corking") Signed-off-by: David S. Miller <davem@davemloft.net>
|
f54b311142a92ea2e42598e347b84e1655caf8e3 |
|
06-Dec-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: auto corking With the introduction of TCP Small Queues, TSO auto sizing, and TCP pacing, we can implement Automatic Corking in the kernel, to help applications doing small write()/sendmsg() to TCP sockets. Idea is to change tcp_push() to check if the current skb payload is under skb optimal size (a multiple of MSS bytes) If under 'size_goal', and at least one packet is still in Qdisc or NIC TX queues, set the TCP Small Queue Throttled bit, so that the push will be delayed up to TX completion time. This delay might allow the application to coalesce more bytes in the skb in following write()/sendmsg()/sendfile() system calls. The exact duration of the delay is depending on the dynamics of the system, and might be zero if no packet for this flow is actually held in Qdisc or NIC TX ring. Using FQ/pacing is a way to increase the probability of autocorking being triggered. Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control this feature and default it to 1 (enabled) Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking This counter is incremented every time we detected skb was under used and its flush was deferred. Tested: Interesting effects when using line buffered commands under ssh. Excellent performance results in term of cpu usage and total throughput. lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128 9410.39 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128': 35209.439626 task-clock # 2.901 CPUs utilized 2,294 context-switches # 0.065 K/sec 101 CPU-migrations # 0.003 K/sec 4,079 page-faults # 0.116 K/sec 97,923,241,298 cycles # 2.781 GHz [83.31%] 51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%] 25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%] 102,225,978,536 instructions # 1.04 insns per cycle # 0.51 stalled cycles per insn [83.38%] 18,657,696,819 branches # 529.906 M/sec [83.29%] 91,679,646 branch-misses # 0.49% of all branches [83.40%] 12.136204899 seconds time elapsed lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128 6624.89 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128': 40045.864494 task-clock # 3.301 CPUs utilized 171 context-switches # 0.004 K/sec 53 CPU-migrations # 0.001 K/sec 4,080 page-faults # 0.102 K/sec 111,340,458,645 cycles # 2.780 GHz [83.34%] 61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%] 29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%] 108,654,349,355 instructions # 0.98 insns per cycle # 0.57 stalled cycles per insn [83.34%] 19,552,170,748 branches # 488.244 M/sec [83.34%] 157,875,417 branch-misses # 0.81% of all branches [83.34%] 12.130267788 seconds time elapsed Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
98e09386c0ef4dfd48af7ba60ff908f0d525cdee |
|
13-Nov-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: tsq: restore minimal amount of queueing After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several users reported throughput regressions, notably on mvneta and wifi adapters. 802.11 AMPDU requires a fair amount of queueing to be effective. This patch partially reverts the change done in tcp_write_xmit() so that the minimal amount is sysctl_tcp_limit_output_bytes. It also remove the use of this sysctl while building skb stored in write queue, as TSO autosizing does the right thing anyway. Users with well behaving NICS and correct qdisc (like sch_fq), can then lower the default sysctl_tcp_limit_output_bytes value from 128KB to 8KB. This new usage of sysctl_tcp_limit_output_bytes permits each driver authors to check how their driver performs when/if the value is set to a minimum of 4KB. Normally, line rate for a single TCP flow should be possible, but some drivers rely on timers to perform TX completion and too long TX completion delays prevent reaching full throughput. Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sujith Manoharan <sujith@msujith.org> Reported-by: Arnaud Ebalard <arno@natisbad.org> Tested-by: Sujith Manoharan <sujith@msujith.org> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
27bf69708394ca270f9ec53875aa7309c86dfc1d |
|
16-Oct-2013 |
Vinod Koul <vinod.koul@intel.com> |
net: use DMA_COMPLETE for dma completion status Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
|
a4fe34bf902b8f709c635ab37f1f39de0b86cff2 |
|
20-Oct-2013 |
Eric W. Biederman <ebiederm@xmission.com> |
tcp_memcontrol: Remove the per netns control. The code that is implemented is per memory cgroup not per netns, and having per netns bits is just confusing. Remove the per netns bits to make it easier to see what is really going on. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
05dbc7b59481ca891bbcfe6799a562d48159fbf7 |
|
03-Oct-2013 |
Eric Dumazet <edumazet@google.com> |
tcp/dccp: remove twchain TCP listener refactoring, part 3 : Our goal is to hash SYN_RECV sockets into main ehash for fast lookup, and parallel SYN processing. Current inet_ehash_bucket contains two chains, one for ESTABLISH (and friend states) sockets, another for TIME_WAIT sockets only. As the hash table is sized to get at most one socket per bucket, it makes little sense to have separate twchain, as it makes the lookup slightly more complicated, and doubles hash table memory usage. If we make sure all socket types have the lookup keys at the same offsets, we can use a generic and faster lookup. It turns out TIME_WAIT and ESTABLISHED sockets already have common lookup fields for IPv4. [ INET_TW_MATCH() is no longer needed ] I'll provide a follow-up to factorize IPv6 lookup as well, to remove INET6_TW_MATCH() This way, SYN_RECV pseudo sockets will be supported the same. A new sock_gen_put() helper is added, doing either a sock_put() or inet_twsk_put() [ and will support SYN_RECV later ]. Note this helper should only be called in real slow path, when rcu lookup found a socket that was moved to another identity (freed/reused immediately), but could eventually be used in other contexts, like sock_edemux() Before patch : dmesg | grep "TCP established" TCP established hash table entries: 524288 (order: 11, 8388608 bytes) After patch : TCP established hash table entries: 524288 (order: 10, 4194304 bytes) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e2e5c4c07caf810d7849658dca42f598b3938e21 |
|
05-Sep-2013 |
Dave Jones <davej@redhat.com> |
tcp: Add missing braces to do_tcp_setsockopt Signed-off-by: Dave Jones <davej@fedoraproject.org> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
95bd09eb27507691520d39ee1044d6ad831c1168 |
|
27-Aug-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: TSO packets automatic sizing After hearing many people over past years complaining against TSO being bursty or even buggy, we are proud to present automatic sizing of TSO packets. One part of the problem is that tcp_tso_should_defer() uses an heuristic relying on upcoming ACKS instead of a timer, but more generally, having big TSO packets makes little sense for low rates, as it tends to create micro bursts on the network, and general consensus is to reduce the buffering amount. This patch introduces a per socket sk_pacing_rate, that approximates the current sending rate, and allows us to size the TSO packets so that we try to send one packet every ms. This field could be set by other transports. Patch has no impact for high speed flows, where having large TSO packets makes sense to reach line rate. For other flows, this helps better packet scheduling and ACK clocking. This patch increases performance of TCP flows in lossy environments. A new sysctl (tcp_min_tso_segs) is added, to specify the minimal size of a TSO packet (default being 2). A follow-up patch will provide a new packet scheduler (FQ), using sk_pacing_rate as an input to perform optional per flow pacing. This explains why we chose to set sk_pacing_rate to twice the current rate, allowing 'slow start' ramp up. sk_pacing_rate = 2 * cwnd * mss / srtt v2: Neal Cardwell reported a suspect deferring of last two segments on initial write of 10 MSS, I had to change tcp_tso_should_defer() to take into account tp->xmit_size_goal_segs Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Van Jacobson <vanj@google.com> Cc: Tom Herbert <therbert@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7ed5c5ae96d23da22de95e1c7a239537acd378b1 |
|
16-Aug-2013 |
Andrey Vagin <avagin@openvz.org> |
tcp: set timestamps for restored skb-s When the repair mode is turned off, the write queue seqs are updated so that the whole queue is considered to be 'already sent. The "when" field must be set for such skb. It's used in tcp_rearm_rto for example. If the "when" field isn't set, the retransmit timeout can be calculated incorrectly and a tcp connected can stop for two minutes (TCP_RTO_MAX). Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c0155b2da4cd6583b1b729451249ca346e1c05a2 |
|
31-Jul-2013 |
Dmitry Popov <dp@highloadlab.com> |
tcp: Remove unused tcpct declarations and comments Remove declaration, 4 defines and confusing comment that are no longer used since 1a2c6181c4 ("tcp: Remove TCPCT"). Signed-off-by: Dmitry Popov <dp@highloadlab.com> Acked-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c9bee3b7fdecb0c1d070c7b54113b3bdfb9a3d36 |
|
23-Jul-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: TCP_NOTSENT_LOWAT socket option Idea of this patch is to add optional limitation of number of unsent bytes in TCP sockets, to reduce usage of kernel memory. TCP receiver might announce a big window, and TCP sender autotuning might allow a large amount of bytes in write queue, but this has little performance impact if a large part of this buffering is wasted : Write queue needs to be large only to deal with large BDP, not necessarily to cope with scheduling delays (incoming ACKS make room for the application to queue more bytes) For most workloads, using a value of 128 KB or less is OK to give applications enough time to react to POLLOUT events in time (or being awaken in a blocking sendmsg()) This patch adds two ways to set the limit : 1) Per socket option TCP_NOTSENT_LOWAT 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets not using TCP_NOTSENT_LOWAT socket option (or setting a zero value) Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect. This changes poll()/select()/epoll() to report POLLOUT only if number of unsent bytes is below tp->nosent_lowat Note this might increase number of sendmsg()/sendfile() calls when using non blocking sockets, and increase number of context switches for blocking sockets. Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is defined as : Specify the minimum number of bytes in the buffer until the socket layer will pass the data to the protocol) Tested: netperf sessions, and watching /proc/net/protocols "memory" column for TCP With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458) lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y Using 128KB has no bad effect on the throughput or cpu usage of a single flow, although there is an increase of context switches. A bonus is that we hold socket lock for a shorter amount of time and should improve latencies of ACK processing. lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 412,514 context-switches 200.034645535 seconds time elapsed lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3 OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand Size Size Size (sec) Util Util Util Util Demand Demand Units Final Final % Method % Method 1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3': 2,675,818 context-switches 200.029651391 seconds time elapsed Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-By: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
64dc61306ce7da370833289739e2f52dfc6b37ba |
|
23-Jul-2013 |
Eric Dumazet <edumazet@google.com> |
net: add sk_stream_is_writeable() helper Several call sites use the hardcoded following condition : sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) Lets use a helper because TCP_NOTSENT_LOWAT support will change this condition for TCP sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
076bb0c82a44fbe46fe2c8527a5b5b64b69f679d |
|
10-Jul-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: rename include/net/ll_poll.h to include/net/busy_poll.h Rename the file and correct all the places where it is included. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cbf55001b2ddb814329735641be5d29b08c82b08 |
|
08-Jul-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
net: rename low latency sockets functions to busy poll Rename functions in include/net/ll_poll.h to busy wait. Clarify documentation about expected power use increase. Rename POLL_LL to POLL_BUSY_LOOP. Add need_resched() testing to poll/select busy loops. Note, that in select and poll can_busy_poll is dynamic and is updated continuously to reflect the existence of supported sockets with valid queue information. Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d30e383bb856f614ddb5bbbb5a7d3f86240e41ec |
|
10-Jun-2013 |
Eliezer Tamir <eliezer.tamir@linux.intel.com> |
tcp: add low latency socket poll support. Adds low latency socket poll support for TCP. In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id from the skb to the sk. In tcp_recvmsg(), when there is no data in the socket we busy-poll. This is a good example of how to add busy-poll support to more protocols. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
28850dc7c71da9d0c0e39246e9ff6913f41f8d0a |
|
07-Jun-2013 |
Daniel Borkmann <dborkman@redhat.com> |
net: tcp: move GRO/GSO functions to tcp_offload Would be good to make things explicit and move those functions to a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c. While moving all related functions into tcp_offload.c, we can also make some of them static, since they are only used there. Also, add an explicit registration function. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5ee98591577aa63dbb9e78a0d142abc86b9063d0 |
|
07-Jun-2013 |
Daniel Borkmann <dborkman@redhat.com> |
net: minor: tcp: use tcp_skb_mss helper in tcp_tso_segment We have the minimal inline helper tcp_skb_mss to access skb_shinfo(skb)->gso_size, so also use it here to get mss. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c3f1dbaf6e281642848b78fe101764170c15f168 |
|
31-May-2013 |
David Majnemer <majnemer@google.com> |
net: Update RFS target at poll for tcp/udp The current state of affairs is that read()/write() will setup RFS (Receive Flow Steering) for internet protocol sockets while poll()/epoll() does not. When poll() gets called with a TCP or UDP socket, we should update the flow target. This permits to RFS (if enabled) to select the appropriate CPU for following incoming packets. Note: Only connected UDP sockets can benefit from RFS. Signed-off-by: David Majnemer <majnemer@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paul Turner <pjt@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f7c0c2ae843b74f8dba55820cb0a3de19c976703 |
|
28-May-2013 |
Simon Horman <horms@verge.net.au> |
ipv4: Correct comparisons and calculations using skb->tail and skb-transport_header This corrects an regression introduced by "net: Use 16bits for *_headers fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In that case skb->tail will be a pointer whereas skb->transport_header will be an offset from head. This is corrected by using wrappers that ensure that comparisons and calculations are always made using pointers. Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0d89d2035fe063461a5ddb609b2c12e7fb006e44 |
|
23-May-2013 |
Simon Horman <horms@verge.net.au> |
MPLS: Add limited GSO support In the case where a non-MPLS packet is received and an MPLS stack is added it may well be the case that the original skb is GSO but the NIC used for transmit does not support GSO of MPLS packets. The aim of this code is to provide GSO in software for MPLS packets whose skbs are GSO. SKB Usage: When an implementation adds an MPLS stack to a non-MPLS packet it should do the following to skb metadata: * Set skb->inner_protocol to the old non-MPLS ethertype of the packet. skb->inner_protocol is added by this patch. * Set skb->protocol to the new MPLS ethertype of the packet. * Set skb->network_header to correspond to the end of the L3 header, including the MPLS label stack. I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to kernel" which adds MPLS support to the kernel datapath of Open vSwtich. That patch sets the above requirements in datapath/actions.c:push_mpls() and was used to exercise this code. The datapath patch is against the Open vSwtich tree but it is intended that it be added to the Open vSwtich code present in the mainline Linux kernel at some point. Features: I believe that the approach that I have taken is at least partially consistent with the handling of other protocols. Jesse, I understand that you have some ideas here. I am more than happy to change my implementation. This patch adds dev->mpls_features which may be used by devices to advertise features supported for MPLS packets. A new NETIF_F_MPLS_GSO feature is added for devices which support hardware MPLS GSO offload. Currently no devices support this and MPLS GSO always falls back to software. Alternate Implementation: One possible alternate implementation is to teach netif_skb_features() and skb_network_protocol() about MPLS, in a similar way to their understanding of VLANs. I believe this would avoid the need for net/mpls/mpls_gso.c and in particular the calls to __skb_push() and __skb_push() in mpls_gso_segment(). I have decided on the implementation in this patch as it should not introduce any overhead in the case where mpls_gso is not compiled into the kernel or inserted as a module. MPLS GSO suggested by Jesse Gross. Based in part on "v4 GRE: Add TCP segmentation offload for GRE" by Pravin B Shelar. Cc: Jesse Gross <jesse@nicira.com> Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
71cea17ed39fdf1c0634f530ddc6a2c2fc601c2b |
|
20-May-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: md5: remove spinlock usage in fast path TCP md5 code uses per cpu variables but protects access to them with a shared spinlock, which is a contention point. [ tcp_md5sig_pool_lock is locked twice per incoming packet ] Makes things much simpler, by allocating crypto structures once, first time a socket needs md5 keys, and not deallocating them as they are really small. Next step would be to allow crypto allocations being done in a NUMA aware way. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6ff50cd55545d922f5c62776fe1feb38a9846168 |
|
15-May-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: gso: do not generate out of order packets GSO TCP handler has following issues : 1) ooo_okay from original GSO packet is duplicated to all segments 2) segments (but the last one) are orphaned, so transmit path can not get transmit queue number from the socket. This happens if GSO segmentation is done before stacked device for example. Result is we can send packets from a given TCP flow to different TX queues (if using multiqueue NICS). This generates OOO problems and spurious SACK & retransmits. Fix this by keeping socket pointer set for all segments. This means that every segment must also have a destructor, and the original gso skb truesize must be split on all segments, to keep precise sk->sk_wmem_alloc accounting. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
54d27fcb338bd9c42d1dfc5a39e18f6f9d373c2e |
|
13-May-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: fix tcp_md5_hash_skb_data() TCP md5 communications fail [1] for some devices, because sg/crypto code assume page offsets are below PAGE_SIZE. This was discovered using mlx4 driver [2], but I suspect loopback might trigger the same bug now we use order-3 pages in tcp_sendmsg() [1] Failure is giving following messages. huh, entered softirq 3 NET_RX ffffffff806ad230 preempt_count 00000100, exited with 00000101? [2] mlx4 driver uses order-2 pages to allocate RX frags Reported-by: Matt Schnall <mischnal@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Bernhard Beck <bbeck@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bece1b9708434b6fb90b029affc228fc21688404 |
|
13-Apr-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: tcp_tso_segment() small optimization We can move th->check computation out of the loop, as compiler doesn't know each skb initially share same tcp headers after skb_segment() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d6a4a10411764cf1c3a5dad4f06c5ebe5194488b |
|
12-Apr-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: GSO should be TSQ friendly I noticed that TSQ (TCP Small queues) was less effective when TSO is turned off, and GSO is on. If BQL is not enabled, TSQ has then no effect. It turns out the GSO engine frees the original gso_skb at the time the fragments are generated and queued to the NIC. We should instead call the tcp_wfree() destructor for the last fragment, to keep the flow control as intended in TSQ. This effectively limits the number of queued packets on qdisc + NIC layers. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1a2c6181c4a1922021b4d7df373bba612c3e5f04 |
|
17-Mar-2013 |
Christoph Paasch <christoph.paasch@uclouvain.be> |
tcp: Remove TCPCT TCPCT uses option-number 253, reserved for experimental use and should not be used in production environments. Further, TCPCT does not fully implement RFC 6013. As a nice side-effect, removing TCPCT increases TCP's performance for very short flows: Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests for files of 1KB size. before this patch: average (among 7 runs) of 20845.5 Requests/Second after: average (among 7 runs) of 21403.6 Requests/Second Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
|
16fad69cfe4adbbfa813de516757b87bcae36d93 |
|
14-Mar-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: fix skb_availroom() Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack : https://code.google.com/p/chromium/issues/detail?id=182056 commit a21d45726acac (tcp: avoid order-1 allocations on wifi and tx path) did a poor choice adding an 'avail_size' field to skb, while what we really needed was a 'reserved_tailroom' one. It would have avoided commit 22b4a4f22da (tcp: fix retransmit of partially acked frames) and this commit. Crash occurs because skb_split() is not aware of the 'avail_size' management (and should not be aware) Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mukesh Agrawal <quiche@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
731362674580cb0c696cd1b1a03d8461a10cf90a |
|
07-Mar-2013 |
Pravin B Shelar <pshelar@nicira.com> |
tunneling: Add generic Tunnel segmentation. Adds generic tunneling offloading support for IPv4-UDP based tunnels. GSO type is added to request this offload for a skb. netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded udp-tunnel support. Currently no device supports this feature, software offload is used. This can be used by tunneling protocols like VXLAN. CC: Jesse Gross <jesse@nicira.com> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
68c331631143f5f039baac99a650e0b9e1ea02b6 |
|
14-Feb-2013 |
Pravin B Shelar <pshelar@nicira.com> |
v4 GRE: Add TCP segmentation offload for GRE Following patch adds GRE protocol offload handler so that skb_gso_segment() can segment GRE packets. SKB GSO CB is added to keep track of total header length so that skb_segment can push entire header. e.g. in case of GRE, skb_segment need to push inner and outer headers to every segment. New NETIF_F_GRE_GSO feature is added for devices which support HW GRE TSO offload. Currently none of devices support it therefore GRE GSO always fall backs to software GSO. [ Compute pkt_len before ip_local_out() invocation. -DaveM ] Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c9af6db4c11ccc6c3e7f19bbc15d54023956f97c |
|
11-Feb-2013 |
Pravin B Shelar <pshelar@nicira.com> |
net: Fix possible wrong checksum generation. Patch cef401de7be8c4e (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
93be6ce0e91b6a94783e012b1857a347a5e6e9f2 |
|
11-Feb-2013 |
Andrey Vagin <avagin@openvz.org> |
tcp: set and get per-socket timestamp A timestamp can be set, only if a socket is in the repair mode. This patch adds a new socket option TCP_TIMESTAMP, which allows to get and set current tcp times stamp. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ceaa1fef65a7c2e017b260b879b310dd24888083 |
|
11-Feb-2013 |
Andrey Vagin <avagin@openvz.org> |
tcp: adding a per-socket timestamp offset This functionality is used for restoring tcp sockets. A tcp timestamp depends on how long a system has been running, so it's differ for each host. The solution is to set a per-socket offset. A per-socket offset for a TIME_WAIT socket is inherited from a proper tcp socket. tcp_request_sock doesn't have a timestamp offset, because the repair mode for them are not implemented. Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ca2eb5679f8ddffff60156af42595df44a315ef0 |
|
05-Feb-2013 |
Stephen Hemminger <stephen@networkplumber.org> |
tcp: remove Appropriate Byte Count support TCP Appropriate Byte Count was added by me, but later disabled. There is no point in maintaining it since it is a potential source of bugs and Linux already implements other better window protection heuristics. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cef401de7be8c4e155c6746bfccf721a4fa5fab9 |
|
25-Jan-2013 |
Eric Dumazet <edumazet@google.com> |
net: fix possible wrong checksum generation Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
50c3a487d507568359ccbce1b15bcdfc01edf7ef |
|
22-Jan-2013 |
YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> |
ipv4: Use IS_ERR_OR_NULL(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f26845b43c75d3f32f98d194c1327b5b1e6b3fb0 |
|
09-Jan-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: fix splice() and tcp collapsing interaction Under unusual circumstances, TCP collapse can split a big GRO TCP packet while its being used in a splice(socket->pipe) operation. skb_splice_bits() releases the socket lock before calling splice_to_pipe(). [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc() [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]- [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13 To fix this problem, we must eat skbs in tcp_recv_skb(). Remove the inline keyword from tcp_recv_skb() definition since it has three call sites. Reported-by: Christian Becker <c.becker@traviangames.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ff905b1e4aad8ccbbb0d42f7137f19482742ff07 |
|
10-Jan-2013 |
Eric Dumazet <edumazet@google.com> |
tcp: splice: fix an infinite loop in tcp_read_sock() commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers) added a regression. [ 83.843570] INFO: rcu_sched self-detected stall on CPU [ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132) [ 83.844582] Task dump for CPU 6: [ 83.844584] netperf R running task 0 8966 8952 0x0000000c [ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000 [ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10 [ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8 [ 83.844594] Call Trace: [ 83.844596] [<ffffffff81088679>] ? vprintk_emit+0x1c9/0x4c0 [ 83.844601] [<ffffffff815ad449>] ? schedule+0x29/0x70 [ 83.844606] [<ffffffff81537bd2>] ? tcp_splice_data_recv+0x42/0x50 [ 83.844610] [<ffffffff8153beaa>] ? tcp_read_sock+0xda/0x260 [ 83.844613] [<ffffffff81537b90>] ? tcp_prequeue_process+0xb0/0xb0 [ 83.844615] [<ffffffff8153c0f0>] ? tcp_splice_read+0xc0/0x250 [ 83.844618] [<ffffffff814dc0c2>] ? sock_splice_read+0x22/0x30 [ 83.844622] [<ffffffff811b820b>] ? do_splice_to+0x7b/0xa0 [ 83.844627] [<ffffffff811ba4bc>] ? sys_splice+0x59c/0x5d0 [ 83.844630] [<ffffffff8119745b>] ? putname+0x2b/0x40 [ 83.844633] [<ffffffff8118bcb4>] ? do_sys_open+0x174/0x1e0 [ 83.844636] [<ffffffff815b6202>] ? system_call_fastpath+0x16/0x1b if recv_actor() returns 0, we should stop immediately, because looping wont give a chance to drain the pipe. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e239345f642e6a255d0ba1e3d92c2f9ec5a44fbe |
|
08-Nov-2012 |
Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> |
dmaengine: remove dma_async_memcpy_complete() macro Just use dma_async_is_tx_complete() directly. Cc: Vinod Koul <vinod.koul@intel.com> Cc: Tomasz Figa <t.figa@samsung.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Dan Williams <djbw@fb.com>
|
b9ee86830f34737a08deead93872a79a37419a13 |
|
08-Nov-2012 |
Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> |
dmaengine: remove dma_async_memcpy_pending() macro Just use dma_async_issue_pending() directly. Cc: Vinod Koul <vinod.koul@intel.com> Cc: Tomasz Figa <t.figa@samsung.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Dan Williams <djbw@fb.com>
|
02275a2ee7c0ea475b6f4a6428f5df592bc9d30b |
|
02-Dec-2012 |
Willy Tarreau <w@1wt.eu> |
tcp: don't abort splice() after small transfers TCP coalescing added a regression in splice(socket->pipe) performance, for some workloads because of the way tcp_read_sock() is implemented. The reason for this is the break when (offset + 1 != skb->len). As we released the socket lock, this condition is possible if TCP stack added a fragment to the skb, which can happen with TCP coalescing. So let's go back to the beginning of the loop when this happens, to give a chance to splice more frags per system call. Doing so fixes the issue and makes GRO 10% faster than LRO on CPU-bound splice() workloads instead of the opposite. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
64022d0b4e93ea432e95db55a72b8a1c5775f3c0 |
|
01-Dec-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: fix crashes in do_tcp_sendpages() Recent network changes allowed high order pages being used for skb fragments. This uncovered a bug in do_tcp_sendpages() which was assuming its caller provided an array of order-0 page pointers. We only have to deal with a single page in this function, and its order is irrelevant. Reported-by: Willy Tarreau <w@1wt.eu> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
fd90b29d757827ab12d6669292612308ec249532 |
|
30-Nov-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: change default tcp hash size As time passed, available memory increased faster than number of concurrent tcp sockets. As a result, a machine with 4GB of ram gets a hash table with 524288 slots, using 8388608 bytes of memory. Lets change that by a 16x factor (one slot for 128 KB of ram) Even if a small machine needs a _lot_ of sockets, tcp lookups are now very efficient, using one cache line per socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
52e804c6dfaa5df1e4b0e290357b82ad4e4cda2c |
|
16-Nov-2012 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Allow userns root to control ipv4 Allow an unpriviled user who has created a user namespace, and then created a network namespace to effectively use the new network namespace, by reducing capable(CAP_NET_ADMIN) and capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns, CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls. Settings that merely control a single network device are allowed. Either the network device is a logical network device where restrictions make no difference or the network device is hardware NIC that has been explicity moved from the initial network namespace. In general policy and network stack state changes are allowed while resource control is left unchanged. Allow creating raw sockets. Allow the SIOCSARP ioctl to control the arp cache. Allow the SIOCSIFFLAG ioctl to allow setting network device flags. Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address. Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address. Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address. Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask. Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes. Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for adding, changing and deleting gre tunnels. Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for adding, changing and deleting ipip tunnels. Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for adding, changing and deleting ipsec virtual tunnel interfaces. Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC, MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing sockets. Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and arbitrary ip options. Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option. Allow setting the IP_TRANSPARENT ipv4 socket option. Allow setting the TCP_REPAIR socket option. Allow setting the TCP_CONGESTION socket option. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ec34232575083fd0f43d3a101e8ebb041b203761 |
|
15-Nov-2012 |
Andrew Vagin <avagin@openvz.org> |
tcp: fix retransmission in repair mode Currently if a socket was repaired with a few packet in a write queue, a kernel bug may be triggered: kernel BUG at net/ipv4/tcp_output.c:2330! RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610 According to the initial realization v3.4-rc2-963-gc0e88ff, all skb-s should look like already posted. This patch fixes code according with this sentence. Here are three points, which were not done in the initial patch: 1. A tcp send head should not be changed 2. Initialize TSO state of a skb 3. Reset the retransmission time This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet passes the ussual way, but isn't sent to network. This patch solves all described problems and handles tcp_sendpages. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6f73601efb35c7003f5c58c2bc6fd08f3652169c |
|
19-Oct-2012 |
Yuchung Cheng <ycheng@google.com> |
tcp: add SYN/data info to TCP_INFO Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options. It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or client application can use this information to check Fast Open success rate. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e71c55c9e3eb32de08e17152ec739582afc9400 |
|
21-Oct-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: speedup SIOCINQ ioctl SIOCINQ can use the lock_sock_fast() version to avoid double acquisition of socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a3374c42aa5f7237e87ff3b0622018636b0c847e |
|
18-Oct-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: fix FIONREAD/SIOCINQ tcp_ioctl() tries to take into account if tcp socket received a FIN to report correct number bytes in receive queue. But its flaky because if the application ate the last skb, we return 1 instead of 0. Correct way to detect that FIN was received is to test SOCK_DONE. Reported-by: Elliot Hughes <enh@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5640f7685831e088fe6c2e1f863a6805962f8e81 |
|
24-Sep-2012 |
Eric Dumazet <edumazet@google.com> |
net: use a per task frag allocator We currently use a per socket order-0 page cache for tcp_sendmsg() operations. This page is used to build fragments for skbs. Its done to increase probability of coalescing small write() into single segments in skbs still in write queue (not yet sent) But it wastes a lot of memory for applications handling many mostly idle sockets, since each socket holds one page in sk->sk_sndmsg_page Its also quite inefficient to build TSO 64KB packets, because we need about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit page allocator more than wanted. This patch adds a per task frag allocator and uses bigger pages, if available. An automatic fallback is done in case of memory pressure. (up to 32768 bytes per frag, thats order-3 pages on x86) This increases TCP stream performance by 20% on loopback device, but also benefits on other network devices, since 8x less frags are mapped on transmit and unmapped on tx completion. Alexander Duyck mentioned a probable performance win on systems with IOMMU enabled. Its possible some SG enabled hardware cant cope with bigger fragments, but their ndo_start_xmit() should already handle this, splitting a fragment in sub fragments, since some arches have PAGE_SIZE=65536 Successfully tested on various ethernet devices. (ixgbe, igb, bnx2x, tg3, mellanox mlx4) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bc26ccd8fc756749de95606d28314efd0ce5aec3 |
|
19-Sep-2012 |
Andrey Vagin <avagin@openvz.org> |
tcp: restore rcv_wscale in a repair mode (v2) rcv_wscale is a symetric parameter with snd_wscale. Both this parameters are set on a connection handshake. Without this value a remote window size can not be interpreted correctly, because a value from a packet should be shifted on rcv_wscale. And one more thing is that wscale_ok should be set too. This patch doesn't break a backward compatibility. If someone uses it in a old scheme, a rcv window will be restored with the same bug (rcv_wscale = 0). v2: Save backward compatibility on big-endian system. Before the first two bytes were snd_wscale and the second two bytes were rcv_wscale. Now snd_wscale is opt_val & 0xFFFF and rcv_wscale >> 16. This approach is independent on byte ordering. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> CC: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bb68b64724a4fd6b93d83b39aeffa4aadb2562fc |
|
18-Sep-2012 |
Christoph Paasch <christoph.paasch@uclouvain.be> |
ipv4: Don't add TCP-code in inet_sock_destruct Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Acked-by: H.K. Jerry Chu <hkchu@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
15c041759bfcd9ab0a4e43f1c16e2644977d0467 |
|
14-Sep-2012 |
Michal Kubeček <mkubecek@suse.cz> |
tcp: flush DMA queue before sk_wait_data if rcv_wnd is zero If recv() syscall is called for a TCP socket so that - IOAT DMA is used - MSG_WAITALL flag is used - requested length is bigger than sk_rcvbuf - enough data has already arrived to bring rcv_wnd to zero then when tcp_recvmsg() gets to calling sk_wait_data(), receive window can be still zero while sk_async_wait_queue exhausts enough space to keep it zero. As this queue isn't cleaned until the tcp_service_net_dma() call, sk_wait_data() cannot receive any data and blocks forever. If zero receive window and non-empty sk_async_wait_queue is detected before calling sk_wait_data(), process the queue first. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8336886f786fdacbc19b719c1f7ea91eb70706d4 |
|
31-Aug-2012 |
Jerry Chu <hkchu@google.com> |
tcp: TCP Fast Open Server - support TFO listeners This patch builds on top of the previous patch to add the support for TFO listeners. This includes - 1. allocating, properly initializing, and managing the per listener fastopen_queue structure when TFO is enabled 2. changes to the inet_csk_accept code to support TFO. E.g., the request_sock can no longer be freed upon accept(), not until 3WHS finishes 3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg() if it's a TFO socket 4. properly closing a TFO listener, and a TFO socket before 3WHS finishes 5. supporting TCP_FASTOPEN socket option 6. modifying tcp_check_req() to use to check a TFO socket as well as request_sock 7. supporting TCP's TFO cookie option 8. adding a new SYN-ACK retransmit handler to use the timer directly off the TFO socket rather than the listener socket. Note that TFO server side will not retransmit anything other than SYN-ACK until the 3WHS is completed. The patch also contains an important function "reqsk_fastopen_remove()" to manage the somewhat complex relation between a listener, its request_sock, and the corresponding child socket. See the comment above the function for the detail. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1485348d2424e1131ea42efc033cbd9366462b01 |
|
30-Jul-2012 |
Ben Hutchings <bhutchings@solarflare.com> |
tcp: Apply device TSO segment limit earlier Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to limit the size of TSO skbs. This avoids the need to fall back to software GSO for local TCP senders. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
42493570100b91ef663c4c6f0c0fdab238f9d3c2 |
|
27-Jul-2012 |
Hangbin Liu <liuhangbin@gmail.com> |
tcp: Add TCP_USER_TIMEOUT negative value check TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int. But patch "tcp: Add TCP_USER_TIMEOUT socket option"(dca43c75) didn't check the negative values. If a user assign -1 to it, the socket will set successfully and wait for 4294967295 miliseconds. This patch add a negative value check to avoid this issue. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cf60af03ca4e71134206809ea892e49b92a88896 |
|
19-Jul-2012 |
Yuchung Cheng <ycheng@google.com> |
net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2) and write(2). The application should replace connect() with it to send data in the opening SYN packet. For blocking socket, sendmsg() blocks until all the data are buffered locally and the handshake is completed like connect() call. It returns similar errno like connect() if the TCP handshake fails. For non-blocking socket, it returns the number of bytes queued (and transmitted in the SYN-data packet) if cookie is available. If cookie is not available, it transmits a data-less SYN packet with Fast Open cookie request option and returns -EINPROGRESS like connect(). Using MSG_FASTOPEN on connecting or connected socket will result in simlar errno like repeating connect() calls. Therefore the application should only use this flag on new sockets. The buffer size of sendmsg() is independent of the MSS of the connection. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
46d3ceabd8d98ed0ad10f20c595ca784e34786c5 |
|
11-Jul-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: TCP Small Queues This introduce TSQ (TCP Small Queues) TSQ goal is to reduce number of TCP packets in xmit queues (qdisc & device queues), to reduce RTT and cwnd bias, part of the bufferbloat problem. sk->sk_wmem_alloc not allowed to grow above a given limit, allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a given time. TSO packets are sized/capped to half the limit, so that we have two TSO packets in flight, allowing better bandwidth use. As a side effect, setting the limit to 40000 automatically reduces the standard gso max limit (65536) to 40000/2 : It can help to reduce latencies of high prio packets, having smaller TSO packets. This means we divert sock_wfree() to a tcp_wfree() handler, to queue/send following frames when skb_orphan() [2] is called for the already queued skbs. Results on my dev machines (tg3/ixgbe nics) are really impressive, using standard pfifo_fast, and with or without TSO/GSO. Without reduction of nominal bandwidth, we have reduction of buffering per bulk sender : < 1ms on Gbit (instead of 50ms with TSO) < 8ms on 100Mbit (instead of 132 ms) I no longer have 4 MBytes backlogged in qdisc by a single netperf session, and both side socket autotuning no longer use 4 Mbytes. As skb destructor cannot restart xmit itself ( as qdisc lock might be taken at this point ), we delegate the work to a tasklet. We use one tasklest per cpu for performance reasons. If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag. This flag is tested in a new protocol method called from release_sock(), to eventually send new segments. [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable [2] skb_orphan() is usually called at TX completion time, but some drivers call it in their start_xmit() handler. These drivers should at least use BQL, or else a single TCP session can still fill the whole NIC TX ring, since TSQ will have no effect. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ae86b9e3846f6fc5509dee721f2bdba1db8ab96a |
|
10-Jul-2012 |
Ben Hutchings <bhutchings@solarflare.com> |
net: Fix non-kernel-doc comments with kernel-doc start marker Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
51c5d0c4b169bf762f09e0d5b283a7f0b2a45739 |
|
10-Jul-2012 |
David S. Miller <davem@davemloft.net> |
tcp: Maintain dynamic metrics in local cache. Maintain a local hash table of TCP dynamic metrics blobs. Computed TCP metrics are no longer maintained in the route metrics. The table uses RCU and an extremely simple hash so that it has low latency and low overhead. A simple hash is legitimate because we only make metrics blobs for fully established connections. Some tweaking of the default hash table sizes, metric timeouts, and the hash chain length limit certainly could use some tweaking. But the basic design seems sound. With help from Eric Dumazet and Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net>
|
31fe62b9586643953f0c0c37a6357dafc69034e2 |
|
23-May-2012 |
Tim Bird <tim.bird@am.sony.com> |
mm: add a low limit to alloc_large_system_hash UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
413c27d8697751f72d2d6cf289140a8e060a8032 |
|
19-May-2012 |
Eldad Zack <eldad@fogrefinery.com> |
net/ipv4: replace simple_strtoul with kstrtoul Replace simple_strtoul with kstrtoul in three similar occurrences, all setup handlers: * route.c: set_rhash_entries * tcp.c: set_thash_entries * udp.c: set_uhash_entries Also check if the conversion failed. Signed-off-by: Eldad Zack <eldad@fogrefinery.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bad115cfe5b509043b684d3a007ab54b80090aa1 |
|
17-May-2012 |
Willy Tarreau <w@1wt.eu> |
tcp: do_tcp_sendpages() must try to push data out on oom conditions Since recent changes on TCP splicing (starting with commits 2f533844 "tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp: tcp_sendpages() should call tcp_push() once"), I started seeing massive stalls when forwarding traffic between two sockets using splice() when pipe buffers were larger than socket buffers. Latest changes (net: netdev_alloc_skb() use build_skb()) made the problem even more apparent. The reason seems to be that if do_tcp_sendpages() fails on out of memory condition without being able to send at least one byte, tcp_push() is not called and the buffers cannot be flushed. After applying the attached patch, I cannot reproduce the stalls at all and the data rate it perfectly stable and steady under any condition which previously caused the problem to be permanent. The issue seems to have been there since before the kernel migrated to git, which makes me think that the stalls I occasionally experienced with tux during stress-tests years ago were probably related to the same issue. This issue was first encountered on 3.0.31 and 3.2.17, so please backport to -stable. Signed-off-by: Willy Tarreau <w@1wt.eu> Acked-by: Eric Dumazet <edumazet@google.com> Cc: <stable@vger.kernel.org>
|
a2a385d627e1549da4b43a8b3dfe370589766e1c |
|
17-May-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: bool conversions bool conversions where possible. __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dc6b9b78234fecdc6d2ca5e1629185718202bcf5 |
|
17-May-2012 |
Eric Dumazet <edumazet@google.com> |
net: include/net/sock.h cleanup bool/const conversions where possible __inline__ -> inline space cleanups Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e87cc4728f0e2fb663e592a1141742b1d6c63256 |
|
13-May-2012 |
Joe Perches <joe@perches.com> |
net: Convert net_ratelimit uses to net_<level>_ratelimited Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
292e8d8c853889140ed77b7b37c66979b13080ae |
|
10-May-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Move rcvq sending to tcp_input.c It actually works on the input queue and will use its read mem routines, thus it's better to have in in the tcp_input.c file. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b081f85c2977b1cbb6e635d53d9512f1ef985972 |
|
02-May-2012 |
Eric Dumazet <edumazet@google.com> |
net: implement tcp coalescing in tcp_queue_rcv() Extend tcp coalescing implementing it from tcp_queue_rcv(), the main receiver function when application is not blocked in recvmsg(). Function tcp_queue_rcv() is moved a bit to allow its call from tcp_data_queue() This gives good results especially if GRO could not kick, and if skb head is a fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b49960a05e32121d29316cfdf653894b88ac9190 |
|
02-May-2012 |
Eric Dumazet <edumazet@google.com> |
tcp: change tcp_adv_win_scale and tcp_rmem[2] tcp_adv_win_scale default value is 2, meaning we expect a good citizen skb to have skb->len / skb->truesize ratio of 75% (3/4) In 2.6 kernels we (mis)accounted for typical MSS=1460 frame : 1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392. So these skbs were considered as not bloated. With recent truesize fixes, a typical MSS=1460 frame truesize is now the more precise : 2048 + 256 = 2304. But 2304 * 3/4 = 1728. So these skb are not good citizen anymore, because 1460 < 1728 (GRO can escape this problem because it build skbs with a too low truesize.) This also means tcp advertises a too optimistic window for a given allocated rcvspace : When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often, especially when application is slow to drain its receive queue or in case of losses (netperf is fast, scp is slow). This is a major latency source. We should adjust the len/truesize ratio to 50% instead of 75% This patch : 1) changes tcp_adv_win_scale default to 1 instead of 2 2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account better truesize tracking and to allow autotuning tcp receive window to reach same value than before. Note that same amount of kernel memory is consumed compared to 2.6 kernels. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
eed530b6c67624db3f2cf477bac7c4d005d8f7ba |
|
02-May-2012 |
Yuchung Cheng <ycheng@google.com> |
tcp: early retransmit This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
de248a75c35e0208294cf304b112916254b69184 |
|
26-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp repair: Fix unaligned access when repairing options (v2) Don't pick __u8/__u16 values directly from raw pointers, but instead use an array of structures of code:value pairs. This is OK, since the buffer we take options from is not an skb memory, but a user-to-kernel one. For those options which don't require any value now, require this to be zero (for potential future extension of this API). v2: Changed tcp_repair_opt to use two __u32-s as spotted by David Laight. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
38ba0a65faf451dd46c7860b4fade84c0b8e444f |
|
23-Apr-2012 |
Eric Dumazet <edumazet@google.com> |
net: skb_can_coalesce returns a boolean Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
900f65d361d333c949ef76a828343075f4fdf523 |
|
19-Apr-2012 |
Neal Cardwell <ncardwell@google.com> |
tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() This commit moves the (substantial) common code shared between tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family independent function, tcp_init_sock(). Centralizing this functionality should help avoid drift issues, e.g. where the IPv4 side is updated without a corresponding update to IPv6. There was already some drift: IPv4 initialized snd_cwnd to TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to 2 (in this case it should not matter, since snd_cwnd is also initialized in tcp_init_metrics(), but the general risks and maintenance overhead remain). When diffing the old and new code, note that new tcp_init_sock() function uses the order of steps from the tcp_v4_init_sock() implementation (the order is slightly different in tcp_v6_init_sock()). Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b139ba4e90dccbf4cd4efb112af96a5c9e0b098c |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Repair connection-time negotiated parameters There are options, which are set up on a socket while performing TCP handshake. Need to resurrect them on a socket while repairing. A new sockoption accepts a buffer and parses it. The buffer should be CODE:VALUE sequence of bytes, where CODE is standard option code and VALUE is the respective value. Only 4 options should be handled on repaired socket. To read 3 out of 4 of these options the TCP_INFO sockoption can be used. An ability to get the last one (the mss_clamp) was added by the previous patch. Now the restore. Three of these options -- timestamp_ok, mss_clamp and snd_wscale -- are just restored on a coket. The sack_ok flags has 2 issues. First, whether or not to do sacks at all. This flag is just read and set back. No other sack info is saved or restored, since according to the standart and the code dropping all sack-ed segments is OK, the sender will resubmit them again, so after the repair we will probably experience a pause in connection. Next, the fack bit. It's just set back on a socket if the respective sysctl is set. No collected stats about packets flow is preserved. As far as I see (plz, correct me if I'm wrong) the fack-based congestion algorithm survives dropping all of the stats and repairs itself eventually, probably losing the performance for that period. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5e6a3ce6573f0c519d1ff57df60e3877bb2d3151 |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Report mss_clamp with TCP_MAXSEG option in repair mode The mss_clamp is the only connection-time negotiated option which cannot be obtained from the user space. Make the TCP_MAXSEG sockopt report one in the repair mode. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c0e88ff0f256958401778ff692da4b8891acb5a9 |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Repair socket queues Reading queues under repair mode is done with recvmsg call. The queue-under-repair set by TCP_REPAIR_QUEUE option is used to determine which queue should be read. Thus both send and receive queue can be read with this. Caller must pass the MSG_PEEK flag. Writing to queues is done with sendmsg call and yet again -- the repair-queue option can be used to push data into the receive queue. When putting an skb into receive queue a zero tcp header is appented to its head to address the tcp_hdr(skb)->syn and the ->fin checks by the (after repair) tcp_recvmsg. These flags flags are both set to zero and that's why. The fin cannot be met in the queue while reading the source socket, since the repair only works for closed/established sockets and queueing fin packet always changes its state. The syn in the queue denotes that the respective skb's seq is "off-by-one" as compared to the actual payload lenght. Thus, at the rcv queue refill we can just drop this flag and set the skb's sequences to precice values. When the repair mode is turned off, the write queue seqs are updated so that the whole queue is considered to be 'already sent, waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the protocol POV the send queue looks like it was sent, but the data between the write_seq and snd_nxt is lost in the network. This helps to avoid another sockoption for setting the snd_nxt sequence. Leaving the whole queue in a 'not yet sent' state (as it will be after sendmsg-s) will not allow to receive any acks from the peer since the ack_seq will be after the snd_nxt. Thus even the ack for the window probe will be dropped and the connection will be 'locked' with the zero peer window. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ee9952831cfd0bbe834f4a26489d7dce74582e37 |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Initial repair mode This includes (according the the previous description): * TCP_REPAIR sockoption This one just puts the socket in/out of the repair mode. Allowed for CAP_NET_ADMIN and for closed/establised sockets only. When repair mode is turned off and the socket happens to be in the established state the window probe is sent to the peer to 'unlock' the connection. * TCP_REPAIR_QUEUE sockoption This one sets the queue which we're about to repair. The 'no-queue' is set by default. * TCP_QUEUE_SEQ socoption Sets the write_seq/rcv_nxt of a selected repaired queue. Allowed for TCP_CLOSE-d sockets only. When the socket changes its state the other seq-s are changed by the kernel according to the protocol rules (most of the existing code is actually reused). * Ability to forcibly bind a socket to a port The sk->sk_reuse is set to SK_FORCE_REUSE. * Immediate connect modification The connect syscall initializes the connection, then directly jumps to the code which finalizes it. * Silent close modification The close just aborts the connection (similar to SO_LINGER with 0 time) but without sending any FIN/RST-s to peer. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
370816aef0c5436c2adbec3966038f36ca326933 |
|
19-Apr-2012 |
Pavel Emelyanov <xemul@parallels.com> |
tcp: Move code around This is just the preparation patch, which makes the needed for TCP repair code ready for use. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
95c961747284a6b83a5e2d81240e214b0fa3464d |
|
15-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a21d45726acacc963d8baddf74607d9b74e2b723 |
|
10-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: avoid order-1 allocations on wifi and tx path Marc Merlin reported many order-1 allocations failures in TX path on its wireless setup, that dont make any sense with MTU=1500 network, and non SG capable hardware. After investigation, it turns out TCP uses sk_stream_alloc_skb() and used as a convention skb_tailroom(skb) to know how many bytes of data payload could be put in this skb (for non SG capable devices) Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER + sizeof(struct skb_shared_info) being above 2048) Later, mac80211 layer need to add some bytes at the tail of skb (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is available has to call pskb_expand_head() and request order-1 allocations. This patch changes sk_stream_alloc_skb() so that only sk->sk_prot->max_header bytes of headroom are reserved, and use a new skb field, avail_size to hold the data payload limit. This way, order-0 allocations done by TCP stack can leave more than 2 KB of tailroom and no more allocation is performed in mac80211 layer (or any layer needing some tailroom) avail_size is unioned with mark/dropcount, since mark will be set later in IP stack for output packets. Therefore, skb size is unchanged. Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5fb84b1428b271f8767e0eb3fcd7231896edfaa4 |
|
10-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: restore correct limit Commit c43b874d5d714f (tcp: properly initialize tcp memory limits) tried to fix a regression added in commits 4acb4190 & 3dc43e3, but still get it wrong. Result is machines with low amount of memory have too small tcp_rmem[2] value and slow tcp receives : Per socket limit being 1/1024 of memory instead of 1/128 in old kernels, so rcv window is capped to small values. Fix this to match comment and previous behavior. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 |
|
05-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tcp_sendpages() should call tcp_push() once commit 2f533844242 (tcp: allow splice() to build full TSO packets) added a regression for splice() calls using SPLICE_F_MORE. We need to call tcp_flush() at the end of the last page processed in tcp_sendpages(), or else transmits can be deferred and future sends stall. Add a new internal flag, MSG_SENDPAGE_NOTLAST, acting like MSG_MORE, but with different semantic. For all sendpage() providers, its a transparent change. Only sock_sendpage() and tcp_sendpages() can differentiate the two different flags provided by pipe_to_sendpage() Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a2bd1140a264b561e38d99e656cd843c2d840e86 |
|
05-Apr-2012 |
Dave Jiang <dave.jiang@intel.com> |
netdma: adding alignment check for NETDMA ops This is the fallout from adding memcpy alignment workaround for certain IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align ops. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
2f53384424251c06038ae612e56231b96ab610ee |
|
03-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: allow splice() to build full TSO packets vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096) The call to tcp_push() at the end of do_tcp_sendpages() forces an immediate xmit when pipe is not already filled, and tso_fragment() try to split these skb to MSS multiples. 4096 bytes are usually split in a skb with 2 MSS, and a remaining sub-mss skb (assuming MTU=1500) This makes slow start suboptimal because many small frames are sent to qdisc/driver layers instead of big ones (constrained by cwnd and packets in flight of course) In fact, applications using sendmsg() (adding an additional memory copy) instead of vmsplice()/splice()/sendfile() are a bit faster because of this anomaly, especially if serving small files in environments with large initial [c]wnd. Call tcp_push() only if MSG_MORE is not set in the flags parameter. This bit is automatically provided by splice() internals but for the last page, or on all pages if user specified SPLICE_F_MORE splice() flag. In some workloads, this can reduce number of sent logical packets by an order of magnitude, making zero-copy TCP actually faster than one-copy :) Reported-by: Tom Herbert <therbert@google.com> Cc: Nandita Dukkipati <nanditad@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: H.K. Jerry Chu <hkchu@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Mahesh Bandewar <maheshb@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: Eric Dumazet <eric.dumazet@gmail>com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
afd465030acb4098abcb6b965a5aebc7ea2209e0 |
|
12-Mar-2012 |
Joe Perches <joe@perches.com> |
net: ipv4: Standardize prefixes for message logging Add #define pr_fmt(fmt) as appropriate. Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files. Standardize on "UDPLite: " for appropriate uses. Some prefixes were previously "UDPLITE: " and "UDP-Lite: ". Add KBUILD_MODNAME ": " to icmp and gre. Remove embedded prefixes as appropriate. Add missing "\n" to pr_info in gre.c. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
058bd4d2a4ff0aaa4a5381c67e776729d840c785 |
|
11-Mar-2012 |
Joe Perches <joe@perches.com> |
net: Convert printks to pr_<level> Use a more current kernel messaging style. Convert a printk block to print_hex_dump. Coalesce formats, align arguments. Use %s, __func__ instead of embedding function names. Some messages that were prefixed with <foo>_close are now prefixed with <foo>_fini. Some ah4 and esp messages are now not prefixed with "ip ". The intent of this patch is to later add something like #define pr_fmt(fmt) "IPv4: " fmt. to standardize the output messages. Text size is trivially reduced. (x86-32 allyesconfig) $ size net/ipv4/built-in.o* text data bss dec hex filename 887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new 887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
074b85175a43a23fdbde60f55feea636e0bf0f85 |
|
08-Feb-2012 |
Dimitri Sivanich <sivanich@sgi.com> |
vfs: fix panic in __d_lookup() with high dentry hashtable counts When the number of dentry cache hash table entries gets too high (2147483648 entries), as happens by default on a 16TB system, use of a signed integer in the dcache_init() initialization loop prevents the dentry_hashtable from getting initialized, causing a panic in __d_lookup(). Fix this in dcache_init() and similar areas. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
c43b874d5d714f271b80d4c3f49e05d0cbf51ed2 |
|
02-Feb-2012 |
Jason Wang <jasowang@redhat.com> |
tcp: properly initialize tcp memory limits Commit 4acb4190 tries to fix the using uninitialized value introduced by commit 3dc43e3, but it would make the per-socket memory limits too small. This patch fixes this and also remove the redundant codes introduced in 4acb4190. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
efcdbf24fd5daa88060869e51ed49f68b7ac8708 |
|
30-Jan-2012 |
Arun Sharma <asharma@fb.com> |
net: Disambiguate kernel message Some of our machines were reporting: TCP: too many of orphaned sockets even when the number of orphaned sockets was well below the limit. We print a different message depending on whether we're out of TCP memory or there are too many orphaned sockets. Also move the check out of line and cleanup the messages that were printed. Signed-off-by: Arun Sharma <asharma@fb.com> Suggested-by: Mohan Srinivasan <mohan@fb.com> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: David Miller <davem@davemloft.net> Cc: Glauber Costa <glommer@parallels.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4acb41903b2f99f3dffd4c3df9acc84ca5942cb2 |
|
30-Jan-2012 |
Glauber Costa <glommer@parallels.com> |
net/tcp: Fix tcp memory limits initialization when !CONFIG_SYSCTL sysctl_tcp_mem() initialization was moved to sysctl_tcp_ipv4.c in commit 3dc43e3e4d0b52197d3205214fe8f162f9e0c334, since it became a per-ns value. That code, however, will never run when CONFIG_SYSCTL is disabled, leading to bogus values on those fields - causing hung TCP sockets. This patch fixes it by keeping an initialization code in tcp_init(). It will be overwritten by the first net namespace init if CONFIG_SYSCTL is compiled in, and do the right thing if it is compiled out. It is also named properly as tcp_init_mem(), to properly signal its non-sysctl side effect on TCP limits. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: David S. Miller <davem@davemloft.net> Link: http://lkml.kernel.org/r/4F22D05A.8030604@parallels.com [ renamed the function, tidied up the changelog a bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3dc43e3e4d0b52197d3205214fe8f162f9e0c334 |
|
11-Dec-2011 |
Glauber Costa <glommer@parallels.com> |
per-netns ipv4 sysctl_tcp_mem This patch allows each namespace to independently set up its levels for tcp memory pressure thresholds. This patch alone does not buy much: we need to make this values per group of process somehow. This is achieved in the patches that follows in this patchset. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0a5912db7b4f9c3ff3bd0dbb67f36484a3b21a35 |
|
05-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: remove TCP_OFF and TCP_PAGE macros As mentioned by Joe Perches, TCP_OFF() and TCP_PAGE() macros are useless. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
761965eab38d2cbc59c36e355c59609e3a04705a |
|
04-Dec-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tcp_sendmsg() page recycling If our TCP_PAGE(sk) is not shared (page_count() == 1), we can set page offset to 0. This permits better filling of the pages on small to medium tcp writes. "tbench 16" results on my dev server (2x4x2 machine) : Before : 3072 MB/s After : 3146 MB/s (2.4 % gain) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f07d960df33c5aef8f513efce0fd201f962f94a1 |
|
28-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: avoid frag allocation for small frames tcp_sendmsg() uses select_size() helper to choose skb head size when a new skb must be allocated. If GSO is enabled for the socket, current strategy is to force all payload data to be outside of headroom, in PAGE fragments. This strategy is not welcome for small packets, wasting memory. Experiments show that best results are obtained when using 2048 bytes for skb head (This includes the skb overhead and various headers) This patch provides better len/truesize ratios for packets sent to loopback device, and reduce memory needs for in-flight loopback packets, particularly on arches with big pages. If a sender sends many 1-byte packets to an unresponsive application, receiver rmem_alloc will grow faster and will stop queuing these packets sooner, or will collapse its receive queue to free excess memory. netperf -t TCP_RR results are improved by ~4 %, and many workloads are improved as well (tbench, mysql...) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
690e99c4ba73fc18643b38fa032022b8758ad4d3 |
|
28-Nov-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tcp_sendmsg() wrong access to sk_route_caps Now sk_route_caps is u64, its dangerous to use an integer to store result of an AND operator. It wont work if NETIF_F_SG is moved on the upper part of u64. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c8f44affb7244f2ac3e703cab13d55ede27621bb |
|
15-Nov-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: introduce and use netdev_features_t for device features sets v2: add couple missing conversions in drivers split unexporting netdev_fix_features() implemented %pNF convert sock::sk_route_(no?)caps Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
78d81d15b74246c7cedf84894434890b33da3907 |
|
24-Oct-2011 |
Flavio Leitner <fbl@redhat.com> |
TCP: remove TCP_DEBUG It was enabled by default and the messages guarded by the define are useful. Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ca35a0ef85e8ed6df6d5ab01fb6c3530cca0c469 |
|
24-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: md5: dont write skb head in tcp_md5_hash_header() tcp_md5_hash_header() writes into skb header a temporary zero value, this might confuse other users of this area. Since tcphdr is small (20 bytes), copy it in a temporary variable and make the change in the copy. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cf533ea53ebfae41be15b103d78e7ebec30b9969 |
|
21-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: add const qualifiers where possible Adding const qualifiers to pointers can ease code review, and spot some bugs. It might allow compiler to optimize code further. For example, is it legal to temporary write a null cksum into tcphdr in tcp_md5_hash_header() ? I am afraid a sniffer could catch the temporary null value... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9e903e085262ffbf1fc44a17ac06058aca03524a |
|
18-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add skb frag size accessors To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set|add|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b5c5693bb723a019deac3cd1345f3e7233c8a67e |
|
03-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: report ECN_SEEN in tcp_info Allows ss command (iproute2) to display "ecnseen" if at least one packet with ECT(0) or ECT(1) or ECN was received by this socket. "ecn" means ECN was negotiated at session establishment (TCP level) "ecnseen" means we received at least one packet with ECT fields set (IP level) ss -i ... ESTAB 0 0 192.168.20.110:22 192.168.20.144:38016 ino:5950 sk:f178e400 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210 rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4de075e0438ba54b8f42cbbc1263d404229dc997 |
|
27-Sep-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: rename tcp_skb_cb flags Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and maintenance. Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
765cf9976e937f1cfe9159bf4534967c8bf8eb6d |
|
12-Sep-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: md5: remove one indirection level in tcp_md5sig_pool tcp_md5sig_pool is currently an 'array' (a percpu object) of pointers to struct tcp_md5sig_pool. Only the pointers are NUMA aware, but objects themselves are all allocated on a single node. Remove this extra indirection to get proper percpu memory (NUMA aware) and make code simpler. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aff65da0f1be5daec44231972b6b5fc45bfa7a58 |
|
23-Aug-2011 |
Ian Campbell <Ian.Campbell@citrix.com> |
net: ipv4: convert to SKB frag APIs Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
f03d78db65085609938fdb686238867e65003181 |
|
07-Jul-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: refine {udp|tcp|sctp}_mem limits Current tcp/udp/sctp global memory limits are not taking into account hugepages allocations, and allow 50% of ram to be used by buffers of a single protocol [ not counting space used by sockets / inodes ...] Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram per protocol, and a minimum of 128 pages. Heavy duty machines sysadmins probably need to tweak limits anyway. References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032 Reported-by: starlight <starlight@binnacle.cx> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c6e1a0d12ca7b4f22c58e55a16beacfb7d3d8462 |
|
05-Apr-2011 |
Tom Herbert <therbert@google.com> |
net: Allow no-cache copy from user on transmit This patch uses __copy_from_user_nocache on transmit to bypass data cache for a performance improvement. skb_add_data_nocache and skb_copy_to_page_nocache can be called by sendmsg functions to use this feature, initial support is in tcp_sendmsg. This functionality is configurable per device using ethtool. Presumably, this feature would only be useful when the driver does not touch the data. The feature is turned on by default if a device indicates that it does some form of checksum offload; it is off by default for devices that do no checksum offload or indicate no checksum is necessary. For the former case copy-checksum is probably done anyway, in the latter case the device is likely loopback in which case the no cache copy is probably not beneficial. This patch was tested using 200 instances of netperf TCP_RR with 1400 byte request and one byte reply. Platform is 16 core AMD x86. No-cache copy disabled: 672703 tps, 97.13% utilization 50/90/99% latency:244.31 484.205 1028.41 No-cache copy enabled: 702113 tps, 96.16% utilization, 50/90/99% latency 238.56 467.56 956.955 Using 14000 byte request and response sizes demonstrate the effects more dramatically: No-cache copy disabled: 79571 tps, 34.34 %utlization 50/90/95% latency 1584.46 2319.59 5001.76 No-cache copy enabled: 83856 tps, 34.81% utilization 50/90/95% latency 2508.42 2622.62 2735.88 Note especially the effect on latency tail (95th percentile). This seems to provide a nice performance improvement and is consistent in the tests I ran. Presumably, this would provide the greatest benfits in the presence of an application workload stressing the cache and a lot of transmit data happening. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2f4e1b3970973bbb57cc3a3b9d67e67c1c648c37 |
|
09-Mar-2011 |
Mario Schuknecht <m.schuknecht@dresearch.de> |
tcp: ioctl type SIOCOUTQNSD returns amount of data not sent In contrast to SIOCOUTQ which returns the amount of data sent but not yet acknowledged plus data not yet sent this patch only returns the data not sent. For various methods of live streaming bitrate control it may be helpful to know how much data are in the tcp outqueue are not sent yet. Signed-off-by: Mario Schuknecht <m.schuknecht@dresearch.de> Signed-off-by: Steffen Sledz <sledz@dresearch.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
089c34827e52346f0303d1e6a7b744c1f4da3095 |
|
19-Feb-2011 |
Shan Wei <shanwei@cn.fujitsu.com> |
tcp: Remove debug macro of TCP_CHECK_TIMER Now, TCP_CHECK_TIMER is not used for debuging, it does nothing. And, it has been there for several years, maybe 6 years. Remove it to keep code clearer. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
04ed3e741d0f133e02bed7fa5c98edba128f90e7 |
|
25-Jan-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: change netdev->features to u32 Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c39508d6f118308355468314ff414644115a07f3 |
|
24-Nov-2010 |
David S. Miller <davem@davemloft.net> |
tcp: Make TCP_MAXSEG minimum more correct. Use TCP_MIN_MSS instead of constant 64. Reported-by: Min Zhang <mzhang@mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7a1abd08d52fdeddb3e9a5a33f2f15cc6a5674d2 |
|
11-Nov-2010 |
David S. Miller <davem@davemloft.net> |
tcp: Increase TCP_MAXSEG socket option minimum. As noted by Steve Chen, since commit f5fff5dc8a7a3f395b0525c02ba92c95d42b7390 ("tcp: advertise MSS requested by user") we can end up with a situation where tcp_select_initial_window() does a divide by a zero (or even negative) mss value. The problem is that sometimes we effectively subtract TCPOLEN_TSTAMP_ALIGNED and/or TCPOLEN_MD5SIG_ALIGNED from the mss. Fix this by increasing the minimum from 8 to 64. Reported-by: Steve Chen <schen@mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8d987e5c75107ca7515fa19e857cfa24aab6ec8f |
|
10-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: avoid limits overflow Robin Holt tried to boot a 16TB machine and found some limits were reached : sysctl_tcp_mem[2], sysctl_udp_mem[2] We can switch infrastructure to use long "instead" of "int", now atomic_long_t primitives are available for free. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Robin Holt <holt@sgi.com> Reviewed-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2af6fd8b18ceed416c9dfa675287c765aabf7d43 |
|
30-Oct-2010 |
Joe Perches <joe@perches.com> |
net/ipv4/tcp.c: Update WARN uses Coalesce long formats. Align arguments. Remove KERN_<level>. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
01db403cf99f739f86903314a489fb420e0e254f |
|
28-Sep-2010 |
David S. Miller <davem@davemloft.net> |
tcp: Fix >4GB writes on 64-bit. Fixes kernel bugzilla #16603 tcp_sendmsg() truncates iov_len to an 'int' which a 4GB write to write zero bytes, for example. There is also the problem higher up of how verify_iovec() works. It wants to prevent the total length from looking like an error return value. However it does this using 'int', but syscalls return 'long' (and thus signed 64-bit on 64-bit machines). So it could trigger false-positives on 64-bit as written. So fix it to use 'long'. Reported-by: Olaf Bonorden <bono@onlinehome.de> Reported-by: Daniel Büse <dbuese@gmx.de> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a4d258036ed9b2a1811c3670c6099203a0f284a0 |
|
21-Sep-2010 |
Tom Marshall <tdm.code@gmail.com> |
tcp: Fix race in tcp_poll If a RST comes in immediately after checking sk->sk_err, tcp_poll will return POLLIN but not POLLOUT. Fix this by checking sk->sk_err at the end of tcp_poll. Additionally, ensure the correct order of operations on SMP machines with memory barriers. Signed-off-by: Tom Marshall <tdm.code@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dca43c75e7e545694a9dd6288553f55c53e2a3a3 |
|
27-Aug-2010 |
Jerry Chu <hkchu@google.com> |
tcp: Add TCP_USER_TIMEOUT socket option. This patch provides a "user timeout" support as described in RFC793. The socket option is also needed for the the local half of RFC5482 "TCP User Timeout Option". TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int, when > 0, to specify the maximum amount of time in ms that transmitted data may remain unacknowledged before TCP will forcefully close the corresponding connection and return ETIMEDOUT to the application. If 0 is given, TCP will continue to use the system default. Increasing the user timeouts allows a TCP connection to survive extended periods without end-to-end connectivity. Decreasing the user timeouts allows applications to "fail fast" if so desired. Otherwise it may take upto 20 minutes with the current system defaults in a normal WAN environment. The socket option can be made during any state of a TCP connection, but is only effective during the synchronized states of a connection (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK). Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will overtake keepalive to determine when to close a connection due to keepalive failure. The option does not change in anyway when TCP retransmits a packet, nor when a keepalive probe will be sent. This option, like many others, will be inherited by an acceptor from its listener. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d84ba638e4ba3c40023ff997aa5e8d3ed002af36 |
|
24-Aug-2010 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
tcp: select(writefds) don't hang up when a peer close connection This issue come from ruby language community. Below test program hang up when only run on Linux. % uname -mrsv Linux 2.6.26-2-486 #1 Sat Dec 26 08:37:39 UTC 2009 i686 % ruby -rsocket -ve ' BasicSocket.do_not_reverse_lookup = true serv = TCPServer.open("127.0.0.1", 0) s1 = TCPSocket.open("127.0.0.1", serv.addr[1]) s2 = serv.accept s2.close s1.write("a") rescue p $! s1.write("a") rescue p $! Thread.new { s1.write("a") }.join' ruby 1.9.3dev (2010-07-06 trunk 28554) [i686-linux] #<Errno::EPIPE: Broken pipe> [Hang Here] FreeBSD, Solaris, Mac doesn't. because Ruby's write() method call select() internally. and tcp_poll has a bug. SUS defined 'ready for writing' of select() as following. | A descriptor shall be considered ready for writing when a call to an output | function with O_NONBLOCK clear would not block, whether or not the function | would transfer data successfully. That said, EPIPE situation is clearly one of 'ready for writing'. We don't have read-side issue because tcp_poll() already has read side shutdown care. | if (sk->sk_shutdown & RCV_SHUTDOWN) | mask |= POLLIN | POLLRDNORM | POLLRDHUP; So, Let's insert same logic in write side. - reference url http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/31065 http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/31068 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c5ed63d66f24fd4f7089b5a6e087b0ce7202aa8e |
|
26-Aug-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: fix three tcp sysctls tuning As discovered by Anton Blanchard, current code to autotune tcp_death_row.sysctl_max_tw_buckets, sysctl_tcp_max_orphans and sysctl_max_syn_backlog makes little sense. The bigger a page is, the less tcp_max_orphans is : 4096 on a 512GB machine in Anton's case. (tcp_hashinfo.bhash_size * sizeof(struct inet_bind_hashbucket)) is much bigger if spinlock debugging is on. Its wrong to select bigger limits in this case (where kernel structures are also bigger) bhash_size max is 65536, and we get this value even for small machines. A better ground is to use size of ehash table, this also makes code shorter and more obvious. Based on a patch from Anton, and another from David. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ad1af0fedba14f82b240a03fe20eb9b2fdbd0357 |
|
25-Aug-2010 |
David S. Miller <davem@davemloft.net> |
tcp: Combat per-cpu skew in orphan tests. As reported by Anton Blanchard when we use percpu_counter_read_positive() to make our orphan socket limit checks, the check can be off by up to num_cpus_online() * batch (which is 32 by default) which on a 128 cpu machine can be as large as the default orphan limit itself. Fix this by doing the full expensive sum check if the optimized check triggers. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
|
3c0fef0b7d36e5f8d3ea3731a8228102274e3c23 |
|
30-Jul-2010 |
Josh Hunt <johunt@akamai.com> |
net: Add getsockopt support for TCP thin-streams Initial TCP thin-stream commit did not add getsockopt support for the new socket options: TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK. This adds support for them. Signed-off-by: Josh Hunt <johunt@akamai.com> Tested-by: Andreas Petlund <apetlund@simula.no> Acked-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a3bdb549e30e7a263f7a589747c40e9c50110315 |
|
29-Jul-2010 |
Dmitry Popov <dp@highloadlab.com> |
tcp: cookie transactions setsockopt memory leak There is a bug in do_tcp_setsockopt(net/ipv4/tcp.c), TCP_COOKIE_TRANSACTIONS case. In some cases (when tp->cookie_values == NULL) new tcp_cookie_values structure can be allocated (at cvp), but not bound to tp->cookie_values. So a memory leak occurs. Signed-off-by: Dmitry Popov <dp@highloadlab.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3a047bf87b1b6f69c62ab9fb28072c639cb7e2fa |
|
12-Jul-2010 |
Changli Gao <xiaosuo@gmail.com> |
rfs: call sock_rps_record_flow() in tcp_splice_read() rfs: call sock_rps_record_flow() in tcp_splice_read() call sock_rps_record_flow() in tcp_splice_read(), so the applications using splice(2) or sendfile(2) can utilize RFS. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- net/ipv4/tcp.c | 1 + 1 file changed, 1 insertion(+) Signed-off-by: David S. Miller <davem@davemloft.net>
|
7ba42910073f8432934d61a6c08b1023c408fb62 |
|
10-Jul-2010 |
Changli Gao <xiaosuo@gmail.com> |
inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() and inet_sendpage() a new boolean flag no_autobind is added to structure proto to avoid the autobind calls when the protocol is TCP. Then sock_rps_record_flow() is called int the TCP's sendmsg() and sendpage() pathes. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/inet_common.h | 4 ++++ include/net/sock.h | 1 + include/net/tcp.h | 8 ++++---- net/ipv4/af_inet.c | 15 +++++++++------ net/ipv4/tcp.c | 11 +++++------ net/ipv4/tcp_ipv4.c | 3 +++ net/ipv6/af_inet6.c | 8 ++++---- net/ipv6/tcp_ipv6.c | 3 +++ 8 files changed, 33 insertions(+), 20 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
|
4bc2f18ba4f22a90ab593c0a580fc9a19c4777b6 |
|
09-Jul-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net/ipv4: EXPORT_SYMBOL cleanups CodingStyle cleanups EXPORT_SYMBOL should immediately follow the symbol declaration. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7a9b2d59507d85569b8a456688ef40cf2ac73e48 |
|
24-Jun-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: use this_cpu_ptr() use this_cpu_ptr(p) instead of per_cpu_ptr(p, smp_processor_id()) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
565b7b2d2e632b5792879c0c9cccdd9eecd31195 |
|
25-Jun-2010 |
Konstantin Khorenko <khorenko@openvz.org> |
tcp: do not send reset to already closed sockets i've found that tcp_close() can be called for an already closed socket, but still sends reset in this case (tcp_send_active_reset()) which seems to be incorrect. Moreover, a packet with reset is sent with different source port as original port number has been already cleared on socket. Besides that incrementing stat counter for LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case. Initially this issue was found on 2.6.18-x RHEL5 kernel, but the same seems to be true for the current mainstream kernel (checked on 2.6.35-rc3). Please, correct me if i missed something. How that happens: 1) the server receives a packet for socket in TCP_CLOSE_WAIT state that triggers a tcp_reset(): Call Trace: <IRQ> [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8 [<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08 [<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a [<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43 [<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259 [<ffffffff80037131>] ip_local_deliver+0x200/0x2f4 [<ffffffff8003843c>] ip_rcv+0x64c/0x69f [<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa [<ffffffff80032eca>] process_backlog+0x90/0xec [<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1 [<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce [<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0 [<ffffffff8006334c>] call_softirq+0x1c/0x28 [<ffffffff80070476>] do_softirq+0x2c/0x85 [<ffffffff80070441>] do_IRQ+0x149/0x152 [<ffffffff80062665>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303 [<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303 [<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e [<ffffffff8006a263>] do_page_fault+0x49a/0x7dc [<ffffffff80066487>] thread_return+0x89/0x174 [<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c [<ffffffff80062e39>] error_exit+0x0/0x84 tcp_rcv_state_process() ... // (sk_state == TCP_CLOSE_WAIT here) ... /* step 2: check RST bit */ if(th->rst) { tcp_reset(sk); goto discard; } ... --------------------------------- tcp_rcv_state_process tcp_reset tcp_done tcp_set_state(sk, TCP_CLOSE); inet_put_port __inet_put_port inet_sk(sk)->num = 0; sk->sk_shutdown = SHUTDOWN_MASK; 2) After that the process (socket owner) tries to write something to that socket and "inet_autobind" sets a _new_ (which differs from the original!) port number for the socket: Call Trace: [<ffffffff80255a12>] inet_bind_hash+0x33/0x5f [<ffffffff80257180>] inet_csk_get_port+0x216/0x268 [<ffffffff8026bcc9>] inet_autobind+0x22/0x8f [<ffffffff80049140>] inet_sendmsg+0x27/0x57 [<ffffffff8003a9d9>] do_sock_write+0xae/0xea [<ffffffff80226ac7>] sock_writev+0xdc/0xf6 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe [<ffffffff8001fb49>] __pollwait+0x0/0xdd [<ffffffff8008d533>] default_wake_function+0x0/0xe [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e [<ffffffff800f0b49>] do_readv_writev+0x163/0x274 [<ffffffff80066538>] thread_return+0x13a/0x174 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4 [<ffffffff800622dd>] tracesys+0xd5/0xe0 3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace): F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3 Call Trace: [<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87 [<ffffffff80033300>] release_sock+0x10/0xae [<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7 [<ffffffff8026bd32>] inet_autobind+0x8b/0x8f [<ffffffff8003a9d9>] do_sock_write+0xae/0xea [<ffffffff80226ac7>] sock_writev+0xdc/0xf6 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe [<ffffffff8001fb49>] __pollwait+0x0/0xdd [<ffffffff8008d533>] default_wake_function+0x0/0xe [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e [<ffffffff800f0b49>] do_readv_writev+0x163/0x274 [<ffffffff80066538>] thread_return+0x13a/0x174 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4 [<ffffffff800622dd>] tracesys+0xd5/0xe0 tcp_sendmsg() ... /* Wait for a connection to finish. */ if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) { int old_state = sk->sk_state; if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) { if (f_d && (err == -EPIPE)) { printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, " "sk_err=%d, sk_shutdown=%d\n", sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state, sk->sk_err, sk->sk_shutdown); dump_stack(); } goto out_err; } } ... 4) Then the process (socket owner) understands that it's time to close that socket and does that (and thus triggers sending reset packet): Call Trace: ... [<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6 [<ffffffff80034698>] ip_output+0x351/0x384 [<ffffffff80251ae9>] dst_output+0x0/0xe [<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2 [<ffffffff80095700>] vprintk+0x21/0x33 [<ffffffff800070f0>] check_poison_obj+0x2e/0x206 [<ffffffff80013587>] poison_obj+0x36/0x45 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d [<ffffffff80023481>] dbg_redzone1+0x1c/0x25 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d [<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8 [<ffffffff80023405>] tcp_transmit_skb+0x764/0x786 [<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d [<ffffffff80258ff1>] tcp_close+0x39a/0x960 [<ffffffff8026be12>] inet_release+0x69/0x80 [<ffffffff80059b31>] sock_release+0x4f/0xcf [<ffffffff80059d4c>] sock_close+0x2c/0x30 [<ffffffff800133c9>] __fput+0xac/0x197 [<ffffffff800252bc>] filp_close+0x59/0x61 [<ffffffff8001eff6>] sys_close+0x85/0xc7 [<ffffffff800622dd>] tracesys+0xd5/0xe0 So, in brief: * a received packet for socket in TCP_CLOSE_WAIT state triggers tcp_reset() which clears inet_sk(sk)->num and put socket into TCP_CLOSE state * an attempt to write to that socket forces inet_autobind() to get a new port (but the write itself fails with -EPIPE) * tcp_close() called for socket in TCP_CLOSE state sends an active reset via socket with newly allocated port This adds an additional check in tcp_close() for already closed sockets. We do not want to send anything to closed sockets. Signed-off-by: Konstantin Khorenko <khorenko@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a3433f35a55f7604742cae620c6dc6edfc70db6a |
|
12-Jun-2010 |
Changli Gao <xiaosuo@gmail.com> |
tcp: unify tcp flag macros unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH, TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced with the corresponding TCPHDR_*. Signed-off-by: Changli Gao <xiaosuo@gmail.com> ---- include/net/tcp.h | 24 ++++++------- net/ipv4/tcp.c | 8 ++-- net/ipv4/tcp_input.c | 2 - net/ipv4/tcp_output.c | 59 ++++++++++++++++----------------- net/netfilter/nf_conntrack_proto_tcp.c | 32 ++++++----------- net/netfilter/xt_TCPMSS.c | 4 -- 6 files changed, 58 insertions(+), 71 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
|
d7fd1b5747fff3bde92777bcaf705d98ae6f8b6f |
|
17-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tcp_md5_hash_skb_data() frag_list handling tcp_md5_hash_skb_data() should handle skb->frag_list, and eventually recurse. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ccbd6a5a4f76e821ed36f69fdaf59817c3a7f18e |
|
14-May-2010 |
Joe Perches <joe@perches.com> |
net: Remove unnecessary semicolons after switch statements Also added an explicit break; to avoid a fallthrough in net/ipv4/tcp_input.c Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
35790c0421121364883a167bab8a2e37e1f67f78 |
|
16-May-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: fix MD5 (RFC2385) support TCP MD5 support uses percpu data for temporary storage. It currently disables preemption so that same storage cannot be reclaimed by another thread on same cpu. We also have to make sure a softirq handler wont try to use also same context. Various bug reports demonstrated corruptions. Fix is to disable preemption and BH. Reported-by: Bhaskar Dutta <bhaskie@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6c37e5de456987f5bc80879afde05aa120784095 |
|
26-Apr-2010 |
Flavio Leitner <fleitner@redhat.com> |
TCP: avoid to send keepalive probes if receiving data RFC 1122 says the following: ... Keep-alive packets MUST only be sent when no data or acknowledgement packets have been received for the connection within an interval. ... The acknowledgement packet is reseting the keepalive timer but the data packet isn't. This patch fixes it by checking the timestamp of the last received data packet too when the keepalive timer expires. Signed-off-by: Flavio Leitner <fleitner@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0eae88f31ca2b88911ce843452054139e028771f |
|
21-Apr-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: Fix various endianness glitches Sparse can help us find endianness bugs, but we need to make some cleanups to be able to more easily spot real bugs. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa395145165cb06a0d0885221bbe0ce4a564391d |
|
20-Apr-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: sk_sleep() helper Define a new function to return the waitqueue of a "struct sock". static inline wait_queue_head_t *sk_sleep(struct sock *sk) { return sk->sk_sleep; } Change all read occurrences of sk_sleep by a call to this function. Needed for a future RCU conversion. sk_sleep wont be a field directly available. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
baff42ab1494528907bf4d5870359e31711746ae |
|
30-Mar-2010 |
Steven J. Magnani <steve@digidescorp.com> |
net: Fix oops from tcp_collapse() when using splice() tcp_read_sock() can have a eat skbs without immediately advancing copied_seq. This can cause a panic in tcp_collapse() if it is called as a result of the recv_actor dropping the socket lock. A userspace program that splices data from a socket to either another socket or to a file can trigger this bug. Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5a0e3ad6af8660be21ca98a971cd00f331318c05 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
73852e8151b7d7a529fbe019ab6d2d0c02d8f3f2 |
|
16-Mar-2010 |
Steven J. Magnani <steve@digidescorp.com> |
NET_DMA: free skbs periodically Under NET_DMA, data transfer can grind to a halt when userland issues a large read on a socket with a high RCVLOWAT (i.e., 512 KB for both). This appears to be because the NET_DMA design queues up lots of memcpy operations, but doesn't issue or wait for them (and thus free the associated skbs) until it is time for tcp_recvmesg() to return. The socket hangs when its TCP window goes to zero before enough data is available to satisfy the read. Periodically issue asynchronous memcpy operations, and free skbs for ones that have completed, to prevent sockets from going into zero-window mode. Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b634f87522dff87712df8bda2a6c9061954d552a |
|
19-Mar-2010 |
Alexandra Kossovsky <Alexandra.Kossovsky@oktetlabs.ru> |
tcp: Fix OOB POLLIN avoidance. From: Alexandra.Kossovsky@oktetlabs.ru Fixes kernel bugzilla #15541 Signed-off-by: David S. Miller <davem@davemloft.net>
|
7e38017557bc0b87434d184f8804cadb102bb903 |
|
18-Feb-2010 |
Andreas Petlund <apetlund@simula.no> |
net: TCP thin dupack This patch enables fast retransmissions after one dupACK for TCP if the stream is identified as thin. This will reduce latencies for thin streams that are not able to trigger fast retransmissions due to high packet interarrival time. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin. Signed-off-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>
|
36e31b0af58728071e8023cf8e20c5166b700717 |
|
18-Feb-2010 |
Andreas Petlund <apetlund@simula.no> |
net: TCP thin linear timeouts This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin. A maximum of 6 linear timeouts is tried before exponential backoff is resumed. Signed-off-by: Andreas Petlund <apetlund@simula.no> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7d720c3e4f0c4fc152a6bf17e24244a3c85412d2 |
|
16-Feb-2010 |
Tejun Heo <tj@kernel.org> |
percpu: add __percpu sparse annotations to net Add __percpu sparse annotations to net. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. The macro and type tricks around snmp stats make things a bit interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All snmp_mib_*() users which used to cast the argument to (void **) are updated to cast it to (void __percpu **). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
def87cf42069a6d4fd42a2ede8f19c620a292568 |
|
10-Dec-2009 |
Krishna Kumar <krkumar2@in.ibm.com> |
tcp: Slightly optimize tcp_sendmsg Slightly optimize tcp_sendmsg since NETIF_F_SG is used many times iteratively in the loop. The only other modification is to change: } else if (i == MAX_SKB_FRAGS || (!i && !(sk->sk_route_caps & NETIF_F_SG))) { to: } else if (i == MAX_SKB_FRAGS || !sg) { The reason why this change is correct: this code (other than the MAX_SKB_FRAGS case) executes only due to the else part of: "if (skb_tailroom(skb) > 0) {" - i.e. there was no space in the skb to put the data inline. Hence SG is false is a sufficient condition, and there is no way a fragment can be added to the skb. Changelog: - Added the above explanation for the change Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
afeca340c078e17ca233b3c68c3c3a70c56bfe1d |
|
10-Dec-2009 |
Krishna Kumar <krkumar2@in.ibm.com> |
tcp: Remove unrequired operations in tcp_push() Remove unrequired operations in tcp_push() Changelog: Removed a temporary skb variable from tcp_push() Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3dc789320e1b310cb505dcd94512c279abcd5e1c |
|
09-Dec-2009 |
David S. Miller <davem@davemloft.net> |
tcp: Remove runtime check that can never be true. GCC even warns about it, as reported by Andrew Morton: net/ipv4/tcp.c: In function 'do_tcp_getsockopt': net/ipv4/tcp.c:2544: warning: comparison is always false due to limited range of data type Signed-off-by: David S. Miller <davem@davemloft.net>
|
e56fb50f2b7958b931c8a2fc0966061b3f3c8f3a |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1e: implement socket option TCP_COOKIE_TRANSACTIONS Provide per socket control of the TCP cookie option and SYN/SYNACK data. This is a straightforward re-implementation of an earlier (year-old) patch that no longer applies cleanly, with permission of the original author (Adam Langley): http://thread.gmane.org/gmane.linux.network/102586 The principle difference is using a TCP option to carry the cookie nonce, instead of a user configured offset in the data. Allocations have been rearranged to avoid requiring GFP_ATOMIC. Requires: net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS TCPCT part 1d: define TCP cookie option, extend existing struct's Signed-off-by: William.Allen.Simpson@gmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
|
da5c78c82629a167794436e4306b4cf1faddea90 |
|
02-Dec-2009 |
William Allen Simpson <william.allen.simpson@gmail.com> |
TCPCT part 1b: generate Responder Cookie secret Define (missing) hash message size for SHA1. Define hashing size constants specific to TCP cookies. Add new function: tcp_cookie_generator(). Maintain global secret values for tcp_cookie_generator(). This is a significantly revised implementation of earlier (15-year-old) Photuris [RFC-2522] code for the KA9Q cooperative multitasking platform. Linux RCU technique appears to be well-suited to this application, though neither of the circular queue items are freed. These functions will also be used in subsequent patches that implement additional features. Signed-off-by: William.Allen.Simpson@gmail.com Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1fdf475aa141a669af8db6ccc7015f0b725087de |
|
30-Nov-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: tcp_disconnect() should clear window_clamp NFS can reuse its TCP socket after calling tcp_disconnect(). We noticed window scaling was not negotiated in SYN packet of next connection request. Fix is to clear tp->window_clamp in tcp_disconnect(). Reported-by: Krzysztof Oledzki <ole@ans.pl> Tested-by: Krzysztof Oledzki <ole@ans.pl> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d792c1006fe92448217b71513d3955868358271d |
|
13-Nov-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: provide more information on the tcp receive_queue bugs The addition of rcv_nxt allows to discern whether the skb was out of place or tp->copied. Also catch fancy combination of flags if necessary (sadly we might miss the actual causer flags as it might have already returned). Btw, we perhaps would want to forward copied_seq in somewhere or otherwise we might have some nice loop with WARN stuff within but where to do that safely I don't know at this stage until more is known (but it is not made significantly worse by this patch). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c62f4c453ab4b0240ab857bfd089da2c01ad91e7 |
|
23-Oct-2009 |
Arjan van de Ven <arjan@linux.intel.com> |
net: use WARN() for the WARN_ON in commit b6b39e8f3fbbb Commit b6b39e8f3fbbb (tcp: Try to catch MSG_PEEK bug) added a printk() to the WARN_ON() that's in tcp.c. This patch changes this combination to WARN(); the advantage of WARN() is that the printk message shows up inside the message, so that kerneloops.org will collect the message. In addition, this gets rid of an extra if() statement. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b6b39e8f3fbbb31001b836afec87bcaf4811a7bf |
|
19-Oct-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Try to catch MSG_PEEK bug This patch tries to print out more information when we hit the MSG_PEEK bug in tcp_recvmsg. It's been around since at least 2005 and it's about time that we finally fix it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b103cf34382f26ff48a87931b83f13b177b47c1a |
|
19-Oct-2009 |
Julian Anastasov <ja@ssi.bg> |
tcp: fix TCP_DEFER_ACCEPT retrans calculation Fix TCP_DEFER_ACCEPT conversion between seconds and retransmission to match the TCP SYN-ACK retransmission periods because the time is converted to such retransmissions. The old algorithm selects one more retransmission in some cases. Allow up to 255 retransmissions. Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c720c7e8383aff1cb219bddf474ed89d850336e3 |
|
15-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
inet: rename some inet_sock fields In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f373b53b5fe67aa4a6f28f921a529cc90f88e79b |
|
09-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: replace ehash_size by ehash_mask Storing the mask (size - 1) instead of the size allows fast path to be a bit faster. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
42324c62704365d6a3e89138dea55909d2f26afe |
|
02-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: splice() from tcp to pipe should take into account O_NONBLOCK tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag Before this patch : splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); causes a random endless block (if pipe is full) and splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK); will return 0 immediately if the TCP buffer is empty. User application has no way to instruct splice() that socket should be in blocking mode but pipe in nonblock more. Many projects cannot use splice(tcp -> pipe) because of this flaw. http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html Linus introduced SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d (splice: add SPLICE_F_NONBLOCK flag ) It doesn't make the splice itself necessarily nonblocking (because the actual file descriptors that are spliced from/to may block unless they have the O_NONBLOCK flag set), but it makes the splice pipe operations nonblocking. Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK flag, as other socket operations do. Users will then call : splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); to block on data coming from socket (if file is in blocking mode), and not block on pipe output (to avoid deadlock) First version of this patch was submitted by Octavian Purdila Reported-by: Volker Lendecke <vl@samba.org> Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4fdb78d3093a347456e108b77d56d493d29071b2 |
|
02-Oct-2009 |
Andrew Morton <akpm@linux-foundation.org> |
net/ipv4/tcp.c: fix min() type mismatch warning net/ipv4/tcp.c: In function 'do_tcp_setsockopt': net/ipv4/tcp.c:2050: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b7058842c940ad2c08dd829b21e5c92ebe3b8758 |
|
01-Oct-2009 |
David S. Miller <davem@davemloft.net> |
net: Make setsockopt() optlen be unsigned. This provides safety against negative optlen at the type level instead of depending upon (sometimes non-trivial) checks against this sprinkled all over the the place, in each and every implementation. Based upon work done by Arjan van de Ven and feedback from Linus Torvalds. Signed-off-by: David S. Miller <davem@davemloft.net>
|
4481374ce88ba8f460c8b89f2572027bd27057d0 |
|
22-Sep-2009 |
Jan Beulich <JBeulich@novell.com> |
mm: replace various uses of num_physpages by totalram_pages Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
0b6a05c1dbebe8c616e2e5b0f52b7a01fd792911 |
|
15-Sep-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: fix ssthresh u16 leftover It was once upon time so that snd_sthresh was a 16-bit quantity. ...That has not been true for long period of time. I run across some ancient compares which still seem to trust such legacy. Put all that magic into a single place, I hopefully found all of them. Compile tested, though linking of allyesconfig is ridiculous nowadays it seems. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa1330766c49199bdab4d4a9096d98b072df9044 |
|
03-Sep-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
tcp: replace hard coded GFP_KERNEL with sk_allocation This fixed a lockdep warning which appeared when doing stress memory tests over NFS: inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock mount_root => nfs_root_data => tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim David raised a concern that if the allocation fails in tcp_send_fin(), and it's GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting for the allocation to succeed. But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could loop endlessly under memory pressure. CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> CC: David S. Miller <davem@davemloft.net> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
df19a6267705456f463871ae2aabc44299909d2a |
|
29-Aug-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
tcp: keepalive cleanups Introduce keepalive_probes(tp) helper, and use it, like keepalive_time_when(tp) and keepalive_intvl_when(tp) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a57de0b4336e48db2811a2030bb68dba8dd09d88 |
|
08-Jul-2009 |
Jiri Olsa <jolsa@redhat.com> |
net: adding memory barrier to the poll and receive callbacks Adding memory barrier after the poll_wait function, paired with receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper to wrap the memory barrier. Without the memory barrier, following race can happen. The race fires, when following code paths meet, and the tp->rcv_nxt and __add_wait_queue updates stay in CPU caches. CPU1 CPU2 sys_select receive packet ... ... __add_wait_queue update tp->rcv_nxt ... ... tp->rcv_nxt check sock_def_readable ... { schedule ... if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible(sk->sk_sleep) ... } If there was no cache the code would work ok, since the wait_queue and rcv_nxt are opposit to each other. Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already passed the tp->rcv_nxt check and sleeps, or will get the new value for tp->rcv_nxt and will return with new data mask. In both cases the process (CPU1) is being added to the wait queue, so the waitqueue_active (CPU2) call cannot miss and will wake up CPU1. The bad case is when the __add_wait_queue changes done by CPU1 stay in its cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then endup calling schedule and sleep forever if there are no more data on the socket. Calls to poll_wait in following modules were ommited: net/bluetooth/af_bluetooth.c net/irda/af_irda.c net/irda/irnet/irnet_ppp.c net/mac80211/rc80211_pid_debugfs.c net/phonet/socket.c net/rds/af_rds.c net/rfkill/core.c net/sunrpc/cache.c net/sunrpc/rpc_pipe.c net/tipc/socket.c Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6828b92bd21acd65113dfe0541f19f5df0d9668f |
|
28-Jun-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Do not tack on TSO data to non-TSO packet If a socket starts out on a non-TSO route, and then switches to a TSO route, then we will tack on data to the tail of the tx queue even if it started out life as non-TSO. This is suboptimal because all of it will then be copied and checksummed unnecessarily. This patch fixes this by ensuring that skb->ip_summed is set to CHECKSUM_PARTIAL before appending extra data beyond the MSS. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
915219441d566f1da0caa0e262be49b666159e17 |
|
29-May-2009 |
David S. Miller <davem@davemloft.net> |
tcp: Use SKB queue and list helpers instead of doing it by-hand. Signed-off-by: David S. Miller <davem@davemloft.net>
|
a2a804cddfe65f18f903985e8a8d04c7c9eec354 |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Do not check flush when comparing options for GRO There is no need to repeatedly check flush when comparing TCP options for GRO as it will be false 99% of the time where it matters. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a5b1cf288d4200506ab62fbb86cc81ace948a306 |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Avoid unnecessary comparison after skb_gro_header For the overwhelming majority of cases, skb_gro_header's return value cannot be NULL. Yet we must check it because of its current form. This patch splits it up into multiple functions in order to avoid this. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
30a3ae30c775e2723f86ef70746ad3cb4404a4c9 |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Optimise len/mss comparison Instead of checking len > mss || len == 0, we can accomplish both by checking (len - 1) > mss using the unsigned wraparound. At nearly a million times a second, this might just help. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4a9a2968a17eae42ef5dffca8b37534c864e30cc |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Remove unnecessary window comparisons for GRO The window has already been checked as part of the flag word so there is no need to check it explicitly. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
745898eaf0eb7a04a56dec1188d9148259510863 |
|
26-May-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Optimise GRO port comparisons Instead of doing two 16-bit operations for the source/destination ports, we can do one 32-bit operation to take care both. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
775273131810caa41dfc7f9e552ea5d8508caf40 |
|
10-May-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: fix MSG_PEEK race check Commit 518a09ef11 (tcp: Fix recvmsg MSG_PEEK influence of blocking behavior) lets the loop run longer than the race check did previously expect, so we need to be more careful with this check and consider the work we have been doing. I tried my best to deal with urg hole madness too which happens here: if (!sock_flag(sk, SOCK_URGINLINE)) { ++*seq; ... by using additional offset by one but I certainly have very little interest in testing that part. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Tested-by: Frans Pop <elendil@planet.nl> Tested-by: Ian Zimmermann <itz@buug.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a0a69a0106dab8d20596f97f6674bed3b394d1ee |
|
17-Apr-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Fix use after free in tcp_gro_receive After calling skb_gro_receive skb->len can no longer be relied on since if the skb was merged using frags, then its pages will have been removed and the length reduced. This caused tcp_gro_receive to prematurely end merging which resulted in suboptimal performance with ixgbe. The fix is to store skb->len on the stack. Reported-by: Mark Wagner <mwagner@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
377f0a08e4cb56658d878d22c3aed4716e283c6b |
|
31-Mar-2009 |
Rami Rosen <ramirose@gmail.com> |
ipv4: remove unused parameter from tcp_recv_urg(). Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
beedad923ad6237f03265fdf86eb8a1b50d14ae9 |
|
19-Mar-2009 |
Rami Rosen <ramirose@gmail.com> |
tcp: remove parameter from tcp_recv_urg(). This patch removes an unused parameter (addr_len) from tcp_recv_urg() method in net/ipv4/tcp.c. Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
afece1c6587010cc81d1a43045c855774e8234a3 |
|
14-Mar-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: make sure xmit goal size never becomes zero It's not too likely to happen, would basically require crafted packets (must hit the max guard in tcp_bound_to_half_wnd()). It seems that nothing that bad would happen as there's tcp_mems and congestion window that prevent runaway at some point from hurting all too much (I'm not that sure what all those zero sized segments we would generate do though in write queue). Preventing it regardless is certainly the best way to go. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2a3a041c4e2c1685e668b280c121a5a40a029a03 |
|
14-Mar-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: cache result of earlier divides when mss-aligning things The results is very unlikely change every so often so we hardly need to divide again after doing that once for a connection. Yet, if divide still becomes necessary we detect that and do the right thing and again settle for non-divide state. Takes the u16 space which was previously taken by the plain xmit_size_goal. This should take care part of the tso vs non-tso difference we found earlier. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0c54b85f2828128274f319a1eb3ce7f604fe2a53 |
|
14-Mar-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: simplify tcp_current_mss There's very little need for most of the callsites to get tp->xmit_goal_size updated. That will cost us divide as is, so slice the function in two. Also, the only users of the tp->xmit_goal_size are directly behind tcp_current_mss(), so there's no need to store that variable into tcp_sock at all! The drop of xmit_goal_size currently leaves 16-bit hole and some reorganization would again be necessary to change that (but I'm aiming to fill that hole with u16 xmit_goal_size_segs to cache the results of the remaining divide to get that tso on regression). Bring xmit_goal_size parts into tcp.c Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0d6a775e27d975e5f9ea8e2911216d84face50ca |
|
28-Feb-2009 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: in sendmsg/pages open code the real goto target copied was assigned zero right before the goto, so if (copied) cannot ever be true. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa6320d336971171df1d13c1c284facf10804881 |
|
08-Feb-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Optimise TCP packet reception gro: Optimise TCP packet reception As this function can be called more than half a million times for 10GbE, it's important to optimise it as much as we can. This patch uses bit ops to logical ops, as well as open coding memcmp to exploit alignment properties. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
86911732d3996a9da07914b280621450111bb6da |
|
29-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Avoid copying headers of unmerged packets Unfortunately simplicity isn't always the best. The fraginfo interface turned out to be suboptimal. The problem was quite obvious. For every packet, we have to copy the headers from the frags structure into skb->head, even though for 99% of the packets this part is immediately thrown away after the merge. LRO didn't have this problem because it directly read the headers from the frags structure. This patch attempts to address this by creating an interface that allows GRO to access the headers in the first frag without having to copy it. Because all drivers that use frags place the headers in the first frag this optimisation should be enough. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9fa5fdf291c9b58b1cb8b4bb2a0ee57efa21d635 |
|
27-Jan-2009 |
Dimitris Michailidis <dm@chelsio.com> |
tcp: Fix length tcp_splice_data_recv passes to skb_splice_bits. tcp_splice_data_recv has two lengths to consider: the len parameter it gets from tcp_read_sock, which specifies the amount of data in the skb, and rd_desc->count, which is the amount of data the splice caller still wants. Currently it passes just the latter to skb_splice_bits, which then splices min(rd_desc->count, skb->len - offset) bytes. Most of the time this is fine, except when the skb contains urgent data. In that case len goes only up to the urgent byte and is less than skb->len - offset. By ignoring len tcp_splice_data_recv may a) splice data tcp_read_sock told it not to, b) return to tcp_read_sock a value > len. Now, tcp_read_sock doesn't handle used > len and leaves the socket in a bad state (both sk_receive_queue and copied_seq are bad at that point) resulting in duplicated data and corruption. Fix by passing min(rd_desc->count, len) to skb_splice_bits. Signed-off-by: Dimitris Michailidis <dm@chelsio.com> Acked-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4e704ee3c2cd38748ca59d835435d6a7e7f6f613 |
|
15-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gso: Ensure that the packet is long enough When we get a GSO packet from an untrusted source, we need to ensure that it is sufficiently long so that we don't end up crashing. Based on discovery and patch by Ian Campbell. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
33966dd0e2f68f26943cd9ee93ec6abbc6547a8e |
|
14-Jan-2009 |
Willy Tarreau <w@1wt.eu> |
tcp: splice as many packets as possible at once As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not optimal. It processes at most one segment per call. This results in low performance and very high overhead due to syscall rate when splicing from interfaces which do not support LRO. Willy provided a patch inside tcp_splice_read(), but a better fix is to let tcp_read_sock() process as many segments as possible, so that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less often. With this change, splice() behaves like tcp_recvmsg(), being able to consume many skbs in one system call. With typical 1460 bytes of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return 16*1460 = 23360 bytes. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
684f2176015b313ab59cecf574117969cf638f28 |
|
08-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp6: Add GRO support This patch adds GRO support for TCP over IPv6. The code is exactly the same as the IPv4 version except for the pseudo-header checksum computation. Note that I've removed the unused tcphdr argument from tcp_v6_check rather than invent a bogus value for GRO. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f67b45999205164958de4ec0658d51fa4bee066d |
|
06-Jan-2009 |
Dan Williams <dan.j.williams@intel.com> |
net_dma: convert to dma_find_channel Use the general-purpose channel allocation provided by dmaengine. Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
6f49a57aa5a0c6d4e4e27c85f7af6c83325a12d1 |
|
06-Jan-2009 |
Dan Williams <dan.j.williams@intel.com> |
dmaengine: up-level reference counting to the module level Simply, if a client wants any dmaengine channel then prevent all dmaengine modules from being removed. Once the clients are done re-enable module removal. Why?, beyond reducing complication: 1/ Tracking reference counts per-transaction in an efficient manner, as is currently done, requires a complicated scheme to avoid cache-line bouncing effects. 2/ Per-transaction ref-counting gives the false impression that a dma-driver can be gracefully removed ahead of its user (net, md, or dma-slave) 3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but if such an engine were built one day we still would not need to notify clients of remove events. The driver can simply return NULL to a ->prep() request, something that is much easier for a client to handle. Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
7945cc6464a4db0caf6dfacdfe05806051c4cb7b |
|
05-Jan-2009 |
David S. Miller <davem@davemloft.net> |
tcp: Kill extraneous SPLICE_F_NONBLOCK checks. In splice TCP receive, the SPLICE_F_NONBLOCK flag is used to compute the "timeo" value. So checking it again inside of the main receive loop to trigger -EAGAIN processing is entirely unnecessary. Noticed by Jarek P. and Lennert Buytenhek. Signed-off-by: David S. Miller <davem@davemloft.net>
|
4f7d54f59bc470f0aaa932f747a95232d7ebf8b1 |
|
05-Jan-2009 |
Lennert Buytenhek <buytenh@marvell.com> |
tcp: don't mask EOF and socket errors on nonblocking splice receive Currently, setting SPLICE_F_NONBLOCK on splice from a TCP socket results in masking of EOF (RDHUP) and error conditions on the socket by an -EAGAIN return. Move the NONBLOCK check in tcp_splice_read() to be after the EOF and error checks to fix this. Signed-off-by: Lennert Buytenhek <buytenh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b530256d2e0f1a75fab31f9821129fff1bb49faa |
|
05-Jan-2009 |
Herbert Xu <herbert@gondor.apana.org.au> |
gro: Use gso_size to store MSS In order to allow GRO packets without frag_list at all, we need to store the MSS in the packet itself. The obvious place is gso_size. The only thing to watch out for is if the packet ends up not being GRO then we need to clear gso_size before pushing the packet into the stack. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
eb4dea5853046727bfbb579f0c9a8cae7369f7c6 |
|
30-Dec-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
net: Fix percpu counters deadlock When we converted the protocol atomic counters such as the orphan count and the total socket count deadlocks were introduced due to the mismatch in BH status of the spots that used the percpu counter operations. Based on the diagnosis and patch by Peter Zijlstra, this patch fixes these issues by disabling BH where we may be in process context. Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bf296b125b21b8d558ceb6ec30bb4eba2730cd6b |
|
16-Dec-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
tcp: Add GRO support This patch adds the TCP-specific portion of GRO. The criterion for merging is extremely strict (the TCP header must match exactly apart from the checksum) so as to allow refragmentation. Otherwise this is pretty much identical to LRO, except that we support the merging of ECN packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dd24c00191d5e4a1ae896aafe33c6b8095ab4bd1 |
|
26-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: Use a percpu_counter for orphan_count Instead of using one atomic_t per protocol, use a percpu_counter for "orphan_count", to reduce cache line contention on heavy duty network servers. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1748376b6626acf59c24e9592ac67b3fe2a0e026 |
|
26-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: Use a percpu_counter for sockets_allocated Instead of using one atomic_t per protocol, use a percpu_counter for "sockets_allocated", to reduce cache line contention on heavy duty network servers. Note : We revert commit (248969ae31e1b3276fc4399d67ce29a5d81e6fd9 net: af_unix can make unix_nr_socks visbile in /proc), since it is not anymore used after sock_prot_inuse_add() addition Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3ab5aee7fe840b5b1b35a8d1ac11c3de5281e611 |
|
17-Nov-2008 |
Eric Dumazet <dada1@cosmosbay.com> |
net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
518a09ef11f8454f4676125d47c3e775b300c6a5 |
|
05-Nov-2008 |
David S. Miller <davem@davemloft.net> |
tcp: Fix recvmsg MSG_PEEK influence of blocking behavior. Vito Caputo noticed that tcp_recvmsg() returns immediately from partial reads when MSG_PEEK is used. In particular, this means that SO_RCVLOWAT is not respected. Simply remove the test. And this matches the behavior of several other systems, including BSD. Signed-off-by: David S. Miller <davem@davemloft.net>
|
5a5f3a8db9d70c90e9d55b46e02b2d8deb1c2c2e |
|
03-Nov-2008 |
Jianjun Kong <jianjun@zeuux.org> |
net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c xfrm4_policy.c Signed-off-by: Jianjun Kong <jianjun@zeuux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
33f5f57eeb0c6386fdd85f9c690dc8d700ba7928 |
|
07-Oct-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
tcp: kill pointless urg_mode It all started from me noticing that this urgent check in tcp_clean_rtx_queue is unnecessarily inside the loop. Then I took a longer look to it and found out that the users of urg_mode can trivially do without, well almost, there was one gotcha. Bonus: those funny people who use urg with >= 2^31 write_seq - snd_una could now rejoice too (that's the only purpose for the between being there, otherwise a simple compare would have done the thing). Not that I assume that the rest of the tcp code happily lives with such mind-boggling numbers :-). Alas, it turned out to be impossible to set wmem to such numbers anyway, yes I really tried a big sendfile after setting some wmem but nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf... So I hacked a bit variable to long and found out that it seems to work... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c57943a1c96214ee68f3890bb6772841ffbfd606 |
|
07-Oct-2008 |
Peter Zijlstra <a.p.zijlstra@chello.nl> |
net: wrap sk->sk_backlog_rcv() Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the generic sk_backlog_rcv behaviour. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c7004482e8dcb7c3c72666395cfa98a216a4fb70 |
|
06-Oct-2008 |
David S. Miller <davem@davemloft.net> |
tcp: Respect SO_RCVLOWAT in tcp_poll(). Based upon a report by Vito Caputo. Signed-off-by: David S. Miller <davem@davemloft.net>
|
547b792cac0a038b9dbf958d3c120df3740b5572 |
|
26-Jul-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
net: convert BUG_TRAP to generic WARN_ON Removes legacy reinvent-the-wheel type thing. The generic machinery integrates much better to automated debugging aids such as kerneloops.org (and others), and is unambiguous due to better naming. Non-intuively BUG_TRAP() is actually equal to WARN_ON() rather than BUG_ON() though some might actually be promoted to BUG_ON() but I left that to future. I could make at least one BUILD_BUG_ON conversion. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
49a72dfb8814c2d65bd9f8c9c6daf6395a1ec58d |
|
19-Jul-2008 |
Adam Langley <agl@imperialviolet.org> |
tcp: Fix MD5 signatures for non-linear skbs Currently, the MD5 code assumes that the SKBs are linear and, in the case that they aren't, happily goes off and hashes off the end of the SKB and into random memory. Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy Polyakov. Also includes a couple of missed route_caps from Stephen's patch in [2]. [1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2 [2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2 Signed-off-by: Adam Langley <agl@imperialviolet.org> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
57ef42d59d1c1d79be59fc3c6380ae14234e38c3 |
|
18-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: put tcp statistics on struct net Proc temporary uses stats from init_net. BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ed88098e25d77bef3b2ad8c9d8e2ebf454d9ccbf |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_ADD_STATS_USER Done with NET_XXX_STATS macros :) To be continued... Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6f67c817fcfd94f5ca0f14b114b7fa25c0210c8b |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_INC_STATS_USER Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
de0744af1fe2d0a3d428f6af0f2fe1f6179b1a9c |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_INC_STATS_BH Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4e6734447dbc7a0a85e09616821c0782d9fb1141 |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to NET_INC_STATS Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5c52ba170f8167511bdb65b981f4582100c40675 |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
sock: add net to prot->enter_memory_pressure callback The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't have where to get the net from. I decided to add a sk argument, not the net itself, only to factor all the required sock_net(sk) calls inside the enter_memory_pressure callback itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
74688e487a407a33d42879957b478601aca616b8 |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to TCP_DEC_STATS Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
63231bddf6514778792d3784f63822473d250fc0 |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to TCP_INC_STATS_BH Same as before - the sock is always there to get the net from, but there are also some places with the net already saved on the stack. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
81cc8a75d944fa39fc333c2c329c8e8b3c62cada |
|
17-Jul-2008 |
Pavel Emelyanov <xemul@openvz.org> |
mib: add net to TCP_INC_STATS Fortunately (almost) all the TCP code has a sock to get the net from :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
70efce27fc3d54271519244dc5e47da4ed711dd4 |
|
17-Jul-2008 |
Will Newton <will.newton@gmail.com> |
net/ipv4/tcp.c: Fix use of PULLHUP instead of POLLHUP in comments. Change PULLHUP to POLLHUP in tcp_poll comments and clean up another comment for grammar and coding style. Signed-off-by: Will Newton <will.newton@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
374e7b59498ce0785b3727794b351221528a5159 |
|
03-Jul-2008 |
Octavian Purdila <opurdila@ixiacom.com> |
tcp: fix a size_t < 0 comparison in tcp_read_sock <used> should be of type int (not size_t) since recv_actor can return negative values and it is also used in a < 0 comparison. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
81b23b4a7acd9b37a269c62d02479d4f645dd20a |
|
03-Jul-2008 |
Andrew Morton <akpm@linux-foundation.org> |
tcp: net/ipv4/tcp.c needs linux/scatterlist.h alpha: net/ipv4/tcp.c: In function 'tcp_calc_md5_hash': net/ipv4/tcp.c:2479: error: implicit declaration of function 'sg_init_table' net/ipv4/tcp.c:2482: error: implicit declaration of function 'sg_set_buf' net/ipv4/tcp.c:2507: error: implicit declaration of function 'sg_mark_end' Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
57413ebc4e0f1e471a3b4db4aff9a85c083d090e |
|
28-Jun-2008 |
Miquel van Smoorenburg <miquels@cistron.nl> |
tcp: calculate tcp_mem based on low memory instead of all memory The tcp_mem array which contains limits on the total amount of memory used by TCP sockets is calculated based on nr_all_pages. On a 32 bits x86 system, we should base this on the number of lowmem pages. Signed-off-by: Miquel van Smoorenburg <miquels@cistron.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ec0a196626bd12e0ba108d7daa6d95a4fb25c2c5 |
|
13-Jun-2008 |
David S. Miller <davem@davemloft.net> |
tcp: Revert 'process defer accept as established' changes. This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38 ("tcp: Fix slab corruption with ipv6 and tcp6fuzz"). This change causes several problems, first reported by Ingo Molnar as a distcc-over-loopback regression where connections were getting stuck. Ilpo Järvinen first spotted the locking problems. The new function added by this code, tcp_defer_accept_check(), only has the child socket locked, yet it is modifying state of the parent listening socket. Fixing that is non-trivial at best, because we can't simply just grab the parent listening socket lock at this point, because it would create an ABBA deadlock. The normal ordering is parent listening socket --> child socket, but this code path would require the reverse lock ordering. Next is a problem noticed by Vitaliy Gusev, he noted: ---------------------------------------- >--- a/net/ipv4/tcp_timer.c >+++ b/net/ipv4/tcp_timer.c >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data) > goto death; > } > >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) { >+ tcp_send_active_reset(sk, GFP_ATOMIC); >+ goto death; Here socket sk is not attached to listening socket's request queue. tcp_done() will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should release this sk) as socket is not DEAD. Therefore socket sk will be lost for freeing. ---------------------------------------- Finally, Alexey Kuznetsov argues that there might not even be any real value or advantage to these new semantics even if we fix all of the bugs: ---------------------------------------- Hiding from accept() sockets with only out-of-order data only is the only thing which is impossible with old approach. Is this really so valuable? My opinion: no, this is nothing but a new loophole to consume memory without control. ---------------------------------------- So revert this thing for now. Signed-off-by: David S. Miller <davem@davemloft.net>
|
0b040829952d84bf2a62526f0e24b624e0699447 |
|
11-Jun-2008 |
Adrian Bunk <bunk@kernel.org> |
net: remove CVS keywords This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8d26d76dd4a4c87ef037a44a42a0608ffc730199 |
|
17-Apr-2008 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
tcp md5sig: Share most of hash calcucaltion bits between IPv4 and IPv6. We can share most part of the hash calculation code because the only difference between IPv4 and IPv6 is their pseudo headers. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
|
293ad60401da621b8b329abbe8c388edb25f658a |
|
05-Jun-2008 |
Octavian Purdila <opurdila@ixiacom.com> |
tcp: Fix for race due to temporary drop of the socket lock in skb_splice_bits. skb_splice_bits temporary drops the socket lock while iterating over the socket queue in order to break a reverse locking condition which happens with sendfile. This, however, opens a window of opportunity for tcp_collapse() to aggregate skbs and thus potentially free the current skb used in skb_splice_bits and tcp_read_sock. This patch fixes the problem by (re-)getting the same "logical skb" after the lock has been temporary dropped. Based on idea and initial patch from Evgeniy Polyakov. Signed-off-by: Octavian Purdila <opurdila@ixiacom.com> Acked-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1f29b0584dcb695589407a11bb7533fe21fa47c4 |
|
21-Apr-2008 |
Satoru SATOH <satoru.satoh@gmail.com> |
tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c This is a trivial fix to correct function name in a comment in net/ipv4/tcp.c. Signed-off-by: Satoru SATOH <satoru.satoh@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
69d1506731168d6845a76a303b2c45f7c05f3f2c |
|
22-Mar-2008 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Let skbs grow over a page on fast peers While testing the virtio-net driver on KVM with TSO I noticed that TSO performance with a 1500 MTU is significantly worse compared to the performance of non-TSO with a 16436 MTU. The packet dump shows that most of the packets sent are smaller than a page. Looking at the code this actually is quite obvious as it always stop extending the packet if it's the first packet yet to be sent and if it's larger than the MSS. Since each extension is bound by the page size, this means that (given a 1500 MTU) we're very unlikely to construct packets greater than a page, provided that the receiver and the path is fast enough so that packets can always be sent immediately. The fix is also quite obvious. The push calls inside the loop is just an optimisation so that we don't end up doing all the sending at the end of the loop. Therefore there is no specific reason why it has to do so at MSS boundaries. For TSO, the most natural extension of this optimisation is to do the pushing once the skb exceeds the TSO size goal. This is what the patch does and testing with KVM shows that the TSO performance with a 1500 MTU easily surpasses that of a 16436 MTU and indeed the packet sizes sent are generally larger than 16436. I don't see any obvious downsides for slower peers or connections, but it would be prudent to test this extensively to ensure that those cases don't regress. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 |
|
22-Mar-2008 |
Patrick McManus <mcmanus@ducksong.com> |
[TCP]: TCP_DEFER_ACCEPT updates - process as established Change TCP_DEFER_ACCEPT implementation so that it transitions a connection to ESTABLISHED after handshake is complete instead of leaving it in SYN-RECV until some data arrvies. Place connection in accept queue when first data packet arrives from slow path. Benefits: - established connection is now reset if it never makes it to the accept queue - diagnostic state of established matches with the packet traces showing completed handshake - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be enforced with reasonable accuracy instead of rounding up to next exponential back-off of syn-ack retry. Signed-off-by: Patrick McManus <mcmanus@ducksong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ab1e0a13d70299e792fd0527cefd070c1405fa5b |
|
03-Feb-2008 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SOCK] proto: Add hashinfo member to struct proto This way we can remove TCP and DCCP specific versions of sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port sk->sk_prot->hash: inet_hash is directly used, only v6 need a specific version to deal with mapped sockets sk->sk_prot->unhash: both v4 and v6 use inet_hash directly struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so that inet_csk_get_port can find the per family routine. Now only the lookup routines receive as a parameter a struct inet_hashtable. With this we further reuse code, reducing the difference among INET transport protocols. Eventually work has to be done on UDP and SCTP to make them share this infrastructure and get as a bonus inet_diag interfaces so that iproute can be used with these protocols. net-2.6/net/ipv4/inet_hashtables.c: struct proto | +8 struct inet_connection_sock_af_ops | +8 2 structs changed __inet_hash_nolisten | +18 __inet_hash | -210 inet_put_port | +8 inet_bind_bucket_create | +1 __inet_hash_connect | -8 5 functions changed, 27 bytes added, 218 bytes removed, diff: -191 net-2.6/net/core/sock.c: proto_seq_show | +3 1 function changed, 3 bytes added, diff: +3 net-2.6/net/ipv4/inet_connection_sock.c: inet_csk_get_port | +15 1 function changed, 15 bytes added, diff: +15 net-2.6/net/ipv4/tcp.c: tcp_set_state | -7 1 function changed, 7 bytes removed, diff: -7 net-2.6/net/ipv4/tcp_ipv4.c: tcp_v4_get_port | -31 tcp_v4_hash | -48 tcp_v4_destroy_sock | -7 tcp_v4_syn_recv_sock | -2 tcp_unhash | -179 5 functions changed, 267 bytes removed, diff: -267 net-2.6/net/ipv6/inet6_hashtables.c: __inet6_hash | +8 1 function changed, 8 bytes added, diff: +8 net-2.6/net/ipv4/inet_hashtables.c: inet_unhash | +190 inet_hash | +242 2 functions changed, 432 bytes added, diff: +432 vmlinux: 16 functions changed, 485 bytes added, 492 bytes removed, diff: -7 /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c: tcp_v6_get_port | -31 tcp_v6_hash | -7 tcp_v6_syn_recv_sock | -9 3 functions changed, 47 bytes removed, diff: -47 /home/acme/git/net-2.6/net/dccp/proto.c: dccp_destroy_sock | -7 dccp_unhash | -179 dccp_hash | -49 dccp_set_state | -7 dccp_done | +1 5 functions changed, 1 bytes added, 242 bytes removed, diff: -241 /home/acme/git/net-2.6/net/dccp/ipv4.c: dccp_v4_get_port | -31 dccp_v4_request_recv_sock | -2 2 functions changed, 33 bytes removed, diff: -33 /home/acme/git/net-2.6/net/dccp/ipv6.c: dccp_v6_get_port | -31 dccp_v6_hash | -7 dccp_v6_request_recv_sock | +5 3 functions changed, 5 bytes added, 38 bytes removed, diff: -33 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
490d5046930276aae50dd16942649bfc626056f7 |
|
12-Jan-2008 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Uninline tcp_set_state net/ipv4/tcp.c: tcp_close_state | -226 tcp_done | -145 tcp_close | -564 tcp_disconnect | -141 4 functions changed, 1076 bytes removed, diff: -1076 net/ipv4/tcp_input.c: tcp_fin | -86 tcp_rcv_state_process | -164 2 functions changed, 250 bytes removed, diff: -250 net/ipv4/tcp_ipv4.c: tcp_v4_connect | -209 1 function changed, 209 bytes removed, diff: -209 net/ipv4/arp.c: arp_ignore | +5 1 function changed, 5 bytes added, diff: +5 net/ipv6/tcp_ipv6.c: tcp_v6_connect | -158 1 function changed, 158 bytes removed, diff: -158 net/sunrpc/xprtsock.c: xs_sendpages | -2 1 function changed, 2 bytes removed, diff: -2 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | +7 1 function changed, 7 bytes added, diff: +7 net/ipv4/tcp.c: tcp_set_state | +238 1 function changed, 238 bytes added, diff: +238 built-in.o: 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445 I've no explanation why some unrelated changes seem to occur consistently as well (arp_ignore, ccid3_update_send_interval; I checked the arp_ignore asm and it seems to be due to some reordered of operation order causing some extra opcodes to be generated). Still, the benefits are pretty obvious from the codiff's results. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4828e7f49a402930e8b3e72de695c8d37e0f98ee |
|
31-Dec-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Remove TCPCB_URG & TCPCB_AT_TAIL as unnecessary The snd_up check should be enough. I suspect this has been there to provide a minor optimization in clean_rtx_queue which used to have a small if (!->sacked) block which could skip snd_up check among the other work. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3ab224be6d69de912ee21302745ea45a99274dbc |
|
31-Dec-2007 |
Hideo Aoki <haoki@redhat.com> |
[NET] CORE: Introducing new memory accounting interface. This patch introduces new memory accounting functions for each network protocol. Most of them are renamed from memory accounting functions for stream protocols. At the same time, some stream memory accounting functions are removed since other functions do same thing. Renaming: sk_stream_free_skb() -> sk_wmem_free_skb() __sk_stream_mem_reclaim() -> __sk_mem_reclaim() sk_stream_mem_reclaim() -> sk_mem_reclaim() sk_stream_mem_schedule -> __sk_mem_schedule() sk_stream_pages() -> sk_mem_pages() sk_stream_rmem_schedule() -> sk_rmem_schedule() sk_stream_wmem_schedule() -> sk_wmem_schedule() sk_charge_skb() -> sk_mem_charge() Removeing sk_stream_rfree(): consolidates into sock_rfree() sk_stream_set_owner_r(): consolidates into skb_set_owner_r() sk_stream_mem_schedule() The following functions are added. sk_has_account(): check if the protocol supports accounting sk_mem_uncharge(): do the opposite of sk_mem_charge() In addition, to achieve consolidation, updating sk_wmem_queued is removed from sk_mem_charge(). Next, to consolidate memory accounting functions, this patch adds memory accounting calls to network core functions. Moreover, present memory accounting call is renamed to new accounting call. Finally we replace present memory accounting calls with new interface in TCP and SCTP. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1f9e636ea21bd648a5cbd2f744a1d39a5e183b20 |
|
11-Dec-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[TCP]: Use BUILD_BUG_ON for tcp_skb_cb size checking The sizeof(struct tcp_skb_cb) should not be less than the sizeof(skb->cb). This is checked in net/ipv4/tcp.c, but this check can be made more gracefully. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
df97c708d5e6eebdd9ded1fa588eae09acf53793 |
|
29-Nov-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Eliminate unused argument from sk_stream_alloc_pskb The 3rd argument is always zero (according to grep :) Eliminate it and merge the function with sk_stream_alloc_skb. This saves 44 more bytes, and together with the previous patch we have: add/remove: 1/0 grow/shrink: 0/8 up/down: 183/-751 (-568) function old new delta sk_stream_alloc_skb - 183 +183 ip_rt_init 529 525 -4 arp_ignore 112 107 -5 __inet_lookup_listener 284 274 -10 tcp_sendmsg 2583 2481 -102 tcp_sendpage 1449 1300 -149 tso_fragment 417 258 -159 tcp_fragment 1149 988 -161 __tcp_push_pending_frames 1998 1837 -161 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f561d0f27d6283c49359bb96048f8ac3728c812c |
|
29-Nov-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Uninline the sk_stream_alloc_pskb This function seems too big for inlining. Indeed, it saves half-a-kilo when uninlined: add/remove: 1/0 grow/shrink: 0/7 up/down: 195/-719 (-524) function old new delta sk_stream_alloc_pskb - 195 +195 ip_rt_init 529 525 -4 __inet_lookup_listener 284 274 -10 tcp_sendmsg 2583 2486 -97 tcp_sendpage 1449 1305 -144 tso_fragment 417 267 -150 tcp_fragment 1149 992 -157 __tcp_push_pending_frames 1998 1841 -157 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6ff7751d06f63131830102ffa0c9a38b313f100e |
|
07-Nov-2007 |
Adrian Bunk <bunk@kernel.org> |
[TCP]: Make tcp_splice_data_recv() static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9c55e01c0cc835818475a6ce8c4d684df9949ac8 |
|
07-Nov-2007 |
Jens Axboe <jens.axboe@oracle.com> |
[TCP]: Splice receive support. Support for network splice receive. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
230140cffa7feae90ad50bf259db1fa07674f3a7 |
|
07-Nov-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[INET]: Remove per bucket rwlock in tcp/dccp ehash table. As done two years ago on IP route cache table (commit 22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one lock per hash bucket for the huge TCP/DCCP hash tables. On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for litle performance differences. (we hit a different cache line for the rwlock, but then the bucket cache line have a better sharing factor among cpus, since we dirty it less often). For netstat or ss commands that want a full scan of hash table, we perform fewer memory accesses. Using a 'small' table of hashed rwlocks should be more than enough to provide correct SMP concurrency between different buckets, without using too much memory. Sizing of this table depends on num_possible_cpus() and various CONFIG settings. This patch provides some locking abstraction that may ease a future work using a different model for TCP/DCCP table. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0ccfe61803ad24f1c0fe5e1f5ce840ff0f3d9660 |
|
30-Oct-2007 |
Jean Delvare <jdelvare@suse.de> |
[TCP]: Saner thash_entries default with much memory. On systems with a very large amount of memory, the heuristics in alloc_large_system_hash() result in a very large TCP established hash table: 16 millions of entries for a 128 GB ia64 system. This makes reading from /proc/net/tcp pretty slow (well over a second) and as a result netstat is slow on these machines. I know that /proc/net/tcp is deprecated in favor of tcp_diag, however at the moment netstat only knows of the former. I am skeptical that such a large TCP established hash is often needed. Just because a system has a lot of memory doesn't imply that it will have several millions of concurrent TCP connections. Thus I believe that we should put an arbitrary high limit to the size of the TCP established hash by default. Users who really need a bigger hash can always use the thash_entries boot parameter to get more. I propose 2 millions of entries as the arbitrary high limit. This makes /proc/net/tcp reasonably fast on the system in question (0.2 s) while being still large enough for me to be confident that network performance won't suffer. This is just one way to limit the hash size, there are others; I am not familiar enough with the TCP code to decide which is best. Thus, I would welcome the proposals of alternatives. [ 2 million is still too large, thus I've modified the limit in the change to be '512 * 1024'. -DaveM ] Signed-off-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ba25f9dcc4ea6e30839fcab5a5516f2176d5bfed |
|
19-Oct-2007 |
Pavel Emelyanov <xemul@openvz.org> |
Use helpers to obtain task pid in printks The task_struct->pid member is going to be deprecated, so start using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in the kernel. The first thing to start with is the pid, printed to dmesg - in this case we may safely use task_pid_nr(). Besides, printks produce more (much more) than a half of all the explicit pid usage. [akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
5ee3afba88f5a79d0bff07ddd87af45919259f91 |
|
18-Sep-2007 |
Rick Jones <rick.jones2@hp.com> |
[TCP]: Return useful listenq info in tcp_info and INET_DIAG_INFO. Return some useful information such as the maximum listen backlog and the current listen backlog in the tcp_info structure and INET_DIAG_INFO. Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
172589ccdde41b59861c92c4a971b95514ef24e3 |
|
29-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[NET]: DIV_ROUND_UP cleanup (part two) Hopefully captured all single statement cases under net/. I'm not too sure if there is some policy about #includes that are "guaranteed" (ie., in the current tree) to be available through some other #included header, so I just added linux/kernel.h to each changed file that didn't #include it previously. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e60402d0a909ca2e6e2fbdf9ed004ef0fae36d33 |
|
09-Aug-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Move sack_ok access to obviously named funcs & cleanup Previously code had IsReno/IsFack defined as macros that were local to tcp_input.c though sack_ok field has user elsewhere too for the same purpose. This changes them to static inlines as preferred according the current coding style and unifies the access to sack_ok across multiple files. Magic bitops of sack_ok for FACK and DSACK are also abstracted to functions with appropriate names. Note: - One sack_ok = 1 remains but that's self explanary, i.e., it enables sack - Couple of !IsReno cases are changed to tcp_is_sack - There were no users for IsDSack => I dropped it Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3516ffb0fef710749daf288c0fe146503e0cf9d4 |
|
03-Aug-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Invoke tcp_sendmsg() directly, do not use inet_sendmsg(). As discovered by Evegniy Polyakov, if we try to sendmsg after a connection reset, we can do incredibly stupid things. The core issue is that inet_sendmsg() tries to autobind the socket, but we should never do that for TCP. Instead we should just go straight into TCP's sendmsg() code which will do all of the necessary state and pending socket error checks. TCP's sendpage already directly vectors to tcp_sendpage(), so this merely brings sendmsg() in line with that. Signed-off-by: David S. Miller <davem@davemloft.net>
|
20c2df83d25c6a95affe6157a4c9cac4cf5ffaac |
|
20-Jul-2007 |
Paul Mundt <lethal@linux-sh.org> |
mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
|
e00c5d8b4d800b95b72b3f072e1d55d7c7034702 |
|
08-Mar-2007 |
Andrew Morton <akpm@osdl.org> |
I/OAT: warning fix net/ipv4/tcp.c: In function 'tcp_recvmsg': net/ipv4/tcp.c:1111: warning: unused variable 'available' Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Leech <christopher.leech@intel.com>
|
2b1244a43be97f504494b557a7f7a65fe0d00dba |
|
08-Mar-2007 |
Chris Leech <christopher.leech@intel.com> |
I/OAT: Only offload copies for TCP when there will be a context switch The performance wins come with having the DMA copy engine doing the copies in parallel with the context switch. If there is enough data ready on the socket at recv time just use a regular copy. Signed-off-by: Chris Leech <christopher.leech@intel.com>
|
ddb61a57bb6df673986e6476407f97d28b02031f |
|
24-Jun-2007 |
Jens Axboe <jens.axboe@oracle.com> |
[TCP] tcp_read_sock: Allow recv_actor() return return negative error value. tcp_read_sock() currently assumes that the recv_actor() only returns number of bytes copied. For network splice receive, we may have to return an error in some cases. So allow the actor to return a negative error value. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3f196eb519a419bf83ecc22753943fd0a0de4f8f |
|
01-Jun-2007 |
Mark Glines <mark@glines.org> |
[TCP]: Use default 32768-61000 outgoing port range in all cases. This diff changes the default port range used for outgoing connections, from "use 32768-61000 in most cases, but use N-4999 on small boxes (where N is a multiple of 1024, depending on just *how* small the box is)" to just "use 32768-61000 in all cases". I don't believe there are any drawbacks to this change, and it keeps outgoing connection ports farther away from the mess of IANA-registered ports. Signed-off-by: Mark Glines <mark@glines.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e4fd5da39f99d5921dda1fe3d93652fbd925fbfd |
|
29-May-2007 |
Pavel Emelianov <xemul@openvz.org> |
[TCP]: Consolidate checking for tcp orphan count being too big. tcp_out_of_resources() and tcp_close() perform the same checking of number of orphan sockets. Move this code into common place. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e63340ae6b6205fef26b40a75673d1c9c0c8bb90 |
|
08-May-2007 |
Randy Dunlap <randy.dunlap@oracle.com> |
header cleaning: don't include smp_lock.h when not used Remove includes of <linux/smp_lock.h> where it is not used/needed. Suggested by Al Viro. Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc, sparc64, and arm (all 59 defconfigs). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
b40b4f79ce789e9e28d382c85006f62be2725282 |
|
04-May-2007 |
Srinivas Aji <Aji_Srinivas@emc.com> |
[TCP]: zero out rx_opt in tcp_disconnect() When the server drops its connection, NFS client reconnects using the same socket after disconnecting. If the new connection's SYN,ACK doesn't contain the TCP timestamp option and the old connection's did, tp->tcp_header_len is recomputed assuming no timestamp header but tp->rx_opt.tstamp_ok remains set. Then tcp_build_and_update_options() adds in a timestamp option past the end of the allocated TCP header, overwriting TCP data, or when the data is in skb_shinfo(skb)->frags[], overwriting skb_shinfo(skb) causing a crash soon after. (The issue was debugged from such a crash.) Similarly, wscale_ok and sack_ok also get set based on the SYN,ACK packet but not reset on disconnect, since they are zeroed out at initialization. The patch zeroes out the entire tp->rx_opt struct in tcp_disconnect() to avoid this sort of problem. Signed-off-by: Srinivas Aji <Aji_Srinivas@emc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
65bb723c9502b7ba0a3aad13bdac8832e213ba74 |
|
29-Apr-2007 |
Gerrit Renker <gerrit@erg.abdn.ac.uk> |
[TCP]: Update references in two old comments This updates references to drafts in comments which must be about 10 years old. Internet draft draft-ietf-tcpimpl-prob-03.txt expired in 1998 and was replaced by RFC 2525 in March 1999. Section 3.10 of the draft maps almost identically into section 2.17 of RFC 2525: both are entitled "Failure to RST on close with data pending", the differences in text body amount to a typo and minor sentence change. Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9e412ba7632f71259a53085665d4983b78257b7c |
|
21-Apr-2007 |
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> |
[TCP]: Sed magic converts func(sk, tp, ...) -> func(sk, ...) This is (mostly) automated change using magic: sed -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N' -e '/struct sock \*sk/ N' -e 's|struct sock \*sk,[\n\t ]*struct tcp_sock \*tp\([^{]*\n{\n\)| struct sock \*sk\1\tstruct tcp_sock *tp = tcp_sk(sk);\n|g' -e 's|struct sock \*sk, struct tcp_sock \*tp| struct sock \*sk|g' -e 's|sk, tp\([^-]\)|sk\1|g' Fixed four unused variable (tp) warnings that were introduced. In addition, manually added newlines after local variables and tweaked function arguments positioning. $ gcc --version gcc (GCC) 4.1.1 20060525 (Red Hat 4.1.1-1) ... $ codiff -fV built-in.o.old built-in.o.new net/ipv4/route.c: rt_cache_flush | +14 1 function changed, 14 bytes added net/ipv4/tcp.c: tcp_setsockopt | -5 tcp_sendpage | -25 tcp_sendmsg | -16 3 functions changed, 46 bytes removed net/ipv4/tcp_input.c: tcp_try_undo_recovery | +3 tcp_try_undo_dsack | +2 tcp_mark_head_lost | -12 tcp_ack | -15 tcp_event_data_recv | -32 tcp_rcv_state_process | -10 tcp_rcv_established | +1 7 functions changed, 6 bytes added, 69 bytes removed, diff: -63 net/ipv4/tcp_output.c: update_send_head | -9 tcp_transmit_skb | +19 tcp_cwnd_validate | +1 tcp_write_wakeup | -17 __tcp_push_pending_frames | -25 tcp_push_one | -8 tcp_send_fin | -4 7 functions changed, 20 bytes added, 63 bytes removed, diff: -43 built-in.o.new: 18 functions changed, 40 bytes added, 178 bytes removed, diff: -138 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4ac02bab77438b484a5cf855a002fb6a1d592894 |
|
21-Apr-2007 |
Andi Kleen <ak@suse.de> |
[TCP]: Uninline tcp_done(). The function is quite big and has several call sites and nothing to collapse by compiler optimization on inlining. Besides it's nicer to read in a in .c file. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3ff50b7997fe06cd5d276b229967bb52d6b3b6c1 |
|
21-Apr-2007 |
Stephen Hemminger <shemminger@linux-foundation.org> |
[NET]: cleanup extra semicolons Spring cleaning time... There seems to be a lot of places in the network code that have extra bogus semicolons after conditionals. Most commonly is a bogus semicolon after: switch() { } Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4103f8cd5c1f260d674a7b426ed221812de54d47 |
|
27-Mar-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[TCP]: tcp_memory_pressure and tcp_socket are__read_mostly candidates tcp_memory_pressure and tcp_socket currently share a cache line with tcp_memory_allocated, tcp_sockets_allocated. (Very hot cache line) It makes sense to declare these variables as __read_mostly, to avoid false sharing on SMP. ffffffff8081d9c0 B tcp_orphan_count ffffffff8081d9c4 B tcp_memory_allocated ffffffff8081d9c8 B tcp_sockets_allocated ffffffff8081d9cc B tcp_memory_pressure ffffffff8081d9d0 b tcp_md5sig_users ffffffff8081d9d8 b tcp_md5sig_pool ffffffff8081d9e0 b warntime.31570 ffffffff8081d9e8 b tcp_socket Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
27a884dc3cb63b93c2b3b643f5b31eed5f8a4d26 |
|
20-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Convert skb->tail to sk_buff_data_t So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes on 64bit architectures, allowing us to combine the 4 bytes hole left by the layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4 64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN... :-) Many calculations that previously required that skb->{transport,network, mac}_header be first converted to a pointer now can be done directly, being meaningful as offsets or pointers. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9c70220b73908f64792422a2c39c593c4792f2c5 |
|
26-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce skb_transport_header(skb) For the places where we need a pointer to the transport header, it is still legal to touch skb->h.raw directly if just adding to, subtracting from or setting it to another layer header. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa8223c7bb0b05183e1737881ed21827aa5b9e73 |
|
11-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
fe067e8ab5e0dc5ca3c54634924c628da92090b4 |
|
07-Mar-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Abstract out all write queue operations. This allows the write queue implementation to be changed, for example, to one which allows fast interval searching. Signed-off-by: David S. Miller <davem@davemloft.net>
|
53cdcc04c1e85d4e423b2822b66149b6f2e52c2c |
|
16-Mar-2007 |
John Heffner <jheffner@psc.edu> |
[TCP]: Fix tcp_mem[] initialization. Change tcp_mem initialization function. The fraction of total memory is now a continuous function of memory size, and independent of page size. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2c4f6219aca5939b57596278ea8b014275d4917b |
|
21-Feb-2007 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Fix MD5 signature pool locking. The locking calls assumed that these code paths were only invoked in software interrupt context, but that isn't true. Therefore we need to use spin_{lock,unlock}_bh() throughout. Signed-off-by: David S. Miller <davem@davemloft.net>
|
e905a9edab7f4f14f9213b52234e4a346c690911 |
|
09-Feb-2007 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] IPV4: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dbca9b2750e3b1ee6f56a616160ccfc12e8b161f |
|
08-Feb-2007 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: change layout of ehash table ehash table layout is currently this one : First half of this table is used by sockets not in TIME_WAIT state Second half of it is used by sockets in TIME_WAIT state. This is non optimal because of for a given hash or socket, the two chain heads are located in separate cache lines. Moreover the locks of the second half are never used. If instead of this halving, we use two list heads in inet_ehash_bucket instead of only one, we probably can avoid one cache miss, and reduce ram usage, particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep together related chains to speedup lookups and socket state change. In this patch I did not try to align struct inet_ehash_bucket, but a future patch could try to make this structure have a convenient size (a power of two or a multiple of L1_CACHE_SIZE). I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so maybe we dont need to scratch our heads to align the bucket... Note : In case struct inet_ehash_bucket is not a power of two, we could probably change alloc_large_system_hash() (in case it use __get_free_pages()) to free the unused space. It currently allocates a big zone, but the last quarter of it could be freed. Again, this should be a temporary 'problem'. Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6931ba7cef3991fbb970997d33e24139ccdc3c2c |
|
14-Dec-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Fix oops caused by __tcp_put_md5sig_pool() It should call tcp_free_md5sig_pool() not __tcp_free_md5sig_pool() so that it does proper refcounting. Signed-off-by: David S. Miller <davem@davemloft.net>
|
f5b99bcdddfb2338227faad3489c24907f37ee8e |
|
01-Dec-2006 |
Adrian Bunk <bunk@stusta.de> |
[NET]: Possible cleanups. This patch contains the following possible cleanups: - make the following needlessly global functions statis: - ipv4/tcp.c: __tcp_alloc_md5sig_pool() - ipv4/tcp_ipv4.c: tcp_v4_reqsk_md5_lookup() - ipv4/udplite.c: udplite_rcv() - ipv4/udplite.c: udplite_err() - make the following needlessly global structs static: - ipv4/tcp_ipv4.c: tcp_request_sock_ipv4_ops - ipv4/tcp_ipv4.c: tcp_sock_ipv4_specific - ipv6/tcp_ipv6.c: tcp_request_sock_ipv6_ops - net/ipv{4,6}/udplite.c: remove inline's from static functions (gcc should know best when to inline them) Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
352d48008b6f3e02d8ce77868432e329dd921cb1 |
|
17-Nov-2006 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[TCP]: Tidy up skb_entail Heck, it even saves us some few bytes: [acme@newtoy net-2.6.20]$ codiff -f /tmp/tcp.o.before ../OUTPUT/qemu/net-2.6.20/net/ipv4/tcp.o /pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv4/tcp.c: tcp_sendpage | -7 tcp_sendmsg | -5 2 functions changed, 12 bytes removed [acme@newtoy net-2.6.20]$ Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
d3bc23e7ee9db8023dff5a86bb3b0069ed018789 |
|
15-Nov-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[NET]: Annotate callers of csum_fold() in net/* Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cfb6eeb4c860592edd123fdea908d23c6ad1c7dc |
|
15-Nov-2006 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[TCP]: MD5 Signature Option (RFC2385) support. Based on implementation by Rick Payne. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
52bf376c63eebe72e862a1a6e713976b038c3f50 |
|
15-Nov-2006 |
John Heffner <jheffner@psc.edu> |
[TCP]: Fix up sysctl_tcp_mem initialization. Fix up tcp_mem initial settings to take into account the size of the hash entries (different on SMP and non-SMP systems). Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9e950efa20dc8037c27509666cba6999da9368e8 |
|
07-Nov-2006 |
John Heffner <jheffner@psc.edu> |
[TCP]: Don't use highmem in tcp hash size calculation. This patch removes consideration of high memory when determining TCP hash table sizes. Taking into account high memory results in tcp_mem values that are too large. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1ef9696c909060ccdae3ade245ca88692b49285b |
|
19-Sep-2006 |
Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> |
[TCP]: Send ACKs each 2nd received segment. It does not affect either mss-sized connections (obviously) or connections controlled by Nagle (because there is only one small segment in flight). The idea is to record the fact that a small segment arrives on a connection, where one small segment has already been received and still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains receive buffer. In other words, it is a "soft" each-2nd-segment ACK, which is enough to preserve ACK clock even when ABC is enabled. Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e5d679f33900c71d1a76ba07c5b04055abd34480 |
|
27-Aug-2006 |
Alexey Dobriyan <adobriyan@gmail.com> |
[NET]: Use SLAB_PANIC Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ab32ea5d8a760e7dd4339634e95d7be24ee5b842 |
|
22-Sep-2006 |
Brian Haley <brian.haley@hp.com> |
[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly. Couldn't actually measure any performance increase while testing (.3% I consider noise), but seems like the right thing to do. Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
84fa7933a33f806bbbaae6775e87459b1ec584c0 |
|
30-Aug-2006 |
Patrick McHardy <kaber@trash.net> |
[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose checksum still needs to be completed) and CHECKSUM_COMPLETE (for incoming packets, device supplied full checksum). Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
29bbd72d6ee1dbf2d9f00d022f8e999aa528fb3a |
|
03-Aug-2006 |
Alexey Dobriyan <adobriyan@gmail.com> |
[NET]: Fix more per-cpu typos Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
52499afe40387524e9f46ef9ce4695efccdd2ed9 |
|
01-Aug-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Process linger2 timeout consistently. Based upon guidance from Alexey Kuznetsov. When linger2 is active, we check to see if the fin_wait2 timeout is longer than the timewait. If it is, we schedule the keepalive timer for the difference between the timewait timeout and the fin_wait2 timeout. When this orphan socket is seen by tcp_keepalive_timer() it will try to transform this fin_wait2 socket into a fin_wait2 mini-socket, again if linger2 is active. Not all paths were setting this initial keepalive timer correctly. The tcp input path was doing it correctly, but tcp_close() wasn't, potentially making the socket linger longer than it really needs to. Signed-off-by: David S. Miller <davem@davemloft.net>
|
bbcf467dab42ea3c85f368df346c82af2fbba665 |
|
04-Jul-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Verify gso_type too in gso_segment We don't want nasty Xen guests to pass a TCPv6 packet in with gso_type set to TCPv4 or even UDP (or a packet that's both TCP and UDP). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
bcd76111178ebccedd46a9b3eaff65c78e5a70af |
|
30-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Generalise TSO-specific bits from skb_setup_caps This patch generalises the TSO-specific bits from sk_setup_caps by adding the sk_gso_type member to struct sock. This makes sk_setup_caps generic so that it can be used by TCPv6 or UFO. The only catch is that whoever uses this must provide a GSO implementation for their protocol which I think is a fair deal :) For now UFO continues to live without a GSO implementation which is OK since it doesn't use the sock caps field at the moment. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
adcfc7d0b4d7bc3c7edac6fdde9f3ae510bd6054 |
|
30-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[IPV6]: Added GSO support for TCPv6 This patch adds GSO support for IPv6 and TCPv6. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3820c3f3e41786322c0bb225b9c77b8deff869d1 |
|
30-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Reset gso_segs if packet is dodgy I wasn't paranoid enough in verifying GSO information. A bogus gso_segs could upset drivers as much as a bogus header would. Let's reset it in the per-protocol gso_segment functions. I didn't verify gso_size because that can be verified by the source of the dodgy packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6ab3d5624e172c553004ecc862bfeac16d9d68b7 |
|
30-Jun-2006 |
Jörn Engel <joern@wohnheim.fh-wedel.de> |
Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
576a30eb6453439b3c37ba24455ac7090c247b5a |
|
27-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Added GSO header verification When GSO packets come from an untrusted source (e.g., a Xen guest domain), we need to verify the header integrity before passing it to the hardware. Since the first step in GSO is to verify the header, we can reuse that code by adding a new bit to gso_type: SKB_GSO_DODGY. Packets with this bit set can only be fed directly to devices with the corresponding bit NETIF_F_GSO_ROBUST. If the device doesn't have that bit, then the skb is fed to the GSO engine which will allow the packet to be sent to the hardware if it passes the header check. This patch changes the sg flag to a full features flag. The same method can be used to implement TSO ECN support. We simply have to mark packets with CWR set with SKB_GSO_ECN so that only hardware with a corresponding NETIF_F_TSO_ECN can accept them. The GSO engine can either fully segment the packet, or segment the first MTU and pass the rest to the hardware for further segmentation. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0718bcc09b3597c51e87f265c72135a4928d3c0b |
|
26-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Fix CHECKSUM_HW GSO problems. Fix checksum problems in the GSO code path for CHECKSUM_HW packets. The ipv4 TCP pseudo header checksum has to be adjusted for GSO segmented packets. The adjustment is needed because the length field in the pseudo-header changes. However, because we have the inequality oldlen > newlen, we know that delta = (u16)~oldlen + newlen is still a 16-bit quantity. This also means that htonl(delta) + th->check still fits in 32 bits. Therefore we don't have to use csum_add on this operations. This is based on a patch by Michael Chan <mchan@broadcom.com>. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f4c50d990dcf11a296679dc05de3873783236711 |
|
22-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Add software TSOv4 This patch adds the GSO implementation for IPv4 TCP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7967168cefdbc63bf332d6b1548eca7cd65ebbcc |
|
22-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Merge TSO/UFO fields in sk_buff Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not going to scale if we add any more segmentation methods (e.g., DCCP). So let's merge them. They were used to tell the protocol of a packet. This function has been subsumed by the new gso_type field. This is essentially a set of netdev feature bits (shifted by 16 bits) that are required to process a specific skb. As such it's easy to tell whether a given device can process a GSO skb: you just have to and the gso_type field and the netdev's features field. I've made gso_type a conjunction. The idea is that you have a base type (e.g., SKB_GSO_TCPV4) that can be modified further to support new features. For example, if we add a hardware TSO type that supports ECN, they would declare NETIF_F_TSO | NETIF_F_TSO_ECN. All TSO packets with CWR set would have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO packets would be SKB_GSO_TCPV4. This means that only the CWR packets need to be emulated in software. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8648b3053bff39a7ee4c711d74268079c928a657 |
|
18-Jun-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Add NETIF_F_GEN_CSUM and NETIF_F_ALL_CSUM The current stack treats NETIF_F_HW_CSUM and NETIF_F_NO_CSUM identically so we test for them in quite a few places. For the sake of brevity, I'm adding the macro NETIF_F_GEN_CSUM for these two. We also test the disjunct of NETIF_F_IP_CSUM and the other two in various places, for that purpose I've added NETIF_F_ALL_CSUM. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1a2449a87bb7606113b1aa1a9d3c3e78ef189a1c |
|
24-May-2006 |
Chris Leech <christopher.leech@intel.com> |
[I/OAT]: TCP recv offload to I/OAT Locks down user pages and sets up for DMA in tcp_recvmsg, then calls dma_async_try_early_copy in tcp_v4_do_rcv Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
624d1164730d58a494cc5aa4afa37d02c41e83a7 |
|
24-May-2006 |
Chris Leech <christopher.leech@intel.com> |
[I/OAT]: Make sk_eat_skb I/OAT aware. Add an extra argument to sk_eat_skb, and make it move early copied packets to the async_wait_queue instead of freeing them. Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e4b4992b8007c6b62ec143cbbb292f98813ca11 |
|
24-May-2006 |
Chris Leech <christopher.leech@intel.com> |
[I/OAT]: Rename cleanup_rbuf to tcp_cleanup_rbuf and make non-static Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT Signed-off-by: Chris Leech <christopher.leech@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
75c2d9077c63ac21488129cc23561d4f4fd0f5e5 |
|
04-May-2006 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix sock_orphan dead lock Calling sock_orphan inside bh_lock_sock in tcp_close can lead to dead locks. For example, the inet_diag code holds sk_callback_lock without disabling BH. If an inbound packet arrives during that admittedly tiny window, it will cause a dead lock on bh_lock_sock. Another possible path would be through sock_wfree if the network device driver frees the tx skb in process context with BH enabled. We can fix this by moving sock_orphan out of bh_lock_sock. The tricky bit is to work out when we need to destroy the socket ourselves and when it has already been destroyed by someone else. By moving sock_orphan before the release_sock we can solve this problem. This is because as long as we own the socket lock its state cannot change. So we simply record the socket state before the release_sock and then check the state again after we regain the socket lock. If the socket state has transitioned to TCP_CLOSE in the time being, we know that the socket has been destroyed. Otherwise the socket is still ours to keep. Note that I've also moved the increment on the orphan count forward. This may look like a problem as we're increasing it even if the socket is just about to be destroyed where it'll be decreased again. However, this simply enlarges a window that already exists. This also changes the orphan count test by one. Considering what the orphan count is meant to do this is no big deal. This problem was discoverd by Ingo Molnar using his lock validator. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f348d70a324e15afc701a494f32ec468abb7d1eb |
|
25-Mar-2006 |
Davide Libenzi <davidel@xmailserver.org> |
[PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications Implement the half-closed devices notifiation, by adding a new POLLRDHUP (and its alias EPOLLRDHUP) bit to the existing poll/select sets. Since the existing POLLHUP handling, that does not report correctly half-closed devices, was feared to be changed, this implementation leaves the current POLLHUP reporting unchanged and simply add a new bit that is set in the few places where it makes sense. The same thing was discussed and conceptually agreed quite some time ago: http://lkml.org/lkml/2003/7/12/116 Since this new event bit is added to the existing Linux poll infrastruture, even the existing poll/select system calls will be able to use it. As far as the existing POLLHUP handling, the patch leaves it as is. The pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing archs and sets the bit in the six relevant files. The other attached diff is the simple change required to sys/epoll.h to add the EPOLLRDHUP definition. There is "a stupid program" to test POLLRDHUP delivery here: http://www.xmailserver.org/pollrdhup-test.c It tests poll(2), but since the delivery is same epoll(2) will work equally. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b8059eadf9f4dc24bd72da71daa832a9a9899fb4 |
|
25-Mar-2006 |
David S. Miller <davem@sunset.davemloft.net> |
[TCP]: Mark tcp_*mem[] __read_mostly. Suggested by Stephen Hemminger. Signed-off-by: David S. Miller <davem@davemloft.net>
|
7b4f4b5ebceab67ce440a61081a69f0265e17c2a |
|
25-Mar-2006 |
John Heffner <jheffner@psc.edu> |
[TCP]: Set default max buffers from memory pool size This patch sets the maximum TCP buffer sizes (available to automatic buffer tuning, not to setsockopt) based on the TCP memory pool size. The maximum sndbuf and rcvbuf each will be up to 4 MB, but no more than 1/128 of the memory pressure threshold. Signed-off-by: John Heffner <jheffner@psc.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
|
543d9cfeec4d58ad3fd974db5531b06b6b95deb4 |
|
21-Mar-2006 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[NET]: Identation & other cleanups related to compat_[gs]etsockopt cset No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs to just after the function exported, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dec73ff0293d59076d1fd8f4a264898ecfc457ec |
|
21-Mar-2006 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK] compat: Introduce inet_csk_compat_[gs]etsockopt Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3fdadf7d27e3fbcf72930941884387d1f4936f04 |
|
21-Mar-2006 |
Dmitry Mishin <dim@openvz.org> |
[NET]: {get|set}sockopt compatibility layer This patch extends {get|set}sockopt compatibility layer in order to move protocol specific parts to their place and avoid huge universal net/compat.c file in the future. Signed-off-by: Dmitry Mishin <dim@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d83d8461f902c672bc1bd8fbc6a94e19f092da97 |
|
14-Dec-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[IP_SOCKGLUE]: Remove most of the tcp specific calls As DCCP needs to be called in the same spots. Now we have a member in inet_sock (is_icsk), set at sock creation time from struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and DCCP) to see if a struct sock instance is a inet_connection_sock for places like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if sk_type was SOCK_STREAM, that is insufficient because we now use the same code for DCCP, that has sk_type SOCK_DCCP. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8292a17a399ffb7c5c8b083db4ad994e090055f7 |
|
14-Dec-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_ops And move it to struct inet_connection_sock. DCCP will use it in the upcoming changesets. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9b5b5cff9a6655dbb6d2e2be365bb95eec3950eb |
|
30-Nov-2005 |
Arjan van de Ven <arjan@infradead.org> |
[NET]: Add const markers to various variables. the patch below marks various variables const in net/; the goal is to move them to the .rodata section so that they can't false-share cachelines with things that get written to, as well as potentially helping gcc a bit with optimisations. (these were found using a gcc patch to warn about such variables) Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
18955cfcb2a5d75a08e0cb297f13ccfb6904de48 |
|
30-Nov-2005 |
Mike Stroyan <mike.stroyan@hp.com> |
[IPV4] tcp/route: Another look at hash table sizes The tcp_ehash hash table gets too big on systems with really big memory. It is worse on systems with pages larger than 4KB. It wastes memory that could be better used. It also makes the netstat command slow because reading /proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table. The default value should not be larger for larger page sizes. It seems that the effect of page size is an unintended error dating back a long time. I also wonder if the default value really should be a larger fraction of memory for systems with more memory. While systems with really big ram can afford more space for hash tables, it is not clear to me that they benefit from increasing the allocation ratio for this table. The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and mm/page_alloc.c:alloc_large_system_hash. tcp_init calls alloc_large_system_hash passing parameters- bucketsize=sizeof(struct tcp_ehash_bucket) numentries=thash_entries scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT) limit=0 On i386, PAGE_SHIFT is 12 for a page size of 4K On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K The num_physpages test above makes the allocation take a larger fraction of the total memory on systems with larger memory. The threshold size for a i386 system is 512MB. For an ia64 system with 16KB pages the threshold is 2GB. For smaller memory systems- On i386, scale = (27 - 12) = 15 On ia64, scale = (27 - 14) = 13 For larger memory systems- On i386, scale = (25 - 12) = 13 On ia64, scale = (25 - 14) = 11 For the rest of this discussion, I'll just track the larger memory case. The default behavior has numentries=thash_entries=0, so the allocated size is determined by either scale or by the default limit of 1/16 of total memory. In alloc_large_system_hash- | numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages; | numentries += (1UL << (20 - PAGE_SHIFT)) - 1; | numentries >>= 20 - PAGE_SHIFT; | numentries <<= 20 - PAGE_SHIFT; At this point, numentries is pages for all of memory, rounded up to the nearest megabyte boundary. | /* limit to 1 bucket per 2^scale bytes of low memory */ | if (scale > PAGE_SHIFT) | numentries >>= (scale - PAGE_SHIFT); | else | numentries <<= (PAGE_SHIFT - scale); On i386, numentries >>= (13 - 12), so numentries is 1/8196 of bytes of total memory. On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of bytes of total memory. | log2qty = long_log2(numentries); | | do { | size = bucketsize << log2qty; bucketsize is 16, so size is 16 times numentries, rounded down to a power of two. On i386, size is 1/512 of bytes of total memory. On ia64, size is 1/128 of bytes of total memory. For smaller systems the results are On i386, size is 1/2048 of bytes of total memory. On ia64, size is 1/512 of bytes of total memory. The large page effect can be removed by just replacing the use of PAGE_SHIFT with a constant of 12 in the calls to alloc_large_system_hash. That makes them more like the other uses of that function from fs/inode.c and fs/dcache.c Signed-off-by: David S. Miller <davem@davemloft.net>
|
caa20d9abe810be2ede9612b6c9db6ce7d6edf80 |
|
11-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: spelling fixes Minor spelling fixes for TCP code. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9772efb970780aeed488c19d8b4afd46c3b484af |
|
11-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Appropriate Byte Count support This is an updated version of the RFC3465 ABC patch originally for Linux 2.6.11-rc4 by Yee-Ting Li. ABC is a way of counting bytes ack'd rather than packets when updating congestion control. The orignal ABC described in the RFC applied to a Reno style algorithm. For advanced congestion control there is little change after leaving slow start. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6df716340da3a6fdd33d73d7ed4c6f7590ca1c42 |
|
04-Nov-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP/DCCP]: Randomize port selection This patch randomizes the port selected on bind() for connections to help with possible security attacks. It should also be faster in most cases because there is no need for a global lock. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
|
fb5f5e6e0cebd574be737334671d1aa8f170d5f3 |
|
06-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix TCP_OFF() bug check introduced by previous change. The TCP_OFF assignment at the bottom of that if block can indeed set TCP_OFF without setting TCP_PAGE. Since there is not much to be gained from avoiding this situation, we might as well just zap the offset. The following patch should fix it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ef015786152adaff5a6a8bf0c8ea2f70cee8059d |
|
02-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[TCP]: Fix sk_forward_alloc underflow in tcp_sendmsg I've finally found a potential cause of the sk_forward_alloc underflows that people have been reporting sporadically. When tcp_sendmsg tacks on extra bits to an existing TCP_PAGE we don't check sk_forward_alloc even though a large amount of time may have elapsed since we allocated the page. In the mean time someone could've come along and liberated packets and reclaimed sk_forward_alloc memory. This patch makes tcp_sendmsg check sk_forward_alloc every time as we do in do_tcp_sendpages. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d80d99d643090c3cf2b1f9fb3fadd1256f7e384f |
|
02-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: Add sk_stream_wmem_schedule This patch introduces sk_stream_wmem_schedule as a short-hand for the sk_forward_alloc checking on egress. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
ba89966c1984513f4f2cc0a6c182266be44ddd03 |
|
26-Aug-2005 |
Eric Dumazet <dada1@cosmosbay.com> |
[NET]: use __read_mostly on kmem_cache_t , DEFINE_SNMP_STAT pointers This patch puts mostly read only data in the right section (read_mostly), to help sharing of these data between CPUS without memory ping pongs. On one of my production machine, tcp_statistics was sitting in a heavily modified cache line, so *every* SNMP update had to force a reload. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
dc40c7bc76054f5e4382835ca2bafb895b993a8a |
|
24-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK]: Generalise tcp_listen_poll Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6687e988d9aeaccad6774e6a8304f681f3ec0a03 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[ICSK]: Move TCP congestion avoidance members to icsk This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(), minimal renaming/moving done in this changeset to ease review. Most of it is just changes of struct tcp_sock * to struct sock * parameters. With this we move to a state closer to two interesting goals: 1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used for any INET transport protocol that has struct inet_hashinfo and are derived from struct inet_connection_sock. Keeps the userspace API, that will just not display DCCP sockets, while newer versions of tools can support DCCP. 2. INET generic transport pluggable Congestion Avoidance infrastructure, using the current TCP CA infrastructure with DCCP. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
295ff7edb8f72b77d524759266f7524deae379b3 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@mandriva.com> |
[TIMEWAIT]: Introduce inet_timewait_death_row That groups all of the tables and variables associated to the TCP timewait schedulling/recycling/killing code, that now can be isolated from the TCP specific code and used by other transport protocols, such as DCCP. Next changeset will move this code to net/ipv4/inet_timewait_sock.c Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a019d6fe2b9da68ea4ba6cf3c4e86fc1dbf554c3 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[ICSK]: Move generalised functions from tcp to inet_connection_sock This also improves reqsk_queue_prune and renames it to inet_csk_reqsk_queue_prune, as it deals with both inet_connection_sock and inet_request_sock objects, not just with request_sock ones thus belonging to inet_request_sock. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
295f7324ff8d9ea58b4d3ec93b1aaa1d80e048a9 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[ICSK]: Introduce reqsk_queue_prune from code in tcp_synack_timer With this we're very close to getting all of the current TCP refactorings in my dccp-2.6 tree merged, next changeset will export some functions needed by the current DCCP code and then dccp-2.6.git will be born! Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0a5578cf8e5e045aaa68643c17ce885426697c6b |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[ICSK]: Generalise tcp_listen_{start,stop} This also moved inet_iif from tcp to inet_hashtables.h, as it is needed by the inet_lookup callers, perhaps this needs a bit of polishing, but for now seems fine. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3f421baa4720b708022f8bcc52a61e5cd6f10bf8 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Just move the inet_connection_sock function from tcp sources Completing the previous changeset, this also generalises tcp_v4_synq_add, renaming it to inet_csk_reqsk_queue_hash_add, already geing used in the DCCP tree, which I plan to merge RSN. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
463c84b97f24010a67cd871746d6a7e4c925a5f9 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Introduce inet_connection_sock This creates struct inet_connection_sock, moving members out of struct tcp_sock that are shareable with other INET connection oriented protocols, such as DCCP, that in my private tree already uses most of these members. The functions that operate on these members were renamed, using a inet_csk_ prefix while not being moved yet to a new file, so as to ease the review of these changes. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8feaf0c0a5488b3d898a9c207eb6678f44ba3f26 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise tcp_tw_bucket, aka TIME_WAIT sockets This paves the way to generalise the rest of the sock ID lookup routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro kernels (where IPv6 is always built as a module): [root@qemu ~]# grep tw_sock /proc/slabinfo tw_sock_TCPv6 0 0 128 31 1 tw_sock_TCP 0 0 96 41 1 [root@qemu ~]# Now if a protocol wants to use the TIME_WAIT generic infrastructure it only has to set the sk_prot->twsk_obj_size field with the size of its inet_timewait_sock derived sock and proto_register will create sk_prot->twsk_slab, for now its only for INET sockets, but we can introduce timewait_sock later if some non INET transport protocolo wants to use this stuff. Next changesets will take advantage of this new infrastructure to generalise even more TCP code. [acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size /tmp/before.size: 188646 11764 5068 205478 322a6 net/ipv4/built-in.o /tmp/after.size: 188144 11764 5068 204976 320b0 net/ipv4/built-in.o [acme@toy net-2.6.14]$ Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1 (qemu host)). Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6e04e02165a7209a71db553b7bc48d68421e5ebf |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Move tcp_port_rover to inet_hashinfo Also expose all of the tcp_hashinfo members, i.e. killing those tcp_ehash, etc macros, this will more clearly expose already generic functions and some that need just a bit of work to become generic, as we'll see in the upcoming changesets. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2d8c4ce51903636ce0f60addc8134aa50ab8fa76 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Generalise tcp_bind_hash & tcp_inherit_port This required moving tcp_bucket_cachep to inet_hashinfo. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a55ebcc4c4532107ad9eee1c9bb698ab5f12c00f |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Move bind_hash from tcp_sk to inet_sk This should really be in a inet_connection_sock, but I'm leaving it for a later optimization, when some more fields common to INET transport protocols now in tcp_sk or inet_sk will be chunked out into inet_connection_sock, for now its better to concentrate on getting the changes in the core merged to leave the DCCP tree with only DCCP specific code. Next changesets will take advantage of this move to generalise things like tcp_bind_hash, tcp_put_port, tcp_inherit_port, making the later receive a inet_hashinfo parameter, and even __tcp_tw_hashdance, etc in the future, when tcp_tw_bucket gets transformed into the struct timewait_sock hierarchy. tcp_destroy_sock also is eligible as soon as tcp_orphan_count gets moved to sk_prot. A cascade of incremental changes will ultimately make the tcp_lookup functions be fully generic. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0f7ff9274e72fd254fbd1ab117bbc1db6e7cdb34 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[INET]: Just rename the TCP hashtable functions/structs to inet_ This is to break down the complexity of the series of patches, making it very clear that this one just does: 1. renames tcp_ prefixed hashtable functions and data structures that were already mostly generic to inet_ to share it with DCCP and other INET transport protocols. 2. Removes not used functions (__tb_head & tb_head) 3. Removes some leftover prototypes in the headers (tcp_bucket_unlock & tcp_v4_build_header) Next changesets will move tcp_sk(sk)->bind_hash to inet_sock so that we can make functions such as tcp_inherit_port, __tcp_inherit_port, tcp_v4_get_port, __tcp_put_port, generic and get others like tcp_destroy_sock closer to generic (tcp_orphan_count will go to sk->sk_prot to allow this). Eventually most of these functions will be used passing the transport protocol inet_hashinfo structure. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e6848976b721eeb5551cd94673faafeef78d9f35 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET]: Cleanup INET_REFCNT_DEBUG code Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
83e3609eba3818f6e18b8bf9442195169ac306b7 |
|
10-Aug-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[REQSK]: Move the syn_table destroy from tcp_listen_stop to reqsk_queue_destroy Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8728b834b226ffcf2c94a58530090e292af2a7bf |
|
10-Aug-2005 |
David S. Miller <davem@davemloft.net> |
[NET]: Kill skb->list Remove the "list" member of struct sk_buff, as it is entirely redundant. All SKB list removal callers know which list the SKB is on, so storing this in sk_buff does nothing other than taking up some space. Two tricky bits were SCTP, which I took care of, and two ATM drivers which Francois Romieu <romieu@fr.zoreil.com> fixed up. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
|
89ebd197eb2cd31d6187db344d5117064e19fdde |
|
23-Aug-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Unconditionally clear TCP_NAGLE_PUSH in skb_entail(). Intention of this bit is to force pushing of the existing send queue when TCP_CORK or TCP_NODELAY state changes via setsockopt(). But it's easy to create a situation where the bit never clears. For example, if the send queue starts empty: 1) set TCP_NODELAY 2) clear TCP_NODELAY 3) set TCP_CORK 4) do small write() The current code will leave TCP_NAGLE_PUSH set after that sequence. Unconditionally clearing the bit when new data is added via skb_entail() solves the problem. Signed-off-by: David S. Miller <davem@davemloft.net>
|
b03efcfb2180289718991bb984044ce6c5b7d1b0 |
|
08-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[NET]: Transform skb_queue_len() binary tests into skb_queue_empty() This is part of the grand scheme to eliminate the qlen member of skb_queue_head, and subsequently remove the 'list' member of sk_buff. Most users of skb_queue_len() want to know if the queue is empty or not, and that's trivially done with skb_queue_empty() which doesn't use the skb_queue_head->qlen member and instead uses the queue list emptyness as the test. Signed-off-by: David S. Miller <davem@davemloft.net>
|
c1b4a7e69576d65efc31a8cea0714173c2841244 |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Move to new TSO segmenting scheme. Make TSO segment transmit size decisions at send time not earlier. The basic scheme is that we try to build as large a TSO frame as possible when pulling in the user data, but the size of the TSO frame output to the card is determined at transmit time. This is guided by tp->xmit_size_goal. It is always set to a multiple of MSS and tells sendmsg/sendpage how large an SKB to try and build. Later, tcp_write_xmit() and tcp_push_one() chop up the packet if necessary and conditions warrant. These routines can also decide to "defer" in order to wait for more ACKs to arrive and thus allow larger TSO frames to be emitted. A general observation is that TSO elongates the pipe, thus requiring a larger congestion window and larger buffering especially at the sender side. Therefore, it is important that applications 1) get a large enough socket send buffer (this is accomplished by our dynamic send buffer expansion code) 2) do large enough writes. Signed-off-by: David S. Miller <davem@davemloft.net>
|
b4e26f5ea0dbdd1e813c5571fb467022d8eb948a |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Fix send-side cpu utiliziation regression. Only put user data purely to pages when doing TSO. The extra page allocations cause two problems: 1) Add the overhead of the page allocations themselves. 2) Make us do small user copies when we get to the end of the TCP socket cache page. It is still beneficial to purely use pages for TSO, so we will do it for that case. Signed-off-by: David S. Miller <davem@davemloft.net>
|
c65f7f00c587828e3d50737805a78f74804972de |
|
06-Jul-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Simplify SKB data portion allocation with NETIF_F_SG. The ideal and most optimal layout for an SKB when doing scatter-gather is to put all the headers at skb->data, and all the user data in the page array. This makes SKB splitting and combining extremely simple, especially before a packet goes onto the wire the first time. So, when sk_stream_alloc_pskb() is given a zero size, make sure there is no skb_tailroom(). This is achieved by applying SKB_DATA_ALIGN() to the header length used here. Next, make select_size() in TCP output segmentation use a length of zero when NETIF_F_SG is true on the outgoing interface. Signed-off-by: David S. Miller <davem@davemloft.net>
|
5f8ef48d240963093451bcf83df89f1a1364f51d |
|
24-Jun-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Allow choosing TCP congestion control via sockopt. Allow using setsockopt to set TCP congestion control to use on a per socket basis. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
317a76f9a44b437d6301718f4e5d08bd93f98da7 |
|
23-Jun-2005 |
Stephen Hemminger <shemminger@osdl.org> |
[TCP]: Add pluggable congestion control algorithm infrastructure. Allow TCP to have multiple pluggable congestion control algorithms. Algorithms are defined by a set of operations and can be built in or modules. The legacy "new RENO" algorithm is used as a starting point and fallback. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7df551254add79a445d2e47e8f849cef8fee6e38 |
|
19-Jun-2005 |
David S. Miller <davem@davemloft.net> |
[TCP]: Fix sysctl_tcp_low_latency When enabled, this should disable UCOPY prequeue'ing altogether, but it does not due to a missing test. Signed-off-by: David S. Miller <davem@davemloft.net>
|
2ad69c55a282315e6119cf7fd744f26a925bdfd2 |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] rename struct tcp_listen_opt to struct listen_sock Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0e87506fcc734647c7b2497eee4eb81e785c857a |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Generalise tcp_listen_opt This chunks out the accept_queue and tcp_listen_opt code and moves them to net/core/request_sock.c and include/net/request_sock.h, to make it useful for other transport protocols, DCCP being the first one to use it. Next patches will rename tcp_listen_opt to accept_sock and remove the inline tcp functions that just call a reqsk_queue_ function. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
60236fdd08b2169045a3bbfc5ffe1576e6c3c17b |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Rename open_request to request_sock Ok, this one just renames some stuff to have a better namespace and to dissassociate it from TCP: struct open_request -> struct request_sock tcp_openreq_alloc -> reqsk_alloc tcp_openreq_free -> reqsk_free tcp_openreq_fastfree -> __reqsk_free With this most of the infrastructure closely resembles a struct sock methods subset. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2e6599cb899ba4b133f42cbf9d2b1883d2dc583a |
|
19-Jun-2005 |
Arnaldo Carvalho de Melo <acme@ghostprotocols.net> |
[NET] Generalise TCP's struct open_request minisock infrastructure Kept this first changeset minimal, without changing existing names to ease peer review. Basicaly tcp_openreq_alloc now receives the or_calltable, that in turn has two new members: ->slab, that replaces tcp_openreq_cachep ->obj_size, to inform the size of the openreq descendant for a specific protocol The protocol specific fields in struct open_request were moved to a class hierarchy, with the things that are common to all connection oriented PF_INET protocols in struct inet_request_sock, the TCP ones in tcp_request_sock, that is an inet_request_sock, that is an open_request. I.e. this uses the same approach used for the struct sock class hierarchy, with sk_prot indicating if the protocol wants to use the open_request infrastructure by filling in sk_prot->rsk_prot with an or_calltable. Results? Performance is improved and TCP v4 now uses only 64 bytes per open request minisock, down from 96 without this patch :-) Next changeset will rename some of the structs, fields and functions mentioned above, struct or_calltable is way unclear, better name it struct request_sock_ops, s/struct open_request/struct request_sock/g, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e7626486c3c4ce456b11a7944edf164ef76fc599 |
|
13-Jun-2005 |
Andi Kleen <ak@suse.de> |
[TCP]: Adjust TCP mem order check to new alloc_large_system_hash Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
|
02c30a84e6298b6b20a56f0896ac80b47839e134 |
|
06-May-2005 |
Jesper Juhl <juhl-lkml@dif.dk> |
[PATCH] update Ross Biro bouncing email address Ross moved. Remove the bad email address so people will find the correct one in ./CREDITS. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 |
|
17-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|