History log of /net/ipv4/datagram.c
Revision Date Author Comments
b73c3d0e4f0e1961e15bec18720e48aabebe2109 02-Jul-2014 Tom Herbert <therbert@google.com> net: Save TX flow hash in sock and set in skbuf on xmit

For a connected socket we can precompute the flow hash for setting
in skb->hash on output. This is a performance advantage over
calculating the skb->hash for every packet on the connection. The
computation is done using the common hash algorithm to be consistent
with computations done for packets of the connection in other states
where thers is no socket (e.g. time-wait, syn-recv, syn-cookies).

This patch adds sk_txhash to the sock structure. inet_set_txhash and
ip6_set_txhash functions are added which are called from points in
TCP and UDP where socket moves to established state.

skb_set_hash_from_sk is a function which sets skb->hash from the
sock txhash value. This is called in UDP and TCP transmit path when
transmitting within the context of a socket.

Tested: ran super_netperf with 200 TCP_RR streams over a vxlan
interface (in this case skb_get_hash called on every TX packet to
create a UDP source port).

Before fix:

95.02% CPU utilization
154/256/505 90/95/99% latencies
1.13042e+06 tps

Time in functions:
0.28% skb_flow_dissect
0.21% __skb_get_hash

After fix:

94.95% CPU utilization
156/254/485 90/95/99% latencies
1.15447e+06

Neither __skb_get_hash nor skb_flow_dissect appear in perf

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9709674e68646cee5a24e3000b3558d25412203a 10-Jun-2014 Eric Dumazet <edumazet@google.com> ipv4: fix a race in ip4_datagram_release_cb()

Alexey gave a AddressSanitizer[1] report that finally gave a good hint
at where was the origin of various problems already reported by Dormando
in the past [2]

Problem comes from the fact that UDP can have a lockless TX path, and
concurrent threads can manipulate sk_dst_cache, while another thread,
is holding socket lock and calls __sk_dst_set() in
ip4_datagram_release_cb() (this was added in linux-3.8)

It seems that all we need to do is to use sk_dst_check() and
sk_dst_set() so that all the writers hold same spinlock
(sk->sk_dst_lock) to prevent corruptions.

TCP stack do not need this protection, as all sk_dst_cache writers hold
the socket lock.

[1]
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel

AddressSanitizer: heap-use-after-free in ipv4_dst_check
Read of size 2 by thread T15453:
[<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
[<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
[<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
[<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
[<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
[<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
[<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
[<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
[<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Freed by thread T15455:
[<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
[<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
[<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
[<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
[<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
[<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
[<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Allocated by thread T15453:
[<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
[<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
[< inlined >] __ip_route_output_key+0x3e8/0xf70
__mkroute_output ./net/ipv4/route.c:1939
[<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
[<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
[<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
[<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
[<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
[<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
[<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

[2]
<4>[196727.311203] general protection fault: 0000 [#1] SMP
<4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
<4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
<4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
<4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>] [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.311399] RSP: 0018:ffff885effd23a70 EFLAGS: 00010282
<4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
<4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
<4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
<4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
<4>[196727.311510] FS: 0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
<4>[196727.311554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
<4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[196727.311713] Stack:
<4>[196727.311733] ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
<4>[196727.311784] ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
<4>[196727.311834] ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
<4>[196727.311885] Call Trace:
<4>[196727.311907] <IRQ>
<4>[196727.311912] [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
<4>[196727.311959] [<ffffffff815b86c6>] dst_release+0x56/0x80
<4>[196727.311986] [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
<4>[196727.312013] [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
<4>[196727.312041] [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312070] [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
<4>[196727.312097] [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312125] [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
<4>[196727.312154] [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
<4>[196727.312183] [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
<4>[196727.312212] [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
<4>[196727.312242] [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
<4>[196727.312275] [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
<4>[196727.312308] [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
<4>[196727.312338] [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
<4>[196727.312368] [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
<4>[196727.312397] [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
<4>[196727.312433] [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
<4>[196727.312463] [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
<4>[196727.312491] [<ffffffff815b1691>] net_rx_action+0x111/0x210
<4>[196727.312521] [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
<4>[196727.312552] [<ffffffff810519d0>] __do_softirq+0xd0/0x270
<4>[196727.312583] [<ffffffff816cef3c>] call_softirq+0x1c/0x30
<4>[196727.312613] [<ffffffff81004205>] do_softirq+0x55/0x90
<4>[196727.312640] [<ffffffff81051c85>] irq_exit+0x55/0x60
<4>[196727.312668] [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
<4>[196727.312696] [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
<4>[196727.312722] <EOI>
<1>[196727.313071] RIP [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.313100] RSP <ffff885effd23a70>
<4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
<0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt

Reported-by: Alexey Preobrazhensky <preobr@google.com>
Reported-by: dormando <dormando@rydia.ne>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 8141ed9fcedb2 ("ipv4: Add a socket release callback for datagram sockets")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0e0d44ab4275549998567cd4700b43f7496eb62b 28-Aug-2013 Steffen Klassert <steffen.klassert@secunet.com> net: Remove FLOWI_FLAG_CAN_SLEEP

FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
c9e9042994d37cbc1ee538c500e9da1bb9d1bcdf 14-Nov-2013 Eric Dumazet <edumazet@google.com> ipv4: fix possible seqlock deadlock

ip4_datagram_connect() being called from process context,
it should use IP_INC_STATS() instead of IP_INC_STATS_BH()
otherwise we can deadlock on 32bit arches, or get corruptions of
SNMP counters.

Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8141ed9fcedb278f4a3a78680591bef1e55f75fb 21-Jan-2013 Steffen Klassert <steffen.klassert@secunet.com> ipv4: Add a socket release callback for datagram sockets

This implements a socket release callback function to check
if the socket cached route got invalid during the time
we owned the socket. The function is used from udp, raw
and ping sockets.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3038eeac027d8ec62e4936143498f2b37baf4cb5 07-May-2011 David S. Miller <davem@davemloft.net> ipv4: Lock socket and use cork flow in ip4_datagram_connect().

This is to make sure that an l2tp socket's inet cork flow is
fully filled in, when it's encapsulated in UDP.

Signed-off-by: David S. Miller <davem@davemloft.net>
87321c839fb4a65b5d78c16d79d1674cf223a452 29-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Get route daddr from flow key in ip4_datagram_connect().

Now that output route lookups update the flow with
destination address selection, we can fetch it from
fl4->daddr instead of rt->rt_dst

Signed-off-by: David S. Miller <davem@davemloft.net>
a406b611b5f26a18773e4237d79f6df3eaed1d32 29-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Fetch route saddr from flow key in ip4_datagram_connect().

Now that output route lookups update the flow with
source address selection, we can fetch it from
fl4->saddr instead of rt->rt_src

Signed-off-by: David S. Miller <davem@davemloft.net>
2d7192d6cbab20e153c47fa1559ffd41ceef0e79 26-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Sanitize and simplify ip_route_{connect,newports}()

These functions are used together as a unit for route resolution
during connect(). They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object. That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.

Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
b23dd4fe42b455af5c6e20966b7d6959fa8352ea 02-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make output route lookup return rtable directly.

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
abdf7e7239da270e68262728f125ea94b9b7d42d 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Can final ip_route_connect() arg to boolean "can_sleep".

Since that's what the current vague "flags" thing means.

Signed-off-by: David S. Miller <davem@davemloft.net>
a02cec2155fbea457eca8881870fd2de1a4c4c76 22-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> net: return operator cleanup

Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
719f835853a92f6090258114a72ffe41f09155cd 08-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> udp: add rehash on connect()

commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).

Problem is that following sequence :

fd = socket(...)
connect(fd, &remote, ...)

not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)

Sequence is :
- autobind() : choose a random local port, insert socket in hash tables
[while local address is INADDR_ANY]
- connect() : set remote address and port, change local address to IP
given by a route lookup.

When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.

One solution to this problem is to rehash datagram socket if needed.

We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.

This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.

Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
4bc2f18ba4f22a90ab593c0a580fc9a19c4777b6 09-Jul-2010 Eric Dumazet <eric.dumazet@gmail.com> net/ipv4: EXPORT_SYMBOL cleanups

CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d8d1f30b95a635dbd610dcc5eb641aca8f4768cf 11-Jun-2010 Changli Gao <xiaosuo@gmail.com> net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c720c7e8383aff1cb219bddf474ed89d850336e3 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> inet: rename some inet_sock fields

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7c73a6faffae0bfae70639113aecf06af666e714 17-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: add net to IP_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f97c1e0c6ebdb606c97b6cb5e837c6110ac5a961 16-Dec-2007 Joe Perches <joe@perches.com> [IPV4] net/ipv4: Use ipv4_is_<type>

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
584bdf8cbdf6f277c2a00e083257ee75687cf6f4 01-Jun-2007 Wei Dong <weidong@cn.fujitsu.com> [IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP

Signed-off-by: Wei Dong <weidong@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e905a9edab7f4f14f9213b52234e4a346c690911 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] IPV4: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8eb9086f21c73b38b5ca27558db4c91d62d0e70b 08-Feb-2007 David S. Miller <davem@sunset.davemloft.net> [IPV4/IPV6]: Always wait for IPSEC SA resolution in socket contexts.

Do this even for non-blocking sockets. This avoids the silly -EAGAIN
that applications can see now, even for non-blocking sockets in some
cases (f.e. connect()).

With help from Venkat Tekkirala.

Signed-off-by: David S. Miller <davem@davemloft.net>
bada8adc4e6622764205921e6ba3f717aa03c882 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: ip_route_connect() ipv4 address arguments annotated

annotated address arguments (port number left alone for now); ditto
for inferred net-endian variables in callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
20380731bc2897f2952ae055420972ded4cd786e 16-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [NET]: Fix sparse warnings

Of this type, mostly:

CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c752f0739f09b803aed191c4765a3b6650a08653 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [TCP]: Move the tcp sock states to net/tcp_states.h

Lots of places just needs the states, not even linux/tcp.h, where this
enum was, needs it.

This speeds up development of the refactorings as less sources are
rebuilt when things get moved from net/tcp.h.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!