History log of /net/ipv4/udp.c
Revision Date Author Comments
ba3d8d3f9f65807763b2e0e1ea7645d74a962248 31-Mar-2014 Lorenzo Colitti <lorenzo@google.com> net: core: Support UID-based routing.

This contains the following commits:

1. cc2f522 net: core: Add a UID range to fib rules.
2. d7ed2bd net: core: Use the socket UID in routing lookups.
3. 2f9306a net: core: Add a RTA_UID attribute to routes.
This is so that userspace can do per-UID route lookups.
4. 8e46efb net: ipv6: Use the UID in IPv6 PMTUD
IPv4 PMTUD already does this because ipv4_sk_update_pmtu
uses __build_flow_key, which includes the UID.

Bug: 15413527
Change-Id: I81bd31dae655de9cce7d7a1f9a905dc1c2feba7c
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
82eabd9eb2ec1603282a2c3f74dfcb6fe0aaea0e 04-Sep-2014 Alexander Duyck <alexander.h.duyck@intel.com> net: merge cases where sock_efree and sock_edemux are the same function

Since sock_efree and sock_demux are essentially the same code for non-TCP
sockets and the case where CONFIG_INET is not defined we can combine the
code or replace the call to sock_edemux in several spots. As a result we
can avoid a bit of unnecessary code or code duplication.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2abb7cdc0dc84e99b76ef983a1ae1978922aa9b3 01-Sep-2014 Tom Herbert <therbert@google.com> udp: Add support for doing checksum unnecessary conversion

Add support for doing CHECKSUM_UNNECESSARY to CHECKSUM_COMPLETE
conversion in UDP tunneling path.

In the normal UDP path, we call skb_checksum_try_convert after locating
the UDP socket. The check is that checksum conversion is enabled for
the socket (new flag in UDP socket) and that checksum field is
non-zero.

In the UDP GRO path, we call skb_gro_checksum_try_convert after
checksum is validated and checksum field is non-zero. Since this is
already in GRO we assume that checksum conversion is always wanted.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
57c67ff4bd92af634f7c91c40eb02a96dd785dda 22-Aug-2014 Tom Herbert <therbert@google.com> udp: additional GRO support

Implement GRO for UDPv6. Add UDP checksum verification in gro_receive
for both UDP4 and UDP6 calling skb_gro_checksum_validate_zero_check.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8fc54f68919298ff9689d980efb495707ef43f30 23-Aug-2014 Daniel Borkmann <dborkman@redhat.com> net: use reciprocal_scale() helper

Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
274f482d33a309c87096f2983601ceda2761094e 22-Jul-2014 Sorin Dumitru <sorin@returnze.ro> sock: remove skb argument from sk_rcvqueues_full

It hasn't been used since commit 0fd7bac(net: relax rcvbuf limits).

Signed-off-by: Sorin Dumitru <sorin@returnze.ro>
Signed-off-by: David S. Miller <davem@davemloft.net>
2dc41cff7545d55c6294525c811594576f8e119c 16-Jul-2014 David Held <drheld@google.com> udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.

Many multicast sources can have the same port which can result in a very
large list when hashing by port only. Hash by address and port instead
if this is the case. This makes multicast more similar to unicast.

On a 24-core machine receiving from 500 multicast sockets on the same
port, before this patch 80% of system CPU was used up by spin locking
and only ~25% of packets were successfully delivered.

With this patch, all packets are delivered and kernel overhead is ~8%
system CPU on spinlocks.

Signed-off-by: David Held <drheld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5cf3d46192fccf68b4a4759e4d7346e41c669a76 16-Jul-2014 David Held <drheld@google.com> udp: Simplify __udp*_lib_mcast_deliver.

Switch to using sk_nulls_for_each which shortens the code and makes it
easier to update.

Signed-off-by: David Held <drheld@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
155e010edbc168f43bb332887c43e8579ee3894a 14-Jul-2014 Tom Herbert <therbert@google.com> udp: Move udp_tunnel_segment into udp_offload.c

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a2f983f83b0f66a74a045d36eb84e2f01bb950c7 11-Jul-2014 Li RongQing <roy.qing.li@gmail.com> ipv4: remove the unnecessary variable in udp_mcast_next

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3e215c8d1b6b772d107f1811b5ee8eae7a046fb4 25-Jun-2014 James M Leddy <james.leddy@redhat.com> udp: Add MIB counters for rcvbuferrors

Add MIB counters for rcvbuferrors in UDP to help diagnose problems.

Signed-off-by: James M Leddy <james.leddy@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
63c6f81cdde58c41da62a8d8a209592e42a0203e 13-Jun-2014 Eric Dumazet <edumazet@google.com> udp: ipv4: do not waste time in __udp4_lib_mcast_demux_lookup

Its too easy to add thousand of UDP sockets on a particular bucket,
and slow down an innocent multicast receiver.

Early demux is supposed to be an optimization, we should avoid spending
too much time in it.

It is interesting to note __udp4_lib_demux_lookup() only tries to
match first socket in the chain.

10 is the threshold we already have in __udp4_lib_lookup() to switch
to secondary hash.

Fixes: 421b3885bf6d5 ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: David Held <drheld@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ebbe495f19d565bb39be31ca28126a222b6a66dd 03-Jun-2014 WANG Cong <xiyou.wangcong@gmail.com> ipv4: use skb frags api in udp4_hwcsum()

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0f4f4ffa7b7c3d29d0537a126145c9f8d8ed5dbc 05-Jun-2014 Tom Herbert <therbert@google.com> net: Add GSO support for UDP tunnels with checksum

Added a new netif feature for GSO_UDP_TUNNEL_CSUM. This indicates
that a device is capable of computing the UDP checksum in the
encapsulating header of a UDP tunnel.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
af5fcba7f38f3166392f4087ab734433c84f160b 05-Jun-2014 Tom Herbert <therbert@google.com> udp: Generic functions to set checksum

Added udp_set_csum and udp6_set_csum functions to set UDP checksums
in packets. These are for simple UDP packets such as those that might
be created in UDP tunnels.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1c19448c9ba6545b80ded18488a64a7f3d8e6998 23-May-2014 Tom Herbert <therbert@google.com> net: Make enabling of zero UDP6 csums more restrictive

RFC 6935 permits zero checksums to be used in IPv6 however this is
recommended only for certain tunnel protocols, it does not make
checksums completely optional like they are in IPv4.

This patch restricts the use of IPv6 zero checksums that was previously
intoduced. no_check6_tx and no_check6_rx have been added to control
the use of checksums in UDP6 RX and TX path. The normal
sk_no_check_{rx,tx} settings are not used (this avoids ambiguity when
dealing with a dual stack socket).

A helper function has been added (udp_set_no_check6) which can be
called by tunnel impelmentations to all zero checksums (send on the
socket, and accept them as valid).

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
28448b80456feafe07e2d05b6363b00f61f6171e 23-May-2014 Tom Herbert <therbert@google.com> net: Split sk_no_check into sk_no_check_{rx,tx}

Define separate fields in the sock structure for configuring disabling
checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
removed UDP_CSUM_* defines since they are no longer necessary.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c7228317441f4dee5e5916e30300dd8c61f75af7 14-May-2014 Joe Perches <joe@perches.com> net: Use a more standard macro for INET_ADDR_COOKIE

Missing a colon on definition use is a bit odd so
change the macro for the 32 bit case to declare an
__attribute__((unused)) and __deprecated variable.

The __deprecated attribute will cause gcc to emit
an error if the variable is actually used.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
122ff243f5f104194750ecbc76d5946dd1eec934 13-May-2014 WANG Cong <xiyou.wangcong@gmail.com> ipv4: make ip_local_reserved_ports per netns

ip_local_port_range is already per netns, so should ip_local_reserved_ports
be. And since it is none by default we don't actually need it when we don't
enable CONFIG_SYSCTL.

By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0a80966b1043c3e2dc684140f155a3fded308660 08-May-2014 Tom Herbert <therbert@google.com> net: Verify UDP checksum before handoff to encap

Moving validation of UDP checksum to be done in UDP not encap layer.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ed70fcfcee953a76028bfc3f963d2167c2990020 03-May-2014 Tom Herbert <therbert@google.com> net: Call skb_checksum_init in IPv4

Call skb_checksum_init instead of private functions.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c8e6ad0829a723a74cd2fea9996a3392d2579a18 18-Feb-2014 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv6: honor IPV6_PKTINFO with v4 mapped addresses on sendmsg

In case we decide in udp6_sendmsg to send the packet down the ipv4
udp_sendmsg path because the destination is either of family AF_INET or
the destination is an ipv4 mapped ipv6 address, we don't honor the
maybe specified ipv4 mapped ipv6 address in IPV6_PKTINFO.

We simply can check for this option in ip_cmsg_send because no calls to
ipv6 module functions are needed to do so.

Reported-by: Gert Doering <gert@space.net>
Cc: Tore Anderson <tore@fud.no>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
342dfc306fb32155314dad277f3c3686b83fb9f1 17-Jan-2014 Steffen Hurrle <steffen@hurrle.net> net: add build-time checks for msg->msg_name size

This is a follow-up patch to f3d3342602f8bc ("net: rework recvmsg
handler msg_name and msg_namelen logic").

DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.

Signed-off-by: Steffen Hurrle <steffen@hurrle.net>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
63862b5bef7349dd1137e4c70702c67d77565785 11-Jan-2014 Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> net: replace macros net_random and net_srandom with direct calls to prandom

This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.

Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7a7ffbabf99445704be01bff5d7e360da908cf8e 26-Dec-2013 Wei-Chun Chao <weichunc@plumgrid.com> ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC

VM to VM GSO traffic is broken if it goes through VXLAN or GRE
tunnel and the physical NIC on the host supports hardware VXLAN/GRE
GSO offload (e.g. bnx2x and next-gen mlx4).

Two issues -
(VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with
SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header
integrity check fails in udp4_ufo_fragment if inner protocol is
TCP. Also gso_segs is calculated incorrectly using skb->len that
includes tunnel header. Fix: robust check should only be applied
to the inner packet.

(VXLAN & GRE) Once GSO header integrity check passes, NULL segs
is returned and the original skb is sent to hardware. However the
tunnel header is already pulled. Fix: tunnel header needs to be
restored so that hardware can perform GSO properly on the original
packet.

Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e47eb5dfb296bf217e9ebee7b2f07486670b9c1b 15-Dec-2013 Eric Dumazet <edumazet@google.com> udp: ipv4: do not use sk_dst_lock from softirq context

Using sk_dst_lock from softirq context is not supported right now.

Instead of adding BH protection everywhere,
udp_sk_rx_dst_set() can instead use xchg(), as suggested
by David.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 975022310233 ("udp: ipv4: must add synchronization in udp_sk_rx_dst_set()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
975022310233fb0f0193873d79a7b8438070fa82 11-Dec-2013 Eric Dumazet <edumazet@google.com> udp: ipv4: must add synchronization in udp_sk_rx_dst_set()

Unlike TCP, UDP input path does not hold the socket lock.

Before messing with sk->sk_rx_dst, we must use a spinlock, otherwise
multiple cpus could leak a refcount.

This patch also takes care of renewing a stale dst entry.
(When the sk->sk_rx_dst would not be used by IP early demux)

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
610438b74496b2986a9025f8e23c134cb638e338 11-Dec-2013 Eric Dumazet <edumazet@google.com> udp: ipv4: fix potential use after free in udp_v4_early_demux()

pskb_may_pull() can reallocate skb->head, we need to move the
initialization of iph and uh pointers after its call.

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8afdd99a1315e759de04ad6e2344f0c5f17ecb1b 11-Dec-2013 Eric Dumazet <edumazet@google.com> udp: ipv4: fix an use after free in __udp4_lib_rcv()

Dave Jones reported a use after free in UDP stack :

[ 5059.434216] =========================
[ 5059.434314] [ BUG: held lock freed! ]
[ 5059.434420] 3.13.0-rc3+ #9 Not tainted
[ 5059.434520] -------------------------
[ 5059.434620] named/863 is freeing memory ffff88005e960000-ffff88005e96061f, with a lock still held there!
[ 5059.434815] (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435012] 3 locks held by named/863:
[ 5059.435086] #0: (rcu_read_lock){.+.+..}, at: [<ffffffff8143054d>] __netif_receive_skb_core+0x11d/0x940
[ 5059.435295] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff81467a5e>] ip_local_deliver_finish+0x3e/0x410
[ 5059.435500] #2: (slock-AF_INET){+.-...}, at: [<ffffffff8149bd21>] udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.435734]
stack backtrace:
[ 5059.435858] CPU: 0 PID: 863 Comm: named Not tainted 3.13.0-rc3+ #9 [loadavg: 0.21 0.06 0.06 1/115 1365]
[ 5059.436052] Hardware name: /D510MO, BIOS MOPNV10J.86A.0175.2010.0308.0620 03/08/2010
[ 5059.436223] 0000000000000002 ffff88007e203ad8 ffffffff8153a372 ffff8800677130e0
[ 5059.436390] ffff88007e203b10 ffffffff8108cafa ffff88005e960000 ffff88007b00cfc0
[ 5059.436554] ffffea00017a5800 ffffffff8141c490 0000000000000246 ffff88007e203b48
[ 5059.436718] Call Trace:
[ 5059.436769] <IRQ> [<ffffffff8153a372>] dump_stack+0x4d/0x66
[ 5059.436904] [<ffffffff8108cafa>] debug_check_no_locks_freed+0x15a/0x160
[ 5059.437037] [<ffffffff8141c490>] ? __sk_free+0x110/0x230
[ 5059.437147] [<ffffffff8112da2a>] kmem_cache_free+0x6a/0x150
[ 5059.437260] [<ffffffff8141c490>] __sk_free+0x110/0x230
[ 5059.437364] [<ffffffff8141c5c9>] sk_free+0x19/0x20
[ 5059.437463] [<ffffffff8141cb25>] sock_edemux+0x25/0x40
[ 5059.437567] [<ffffffff8141c181>] sock_queue_rcv_skb+0x81/0x280
[ 5059.437685] [<ffffffff8149bd21>] ? udp_queue_rcv_skb+0xd1/0x4b0
[ 5059.437805] [<ffffffff81499c82>] __udp_queue_rcv_skb+0x42/0x240
[ 5059.437925] [<ffffffff81541d25>] ? _raw_spin_lock+0x65/0x70
[ 5059.438038] [<ffffffff8149bebb>] udp_queue_rcv_skb+0x26b/0x4b0
[ 5059.438155] [<ffffffff8149c712>] __udp4_lib_rcv+0x152/0xb00
[ 5059.438269] [<ffffffff8149d7f5>] udp_rcv+0x15/0x20
[ 5059.438367] [<ffffffff81467b2f>] ip_local_deliver_finish+0x10f/0x410
[ 5059.438492] [<ffffffff81467a5e>] ? ip_local_deliver_finish+0x3e/0x410
[ 5059.438621] [<ffffffff81468653>] ip_local_deliver+0x43/0x80
[ 5059.438733] [<ffffffff81467f70>] ip_rcv_finish+0x140/0x5a0
[ 5059.438843] [<ffffffff81468926>] ip_rcv+0x296/0x3f0
[ 5059.438945] [<ffffffff81430b72>] __netif_receive_skb_core+0x742/0x940
[ 5059.439074] [<ffffffff8143054d>] ? __netif_receive_skb_core+0x11d/0x940
[ 5059.442231] [<ffffffff8108c81d>] ? trace_hardirqs_on+0xd/0x10
[ 5059.442231] [<ffffffff81430d83>] __netif_receive_skb+0x13/0x60
[ 5059.442231] [<ffffffff81431c1e>] netif_receive_skb+0x1e/0x1f0
[ 5059.442231] [<ffffffff814334e0>] napi_gro_receive+0x70/0xa0
[ 5059.442231] [<ffffffffa01de426>] rtl8169_poll+0x166/0x700 [r8169]
[ 5059.442231] [<ffffffff81432bc9>] net_rx_action+0x129/0x1e0
[ 5059.442231] [<ffffffff810478cd>] __do_softirq+0xed/0x240
[ 5059.442231] [<ffffffff81047e25>] irq_exit+0x125/0x140
[ 5059.442231] [<ffffffff81004241>] do_IRQ+0x51/0xc0
[ 5059.442231] [<ffffffff81542bef>] common_interrupt+0x6f/0x6f

We need to keep a reference on the socket, by using skb_steal_sock()
at the right place.

Note that another patch is needed to fix a race in
udp_sk_rx_dst_set(), as we hold no lock protecting the dst.

Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0e0d44ab4275549998567cd4700b43f7496eb62b 28-Aug-2013 Steffen Klassert <steffen.klassert@secunet.com> net: Remove FLOWI_FLAG_CAN_SLEEP

FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
f1d8cba61c3c4b1eb88e507249c4cb8d635d9a76 28-Nov-2013 Eric Dumazet <edumazet@google.com> inet: fix possible seqlock deadlocks

In commit c9e9042994d3 ("ipv4: fix possible seqlock deadlock") I left
another places where IP_INC_STATS_BH() were improperly used.

udp_sendmsg(), ping_v4_sendmsg() and tcp_v4_connect() are called from
process context, not from softirq context.

This was detected by lockdep seqlock support.

Reported-by: jongman heo <jongman.heo@samsung.com>
Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP")
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
d3f7d56a7a4671d395e8af87071068a195257bf6 25-Nov-2013 Shawn Landden <shawn@churchofgit.com> net: update consumers of MSG_MORE to recognize MSG_SENDPAGE_NOTLAST

Commit 35f9c09fe (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag MSG_SENDPAGE_NOTLAST, similar to
MSG_MORE.

algif_hash, algif_skcipher, and udp used MSG_MORE from tcp_sendpages()
and need to see the new flag as identical to MSG_MORE.

This fixes sendfile() on AF_ALG.

v3: also fix udp

Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> # 3.4.x + 3.2.x
Reported-and-tested-by: Shawn Landden <shawnlandden@gmail.com>
Original-patch: Richard Weinberger <richard@nod.at>
Signed-off-by: Shawn Landden <shawn@churchofgit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
85fbaa75037d0b6b786ff18658ddf0b4014ce2a4 23-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu functions

Commit bceaa90240b6019ed73b49965eac7d167610be69 ("inet: prevent leakage
of uninitialized memory to user in recv syscalls") conditionally updated
addr_len if the msg_name is written to. The recv_error and rxpmtu
functions relied on the recvmsg functions to set up addr_len before.

As this does not happen any more we have to pass addr_len to those
functions as well and set it to the size of the corresponding sockaddr
length.

This broke traceroute and such.

Fixes: bceaa90240b6 ("inet: prevent leakage of uninitialized memory to user in recv syscalls")
Reported-by: Brad Spengler <spender@grsecurity.net>
Reported-by: Tom Labanowski
Cc: mpb <mpb.mail@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
bceaa90240b6019ed73b49965eac7d167610be69 18-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> inet: prevent leakage of uninitialized memory to user in recv syscalls

Only update *addr_len when we actually fill in sockaddr, otherwise we
can return uninitialized memory from the stack to the caller in the
recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL)
checks because we only get called with a valid addr_len pointer either
from sock_common_recvmsg or inet_recvmsg.

If a blocking read waits on a socket which is concurrently shut down we
now return zero and set msg_msgnamelen to 0.

Reported-by: mpb <mpb.mail@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
652586df95e5d76b37d07a11839126dcfede1621 14-Nov-2013 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> seq_file: remove "%n" usage from seq_file users

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1bbdceef1e535add893bf71d7b7ab102e4eb69eb 19-Oct-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once

Initialize the ehash and ipv6_hash_secrets with net_get_random_once.

Each compilation unit gets its own secret now:
ipv4/inet_hashtables.o
ipv4/udp.o
ipv6/inet6_hashtables.o
ipv6/udp.o
rds/connection.o

The functions still get inlined into the hashing functions. In the fast
path we have at most two (needed in ipv6) if (unlikely(...)).

Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
65cd8033ff375b68037df61603ee68070dc48578 19-Oct-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: split inet_ehashfn to hash functions per compilation unit

This duplicates a bit of code but let's us easily introduce
separate secret keys later. The separate compilation units are
ipv4/inet_hashtabbles.o, ipv4/udp.o and rds/connection.o.

Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f69b923a758f598fd6bb69e57564b59506f4f1fc 09-Oct-2013 Eric Dumazet <edumazet@google.com> udp: fix a typo in __udp4_lib_mcast_demux_lookup

At this point sk might contain garbage.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fbf8866d65d5de84f75563eb0edd7fc27dbe9a90 07-Oct-2013 Shawn Bohrer <sbohrer@rgmadvisors.com> net: ipv4 only populate IP_PKTINFO when needed

The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received. This has introduced a performance regression for some
UDP workloads.

This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.

Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After 90587.62 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After 12.48us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
421b3885bf6d56391297844f43fb7154a6396e12 07-Oct-2013 Shawn Bohrer <sbohrer@rgmadvisors.com> udp: ipv4: Add udp early demux

The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.

For UDP multicast we can only cache the dst if there is only one
receiving socket on the host. Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.

For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups. Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.

Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After 89789.68 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After 12.63us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
005ec9743394010cd37d86c3fd2e81978231cdbf 07-Oct-2013 Shawn Bohrer <sbohrer@rgmadvisors.com> udp: Only allow busy read/poll on connected sockets

UDP sockets can receive packets from multiple endpoints and thus may be
received on multiple receive queues. Since packets packets can arrive
on multiple receive queues we should not mark the napi_id for all
packets. This makes busy read/poll only work for connected UDP sockets.

This additionally enables busy read/poll for UDP multicast packets as
long as the socket is connected by moving the check into
__udp_queue_rcv_skb().

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0bbf87d852d243680ed7074110ccc1dea003b61a 28-Sep-2013 Eric W. Biederman <ebiederm@xmission.com> net ipv4: Convert ipv4.ip_local_port_range to be per netns v3

- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
of the callers.
- Move the initialization of sysctl_local_ports into
sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c

v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next

v3:
- Compile fixes of strange callers of inet_get_local_port_range.
This patch now successfully passes an allmodconfig build.
Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range

Originally-by: Samya <samya@twitter.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aa6615814533c634190019ee3a5b10490026d545 24-Sep-2013 Francesco Fusco <ffusco@redhat.com> ipv4: processing ancillary IP_TOS or IP_TTL

If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1a462d189280b560bd84af1407e4d848e262b3b3 20-Sep-2013 Duan Jiong <duanj.fnst@cn.fujitsu.com> net: udp: do not report ICMP redirects to user space

Redirect isn't an error condition, it should leave
the error handler without touching the socket.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eb3c0d83cc78361a28e52e514a7095fdbf771e7e 31-Aug-2013 Cong Wang <amwang@redhat.com> net: unify skb_udp_tunnel_segment() and skb_udp6_tunnel_segment()

As suggested by Pravin, we can unify the code in case of duplicated
code.

Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d14c5ab6bef6a46170b84c3589b27768e979f93d 15-Aug-2013 Francesco Fusco <ffusco@redhat.com> net: proc_fs: trivial: print UIDs as unsigned int

UIDs are printed in the proc_fs as signed int, whereas
they are unsigned int.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c26bf4a51308c85a6f97628253b99767a84ff90a 25-Jul-2013 Thomas Graf <tgraf@suug.ch> pktgen: Add UDPCSUM flag to support UDP checksums

UDP checksums are optional, hence pktgen has been omitting them in
favour of performance. The optional flag UDPCSUM enables UDP
checksumming. If the output device supports hardware checksumming
the skb is prepared and marked CHECKSUM_PARTIAL, otherwise the
checksum is generated in software.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cdbaa0bb26d8116d00be24e6b49043777b382f3a 11-Jul-2013 Alexander Duyck <alexander.h.duyck@intel.com> gso: Update tunnel segmentation to support Tx checksum offload

This change makes it so that the GRE and VXLAN tunnels can make use of Tx
checksum offload support provided by some drivers via the hw_enc_features.
Without this fix enabling GSO means sacrificing Tx checksum offload and
this actually leads to a performance regression as shown below:

Utilization
Send
Throughput local GSO
10^6bits/s % S state
6276.51 8.39 enabled
7123.52 8.42 disabled

To resolve this it was necessary to address two items. First
netif_skb_features needed to be updated so that it would correctly handle
the Trans Ether Bridging protocol without impacting the need to check for
Q-in-Q tagging. To do this it was necessary to update harmonize_features
so that it used skb_network_protocol instead of just using the outer
protocol.

Second it was necessary to update the GRE and UDP tunnel segmentation
offloads so that they would reset the encapsulation bit and inner header
offsets after the offload was complete.

As a result of this change I have seen the following results on a interface
with Tx checksum enabled for encapsulated frames:

Utilization
Send
Throughput local GSO
10^6bits/s % S state
7123.52 8.42 disabled
8321.75 5.43 enabled

v2: Instead of replacing refrence to skb->protocol with
skb_network_protocol just replace the protocol reference in
harmonize_features to allow for double VLAN tag checks.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8b80cda536ea9bceec0364e897868a30ee13b992 10-Jul-2013 Eliezer Tamir <eliezer.tamir@linux.intel.com> net: rename ll methods to busy-poll

Rename ndo_ll_poll to ndo_busy_poll.
Rename sk_mark_ll to sk_mark_napi_id.
Rename skb_mark_ll to skb_mark_napi_id.
Correct all useres of these functions.
Update comments and defines in include/net/busy_poll.h

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
076bb0c82a44fbe46fe2c8527a5b5b64b69f679d 10-Jul-2013 Eliezer Tamir <eliezer.tamir@linux.intel.com> net: rename include/net/ll_poll.h to include/net/busy_poll.h

Rename the file and correct all the places where it is included.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8822b64a0fa64a5dd1dfcf837c5b0be83f8c05d1 01-Jul-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv6: call udp_push_pending_frames when uncorking a socket with AF_INET pending data

We accidentally call down to ip6_push_pending_frames when uncorking
pending AF_INET data on a ipv6 socket. This results in the following
splat (from Dave Jones):

skbuff: skb_under_panic: text:ffffffff816765f6 len:48 put:40 head:ffff88013deb6df0 data:ffff88013deb6dec tail:0x2c end:0xc0 dev:<NULL>
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:126!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in: dccp_ipv4 dccp 8021q garp bridge stp dlci mpoa snd_seq_dummy sctp fuse hidp tun bnep nfnetlink scsi_transport_iscsi rfcomm can_raw can_bcm af_802154 appletalk caif_socket can caif ipt_ULOG x25 rose af_key pppoe pppox ipx phonet irda llc2 ppp_generic slhc p8023 psnap p8022 llc crc_ccitt atm bluetooth
+netrom ax25 nfc rfkill rds af_rxrpc coretemp hwmon kvm_intel kvm crc32c_intel snd_hda_codec_realtek ghash_clmulni_intel microcode pcspkr snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep usb_debug snd_seq snd_seq_device snd_pcm e1000e snd_page_alloc snd_timer ptp snd pps_core soundcore xfs libcrc32c
CPU: 2 PID: 8095 Comm: trinity-child2 Not tainted 3.10.0-rc7+ #37
task: ffff8801f52c2520 ti: ffff8801e6430000 task.ti: ffff8801e6430000
RIP: 0010:[<ffffffff816e759c>] [<ffffffff816e759c>] skb_panic+0x63/0x65
RSP: 0018:ffff8801e6431de8 EFLAGS: 00010282
RAX: 0000000000000086 RBX: ffff8802353d3cc0 RCX: 0000000000000006
RDX: 0000000000003b90 RSI: ffff8801f52c2ca0 RDI: ffff8801f52c2520
RBP: ffff8801e6431e08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88022ea0c800
R13: ffff88022ea0cdf8 R14: ffff8802353ecb40 R15: ffffffff81cc7800
FS: 00007f5720a10740(0000) GS:ffff880244c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000005862000 CR3: 000000022843c000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Stack:
ffff88013deb6dec 000000000000002c 00000000000000c0 ffffffff81a3f6e4
ffff8801e6431e18 ffffffff8159a9aa ffff8801e6431e90 ffffffff816765f6
ffffffff810b756b 0000000700000002 ffff8801e6431e40 0000fea9292aa8c0
Call Trace:
[<ffffffff8159a9aa>] skb_push+0x3a/0x40
[<ffffffff816765f6>] ip6_push_pending_frames+0x1f6/0x4d0
[<ffffffff810b756b>] ? mark_held_locks+0xbb/0x140
[<ffffffff81694919>] udp_v6_push_pending_frames+0x2b9/0x3d0
[<ffffffff81694660>] ? udplite_getfrag+0x20/0x20
[<ffffffff8162092a>] udp_lib_setsockopt+0x1aa/0x1f0
[<ffffffff811cc5e7>] ? fget_light+0x387/0x4f0
[<ffffffff816958a4>] udpv6_setsockopt+0x34/0x40
[<ffffffff815949f4>] sock_common_setsockopt+0x14/0x20
[<ffffffff81593c31>] SyS_setsockopt+0x71/0xd0
[<ffffffff816f5d54>] tracesys+0xdd/0xe2
Code: 00 00 48 89 44 24 10 8b 87 d8 00 00 00 48 89 44 24 08 48 8b 87 e8 00 00 00 48 c7 c7 c0 04 aa 81 48 89 04 24 31 c0 e8 e1 7e ff ff <0f> 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55
RIP [<ffffffff816e759c>] skb_panic+0x63/0x65
RSP <ffff8801e6431de8>

This patch adds a check if the pending data is of address family AF_INET
and directly calls udp_push_ending_frames from udp_v6_push_pending_frames
if that is the case.

This bug was found by Dave Jones with trinity.

(Also move the initialization of fl6 below the AF_INET check, even if
not strictly necessary.)

Cc: Dave Jones <davej@redhat.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7c0cadc69ca2ac8893aa162ee80d92a805840909 12-Jun-2013 Eric Dumazet <edumazet@google.com> udp: fix two sparse errors

commit ba418fa357a7b3c ("soreuseport: UDP/IPv4 implementation")
added following sparse errors :

net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:433:60: expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:433:60: got restricted __be16 [usertype] sport
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:514:60: expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:514:60: got restricted __be16 [usertype] sport
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
da5bab079f9b7d90ba234965a14914ace55e45e9 08-Jun-2013 Daniel Borkmann <dborkman@redhat.com> net: udp4: move GSO functions to udp_offload

Similarly to TCP offloading and UDPv6 offloading, move all related
UDPv4 functions to udp_offload.c to make things more explicit. Also,
by this, we can make those functions static.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a5b50476f77a8fcc8055c955720d05a7c2d9c532 10-Jun-2013 Eliezer Tamir <eliezer.tamir@linux.intel.com> udp: add low latency socket poll support

Add upport for busy-polling on UDP sockets.
In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
from the skb into the sk.
This is done at the earliest possible moment, right after we identify
which socket this skb is for.
In __skb_recv_datagram When there is no data and the user
tries to read we busy poll.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c3f1dbaf6e281642848b78fe101764170c15f168 31-May-2013 David Majnemer <majnemer@google.com> net: Update RFS target at poll for tcp/udp

The current state of affairs is that read()/write() will setup
RFS (Receive Flow Steering) for internet protocol sockets while
poll()/epoll() does not.

When poll() gets called with a TCP or UDP socket, we should update
the flow target.

This permits to RFS (if enabled) to select the appropriate CPU for
following incoming packets.

Note: Only connected UDP sockets can benefit from RFS.

Signed-off-by: David Majnemer <majnemer@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0d89d2035fe063461a5ddb609b2c12e7fb006e44 23-May-2013 Simon Horman <horms@verge.net.au> MPLS: Add limited GSO support

In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.

The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.

SKB Usage:

When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:

* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
skb->inner_protocol is added by this patch.

* Set skb->protocol to the new MPLS ethertype of the packet.

* Set skb->network_header to correspond to the
end of the L3 header, including the MPLS label stack.

I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code. The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.

Features:

I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols. Jesse, I understand that
you have some ideas here. I am more than happy to change my implementation.

This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.

A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload. Currently no devices support this
and MPLS GSO always falls back to software.

Alternate Implementation:

One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().

I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.

MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.

Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
19acc327258ac5bcd0f31c07853e6d9784010fb4 07-May-2013 Pravin B Shelar <pshelar@nicira.com> gso: Handle Trans-Ether-Bridging protocol in skb_network_protocol()

Rather than having logic to calculate inner protocol in every
tunnel gso handler move it to gso code. This simplifies code.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0d05535d413322341f89c77f32936b4df296ac5a 02-May-2013 Pravin B Shelar <pshelar@nicira.com> vxlan: Fix TCPv6 segmentation.

This patch set correct skb->protocol so that inner packet can
lookup correct gso handler.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6a5dc9e598fe90160fee7de098fa319665f5253e 29-Apr-2013 Eric Dumazet <edumazet@google.com> net: Add MIB counters for checksum errors

Add MIB counters for checksum errors in IP layer,
and TCP/UDP/ICMP layers, to help diagnose problems.

$ nstat -a | grep Csum
IcmpInCsumErrors 72 0.0
TcpInCsumErrors 382 0.0
UdpInCsumErrors 463221 0.0
Icmp6InCsumErrors 75 0.0
Udp6InCsumErrors 173442 0.0
IpExtInCsumErrors 10884 0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bf84a01063eaab2f1a37d72d1b903445b3a25a4e 14-Apr-2013 Daniel Borkmann <dborkman@redhat.com> net: sock: make sock_tx_timestamp void

Currently, sock_tx_timestamp() always returns 0. The comment that
describes the sock_tx_timestamp() function wrongly says that it
returns an error when an invalid argument is passed (from commit
20d4947353be, ``net: socket infrastructure for SO_TIMESTAMPING'').
Make the function void, so that we can also remove all the unneeded
if conditions that check for such a _non-existant_ error case in the
output path.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d9dda78bad879595d8c4220a067fc029d6484a16 01-Apr-2013 Al Viro <viro@zeniv.linux.org.uk> procfs: new helper - PDE_DATA(inode)

The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5594c32187e4cb673d2ac815b1adb5a96d9f6254 24-Mar-2013 Pravin B Shelar <pshelar@nicira.com> Revert "udp: increase inner ip header ID during segmentation"

This reverts commit d6a8c36dd6f6f06f046e5c61d3fb39b777c3bdc6.
Next commit makes this commit unnecessary.

Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d6a8c36dd6f6f06f046e5c61d3fb39b777c3bdc6 22-Mar-2013 Cong Wang <amwang@redhat.com> udp: increase inner ip header ID during segmentation

Similar to GRE tunnel, UDP tunnel should take care of IP header ID
too.

Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
44046a593eb770dbecdabf1c82bcd252f2a8337b 19-Mar-2013 Tom Parkin <tparkin@katalix.com> udp: add encap_destroy callback

Users of udp encapsulation currently have an encap_rcv callback which they can
use to hook into the udp receive path.

In situations where a encapsulation user allocates resources associated with a
udp encap socket, it may be convenient to be able to also hook the proto
.destroy operation. For example, if an encap user holds a reference to the
udp socket, the destroy hook might be used to relinquish this reference.

This patch adds a socket destroy hook into udp, which is set and enabled
in the same way as the existing encap_rcv hook.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
731362674580cb0c696cd1b1a03d8461a10cf90a 07-Mar-2013 Pravin B Shelar <pshelar@nicira.com> tunneling: Add generic Tunnel segmentation.

Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.

This can be used by tunneling protocols like VXLAN.

CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ece31ffd539e8e2b586b1ca5f50bc4f4591e3893 18-Feb-2013 Gao feng <gaofeng@cn.fujitsu.com> net: proc: change proc_net_remove to remove_proc_entry

proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
68c331631143f5f039baac99a650e0b9e1ea02b6 14-Feb-2013 Pravin B Shelar <pshelar@nicira.com> v4 GRE: Add TCP segmentation offload for GRE

Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.

[ Compute pkt_len before ip_local_out() invocation. -DaveM ]

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba418fa357a7b3c9d477f4706c6c7c96ddbd1360 22-Jan-2013 Tom Herbert <therbert@google.com> soreuseport: UDP/IPv4 implementation

Allow multiple UDP sockets to bind to the same port.

Motivation soreuseport would be something like a DNS server.  An
alternative would be to recv on the same socket from multiple threads.
As in the case of TCP, the load across these threads tends to be
disproportionate and we also see a lot of contection on the socketlock.
Note that SO_REUSEADDR already allows multiple UDP sockets to bind to
the same port, however there is no provision to prevent hijacking and
nothing to distribute packets across all the sockets sharing the same
bound port.  This patch does not change the semantics of SO_REUSEADDR,
but provides usable functionality of it for unicast.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
50c3a487d507568359ccbce1b15bcdfc01edf7ef 22-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> ipv4: Use IS_ERR_OR_NULL().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8141ed9fcedb278f4a3a78680591bef1e55f75fb 21-Jan-2013 Steffen Klassert <steffen.klassert@secunet.com> ipv4: Add a socket release callback for datagram sockets

This implements a socket release callback function to check
if the socket cached route got invalid during the time
we owned the socket. The function is used from udp, raw
and ping sockets.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
979402b16cde048ced4839e21cc49e0779352b80 06-Sep-2012 Eric Dumazet <edumazet@google.com> udp: increment UDP_MIB_INERRORS if copy failed

In UDP recvmsg(), we miss an increase of UDP_MIB_INERRORS if the copy
of skb to userspace failed for whatever reason.

Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a7cb5a49bf64ba64864ae16a6be028f8b0d3cc06 24-May-2012 Eric W. Biederman <ebiederm@xmission.com> userns: Print out socket uids in a user namespace aware fashion.

Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
b5ec8eeac46a99004c26791f70b15d001e970acf 10-Aug-2012 Eric Dumazet <edumazet@google.com> ipv4: fix ip_send_skb()

ip_send_skb() can send orphaned skb, so we must pass the net pointer to
avoid possible NULL dereference in error path.

Bug added by commit 3a7c384ffd57 (ipv4: tcp: unicast_sock should not
land outside of TCP stack)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
55be7a9c6074f749d617a7fc1914c9a23505438c 12-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Add redirect support to all protocol icmp error handlers.

Signed-off-by: David S. Miller <davem@davemloft.net>
22911fc581f6a241e2897a7a8603e97344a6ec82 27-Jun-2012 Eric Dumazet <edumazet@google.com> net: skb_free_datagram_locked() doesnt drop all packets

dropwatch wrongly diagnose all received UDP packets as drops.

This patch removes trace_kfree_skb() done in skb_free_datagram_locked().

Locations calling skb_free_datagram_locked() should do it on their own.

As a result, drops are accounted on the right function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
36393395536064e483b73d173f6afc103eadfbc4 15-Jun-2012 David S. Miller <davem@davemloft.net> ipv4: Handle PMTU in all ICMP error handlers.

With ip_rt_frag_needed() removed, we have to explicitly update PMTU
information in every ICMP error handler.

Create two helper functions to facilitate this.

1) ipv4_sk_update_pmtu()

This updates the PMTU when we have a socket context to
work with.

2) ipv4_update_pmtu()

Raw version, used when no socket context is available. For this
interface, we essentially just pass in explicit arguments for
the flow identity information we would have extracted from the
socket.

And you'll notice that ipv4_sk_update_pmtu() is simply implemented
in terms of ipv4_update_pmtu()

Note that __ip_route_output_key() is used, rather than something like
ip_route_output_flow() or ip_route_output_key(). This is because we
absolutely do not want to end up with a route that does IPSEC
encapsulation and the like. Instead, we only want the route that
would get us to the node described by the outermost IP header.

Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
31fe62b9586643953f0c0c37a6357dafc69034e2 23-May-2012 Tim Bird <tim.bird@am.sony.com> mm: add a low limit to alloc_large_system_hash

UDP stack needs a minimum hash size value for proper operation and also
uses alloc_large_system_hash() for proper NUMA distribution of its hash
tables and automatic sizing depending on available system memory.

On some low memory situations, udp_table_init() must ignore the
alloc_large_system_hash() result and reallocs a bigger memory area.

As we cannot easily free old hash table, we leak it and kmemleak can
issue a warning.

This patch adds a low limit parameter to alloc_large_system_hash() to
solve this problem.

We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
allocation.

Reported-by: Mark Asselstine <mark.asselstine@windriver.com>
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
413c27d8697751f72d2d6cf289140a8e060a8032 19-May-2012 Eldad Zack <eldad@fogrefinery.com> net/ipv4: replace simple_strtoul with kstrtoul

Replace simple_strtoul with kstrtoul in three similar occurrences, all setup
handlers:
* route.c: set_rhash_entries
* tcp.c: set_thash_entries
* udp.c: set_uhash_entries

Also check if the conversion failed.

Signed-off-by: Eldad Zack <eldad@fogrefinery.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f545a38f74584cc7424cb74f792a00c6d2589485 23-Apr-2012 Eric Dumazet <edumazet@google.com> net: add a limit parameter to sk_add_backlog()

sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
memory limit. We need to make this limit a parameter for TCP use.

No functional change expected in this patch, all callers still using the
old sk_rcvbuf limit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
95c961747284a6b83a5e2d81240e214b0fa3464d 15-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com> net: cleanup unsigned to unsigned int

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5e73ea1a31c3612aa6dfe44f864ca5b7b6a4cff9 15-Apr-2012 Daniel Baluta <dbaluta@ixiacom.com> ipv4: fix checkpatch errors

Fix checkpatch errors of the following type:
* ERROR: "foo * bar" should be "foo *bar"
* ERROR: "(foo*)" should be "(foo *)"

Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
447167bf565a474ff0cfb0f41d54936937479e97 12-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com> udp: intoduce udp_encap_needed static_key

Most machines dont use UDP encapsulation (L2TP)

Adds a static_key so that udp_queue_rcv_skb() doesnt have to perform a
test if L2TP never setup the encap_rcv on a socket.

Idea of this patch came after Simon Horman proposal to add a hook on TCP
as well.

If static_key is not yet enabled, the fast path does a single JMP .

When static_key is enabled, JMP destination is patched to reach the real
encap_type/encap_rcv logic, possibly adding cache misses.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: dev@openvswitch.org
Signed-off-by: David S. Miller <davem@davemloft.net>
9ffc93f203c18a70623f21950f1dd473c9ec48cd 28-Mar-2012 David Howells <dhowells@redhat.com> Remove all #inclusions of asm/system.h

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
afd465030acb4098abcb6b965a5aebc7ea2209e0 12-Mar-2012 Joe Perches <joe@perches.com> net: ipv4: Standardize prefixes for message logging

Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3f518bf745cbd6007d8069100fb9cb09e960c872 21-Feb-2012 Pavel Emelyanov <xemul@parallels.com> datagram: Add offset argument to __skb_recv_datagram

This one is only considered for MSG_PEEK flag and the value pointed by
it specifies where to start peeking bytes from. If the offset happens to
point into the middle of the returned skb, the offset within this skb is
put back to this very argument.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
76e21053b5bf33a07c76f99d27a74238310e3c71 08-Feb-2012 Erich E. Hoover <ehoover@mines.edu> ipv4: Implement IP_UNICAST_IF socket option.

The IP_UNICAST_IF feature is needed by the Wine project. This patch
implements the feature by setting the outgoing interface in a similar
fashion to that of IP_MULTICAST_IF. A separate option is needed to
handle this feature since the existing options do not provide all of
the characteristics required by IP_UNICAST_IF, a summary is provided
below.

SO_BINDTODEVICE:
* SO_BINDTODEVICE requires administrative privileges, IP_UNICAST_IF
does not. From reading some old mailing list articles my
understanding is that SO_BINDTODEVICE requires administrative
privileges because it can override the administrator's routing
settings.
* The SO_BINDTODEVICE option restricts both outbound and inbound
traffic, IP_UNICAST_IF only impacts outbound traffic.

IP_PKTINFO:
* Since IP_PKTINFO and IP_UNICAST_IF are independent options,
implementing IP_UNICAST_IF with IP_PKTINFO will likely break some
applications.
* Implementing IP_UNICAST_IF on top of IP_PKTINFO significantly
complicates the Wine codebase and reduces the socket performance
(doing this requires a lot of extra communication between the
"server" and "user" layers).

bind():
* bind() does not work on broadcast packets, IP_UNICAST_IF is
specifically intended to work with broadcast packets.
* Like SO_BINDTODEVICE, bind() restricts both outbound and inbound
traffic.

Signed-off-by: Erich E. Hoover <ehoover@mines.edu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fce823381e3c082ba1b2e15d5151d1aa8afdc9e9 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> udp: Export code sk lookup routines

The UDP diag get_exact handler will require them to find a
socket by provided net, [sd]addr-s, [sd]ports and device.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
59c2cdae2791c0b2ee13d148edc6b771e7e7953f 01-Dec-2011 David S. Miller <davem@davemloft.net> Revert "udp: remove redundant variable"

This reverts commit 81d54ec8479a2c695760da81f05b5a9fb2dbe40a.

If we take the "try_again" goto, due to a checksum error,
the 'len' has already been truncated. So we won't compute
the same values as the original code did.

Reported-by: paul bilke <fsmail@conspiracy.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c8f44affb7244f2ac3e703cab13d55ede27621bb 15-Nov-2011 Michał Mirosław <mirq-linux@rere.qmqm.pl> net: introduce and use netdev_features_t for device features sets

v2: add couple missing conversions in drivers
split unexporting netdev_fix_features()
implemented %pNF
convert sock::sk_route_(no?)caps

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
d826eb14ecef3574b6b3be55e5f4329f4a76fbf3 09-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com> ipv4: PKTINFO doesnt need dst reference

Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>

OK I found it, I did some extra tests and believe its ready.

[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.

We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.

We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.

This removes two atomic operations per packet, and false sharing as
well.

On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.

IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0ad92ad03aa444b312bd318b0341011a8be09d13 01-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com> udp: fix a race in encap_rcv handling

udp_queue_rcv_skb() has a possible race in encap_rcv handling, since
this pointer can be changed anytime.

We should use ACCESS_ONCE() to close the race.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
73cb88ecb950ee67906d02354f781ea293bcf895 30-Oct-2011 Arjan van de Ven <arjan@linux.intel.com> net: make the tcp and udp file_operations for the /proc stuff const

the tcp and udp code creates a set of struct file_operations at runtime
while it can also be done at compile time, with the added benefit of then
having these file operations be const.

the trickiest part was to get the "THIS_MODULE" reference right; the naive
method of declaring a struct in the place of registration would not work
for this reason.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bdeab991918663aed38757904219e8398214334c 14-Aug-2011 Tom Herbert <therbert@google.com> rps: Add flag to skb to indicate rxhash is based on L4 tuple

The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet. This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
33d480ce6d43326e2541fd79b3548858a174ec3c 11-Aug-2011 Eric Dumazet <eric.dumazet@gmail.com> net: cleanup some rcu_dereference_raw

RCU api had been completed and rcu_access_pointer() or
rcu_dereference_protected() are better than generic
rcu_dereference_raw()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f03d78db65085609938fdb686238867e65003181 07-Jul-2011 Eric Dumazet <eric.dumazet@gmail.com> net: refine {udp|tcp|sctp}_mem limits

Current tcp/udp/sctp global memory limits are not taking into account
hugepages allocations, and allow 50% of ram to be used by buffers of a
single protocol [ not counting space used by sockets / inodes ...]

Lets use nr_free_buffer_pages() and allow a default of 1/8 of kernel ram
per protocol, and a minimum of 128 pages.
Heavy duty machines sysadmins probably need to tweak limits anyway.


References: https://bugzilla.stlinux.com/show_bug.cgi?id=38032
Reported-by: starlight <starlight@binnacle.cx>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9cfaa8def1c795a512bc04f2aec333b03724ca2e 21-Jun-2011 Xufeng Zhang <xufeng.zhang@windriver.com> udp/recvmsg: Clear MSG_TRUNC flag when starting over for a new packet

Consider this scenario: When the size of the first received udp packet
is bigger than the receive buffer, MSG_TRUNC bit is set in msg->msg_flags.
However, if checksum error happens and this is a blocking socket, it will
goto try_again loop to receive the next packet. But if the size of the
next udp packet is smaller than receive buffer, MSG_TRUNC flag should not
be set, but because MSG_TRUNC bit is not cleared in msg->msg_flags before
receive the next packet, MSG_TRUNC is still set, which is wrong.

Fix this problem by clearing MSG_TRUNC flag when starting over for a
new packet.

Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
296f7ea75b45913d5e2067baef0812087e0b6eb9 17-Jun-2011 Satoru Moriya <satoru.moriya@hds.com> udp: add tracepoints for queueing skb to rcvbuf

This patch adds a tracepoint to __udp_queue_rcv_skb to get the
return value of ip_queue_rcv_skb. It indicates why kernel drops
a packet at this point.

ip_queue_rcv_skb returns following values in the packet drop case:

rcvbuf is full : -ENOMEM
sk_filter returns error : -EINVAL, -EACCESS, -ENOMEM, etc.
__sk_mem_schedule returns error: -ENOBUF

Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
71338aa7d050c86d8765cd36e46be514fb0ebbce 23-May-2011 Dan Rosenberg <drosenberg@vsecurity.com> net: convert %p usage to %pK

The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces. Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers. The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs. If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges. Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree. This patch converts users of %p in net/ to %pK. Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
79ab053145ae10edf8f0809d1c788b853c7f6ef1 09-May-2011 David S. Miller <davem@davemloft.net> ipv4: udp: Eliminate remaining uses of rt->rt_src

We already track and pass around the correct flow key,
so simply use it in udp_send_skb().

Signed-off-by: David S. Miller <davem@davemloft.net>
f5fca6086511294653a9e821f8e22f041415ba7b 09-May-2011 David S. Miller <davem@davemloft.net> ipv4: Pass flow key down into ip_append_*().

This way rt->rt_dst accesses are unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
77968b78242ee25e2a4d759f0fca8dd52df6d479 09-May-2011 David S. Miller <davem@davemloft.net> ipv4: Pass flow keys down into datagram packet building engine.

This way ip_output.c no longer needs rt->rt_{src,dst}.

We already have these keys sitting, ready and waiting, on the stack or
in a socket structure.

Signed-off-by: David S. Miller <davem@davemloft.net>
e474995f290ff7bc236b549aa9a89ae445ee5b1b 09-May-2011 David S. Miller <davem@davemloft.net> udp: Use flow key information instead of rt->rt_{src,dst}

We have two cases.

Either the socket is in TCP_ESTABLISHED state and connect() filled
in the inet socket cork flow, or we looked up the route here and
used an on-stack flow.

Track which one it was, and use it to obtain src/dst addrs.

Signed-off-by: David S. Miller <davem@davemloft.net>
f6d8bd051c391c1c0458a30b2a7abcd939329259 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com> inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
b71d1d426d263b0b6cb5760322efebbfc89d4463 22-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com> inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
25985edcedea6396277003854657b5f3cb31a628 31-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi> Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
c0951cbcfd7142003bc683046ef78a53c66d3265 31-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use flowi4_init_output() in udp_sendmsg()

Signed-off-by: David S. Miller <davem@davemloft.net>
9cce96df5b76691712dba22e83ff5efe900361e1 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Put fl4_* macros to struct flowi4 and use them again.

Signed-off-by: David S. Miller <davem@davemloft.net>
b6f21b268026165a239edb5f4d7120eacf7bc593 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use flowi4 in UDP

Signed-off-by: David S. Miller <davem@davemloft.net>
9d6ec938019c6b16cb9ec96598ebe8f20de435fe 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use flowi4 in public route lookup interfaces.

Signed-off-by: David S. Miller <davem@davemloft.net>
6281dcc94a96bd73017b2baa8fa83925405109ef 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Make flowi ports AF dependent.

Create two sets of port member accessors, one set prefixed by fl4_*
and the other prefixed by fl6_*

This will let us to create AF optimal flow instances.

It will work because every context in which we access the ports,
we have to be fully aware of which AF the flowi is anyways.

Signed-off-by: David S. Miller <davem@davemloft.net>
1d28f42c1bd4bb2363d88df74d0128b4da135b4a 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Put flowi_* prefix on AF independent members of struct flowi

I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>
06dc94b1ed05f91e246315afeb1c652d6d0dc9ab 03-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Fix crash in dst_release when udp_sendmsg route lookup fails.

As reported by Eric:

[11483.697233] IP: [<c12b0638>] dst_release+0x18/0x60
...
[11483.697741] Call Trace:
[11483.697764] [<c12fc9d2>] udp_sendmsg+0x282/0x6e0
[11483.697790] [<c12a1c01>] ? memcpy_toiovec+0x51/0x70
[11483.697818] [<c12dbd90>] ? ip_generic_getfrag+0x0/0xb0

The pointer passed to dst_release() is -EINVAL, that's because
we leave an error pointer in the local variable "rt" by accident.

NULL it out to fix the bug.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b23dd4fe42b455af5c6e20966b7d6959fa8352ea 02-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make output route lookup return rtable directly.

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
273447b352e69c327efdecfd6e1d6fe3edbdcd14 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Kill can_sleep arg to ip_route_output_flow()

This boolean state is now available in the flow flags.

Signed-off-by: David S. Miller <davem@davemloft.net>
5df65e5567a497a28067019b8ff08f98fb026629 01-Mar-2011 David S. Miller <davem@davemloft.net> net: Add FLOWI_FLAG_CAN_SLEEP.

And set is in contexts where the route resolution can sleep.

Signed-off-by: David S. Miller <davem@davemloft.net>
420d44daa7aa1cc847e9e527f0a27a9ce61768ca 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"

Since that is what the current vague "flags" argument means.

Signed-off-by: David S. Miller <davem@davemloft.net>
903ab86d195cca295379699299c5fc10beba31c7 01-Mar-2011 Herbert Xu <herbert@gondor.apana.org.au> udp: Add lockless transmit path

The UDP transmit path has been running under the socket lock
for a long time because of the corking feature. This means that
transmitting to the same socket in multiple threads does not
scale at all.

However, as most users don't actually use corking, the locking
can be removed in the common case.

This patch creates a lockless fast path where corking is not used.

Please note that this does create a slight inaccuracy in the
enforcement of socket send buffer limits. In particular, we
may exceed the socket limit by up to (number of CPUs) * (packet
size) because of the way the limit is computed.

As the primary purpose of socket buffers is to indicate congestion,
this should not be a great problem for now.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f6b9664f8b711cf4fd53e70aa0d21f72d5bf806c 01-Mar-2011 Herbert Xu <herbert@gondor.apana.org.au> udp: Switch to ip_finish_skb

This patch converts UDP to use the new ip_finish_skb API. This
would then allows us to more easily use ip_make_skb which allows
UDP to run without a socket lock.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
04ed3e741d0f133e02bed7fa5c98edba128f90e7 25-Jan-2011 Michał Mirosław <mirq-linux@rere.qmqm.pl> net: change netdev->features to u32

Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
struct netdev, as per Eric Dumazet's suggestion -DaveM ]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
55508d601dab7df5cbcc7a63f4be8620eface204 14-Dec-2010 Michał Mirosław <mirq-linux@rere.qmqm.pl> net: Use skb_checksum_start_offset()

Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
fcbdf09d9652c8919dcf47072e3ae7dcb4eb98ac 16-Dec-2010 Octavian Purdila <opurdila@ixiacom.com> net: fix nulls list corruptions in sk_prot_alloc

Special care is taken inside sk_port_alloc to avoid overwriting
skc_node/skc_nulls_node. We should also avoid overwriting
skc_bind_node/skc_portaddr_node.

The patch fixes the following crash:

BUG: unable to handle kernel paging request at fffffffffffffff0
IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370
[<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360
[<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700
[<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0
[<ffffffff812eda35>] udp_rcv+0x15/0x20
[<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190
[<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0
[<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0
[<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0
[<ffffffff812bb94b>] ip_rcv+0x2bb/0x350
[<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0

Signed-off-by: Leonard Crestez <lcrestez@ixiacom.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5811662b15db018c740c57d037523683fd3e6123 12-Nov-2010 Changli Gao <xiaosuo@gmail.com> net: use the macros defined for the members of flowi

Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c31504dc0d1dc853dcee509d9999169a9097a717 15-Nov-2010 Eric Dumazet <eric.dumazet@gmail.com> udp: use atomic_inc_not_zero_hint

UDP sockets refcount is usually 2, unless an incoming frame is going to
be queued in receive or backlog queue.

Using atomic_inc_not_zero_hint() permits to reduce latency, because
processor issues less memory transactions.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8d987e5c75107ca7515fa19e857cfa24aab6ec8f 10-Nov-2010 Eric Dumazet <eric.dumazet@gmail.com> net: avoid limits overflow

Robin Holt tried to boot a 16TB machine and found some limits were
reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]

We can switch infrastructure to use long "instead" of "int", now
atomic_long_t primitives are available for free.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Robin Holt <holt@sgi.com>
Reviewed-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0d7da9ddd9a4eb7808698d04b98bf9d62d02649b 25-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> net: add __rcu annotation to sk_filter

Add __rcu annotation to :
(struct sock)->sk_filter

And use appropriate rcu primitives to reduce sparse warnings if
CONFIG_SPARSE_RCU_POINTER=y

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
719f835853a92f6090258114a72ffe41f09155cd 08-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> udp: add rehash on connect()

commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).

Problem is that following sequence :

fd = socket(...)
connect(fd, &remote, ...)

not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)

Sequence is :
- autobind() : choose a random local port, insert socket in hash tables
[while local address is INADDR_ANY]
- connect() : set remote address and port, change local address to IP
given by a route lookup.

When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.

One solution to this problem is to rehash datagram socket if needed.

We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.

This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.

Reported-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2244d07bfa2097cb00600da91c715a8aa547917e 17-Aug-2010 Oliver Hartkopp <socketcan@hartkopp.net> net: simplify flags for tx timestamping

This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
d8d1f30b95a635dbd610dcc5eb641aca8f4768cf 11-Jun-2010 Changli Gao <xiaosuo@gmail.com> net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b1faf5666438090a4dc4fceac8502edc7788b7e3 01-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com> net: sock_queue_err_skb() dont mess with sk_forward_alloc

Correct sk_forward_alloc handling for error_queue would need to use a
backlog of frames that softirq handler could not deliver because socket
is owned by user thread. Or extend backlog processing to be able to
process normal and error packets.

Another possibility is to not use mem charge for error queue, this is
what I implemented in this patch.

Note: this reverts commit 29030374
(net: fix sk_forward_alloc corruptions), since we dont need to lock
socket anymore.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2903037400a26e7c0cc93ab75a7d62abfacdf485 29-May-2010 Eric Dumazet <eric.dumazet@gmail.com> net: fix sk_forward_alloc corruptions

As David found out, sock_queue_err_skb() should be called with socket
lock hold, or we risk sk_forward_alloc corruption, since we use non
atomic operations to update this field.

This patch adds bh_lock_sock()/bh_unlock_sock() pair to three spots.
(BH already disabled)

1) skb_tstamp_tx()
2) Before calling ip_icmp_error(), in __udp4_lib_err()
3) Before calling ipv6_icmp_error(), in __udp6_lib_err()

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8a74ad60a546b13bd1096b2a61a7a5c6fd9ae17c 26-May-2010 Eric Dumazet <eric.dumazet@gmail.com> net: fix lock_sock_bh/unlock_sock_bh

This new sock lock primitive was introduced to speedup some user context
socket manipulation. But it is unsafe to protect two threads, one using
regular lock_sock/release_sock, one using lock_sock_bh/unlock_sock_bh

This patch changes lock_sock_bh to be careful against 'owned' state.
If owned is found to be set, we must take the slow path.
lock_sock_bh() now returns a boolean to say if the slow path was taken,
and this boolean is used at unlock_sock_bh time to call the appropriate
unlock function.

After this change, BH are either disabled or enabled during the
lock_sock_bh/unlock_sock_bh protected section. This might be misleading,
so we rename these functions to lock_sock_fast()/unlock_sock_fast().

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4be929be34f9bdeffa40d815d32d7d60d2c7f03b 24-May-2010 Alexey Dobriyan <adobriyan@gmail.com> kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN

- C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
USHORT_MAX/SHORT_MAX/SHORT_MIN.

- Make SHRT_MIN of type s16, not int, for consistency.

[akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
[akpm@linux-foundation.org: fix security/keys/keyring.c]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e3826f1e946e7d2354943232f1457be1455a29e2 05-May-2010 Amerigo Wang <amwang@redhat.com> net: reserve ports for applications using fixed port numbers

(Dropped the infiniband part, because Tetsuo modified the related code,
I will send a separate patch for it once this is accepted.)

This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.

The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ccc2d97cb7c798e785c9f198de243e2b59f7073b 06-May-2010 Bjørn Mork <bjorn@mork.no> ipv4: udp: fix short packet and bad checksum logging

commit 2783ef23 moved the initialisation of saddr and daddr after
pskb_may_pull() to avoid a potential data corruption. Unfortunately
also placing it after the short packet and bad checksum error paths,
where these variables are used for logging. The result is bogus
output like

[92238.389505] UDP: short packet: From 2.0.0.0:65535 23715/178 to 0.0.0.0:65535

Moving the saddr and daddr initialisation above the error paths, while still
keeping it after the pskb_may_pull() to keep the fix from commit 2783ef23.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Cc: stable@kernel.org
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f84af32cbca70a3c6d30463dc08c7984af11c277 29-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> net: ip_queue_rcv_skb() helper

When queueing a skb to socket, we can immediately release its dst if
target socket do not use IP_CMSG_PKTINFO.

tcp_data_queue() can drop dst too.

This to benefit from a hot cache line and avoid the receiver, possibly
on another cpu, to dirty this cache line himself.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4b0b72f7dd617b13abd1b04c947e15873e011a24 28-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> net: speedup udp receive path

Since commit 95766fff ([UDP]: Add memory accounting.),
each received packet needs one extra sock_lock()/sock_release() pair.

This added latency because of possible backlog handling. Then later,
ticket spinlocks added yet another latency source in case of DDOS.

This patch introduces lock_sock_bh() and unlock_sock_bh()
synchronization primitives, avoiding one atomic operation and backlog
processing.

skb_free_datagram_locked() uses them instead of full blown
lock_sock()/release_sock(). skb is orphaned inside locked section for
proper socket memory reclaim, and finally freed outside of it.

UDP receive path now take the socket spinlock only once.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c377411f2494a931ff7facdbb3a6839b1266bcf6 28-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> net: sk_add_backlog() take rmem_alloc into account

Current socket backlog limit is not enough to really stop DDOS attacks,
because user thread spend many time to process a full backlog each
round, and user might crazy spin on socket lock.

We should add backlog size and receive_queue size (aka rmem_alloc) to
pace writers, and let user run without being slow down too much.

Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
stress situations.

Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on a 8 core machine.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c58dc01babfd58ec9e71a6ce080150dc27755d88 28-Apr-2010 David S. Miller <davem@davemloft.net> net: Make RFS socket operations not be inet specific.

Idea from Eric Dumazet.

As for placement inside of struct sock, I tried to choose a place
that otherwise has a 32-bit hole on 64-bit systems.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
0eae88f31ca2b88911ce843452054139e028771f 21-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> net: Fix various endianness glitches

Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fec5e652e58fa6017b2c9e06466cb2a6538de5b4 17-Apr-2010 Tom Herbert <therbert@google.com> rfs: Receive Flow Steering

This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU

RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1223c67c0938d2df309fde618bd82c87c8c1af04 08-Apr-2010 Jorge Boncompte [DTI2] <jorge@dti2.net> udp: fix for unicast RX path optimization

Commits 5051ebd275de672b807c28d93002c2fb0514a3c9 and
5051ebd275de672b807c28d93002c2fb0514a3c9 ("ipv[46]: udp: optimize unicast RX
path") broke some programs.

After upgrading a L2TP server to 2.6.33 it started to fail, tunnels going up an
down, after the 10th tunnel came up. My modified rp-l2tp uses a global
unconnected socket bound to (INADDR_ANY, 1701) and one connected socket per
tunnel after parameter negotiation.

After ten sockets were open and due to mixed parameters to
udp[46]_lib_lookup2() kernel started to drop packets.

Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
a3a858ff18a72a8d388e31ab0d98f7e944841a62 04-Mar-2010 Zhu Yi <yi.zhu@intel.com> net: backlog functions rename

sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog

Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
55349790d7cbf0d381873a7ece1dcafcffd4aaa9 04-Mar-2010 Zhu Yi <yi.zhu@intel.com> udp: use limited socket backlog

Make udp adapt to the limited socket backlog change.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
81d54ec8479a2c695760da81f05b5a9fb2dbe40a 10-Feb-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> udp: remove redundant variable

The variable 'copied' is used in udp_recvmsg() to emphasize that the passed
'len' is adjusted to fit the actual datagram length. But the same can be
done by adjusting 'len' directly. This patch thus removes the indirection.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2c8c1e7297e19bdef3c178c3ea41d898a7716e3e 17-Jan-2010 Alexey Dobriyan <adobriyan@gmail.com> net: spread __net_init, __net_exit

__net_init/__net_exit are apparently not going away, so use them
to full extent.

In some cases __net_init was removed, because it was called from
__net_exit code.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5781b2356cbecb0b73b06ec8c3897cabdfdd0928 14-Dec-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: udp_lib_get_port() fix

Now we can have a large udp hash table, udp_lib_get_port() loop
should be converted to a do {} while (cond) form,
or we dont enter it at all if hash table size is exactly 65536.

Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d4fb27db90043cd2640e4bc778f9c755d3c17c1 23-Nov-2009 Joe Perches <joe@perches.com> net/ipv4: Move && and || to end of previous line

On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote:
> It should be of the form:
> if (x &&
> y)
>
> or:
> if (x && y)
>
> Fix patches, rather than complaints, for existing cases where things
> do not follow this pattern are certainly welcome.

Also collapsed some multiple tabs to single space.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
30fff9231fad757c061285e347b33c5149c2c2e4 09-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: bind() optimisation

UDP bind() can be O(N^2) in some pathological cases.

Thanks to secondary hash tables, we can make it O(N)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f6b8f32ca71406de718391369490f6b1e81fe0bb 08-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: multicast RX should increment SNMP/sk_drops counter in allocation failures

When skb_clone() fails, we should increment sk_drops and SNMP counters.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1240d1373cd7f874dd0f3057c3e9643e71ef75c6 08-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> ipv4: udp: Optimise multicast reception

UDP multicast rx path is a bit complex and can hold a spinlock
for a long time.

Using a small (32 or 64 entries) stack of socket pointers can help
to perform expensive operations (skb_clone(), udp_queue_rcv_skb())
outside of the lock, in most cases.

It's also a base for a future RCU conversion of multicast recption.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5051ebd275de672b807c28d93002c2fb0514a3c9 08-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> ipv4: udp: optimize unicast RX path

We first locate the (local port) hash chain head
If few sockets are in this chain, we proceed with previous lookup algo.

If too many sockets are listed, we take a look at the secondary
(port, address) hash chain we added in previous patch.

We choose the shortest chain and proceed with a RCU lookup on the elected chain.

But, if we chose (port, address) chain, and fail to find a socket on given address,
we must try another lookup on (port, INADDR_ANY) chain to find socket not bound
to a particular IP.

-> No extra cost for typical setups, where the first lookup will probabbly
be performed.

RCU lookups everywhere, we dont acquire spinlock.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
512615b6b843ff3ff5ad583f34c39b3f302f5f26 08-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: secondary hash on (local port, local address)

Extends udp_table to contain a secondary hash table.

socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)

udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)

Note : No spinlock lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d4cada4ae1c012815f95fa507eb86a0ae9d607d7 08-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: split sk_hash into two u16 hashes

Union sk_hash with two u16 hashes for udp (no extra memory taken)

One 16 bits hash on (local port) value (the previous udp 'hash')

One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.

Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdcc8aa953a1123a289791dd192090651036d593 08-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: add a counter into udp_hslot

Adds a counter in udp_hslot to keep an accurate count
of sockets present in chain.

This will permit to upcoming UDP lookup algo to chose
the shortest chain when secondary hash is added.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d410c796067686b1e032d54ce475b7055537138 30-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> net: fix sk_forward_alloc corruption

On UDP sockets, we must call skb_free_datagram() with socket locked,
or risk sk_forward_alloc corruption. This requirement is not respected
in SUNRPC.

Add a convenient helper, skb_free_datagram_locked() and use it in SUNRPC

Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8edf19c2fe028563fc6ea9cb1995b8ee4172d4b6 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> net: sk_drops consolidation part 2

- skb_kill_datagram() can increment sk->sk_drops itself, not callers.

- UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c720c7e8383aff1cb219bddf474ed89d850336e3 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> inet: rename some inet_sock fields

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
766e9037cc139ee25ed93ee5ad11e1450c4b99f6 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> net: sk_drops consolidation

sock_queue_rcv_skb() can update sk_drops itself, removing need for
callers to take care of it. This is more consistent since
sock_queue_rcv_skb() also reads sk_drops when queueing a skb.

This adds sk_drops managment to many protocols that not cared yet.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
85584672012ee0c3b7b8e033a1ecf7c11878e45f 09-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: Fix udp_poll() and ioctl()

udp_poll() can in some circumstances drop frames with incorrect checksums.

Problem is we now have to lock the socket while dropping frames, or risk
sk_forward corruption.

This bug is present since commit 95766fff6b9a78d1
([UDP]: Add memory accounting.)

While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b885787ea4112eaa80945999ea0901bf742707f 12-Oct-2009 Neil Horman <nhorman@tuxdriver.com> net: Generalize socket rx gap / receive queue overflow cmsg

Create a new socket level option to report number of queue overflows

Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.

Notes:

1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.

2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.

3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f86dcc5aa8c7908f2c287e7a211228df599e3e71 07-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: dynamically size hash tables at boot time

UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
several setups.

4000 active UDP sockets -> 32 sockets per chain in average. An
incoming frame has to lookup all sockets to find best match, so long
chains hurt latency.

Instead of a fixed size hash table that cant be perfect for every
needs, let UDP stack choose its table size at boot time like tcp/ip
route, using alloc_large_system_hash() helper

Add an optional boot parameter, uhash_entries=x so that an admin can
force a size between 256 and 65536 if needed, like thash_entries and
rhash_entries.

dmesg logs two new lines :
[ 0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
[ 0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)

Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
debugging spinlocks.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
914a9ab386a288d0f22252fc268ecbc048cdcbd5 02-Oct-2009 Atis Elsts <atis@mikrotik.com> net: Use sk_mark for routing lookup in more places

This patch against v2.6.31 adds support for route lookup using sk_mark in some
more places. The benefits from this patch are the following.
First, SO_MARK option now has effect on UDP sockets too.
Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
lookup correctly if TCP sockets with SO_MARK were used.

Signed-off-by: Atis Elsts <atis@mikrotik.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
b7058842c940ad2c08dd829b21e5c92ebe3b8758 01-Oct-2009 David S. Miller <davem@davemloft.net> net: Make setsockopt() optlen be unsigned.

This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.

Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.

Signed-off-by: David S. Miller <davem@davemloft.net>
6ce9e7b5fe3195d1ae6e3a0753d4ddcac5cd699e 03-Sep-2009 Eric Dumazet <eric.dumazet@gmail.com> ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s
...
IP:
1487048 outgoing packets dropped
...
Udp:
...
SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

"The output queue for a network interface was full.
This generally indicates that the interface has stopped sending,
but may be caused by transient congestion.
(Normally, this does not occur in Linux. Packets are just silently
dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c482c568577a2b31716e1019f2868bda7cf5629c 17-Jul-2009 Eric Dumazet <eric.dumazet@gmail.com> udp: cleanups

Pure style cleanups.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d7ca4cc01fd154f2da30ae6dae160fa5800af758 09-Jul-2009 Sridhar Samudrala <sri@us.ibm.com> udpv4: Handle large incoming UDP/IPv4 packets and support software UFO.

- validate and forward GSO UDP/IPv4 packets from untrusted sources.
- do software UFO if the outgoing device doesn't support UFO.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
31e6d363abcd0d05766c82f1a9c905a4c974a199 18-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: correct off-by-one write allocations reports

commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.

We need to take into account this offset when reporting
sk_wmem_alloc to user, in PROC_FS files or various
ioctls (SIOCOUTQ/TIOCOUTQ)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
adf30907d63893e4208dfe3f5c88ae12bc2f25d5 02-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
499923c7a3254971873e55a1690d07d3700eea47 09-Apr-2009 Vlad Yasevich <vladislav.yasevich@hp.com> ipv6: Fix NULL pointer dereference with time-wait sockets

Commit b2f5e7cd3dee2ed721bf0675e1a1ddebb849aee6
(ipv6: Fix conflict resolutions during ipv6 binding)
introduced a regression where time-wait sockets were
not treated correctly. This resulted in the following:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000062
IP: [<ffffffff805d7d61>] ipv4_rcv_saddr_equal+0x61/0x70
...
Call Trace:
[<ffffffffa033847b>] ipv6_rcv_saddr_equal+0x1bb/0x250 [ipv6]
[<ffffffffa03505a8>] inet6_csk_bind_conflict+0x88/0xd0 [ipv6]
[<ffffffff805bb18e>] inet_csk_get_port+0x1ee/0x400
[<ffffffffa0319b7f>] inet6_bind+0x1cf/0x3a0 [ipv6]
[<ffffffff8056d17c>] ? sockfd_lookup_light+0x3c/0xd0
[<ffffffff8056ed49>] sys_bind+0x89/0x100
[<ffffffff80613ea2>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[<ffffffff8020bf9b>] system_call_fastpath+0x16/0x1b

Tested-by: Brian Haley <brian.haley@hp.com>
Tested-by: Ed Tomlinson <edt@aei.ca>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b2f5e7cd3dee2ed721bf0675e1a1ddebb849aee6 24-Mar-2009 Vlad Yasevich <vladislav.yasevich@hp.com> ipv6: Fix conflict resolutions during ipv6 binding

The ipv6 version of bind_conflict code calls ipv6_rcv_saddr_equal()
which at times wrongly identified intersections between addresses.
It particularly broke down under a few instances and caused erroneous
bind conflicts.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
30842f2989aacfaba3ccb39829b3417be9313dbe 23-Mar-2009 Vitaly Mayatskikh <v.mayatskih@gmail.com> udp: Wrong locking code in udp seq_file infrastructure

Reading zero bytes from /proc/net/udp or other similar files which use
the same seq_file udp infrastructure panics kernel in that way:

=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
read/1985 is trying to release lock (&table->hash[i].lock) at:
[<ffffffff81321d83>] udp_seq_stop+0x27/0x29
but there are no more locks to release!

other info that might help us debug this:
1 lock held by read/1985:
#0: (&p->lock){--..}, at: [<ffffffff810eefb6>] seq_read+0x38/0x348

stack backtrace:
Pid: 1985, comm: read Not tainted 2.6.29-rc8 #9
Call Trace:
[<ffffffff81321d83>] ? udp_seq_stop+0x27/0x29
[<ffffffff8106dab9>] print_unlock_inbalance_bug+0xd6/0xe1
[<ffffffff8106db62>] lock_release_non_nested+0x9e/0x1c6
[<ffffffff810ef030>] ? seq_read+0xb2/0x348
[<ffffffff8106bdba>] ? mark_held_locks+0x68/0x86
[<ffffffff81321d83>] ? udp_seq_stop+0x27/0x29
[<ffffffff8106dde7>] lock_release+0x15d/0x189
[<ffffffff8137163c>] _spin_unlock_bh+0x1e/0x34
[<ffffffff81321d83>] udp_seq_stop+0x27/0x29
[<ffffffff810ef239>] seq_read+0x2bb/0x348
[<ffffffff810eef7e>] ? seq_read+0x0/0x348
[<ffffffff8111aedd>] proc_reg_read+0x90/0xaf
[<ffffffff810d878f>] vfs_read+0xa6/0x103
[<ffffffff8106bfac>] ? trace_hardirqs_on_caller+0x12f/0x153
[<ffffffff810d88a2>] sys_read+0x45/0x69
[<ffffffff8101123a>] system_call_fastpath+0x16/0x1b
BUG: scheduling while atomic: read/1985/0xffffff00
INFO: lockdep is turned off.
Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table dm_multipath kvm ppdev snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event arc4 snd_s
eq ecb thinkpad_acpi snd_seq_device iwl3945 hwmon sdhci_pci snd_pcm_oss sdhci rfkill mmc_core snd_mixer_oss i2c_i801 mac80211 yenta_socket ricoh_mmc i2c_core iTCO_wdt snd_pcm iTCO_vendor_support rs
rc_nonstatic snd_timer snd lib80211 cfg80211 soundcore snd_page_alloc video parport_pc output parport e1000e [last unloaded: scsi_wait_scan]
Pid: 1985, comm: read Not tainted 2.6.29-rc8 #9
Call Trace:
[<ffffffff8106b456>] ? __debug_show_held_locks+0x1b/0x24
[<ffffffff81043660>] __schedule_bug+0x7e/0x83
[<ffffffff8136ede9>] schedule+0xce/0x838
[<ffffffff810d7972>] ? fsnotify_access+0x5f/0x67
[<ffffffff810112d0>] ? sysret_careful+0xb/0x37
[<ffffffff8106be9c>] ? trace_hardirqs_on_caller+0x1f/0x153
[<ffffffff8137127b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff810112f6>] sysret_careful+0x31/0x37
read[1985]: segfault at 7fffc479bfe8 ip 0000003e7420a180 sp 00007fffc479bfa0 error 6
Kernel panic - not syncing: Aiee, killing interrupt handler!

udp_seq_stop() tries to unlock not yet locked spinlock. The lock was lost
during splitting global udp_hash_lock to subsequent spinlocks.

Signed-off by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ead2ceb0ec9f85cff19c43b5cdb2f8a054484431 11-Mar-2009 Neil Horman <nhorman@tuxdriver.com> Network Drop Monitor: Adding kfree_skb_clean for non-drops and modifying end-of-line points for skbs

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

include/linux/skbuff.h | 4 +++-
net/core/datagram.c | 2 +-
net/core/skbuff.c | 22 ++++++++++++++++++++++
net/ipv4/arp.c | 2 +-
net/ipv4/udp.c | 2 +-
net/packet/af_packet.c | 2 +-
6 files changed, 29 insertions(+), 5 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
51f31cabe3ce5345b51e4a4f82138b38c4d5dc91 12-Feb-2009 Patrick Ohly <patrick.ohly@intel.com> ip: support for TX timestamps on UDP and RAW sockets

Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2783ef23128ad0a4b34e4121c1f7ff664785712f 06-Feb-2009 Jesper Dangaard Brouer <hawk@comx.dk> udp: Fix potential wrong ip_hdr(skb) pointers

Like the UDP header fix, pskb_may_pull() can potentially
alter the SKB buffer. Thus the saddr and daddr, pointers
may point to the old skb->data buffer.

I haven't seen corruptions, as its only seen if the old
skb->data buffer were reallocated by another user and
written into very quickly (or poison'd by SLAB debugging).

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
7b5e56f9d635643ad54f2f42e69ad16b80a2cff1 06-Feb-2009 Jesper Dangaard Brouer <hawk@comx.dk> udp: Fix UDP short packet false positive

The UDP header pointer assignment must happen after calling
pskb_may_pull(). As pskb_may_pull() can potentially alter the SKB
buffer.

This was exposted by running multicast traffic through the NIU driver,
as it won't prepull the protocol headers into the linear area on
receive.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
e408b8dcb5ce42243a902205005208e590f28454 02-Feb-2009 Eric Dumazet <dada1@cosmosbay.com> udp: increments sk_drops in __udp_queue_rcv_skb()

Commit 93821778def10ec1e69aa3ac10adee975dad4ff3 (udp: Fix rcv socket
locking) accidentally removed sk_drops increments for UDP IPV4
sockets.

This field can be used to detect incorrect sizing of socket receive
buffers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
98322f22eca889478045cf896b572250d03dc45f 27-Jan-2009 Eric Dumazet <dada1@cosmosbay.com> udp: optimize bind(0) if many ports are in use

commit 9088c5609584684149f3fb5b065aa7f18dcb03ff
(udp: Improve port randomization) introduced a regression for UDP bind() syscall
to null port (getting a random port) in case lot of ports are already in use.

This is because we do about 28000 scans of very long chains (220 sockets per chain),
with many spin_lock_bh()/spin_unlock_bh() calls.

Fix this using a bitmap (64 bytes for current value of UDP_HTABLE_SIZE)
so that we scan chains at most once.

Instead of 250 ms per bind() call, we get after patch a time of 2.9 ms

Based on a report from Vitaly Mayatskikh

Reported-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
723b46108f8ee75b61ce703d0c9225e4f537bc46 25-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: udp_unhash() can test if sk is hashed

Impact: Optimization

Like done in inet_unhash(), we can avoid taking a chain lock if
socket is not hashed in udp_unhash()

Triggered by close(socket(AF_INET, SOCK_DGRAM, 0));

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2e77d89b2fa8e3f8325b8ce7893ec3645f41aff5 25-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: avoid a pair of dst_hold()/dst_release() in ip_append_data()

We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_append_data() eventually steal the refcount its
callers had to take on the dst entry.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a134f85c131ffd56720e38af2967ec6265480757 20-Nov-2008 Balazs Scheidler <bazsi@balabit.hu> TPROXY: fill struct flowi->flags in udp_sendmsg()

udp_sendmsg() didn't fill struct flowi->flags, which means that
the route lookup would fail for non-local IPs even if the
IP_TRANSPARENT sockopt was set.

This prevents sendto() to work properly for UDP sockets, whereas
bind(foreign-ip) + connect() + send() worked fine.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
88ab1932eac721c6e7336708558fa5ed02c85c80 17-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> udp: Use hlist_nulls in UDP RCU code

This is a straightforward patch, using hlist_nulls infrastructure.

RCUification already done on UDP two weeks ago.

Using hlist_nulls permits us to avoid some memory barriers, both
at lookup time and delete time.

Patch is large because it adds new macros to include/net/sock.h.
These macros will be used by TCP & DCCP in next patch.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
920a46115ca3fa88990276d98520abab85495b2d 02-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> udp: multicast packets need to check namespace

Current UDP multicast delivery is not namespace aware.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
c37ccc0d4e2a4ee52f1a40cff1be0049f2104bba 02-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> udp: add a missing smp_wmb() in udp_lib_get_port()

Corey Minyard spotted a missing memory barrier in udp_lib_get_port()

We need to make sure a reader cannot read the new 'sk->sk_next' value
and previous value of 'sk->sk_hash'. Or else, an item could be deleted
from a chain, and inserted into another chain. If new chain was empty
before the move, 'next' pointer is NULL, and lockless reader can
not detect it missed following items in original chain.

This patch is temporary, since we expect an upcoming patch
to introduce another way of handling the problem.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
673d57e72398edfedc93fb50ff58048077c9d587 31-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com> net: replace NIPQUAD() in net/ipv4/ net/ipv6/

Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c8db3fec5b02f4cefe441903fe1c142ff14e1771 30-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: Should use spin_lock_bh()/spin_unlock_bh() in udp_lib_unhash()

Spotted by Alexander Beregalov

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
96631ed16c514cf8b28fab991a076985ce378c26 29-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: introduce sk_for_each_rcu_safenext()

Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d
(udp: RCU handling for Unicast packets.)

"If the socket is moved from one list to another list in-between the
time the hash is calculated and the next field is accessed, and the
socket has moved to the end of the new list, the traversal will not
complete properly on the list it should have, since the socket will
be on the end of the new list and there's not a way to tell it's on a
new list and restart the list traversal. I think that this can be
solved by pre-fetching the "next" field (with proper barriers) before
checking the hash."

This patch corrects this problem, introducing a new
sk_for_each_rcu_safenext() macro.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f52b5054ec108aaa9e903850d6b62af8ae3fe6ae 29-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: udp_get_next() should use spin_unlock_bh()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8203efb3c612743fecb1ed67cf1daf9d9c127462 29-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: calculate udp_mem based on low memory instead of all memory

This patch mimics commit 57413ebc4e0f1e471a3b4db4aff9a85c083d090e
(tcp: calculate tcp_mem based on low memory instead of all memory)

The udp_mem array which contains limits on the total amount of memory
used by UDP sockets is calculated based on nr_all_pages. On a 32 bits
x86 system, we should base this on the number of lowmem pages.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
271b72c7fa82c2c7a795bc16896149933110672d 29-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: RCU handling for Unicast packets.

Goals are :

1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.

Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.

2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.

- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."

- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.

Theory of operation :
---------------------

As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.

Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.

In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.

We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
645ca708f936b2fbeb79e52d7823e3eb2c0905f8 29-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: introduce struct udp_table and multiple spinlocks

UDP sockets are hashed in a 128 slots hash table.

This hash table is protected by *one* rwlock.

This rwlock is readlocked each time an incoming UDP message is handled.

This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (close time)

This is not scalable on SMP machines :

1) Even in read mode, lock() and unlock() are atomic operations and
must dirty a contended cache line, shared by all cpus.

2) A writer might be starved if many readers are 'in flight'. This can
happen on a machine with some NIC receiving many UDP messages. User
process can be delayed a long time at socket creation/dismantle time.

This patch prepares RCU migration, by introducing 'struct udp_table
and struct udp_hslot', and using one spinlock per chain, to reduce
contention on central rwlock.

Introducing one spinlock per chain reduces latencies, for port
randomization on heavily loaded UDP servers. This also speedup
bindings to specific ports.

udp_lib_unhash() was uninlined, becoming to big.

Some cleanups were done to ease review of following patch
(RCUification of UDP Unicast lookups)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
113aa838ec3a235d883f8357d31d90e16c47fc89 14-Oct-2008 Alan Cox <alan@redhat.com> net: Rationalise email address: Network Specific Parts

Clean up the various different email addresses of mine listed in the code
to a single current and valid address. As Dave says his network merges
for 2.6.28 are now done this seems a good point to send them in where
they won't risk disrupting real changes.

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f24d43c07e208372aa3d3bff419afbf43ba87698 09-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: complete port availability checking

While looking at UDP port randomization, I noticed it
was litle bit pessimistic, not looking at type of sockets
(IPV6/IPV4) and not looking at bound addresses if any.

We should perform same tests than when binding to a
specific port.

This permits a cleanup of udp_lib_get_port()

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9088c5609584684149f3fb5b065aa7f18dcb03ff 08-Oct-2008 Eric Dumazet <dada1@cosmosbay.com> udp: Improve port randomization

Current UDP port allocation is suboptimal.
We select the shortest chain to chose a port (out of 512)
that will hash in this shortest chain.

First, it can lead to give not so ramdom ports and ease
give attackers more opportunities to break the system.

Second, it can consume a lot of CPU to scan all table
in order to find the shortest chain.

Third, in some pathological cases we can fail to find
a free port even if they are plenty of them.

This patch zap the search for a short chain and only
use one random seed. Problem of getting long chains
should be addressed in another way, since we can
obtain long chains with non random ports.

Based on a report and patch from Vitaly Mayatskikh

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c7ed677fb7013c8028045d409a48ac42151187a 07-Oct-2008 Denis V. Lunev <den@openvz.org> netns: make udpv6 mib per/namespace

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
23542618deb77cfed312842fe8c41ed19fb16470 07-Oct-2008 KOVACS Krisztian <hidden@sch.bme.hu> inet: Don't lookup the socket if there's a socket attached to the skb

Use the socket cached in the skb if it's present.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
607c4aaf03041c8bd81555a0218050c0f895088e 07-Oct-2008 KOVACS Krisztian <hidden@sch.bme.hu> inet: Add udplib_lookup_skb() helpers

To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bcd41303f422015ab662c9276d108414aa75b796 01-Oct-2008 KOVACS Krisztian <hidden@sch.bme.hu> udp: Export UDP socket lookup function

The iptables tproxy code has to be able to do UDP socket hash lookups,
so we have to provide an exported lookup function for this purpose.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
93821778def10ec1e69aa3ac10adee975dad4ff3 15-Sep-2008 Herbert Xu <herbert@gondor.apana.org.au> udp: Fix rcv socket locking

The previous patch in response to the recursive locking on IPsec
reception is broken as it tries to drop the BH socket lock while in
user context.

This patch fixes it by shrinking the section protected by the
socket lock to sock_queue_rcv_skb only. The only reason we added
the lock is for the accounting which happens in that function.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
d97106ea52aa57e63ff40d04479016836bbb5a4e 09-Aug-2008 Herbert Xu <herbert@gondor.apana.org.au> udp: Drop socket lock for encapsulated packets

The socket lock is there to protect the normal UDP receive path.
Encapsulation UDP sockets don't need that protection. In fact
the locking is deadly for them as they may contain another UDP
packet within, possibly with the same addresses.

Also the nested bit was copied from TCP. TCP needs it because
of accept(2) spawning sockets. This simply doesn't apply to UDP
so I've removed it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
47112e25da41d9059626033986dc3353e101f815 21-Jul-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> udplite: Protection against coverage value wrap-around

This patch clamps the cscov setsockopt values to a maximum of 0xFFFF.

Setsockopt values greater than 0xffff can cause an unwanted
wrap-around. Further, IPv6 jumbograms are not supported (RFC 3838,
3.5), so that values greater than 0xffff are not even useful.

Further changes: fixed a typo in the documentation.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2f275f91a438abd8eec5321798d66a4ffe6869fa 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put udp statistics on struct net

Similar to... ouch, I repeat myself.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7c73a6faffae0bfae70639113aecf06af666e714 17-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: add net to IP_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
84a3aa000eacbaf841d745b07ef3a3280899056b 17-Jul-2008 Pavel Emelyanov <xemul@openvz.org> ipv4: prepare net initialization for IP accounting

Some places, that deal with IP statistics already have where to
get a struct net from, but use it directly, without declaring
a separate variable on the stack.

So, save this net on the stack for future IP_XXX_STATS macros.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
dcfc23cac103b54dbc00a6f52f47656ad5c75844 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: add struct net to ICMP_INC_STATS_BH

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fd54d716b1f6a3551ec17a4bb34027727b2db09a 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org> inet: toss struct net initialization around

Some places, that deal with ICMP statistics already have where
to get a struct net from, but use it directly, without declaring
a separate variable on the stack.

Since I will need this net soon, I declare a struct net on the
stack and use it in the existing places in a separate patch not
to spoil the future ones.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0283328e2360bbf3081e97f1b720dd4c4dbae111 06-Jul-2008 Pavel Emelyanov <xemul@openvz.org> MIB: add struct net to UDP_INC_STATS_BH

Two special cases here - one is rxrpc - I put init_net there
explicitly, since we haven't touched this part yet. The second
place is in __udp4_lib_rcv - we already have a struct net there,
but I have to move its initialization above to make it ready
at the "drop" label.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
629ca23c331ec75ac87b016debbb3c4d2fe62650 06-Jul-2008 Pavel Emelyanov <xemul@openvz.org> MIB: add struct net to UDP_INC_STATS_USER

Nothing special - all the places already have a struct sock
at hands, so use the sock_net() net.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
cb61cb9b8b5ef6c2697d84e5015e314626eb2fba 18-Jun-2008 Eric Dumazet <dada1@cosmosbay.com> udp: sk_drops handling

In commits 33c732c36169d7022ad7d6eb474b0c9be43a2dc1 ([IPV4]: Add raw
drops counter) and a92aa318b4b369091fd80433c80e62838db8bc1c ([IPV6]:
Add raw drops counter), Wang Chen added raw drops counter for
/proc/net/raw & /proc/net/raw6

This patch adds this capability to UDP sockets too (/proc/net/udp &
/proc/net/udp6).

This means that 'RcvbufErrors' errors found in /proc/net/snmp can be also
be examined for each udp socket.

# grep Udp: /proc/net/snmp
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
Udp: 23971006 75 899420 16390693 146348 0

# cat /proc/net/udp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt ---
uid timeout inode ref pointer drops
75: 00000000:02CB 00000000:0000 07 00000000:00000000 00:00000000 00000000 ---
0 0 2358 2 ffff81082a538c80 0
111: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 ---
0 0 2286 2 ffff81042dd35c80 146348

In this example, only port 111 (0x006F) was flooded by messages that
user program could not read fast enough. 146348 messages were lost.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
19c7578fb22b0aef103222cae9b522f03ae489d6 17-Jun-2008 Pavel Emelyanov <xemul@openvz.org> udp: add struct net argument to udp_hashfn

Every caller already has this one. The new argument is currently
unused, but this will be fixed shortly.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
e31634931d00081c75e3fb3f3ec51a50dbf108bb 17-Jun-2008 Pavel Emelyanov <xemul@openvz.org> udp: provide a struct net pointer for __udp[46]_lib_mcast_deliver

They both calculate the hash chain, but currently do not have
a struct net pointer, so pass one there via additional argument,
all the more so their callers already have such.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
d6266281f8175e3ad68c28b20a609b278b47ade5 17-Jun-2008 Pavel Emelyanov <xemul@openvz.org> udp: introduce a udp_hashfn function

Currently the chain to store a UDP socket is calculated with
simple (x & (UDP_HTABLE_SIZE - 1)). But taking net into account
would make this calculation a bit more complex, so moving it into
a function would help.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7d06b2e053d2d536348e3a0f6bb02982a41bea37 15-Jun-2008 Brian Haley <brian.haley@hp.com> net: change proto destroy method to return void

Change struct proto destroy function pointer to return void. Noticed
by Al Viro.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0b040829952d84bf2a62526f0e24b624e0699447 11-Jun-2008 Adrian Bunk <bunk@kernel.org> net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
36d926b94a9908937593e5669162305a071b9cc3 04-Jun-2008 Denis V. Lunev <den@openvz.org> [IPV6]: inet_sk(sk)->cork.opt leak

IPv6 UDP sockets wth IPv4 mapped address use udp_sendmsg to send the data
actually. In this case ip_flush_pending_frames should be called instead
of ip6_flush_pending_frames.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
84841c3c6cf2ed9a8e0dfd842fb35ef6e2bce091 02-May-2008 Denis V. Lunev <den@openvz.org> ipv4: assign PDE->data before gluing PDE into /proc tree

The check for PDE->data != NULL becomes useless after the replacement
of proc_net_fops_create with proc_create_data.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5e659e4cb0eedacdc1f621a61e400a4611ddef8a 24-Apr-2008 Pavel Emelyanov <xemul@openvz.org> [NET]: Fix heavy stack usage in seq_file output routines.

Plan C: we can follow the Al Viro's proposal about %n like in this patch.
The same applies to udp, fib (the /proc/net/route file), rt_cache and
sctp debug. This is minus ~150-200 bytes for each.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a7d632b6b4ad1c92746ed409e41f9dc571ec04e2 14-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4]: Use NIPQUAD_FMT to format ipv4 addresses.

And use %u to format port.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
c29a0bc4dfc4d833eb702b1929cec96a3eeb9f7a 01-Apr-2008 Pavel Emelyanov <xemul@openvz.org> [SOCK][NETNS]: Add a struct net argument to sock_prot_inuse_add and _get.

This counter is about to become per-proto-and-per-net, so we'll need
two arguments to determine which cell in this "table" to work with.

All the places, but proc already pass proper net to it - proc will be
tuned a bit later.

Some indentation with spaces in proc files is done to keep the file
coding style consistent.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
b50660f1fe4ebd6129064e4fba0bd882b60c2425 01-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IP] UDP: Use SEQ_START_TOKEN.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4ad96d39a2d74c1b2e400b602da2594f5098fc26 29-Mar-2008 Denis V. Lunev <den@openvz.org> [UDP]: Remove owner from udp_seq_afinfo.

Move it to udp_seq_afinfo->seq_fops as should be.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3ba9441bdf07370670a684e6d95dfc523476677f 29-Mar-2008 Denis V. Lunev <den@openvz.org> [UDP]: Place file operations directly into udp_seq_afinfo.

No need to have separate never-used variable.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a2be75c182cd16bc104919e33451ffda8b6f65b4 29-Mar-2008 Denis V. Lunev <den@openvz.org> [UDP]: Cleanup /proc/udp[6] creation/removal.

Replace seq_open with seq_open_net and remove udp_seq_release
completely. seq_release_net will do this job just fine.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
dda61925f84d89e2f2a4597d6298a05a2bc05c20 29-Mar-2008 Denis V. Lunev <den@openvz.org> [UDP]: Move seq_ops from udp_iter_state to udp_seq_afinfo.

No need to create seq_operations for each instance of 'netstat'.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
997feb5e7a24ce155c1d3bea9d0b5c6530bb6197 29-Mar-2008 Denis V. Lunev <den@openvz.org> [UDP]: No need to check afinfo != NULL in udp_proc_(un)register.

udp_proc_register/udp_proc_unregister are called with a static pointer only.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6f191efe48af62dd5f8917dd21d187f896cd6c81 29-Mar-2008 Denis V. Lunev <den@openvz.org> [UDP]: Replace struct net on udp_iter_state with seq_net_private.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
bdcde3d71a67e97f25e851f3ca97c9bb5ef03e7f 29-Mar-2008 Pavel Emelyanov <xemul@openvz.org> [SOCK]: Drop inuse pcounter from struct proto (v2).

An uppercut - do not use the pcounter on struct proto.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
878628fbf2589eb24357e42027d5f54b1dafd3c8 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit namespace comparision without CONFIG_NET_NS.

Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.

We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
3b1e0a655f8eba44ab1ee2a1068d169ccfb853b9 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.

Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
c346dca10840a874240c78efe3f39acf4312a1f2 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.

Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
05cf89d40c85e622dac20e44713168767be5c520 24-Mar-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Process INET socket layer in the correct namespace.

Replace all the reast of the init_net with a proper net on the socket
layer.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7a6adb92fe301c10ca4dbd0d9f2422f5880595e7 24-Mar-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to ip_cmsg_send.

Pass the init_net there for now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
15439febb0bd530f85e40ad6fa8e9f75106639ef 24-Mar-2008 Pavel Emelyanov <xemul@openvz.org> [NETNS][UDP]: Register /proc/net/udp in a namespace.

After the commit a91275eff43a527e1a25d6d034cbcd19ee323e64 ([NETNS][IPV6]
udp - make proc handle the network namespace) it is now possible to make
this file present in newly created namespaces.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ba5a3c52da00015e739469e3b00cd6d0d4c5c67 23-Mar-2008 Pavel Emelyanov <xemul@openvz.org> [UDP]: Make full use of proto.h.udp_hash innovation.

After this we have only udp_lib_get_port to get the port and two
stubs for ipv4 and ipv6. No difference in udp and udplite except
for initialized h.udp_hash member.

I tried to find a graceful way to drop the only difference between
udp_v4_get_port and udp_v6_get_port (i.e. the rcv_saddr comparison
routine), but adding one more callback on the struct proto didn't
appear such :( Maybe later.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
28518fc1701a757a3df8aa2d2ac2e5d1efd1c3e5 21-Mar-2008 Pavel Emelyanov <xemul@openvz.org> [NET]: NULL pointer dereference and other nasty things in /proc/net/(tcp|udp)[6]

Commits f40c81 ([NETNS][IPV4] tcp - make proc handle the network
namespaces) and a91275 ([NETNS][IPV6] udp - make proc handle the
network namespace) both introduced bad checks on sockets and tw
buckets to belong to proper net namespace.

I.e. when checking for socket to belong to given net and family the

do {
sk = sk_next(sk);
} while (sk && sk->sk_net != net && sk->sk_family != family);

constructions were used. This is wrong, since as soon as the
sk->sk_net fits the net the socket is immediately returned, even if it
belongs to other family.

As the result four /proc/net/(udp|tcp)[6] entries show wrong info.
The udp6 entry even oopses when dereferencing inet6_sk(sk) pointer:

static void udp6_sock_seq_show(struct seq_file *seq, struct sock *sp, int bucket)
{
...
struct ipv6_pinfo *np = inet6_sk(sp);
...

dest = &np->daddr; /* will be NULL for AF_INET sockets */
...
seq_printf(...
dest->s6_addr32[0], dest->s6_addr32[1],
dest->s6_addr32[2], dest->s6_addr32[3],
...

Fix it by converting && to ||.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c96d8c50bffb7f02690dd8a8cf1adb8e07e100f 21-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com> [NETNS][IPV6] udp6 - make proc per namespace

The proc init/exit functions take a new network namespace parameter in
order to register/unregister /proc/net/udp6 for a namespace.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a91275eff43a527e1a25d6d034cbcd19ee323e64 21-Mar-2008 Daniel Lezcano <dlezcano@fr.ibm.com> [NETNS][IPV6] udp - make proc handle the network namespace

This patch makes the common udp proc functions to take care of which
socket they should show taking into account the namespace it belongs.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
db8dac20d5199307dcfcf4e01dac4bda5edf9e89 07-Mar-2008 David S. Miller <davem@davemloft.net> [UDP]: Revert udplite and code split.

This reverts commit db1ed684f6c430c4cdad67d058688b8a1b5e607c ("[IPV6]
UDP: Rename IPv6 UDP files."), commit
8be8af8fa4405652e6c0797db5465a4be8afb998 ("[IPV4] UDP: Move
IPv4-specific bits to other file.") and commit
e898d4db2749c6052072e9bc4448e396cbdeb06a ("[UDP]: Allow users to
configure UDP-Lite.").

First, udplite is of such small cost, and it is a core protocol just
like TCP and normal UDP are.

We spent enormous amounts of effort to make udplite share as much code
with core UDP as possible. All of that work is less valuable if we're
just going to slap a config option on udplite support.

It is also causing build failures, as reported on linux-next, showing
that the changeset was not tested very well. In fact, this is the
second build failure resulting from the udplite change.

Finally, the config options provided was a bool, instead of a modular
option. Meaning the udplite code does not even get build tested
by allmodconfig builds, and furthermore the user is not presented
with a reasonable modular build option which is particularly needed
by distribution vendors.

Signed-off-by: David S. Miller <davem@davemloft.net>
8be8af8fa4405652e6c0797db5465a4be8afb998 04-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4] UDP: Move IPv4-specific bits to other file.

Move IPv4-specific UDP bits from net/ipv4/udp.c into (new) net/ipv4/udp_ipv4.c.
Rename net/ipv4/udplite.c to net/ipv4/udplite_ipv4.c.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
e898d4db2749c6052072e9bc4448e396cbdeb06a 29-Feb-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [UDP]: Allow users to configure UDP-Lite.

Let's give users an option for disabling UDP-Lite (~4K).

old:
| text data bss dec hex filename
| 286498 12432 6072 305002 4a76a net/ipv4/built-in.o
| 193830 8192 3204 205226 321aa net/ipv6/ipv6.o

new (without UDP-Lite):
| text data bss dec hex filename
| 284086 12136 5432 301654 49a56 net/ipv4/built-in.o
| 191835 7832 3076 202743 317f7 net/ipv6/ipv6.o

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
fa4d3c6210380c55cf7f295d28fd981fbcbb828c 31-Jan-2008 Pavel Emelyanov <xemul@openvz.org> [NETNS]: Udp sockets per-net lookup.

Add the net parameter to udp_get_port family of calls and
udp_lookup one and use it to filter sockets.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f1b050bf7a88910f9f00c9c8989c1bf5a67dd140 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to ip_route_output_flow.

Needed to propagate it down to the __ip_route_output_key.

Signed_off_by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fc80be87dca8968fa087aae81544ba3f86afdd50 22-Jan-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4] UDP,UDPLITE: Sparse: {__udp4_lib,udp,udplite}_err() are of void.

Fix following sparse warnings:
| net/ipv4/udp.c:421:2: warning: returning void-valued expression
| net/ipv4/udplite.c:38:2: warning: returning void-valued expression

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
65f7651788e18fadb2fbb7276af935d7871e1803 04-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [NET]: prot_inuse cleanups and optimizations

1) Cleanups (all functions are prefixed by sock_prot_inuse)

sock_prot_inc_use(prot) -> sock_prot_inuse_add(prot,-1)
sock_prot_dec_use(prot) -> sock_prot_inuse_add(prot,-1)
sock_prot_inuse() -> sock_prot_inuse_get()

New functions :

sock_prot_inuse_init() and sock_prot_inuse_free() to abstract pcounter use.

2) if CONFIG_PROC_FS=n, we can zap 'inuse' member from "struct proto",
since nobody wants to read the inuse value.

This saves 1372 bytes on i386/SMP and some cpu cycles.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9a429c4983deae020f1e757ecc8f547b6d4e2f2b 02-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [NET]: Add some acquires/releases sparse annotations.

Add __acquires() and __releases() annotations to suppress some sparse
warnings.

example of warnings :

net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
95766fff6b9a78d11fc2d3812dd035381690b55d 31-Dec-2007 Hideo Aoki <haoki@redhat.com> [UDP]: Add memory accounting.

Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Signed-off-by: Hideo Aoki <haoki@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f97c1e0c6ebdb606c97b6cb5e837c6110ac5a961 16-Dec-2007 Joe Perches <joe@perches.com> [IPV4] net/ipv4: Use ipv4_is_<type>

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9055e051b8d4b266054fe511a65a9888d30fa64f 14-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au> [UDP]: Move udp_stats_in6 into net/ipv4/udp.c

Now that external users may increment the counters directly, we need
to ensure that udp_stats_in6 is always available. Otherwise we'd
either have to requrie the external users to be built as modules or
ipv6 to be built-in.

This isn't too bad because udp_stats_in6 is just a pair of pointers
plus an EXPORT, e.g., just 40 (16 + 24) bytes on x86-64.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
a59322be07c964e916d15be3df473fb7ba20c41e 05-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au> [UDP]: Only increment counter on first peek/recv

The previous move of the the UDP inDatagrams counter caused each
peek of the same packet to be counted separately. This may be
undesirable.

This patch fixes this by adding a bit to sk_buff to record whether
this packet has already been seen through skb_recv_datagram. We
then only increment the counter when the packet is seen for the
first time.

The only dodgy part is the fact that skb_recv_datagram doesn't have
a good way of returning this new bit of information. So I've added
a new function __skb_recv_datagram that does return this and made
skb_recv_datagram a wrapper around it.

The plan is to eventually replace all uses of skb_recv_datagram with
this new function at which time it can be renamed its proper name.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
1781f7f5804e52ee2d35328b129602146a8d8254 11-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au> [UDP]: Restore missing inDatagrams increments

The previous move of the the UDP inDatagrams counter caused the
counting of encapsulated packets, SUNRPC data (as opposed to call)
packets and RXRPC packets to go missing.

This patch restores all of these.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
27ab2568649d5ba6c5a20212079b7c4f6da4ca0d 05-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au> [UDP]: Avoid repeated counting of checksum errors due to peeking

Currently it is possible for two processes to peek on the same socket
and end up incrementing the error counter twice for the same packet.

This patch fixes it by making skb_kill_datagram return whether it
succeeded in unlinking the packet and only incrementing the counter
if it did.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
bbca17680f82b1ba3122d41e8bc675aebf6d1cf2 03-Dec-2007 Wang Chen <wangchen@cn.fujitsu.com> [UDP]: Counter increment should be in USER mode for recvmsg

System calls should be USER. So change the BH to USER for
UDP*_INC_STATS_BH().

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
b2bf1e2659b1cba5e65f81781cfd530be447f80b 03-Dec-2007 Wang Chen <wangchen@cn.fujitsu.com> [UDP]: Clean up for IS_UDPLITE macro

Since we have macro IS_UDPLITE, we can use it.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
cb75994ec311b2cd50e5205efdcc0696abd6675d 03-Dec-2007 Wang Chen <wangchen@cn.fujitsu.com> [UDP]: Defer InDataGrams increment until recvmsg() does checksum

Thanks dave, herbert, gerrit, andi and other people for your
discussion about this problem.

UdpInDatagrams can be confusing because it counts packets that
might be dropped later.
Move UdpInDatagrams into recvmsg() as allowed by the RFC.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
47a31a6ffcca3b55149bccd5b99763e5eea60ac4 06-Nov-2007 Eric Dumazet <dada1@cosmosbay.com> [IPV4]: Use the {DEFINE|REF}_PROTO_INUSE infrastructure

Trivial patch to make "tcp,udp,udplite,raw" protocols uses the fast
"inuse sockets" infrastructure

Each protocol use then a static percpu var, instead of a dynamic one.
This saves some ram and some cpu cycles

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fee9dee730a40f671c1972a324ed54f0d68523e1 26-Oct-2007 Vlad Yasevich <vladislav.yasevich@hp.com> [UDP]: Make use of inet_iif() when doing socket lookups.

UDP currently uses skb->dev->ifindex which may provide the wrong
information when the socket bound to a specific interface.
This patch makes inet_iif() accessible to UDP and makes UDP use it.

The scenario we are trying to fix is when a client is running on
the same system and the server and both client and server bind to
a non-loopback device.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Acked-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a25de534f89c515c82d3553c42d3bb02c2d1a7da 19-Oct-2007 Anton Arapov <aarapov@redhat.com> [INET]: Justification for local port range robustness.

There is a justifying patch for Stephen's patches. Stephen's patches
disallows using a port range of one single port and brakes the meaning
of the 'remaining' variable, in some places it has different meaning.
My patch gives back the sense of 'remaining' variable. It should mean
how many ports are remaining and nothing else. Also my patch allows
using a single port.

I sure we must be able to use mentioned port range, this does not
restricted by documentation and does not brake current behavior.

usefull links:
Patches posted by Stephen Hemminger
http://marc.info/?l=linux-netdev&m=119206106218187&w=2
http://marc.info/?l=linux-netdev&m=119206109918235&w=2

Andrew Morton's comment
http://marc.info/?l=linux-kernel&m=119248225007737&w=2

1. Allows using a port range of one single port.
2. Gives back sense of 'remaining' variable.

Signed-off-by: Anton Arapov <aarapov@redhat.com>
Acked-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
227b60f5102cda4e4ab792b526a59c8cb20cd9f8 11-Oct-2007 Stephen Hemminger <shemminger@linux-foundation.org> [INET]: local port range robustness

Expansion of original idea from Denis V. Lunev <den@openvz.org>

Add robustness and locking to the local_port_range sysctl.
1. Enforce that low < high when setting.
2. Use seqlock to ensure atomic update.

The locking might seem like overkill, but there are
cases where sysadmin might want to change value in the
middle of a DoS attack.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
457c4cbc5a3dde259d2a1f15d5f9785290397267 12-Sep-2007 Eric W. Biederman <ebiederm@xmission.com> [NET]: Make /proc/net per network namespace

This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.

Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
32c1da70810017a98aa6c431a5494a302b6b9a30 25-Aug-2007 Stephen Hemminger <shemminger@linux-foundation.org> [UDP]: Randomize port selection.

This patch causes UDP port allocation to be randomized like TCP.
The earlier code would always choose same port (ie first empty list).

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2a0c6c980d209827e5d69017fa736f107a744d1d 15-Sep-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4]: Just increment OutDatagrams once per a datagram.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3be550f34b03e5eb762f74d447ebbeba97efbd6d 11-Jul-2007 Patrick McHardy <kaber@trash.net> [UDP]: Fix length check.

Rémi Denis-Courmont wrote:
> Right. By the way, shouldn't "len" rather be signed in there?
>
> unsigned int len;
>
> /* if we're overly short, let UDP handle it */
> len = skb->len - sizeof(struct udphdr);
> if (len <= 0)
> goto udp;

It should, but the < 0 case can't happen since __udp4_lib_rcv
already makes sure that we have at least a complete UDP header.

Anyways, this patch fixes it.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
067b207b281db5e3f03f8d244286c20f61aa2343 06-Jul-2007 James Chapman <jchapman@katalix.com> [UDP]: Cleanup UDP encapsulation code

This cleanup fell out after adding L2TP support where a new encap_rcv
funcptr was added to struct udp_sock. Have XFRM use the new encap_rcv
funcptr, which allows us to move the XFRM encap code from udp.c into
xfrm4_input.c.

Make xfrm4_rcv_encap() static since it is no longer called externally.

Signed-off-by: James Chapman <jchapman@katalix.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
342f0234c71b40da785dd6a7ce1dd481ecbfdb81 28-Jun-2007 James Chapman <jchapman@katalix.com> [UDP]: Introduce UDP encapsulation type for L2TP

This patch adds a new UDP_ENCAP_L2TPINUDP encapsulation type for UDP
sockets. When a UDP socket's encap_type is UDP_ENCAP_L2TPINUDP, the
skb is delivered to a function pointed to by the udp_sock's
encap_rcv funcptr. If the skb isn't wanted by L2TP, it returns >0, which
causes it to be passed through to UDP.

Include padding to put the new encap_rcv field on a 4-byte boundary.

Previously, the only user of UDP encap sockets was ESP, so when
CONFIG_XFRM was not defined, some of the encap code was compiled
out. This patch changes that. As a result, udp_encap_rcv() will
now do a little more work when CONFIG_XFRM is not defined.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
df2bc459a3ad71f8b44c358bf7169acf9caf4acd 06-Jun-2007 David S. Miller <davem@sunset.davemloft.net> [UDP]: Revert 2-pass hashing changes.

This reverts changesets:

6aaf47fa48d3c44280810b1b470261d340e4ed87
b7b5f487ab39bc10ed0694af35651a03d9cb97ff
de34ed91c4ffa4727964a832c46e624dd1495cf5
fc038410b4b1643766f8033f4940bcdb1dace633

There are still some correctness issues recently
discovered which do not have a known fix that doesn't
involve doing a full hash table scan on port bind.

So revert for now.

Signed-off-by: David S. Miller <davem@davemloft.net>
584bdf8cbdf6f277c2a00e083257ee75687cf6f4 01-Jun-2007 Wei Dong <weidong@cn.fujitsu.com> [IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP

Signed-off-by: Wei Dong <weidong@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fc038410b4b1643766f8033f4940bcdb1dace633 10-May-2007 David S. Miller <davem@sunset.davemloft.net> [UDP]: Fix AF-specific references in AF-agnostic code.

__udp_lib_port_inuse() cannot make direct references to
inet_sk(sk)->rcv_saddr as that is ipv4 specific state and
this code is used by ipv6 too.

Use an operations vector to solve this, and this also paves
the way for ipv6 support for non-wild saddr hashing in UDP.

Signed-off-by: David S. Miller <davem@davemloft.net>
59c51591a0ac7568824f541f57de967e88adaa07 09-May-2007 Michael Opdenacker <michael@free-electrons.com> Fix occurrences of "the the "

Signed-off-by: Michael Opdenacker <michael@free-electrons.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
de34ed91c4ffa4727964a832c46e624dd1495cf5 30-Apr-2007 David S. Miller <davem@sunset.davemloft.net> [UDP]: Do not allow specific bind when wildcard bind exists.

When allocating local ports, do not allow a bind to a port
with a specific local address when a bind to that port with
a wildcard local address already exists.

Noticed by Linus.

Signed-off-by: David S. Miller <davem@davemloft.net>
b7b5f487ab39bc10ed0694af35651a03d9cb97ff 30-Apr-2007 David S. Miller <davem@sunset.davemloft.net> [IPV4] UDP: Fix endianness bugs in hashing changes.

I accidently applied an earlier version of Eric Dumazet's patch, from
March 21st. His version from March 30th didn't have these bugs, so
this just interdiffs to the correct patch.

Signed-off-by: David S. Miller <davem@davemloft.net>
6aaf47fa48d3c44280810b1b470261d340e4ed87 30-Apr-2007 Eric Dumazet <dada1@cosmosbay.com> [PATCH] INET : IPV4 UDP lookups converted to a 2 pass algo

Some people want to have many UDP sockets, binded to a single port but
many different addresses. We currently hash all those sockets into a
single chain. Processing of incoming packets is very expensive,
because the whole chain must be examined to find the best match.

I chose in this patch to hash UDP sockets with a hash function that
take into account both their port number and address : This has a
drawback because we need two lookups : one with a given address, one
with a wildcard (null) address.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
604763722c655c7e3f31ecf6f7b4dafcd26a7a15 09-Apr-2007 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Treat CHECKSUM_PARTIAL as CHECKSUM_UNNECESSARY

When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY. Therefore we should
treat it as such in the stack.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
663ead3bb8d5b561e70fc3bb3861c9220b5a77eb 09-Apr-2007 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Use csum_start offset instead of skb_transport_header

The skb transport pointer is currently used to specify the start
of the checksum region for transmit checksum offload. Unfortunately,
the same pointer is also used during receive side processing.

This creates a problem when we want to retransmit a received
packet with partial checksums since the skb transport pointer
would be overwritten.

This patch solves this problem by creating a new 16-bit csum_start
offset value to replace the skb transport header for the purpose
of checksums. This offset is calculated from skb->head so that
it does not have to change when skb->data changes.

No extra space is required since csum_offset itself fits within
a 16-bit word so we can use the other 16 bits for csum_start.

For backwards compatibility, just before we push a packet with
partial checksums off into the device driver, we set the skb
transport header to what it would have been under the old scheme.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
88c7664f13bd1a36acb8566b93892a4c58759ac6 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4bedb45203eab92a87b4c863fe2d0cded633427f 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce udp_hdr(), remove skb->h.uh

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ea2ae17d6443abddc79480dc9f7af8feacabddc4 26-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_transport_offset()

For the quite common 'skb->h.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
badff6d01a8589a1c828b0bf118903ca38627f4e 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_reset_transport_header(skb)

For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eddc9ec53be2ecdbf4efe0efd4a83052594f0ac0 21-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c7a3c5da35055e2fa97ed4f0da3eec4bd0ef4c38 09-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [UDP]: Use __skb_pull since we have checked it won't fail with pskb_may_pull

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3fbe070a4293e8ab2d2edb1bc23f1e5220ce61af 09-Mar-2007 Stephen Hemminger <shemminger@linux-foundation.org> [UDP]: deinline

A couple of functions are exported or used indirectly
so it is pointless to mark them as inline.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6516c65573fde5e421c6c92c4b180bbe2245b23b 09-Mar-2007 Stephen Hemminger <shemminger@linux-foundation.org> [UDP]: ipv4 whitespace cleanup

Fix whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
759e5d006462d53fb708daa8284b4ad909415da1 26-Mar-2007 Herbert Xu <herbert@gondor.apana.org.au> [UDP]: Clean up UDP-Lite receive checksum

This patch eliminates some duplicate code for the verification of
receive checksums between UDP-Lite and UDP. It does this by
introducing __skb_checksum_complete_head which is identical to
__skb_checksum_complete_head apart from the fact that it takes
a length parameter rather than computing the first skb->len bytes.

As a result UDP-Lite will be able to use hardware checksum offload
for packets which do not use partial coverage checksums. It also
means that UDP-Lite loopback no longer does unnecessary checksum
verification.

If any NICs start support UDP-Lite this would also start working
automatically.

This patch removes the assumption that msg_flags has MSG_TRUNC clear
upon entry in recvmsg.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
d644329bc9f10213a282387dfe4984ff891dd768 07-Mar-2007 Herbert Xu <herbert@gondor.apana.org.au> [UDP]: Reread uh pointer after pskb_trim

The header may have moved when trimming.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
95f30b336b944e3e418f825044b4793d9e9aac09 10-Feb-2007 Eric Dumazet <dada1@cosmosbay.com> [UDP]: UDP can use sk_hash to speedup lookups

In a prior patch, I introduced a sk_hash field (__sk_common.skc_hash) to let
tcp lookups use one cache line per unmatched entry instead of two.

We can also use sk_hash to speedup UDP part as well. We store in sk_hash the
hnum value, and use sk->sk_hash (same cache line than 'next' pointer),
instead of inet->num (different cache line)

Note : We still have a false sharing problem for SMP machines, because
sock_hold(sock) dirties the cache line containing the 'next' pointer. Not
counting the udp_hash_lock rwlock. (did someone mentioned RCU ? :) )

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e905a9edab7f4f14f9213b52234e4a346c690911 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] IPV4: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8eb9086f21c73b38b5ca27558db4c91d62d0e70b 08-Feb-2007 David S. Miller <davem@sunset.davemloft.net> [IPV4/IPV6]: Always wait for IPSEC SA resolution in socket contexts.

Do this even for non-blocking sockets. This avoids the silly -EAGAIN
that applications can see now, even for non-blocking sockets in some
cases (f.e. connect()).

With help from Venkat Tekkirala.

Signed-off-by: David S. Miller <davem@davemloft.net>
5c668704b7fa5a4ebf21a490ddfbd6dc2e01fc97 22-Dec-2006 David S. Miller <davem@sunset.davemloft.net> [UDP]: Fix reversed logic in udp_get_port().

When this code was converted to use sk_for_each() the
logic for the "best hash chain length" code was reversed,
breaking everything.

The original code was of the form:

size = 0;
do {
if (++size >= best_size_so_far)
goto next;
} while ((sk = sk->next) != NULL);
best_size_so_far = size;
best = result;
next:;

and this got converted into:

sk_for_each(sk2, node, head)
if (++size < best_size_so_far) {
best_size_so_far = size;
best = result;
}

Which does something very very different from the original.

Signed-off-by: David S. Miller <davem@davemloft.net>
4c0a6cb0db19de411c4bf7fcdc79d4c7c4ccafb1 27-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [UDP(-Lite)]: consolidate v4 and v6 get|setsockopt code

This patch consolidates set/getsockopt code between UDP(-Lite) v4 and 6. The
justification is that UDP(-Lite) is a transport-layer protocol and therefore
the socket option code (at least in theory) should be AF-independent.

Furthermore, there is the following code reduplication:
* do_udp{,v6}_getsockopt is 100% identical between v4 and v6
* do_udp{,v6}_setsockopt is identical up to the following differerence
--v4 in contrast to v4 additionally allows the experimental encapsulation
types UDP_ENCAP_ESPINUDP and UDP_ENCAP_ESPINUDP_NON_IKE
--the remainder is identical between v4 and v6
I believe that this difference is of little relevance.

The advantages in not duplicating twice almost completely identical code.

The patch further simplifies the interface of udp{,v6}_push_pending_frames,
since for the second argument (struct udp_sock *up) it always holds that
up = udp_sk(sk); where sk is the first function argument.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
ff1dcadb1b55dbf471c5ed109dbbdf06bd19ef3b 21-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Split skb->csum

... into anonymous union of __wsum and __u32 (csum and csum_offset resp.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
8e5200f54062b8af0ed1d186ea0f113854786d89 21-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Fix assorted misannotations (from md5 and udplite merges).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
078250d68d883495f31d0e6ce8321292f2dc1cfa 15-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [NET/IPv4]: Make udp_push_pending_frames static

udp_push_pending_frames is only referenced within
net/ipv4/udp.c and hence can remain static.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
f6ab028804bdc580fe0915494dbf31f5ea473ca7 16-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Make mangling a checksum (0 -> 0xffff on the wire) explicit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
6b11687ef003ed595033da89643c8995676f929d 15-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Annotate csum_tcpudp_magic() callers in net/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba4e58eca8aa9473b44fdfd312f26c4a2e7798b3 27-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [NET]: Supporting UDP-Lite (RFC 3828) in Linux

This is a revision of the previously submitted patch, which alters
the way files are organized and compiled in the following manner:

* UDP and UDP-Lite now use separate object files
* source file dependencies resolved via header files
net/ipv{4,6}/udp_impl.h
* order of inclusion files in udp.c/udplite.c adapted
accordingly

[NET/IPv4]: Support for the UDP-Lite protocol (RFC 3828)

This patch adds support for UDP-Lite to the IPv4 stack, provided as an
extension to the existing UDPv4 code:
* generic routines are all located in net/ipv4/udp.c
* UDP-Lite specific routines are in net/ipv4/udplite.c
* MIB/statistics support in /proc/net/snmp and /proc/net/udplite
* shared API with extensions for partial checksum coverage

[NET/IPv6]: Extension for UDP-Lite over IPv6

It extends the existing UDPv6 code base with support for UDP-Lite
in the same manner as per UDPv4. In particular,
* UDPv6 generic and shared code is in net/ipv6/udp.c
* UDP-Litev6 specific extensions are in net/ipv6/udplite.c
* MIB/statistics support in /proc/net/snmp6 and /proc/net/udplite6
* support for IPV6_ADDRFORM
* aligned the coding style of protocol initialisation with af_inet6.c
* made the error handling in udpv6_queue_rcv_skb consistent;
to return `-1' on error on all error cases
* consolidation of shared code

[NET]: UDP-Lite Documentation and basic XFRM/Netfilter support

The UDP-Lite patch further provides
* API documentation for UDP-Lite
* basic xfrm support
* basic netfilter support for IPv4 and IPv6 (LOG target)

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
753eab76a3337863a0d86ce045fa4eb6c3cbeef9 23-Nov-2006 Olaf Kirch <okir@suse.de> [UDP]: Make udp_encap_rcv use pskb_may_pull

Make udp_encap_rcv use pskb_may_pull

IPsec with NAT-T breaks on some notebooks using the latest e1000 chipset,
when header split is enabled. When receiving sufficiently large packets, the
driver puts everything up to and including the UDP header into the header
portion of the skb, and the rest goes into the paged part. udp_encap_rcv
forgets to use pskb_may_pull, and fails to decapsulate it. Instead, it
passes it up it to the IKE daemon.

Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
1e0c14f49d6b393179f423abbac47f85618d3d46 03-Oct-2006 Herbert Xu <herbert@gondor.apana.org.au> [UDP]: Fix MSG_PROBE crash

UDP tracks corking status through the pending variable. The
IP layer also tracks it through the socket write queue. It
is possible for the two to get out of sync when MSG_PROBE is
used.

This patch changes UDP to check the write queue to ensure
that the two stay in sync.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
734ab87f632e94dfd4299f4d4d1d10668548f9d8 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [UDP] net/ipv4/udp.c: trivial annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
3ca3c68e76686bee058937ade2b96f4de58ee434 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: struct ip_options annotations

->faddr is net-endian; annotated as such, variables inferred to be net-endian
annotated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
e3b4eadbea77ecb3c3a74d1bc81b392f454c7f2e 27-Aug-2006 David S. Miller <davem@sunset.davemloft.net> [UDP]: saddr_cmp function should take const socket pointers

This also kills a warning while building ipv6:

net/ipv6/udp.c: In function ‘udp_v6_get_port’:
net/ipv6/udp.c:66: warning: passing argument 3 of ‘udp_get_port’ from incompatible pointer type

Signed-off-by: David S. Miller <davem@davemloft.net>
bed53ea7fef37820b7c92ad74feff1b817c6aae3 27-Aug-2006 David S. Miller <davem@sunset.davemloft.net> [UDP]: Mark udp_port_rover static.

It is not referenced outside of net/ipv4/udp.c any longer.

Signed-off-by: David S. Miller <davem@davemloft.net>
25030a7f9eeab2dcefff036469e0e2b4f956198f 27-Aug-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [UDP]: Unify UDPv4 and UDPv6 ->get_port()

This patch creates one common function which is called by
udp_v4_get_port() and udp_v6_get_port(). As a result,
* duplicated code is removed
* udp_port_rover and local port lookup can now be removed from udp.h
* further savings follow since the same function will be used by UDP-Litev4
and UDP-Litev6

In contrast to the patch sent in response to Yoshifujis comments
(fixed by this variant), the code below also removes the
EXPORT_SYMBOL(udp_port_rover), since udp_port_rover can now remain
local to net/ipv4/udp.c.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
81aa646cc4df3779bcbf9d18cc2c0813ee9b3262 15-Aug-2006 Martin Bligh <mbligh@google.com> [IPV4]: add the UdpSndbufErrors and UdpRcvbufErrors MIBs

Signed-off-by: Martin Bligh <mbligh@google.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
84fa7933a33f806bbbaae6775e87459b1ec584c0 30-Aug-2006 Patrick McHardy <kaber@trash.net> [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE

Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).

Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
beb8d13bed80f8388f1a9a107d07ddd342e627e8 05-Aug-2006 Venkat Yekkirala <vyekkirala@TrustedCS.com> [MLSXFRM]: Add flow labeling

This labels the flows that could utilize IPSec xfrms at the points the
flows are defined so that IPSec policy and SAs at the right label can
be used.

The following protos are currently not handled, but they should
continue to be able to use single-labeled IPSec like they currently
do.

ipmr
ip_gre
ipip
igmp
sit
sctp
ip6_tunnel (IPv6 over IPv6 tunnel device)
decnet

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0da974f4f303a6842516b764507e3c0a03f41e5a 21-Jul-2006 Panagiotis Issaris <takis@issaris.org> [NET]: Conversions from kmalloc+memset to k(z|c)alloc.

Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
543d9cfeec4d58ad3fd974db5531b06b6b95deb4 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [NET]: Identation & other cleanups related to compat_[gs]etsockopt cset

No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs
to just after the function exported, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3fdadf7d27e3fbcf72930941884387d1f4936f04 21-Mar-2006 Dmitry Mishin <dim@openvz.org> [NET]: {get|set}sockopt compatibility layer

This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
b59c270104f03960069596722fea70340579244d 07-Jan-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: Keep conntrack reference until IPsec policy checks are done

Keep the conntrack reference until policy checks have been performed for
IPsec NAT support. The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
14c850212ed8f8cbb5972ad6b8812e08a0bc901c 27-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_SOCK]: Move struct inet_sock & helper functions to net/inet_sock.h

To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.

Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
65a45441d7e91ef6107fadbc55f775db4ba23874 14-Dec-2005 Stephen Hemminger <shemminger@osdl.org> [UDP]: udp_checksum_init return value

Since udp_checksum_init always returns 0 there is no point in
having it return a value.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3305b80c214c642b89cd5c21af83bc91ec13f8bd 14-Dec-2005 Herbert Xu <herbert@gondor.apana.org.au> [IP]: Simplify and consolidate MSG_PEEK error handling

When a packet is obtained from skb_recv_datagram with MSG_PEEK enabled
it is left on the socket receive queue. This means that when we detect
a checksum error we have to be careful when trying to free the packet
as someone could have dequeued it in the time being.

Currently this delicate logic is duplicated three times between UDPv4,
UDPv6 and RAWv6. This patch moves them into a one place and simplifies
the code somewhat.

This is based on a suggestion by Eric Dumazet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
fb286bb2990a107009dbf25f6ffebeb7df77f9be 10-Nov-2005 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Detect hardware rx checksum faults correctly

Here is the patch that introduces the generic skb_checksum_complete
which also checks for hardware RX checksum faults. If that happens,
it'll call netdev_rx_csum_fault which currently prints out a stack
trace with the device name. In future it can turn off RX checksum.

I've converted every spot under net/ that does RX checksum checks to
use skb_checksum_complete or __skb_checksum_complete with the
exceptions of:

* Those places where checksums are done bit by bit. These will call
netdev_rx_csum_fault directly.

* The following have not been completely checked/converted:

ipmr
ip_vs
netfilter
dccp

This patch is based on patches and suggestions from Stephen Hemminger
and David S. Miller.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
e308e25c97f06cf704e65eeb773412f5460a3b93 08-Sep-2005 Stephen Hemminger <shemminger@osdl.org> [IPV4] udp: trim forgets about CHECKSUM_HW

A UDP packet may contain extra data that needs to be trimmed off.
But when doing so, UDP forgets to fixup the skb checksum if CHECKSUM_HW
is being used.

I think this explains the case of a NFS receive using skge driver
causing 'udp hw checksum failures' when interacting with a crufty
settop box.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba89966c1984513f4f2cc0a6c182266be44ddd03 26-Aug-2005 Eric Dumazet <dada1@cosmosbay.com> [NET]: use __read_mostly on kmem_cache_t , DEFINE_SNMP_STAT pointers

This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.

On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
64ce207306debd7157f47282be94770407bec01c 10-Aug-2005 Patrick McHardy <kaber@trash.net> [NET]: Make NETDEBUG pure printk wrappers

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c752f0739f09b803aed191c4765a3b6650a08653 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [TCP]: Move the tcp sock states to net/tcp_states.h

Lots of places just needs the states, not even linux/tcp.h, where this
enum was, needs it.

This speeds up development of the refactorings as less sources are
rebuilt when things get moved from net/tcp.h.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ca9334523c853e407da7b3a0bd02f54d0fa59414 08-Aug-2005 Heikki Orsila <heikki.orsila@iki.fi> [IPV4]: Debug cleanup

Here's a small patch to cleanup NETDEBUG() use in net/ipv4/ for Linux
kernel 2.6.13-rc5. Also weird use of indentation is changed in some
places.

Signed-off-by: Heikki Orsila <heikki.orsila@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
208d89843b7b03978d8e748b8b991c1be81c4f43 31-May-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Fix BUG() in 2.6.x, udp_poll(), fragments + CONFIG_HIGHMEM

Steven Hand <Steven.Hand@cl.cam.ac.uk> wrote:
>
> Reconstructed forward trace:
>
> net/ipv4/udp.c:1334 spin_lock_irq()
> net/ipv4/udp.c:1336 udp_checksum_complete()
> net/core/skbuff.c:1069 skb_shinfo(skb)->nr_frags > 1
> net/core/skbuff.c:1086 kunmap_skb_frag()
> net/core/skbuff.h:1087 local_bh_enable()
> kernel/softirq.c:0140 WARN_ON(irqs_disabled());

The receive queue lock is never taken in IRQs (and should never be) so
we can simply substitute bh for irq.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
02c30a84e6298b6b20a56f0896ac80b47839e134 06-May-2005 Jesper Juhl <juhl-lkml@dif.dk> [PATCH] update Ross Biro bouncing email address

Ross moved. Remove the bad email address so people will find the correct
one in ./CREDITS.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4d78b6c78ae6d87e4c1c8072f42efa716f04afb9 20-Apr-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPSEC]: COW skb header in UDP decap

The following patch just makes the header part of the skb writeable.
This is needed since we modify the IP headers just a few lines below.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!