History log of /net/ipv4/af_inet.c
Revision Date Author Comments
6a3db02c6f32102f98e1bd8d33b28ba5ec4528bb 30-Jun-2009 Chia-chi Yeh <chiachi@android.com> net: Replace AID_NET_RAW checks with capable(CAP_NET_RAW).

Signed-off-by: Chia-chi Yeh <chiachi@android.com>
c7c3ec4903d32c60423ee013d96e94602f66042c 12-May-2008 Robert Love <rlove@google.com> net: socket ioctl to reset connections matching local address

Introduce a new socket ioctl, SIOCKILLADDR, that nukes all sockets
bound to the same local address. This is useful in situations with
dynamic IPs, to kill stuck connections.

Signed-off-by: Brian Swetland <swetland@google.com>

net: fix tcp_v4_nuke_addr

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix a spinlock recursion bug in tcp_v4_nuke.

We can't hold the lock while calling to tcp_done(), so we drop
it before calling. We then have to start at the top of the chain again.

Signed-off-by: Dima Zavin <dima@android.com>

net: ipv4: Fix race in tcp_v4_nuke_addr().

To fix a recursive deadlock in 2.6.29, we stopped holding the hash table lock
across tcp_done() calls. This fixed the deadlock, but introduced a race where
the socket could die or change state.

Fix: Before unlocking the hash table, we grab a reference to the socket. We
can then unlock the hash table without risk of the socket going away. We then
lock the socket, which is safe because it is pinned. We can then call
tcp_done() without recursive deadlock and without race. Upon return, we unlock
the socket and then unpin it, killing it.

Change-Id: Idcdae072b48238b01bdbc8823b60310f1976e045
Signed-off-by: Robert Love <rlove@google.com>
Acked-by: Dima Zavin <dima@android.com>

ipv4: disable bottom halves around call to tcp_done().

Signed-off-by: Robert Love <rlove@google.com>
Signed-off-by: Colin Cross <ccross@android.com>

ipv4: Move sk_error_report inside bh_lock_sock in tcp_v4_nuke_addr

When sk_error_report is called, it wakes up the user-space thread, which then
calls tcp_close. When the tcp_close is interrupted by the tcp_v4_nuke_addr
ioctl thread running tcp_done, it leaks 392 bytes and triggers a WARN_ON.

This patch moves the call to sk_error_report inside the bh_lock_sock, which
matches the locking used in tcp_v4_err.

Signed-off-by: Colin Cross <ccross@android.com>
db0e07948289cd4d332b82b1b5e9fe05846b2bc6 15-Oct-2008 Robert Love <rlove@google.com> Paranoid network.

With CONFIG_ANDROID_PARANOID_NETWORK, require specific uids/gids to instantiate
network sockets.

Signed-off-by: Robert Love <rlove@google.com>

paranoid networking: Use in_egroup_p() to check group membership

The previous group_search() caused trouble for partners with module builds.
in_egroup_p() is also cleaner.

Signed-off-by: Nick Pelly <npelly@google.com>

Fix 2.6.29 build.

Signed-off-by: Arve Hjønnevåg <arve@android.com>

net: Fix compilation of the IPv6 module

Fix compilation of the IPv6 module -- current->euid does not exist anymore,
current_euid() is what needs to be used.

Signed-off-by: Steinar H. Gunderson <sesse@google.com>

net: bluetooth: Remove the AID_NET_BT* gid numbers

Removed bluetooth checks for AID_NET_BT and AID_NET_BT_ADMIN
which are not useful anymore.
This is in preparation for getting rid of all the AID_* gids.

Signed-off-by: JP Abgrall <jpa@google.com>
f4713a3dfad045d46afcb9c2a7d0bba288920ed4 26-Nov-2014 Willem de Bruijn <willemb@google.com> net-timestamp: make tcp_recvmsg call ipv6_recv_error for AF_INET6 socks

TCP timestamping introduced MSG_ERRQUEUE handling for TCP sockets.
If the socket is of family AF_INET6, call ipv6_recv_error instead
of ip_recv_error.

This change is more complex than a single branch due to the loadable
ipv6 module. It reuses a pre-existing indirect function call from
ping. The ping code is safe to call, because it is part of the core
ipv6 module and always present when AF_INET6 sockets are active.

Fixes: 4ed2d765 (net-timestamp: TCP timestamping)
Signed-off-by: Willem de Bruijn <willemb@google.com>

----

It may also be worthwhile to add WARN_ON_ONCE(sk->family == AF_INET6)
to ip_recv_error.
Signed-off-by: David S. Miller <davem@davemloft.net>
1e16aa3ddf863c6b9f37eddf52503230a62dedb3 20-Oct-2014 Florian Westphal <fw@strlen.de> net: gso: use feature flag argument in all protocol gso handlers

skb_gso_segment() has a 'features' argument representing offload features
available to the output path.

A few handlers, e.g. GRE, instead re-fetch the features of skb->dev and use
those instead of the provided ones when handing encapsulation/tunnels.

Depending on dev->hw_enc_features of the output device skb_gso_segment() can
then return NULL even when the caller has disabled all GSO feature bits,
as segmentation of inner header thinks device will take care of segmentation.

This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs
that did not fit the remaining token quota as the segmentation does not work
when device supports corresponding hw offload capabilities.

Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2c804d0f8fc7799981d9fdd8c88653541b28c1a7 01-Oct-2014 Eric Dumazet <edumazet@google.com> ipv4: mentions skb_gro_postpull_rcsum() in inet_gro_receive()

Proper CHECKSUM_COMPLETE support needs to adjust skb->csum
when we remove one header. Its done using skb_gro_postpull_rcsum()

In the case of IPv4, we know that the adjustment is not really needed,
because the checksum over IPv4 header is 0. Lets add a comment to
ease code comprehension and avoid copy/paste errors.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
53e50398968d43338c4d932114e68bc099fc5fbd 20-Sep-2014 Tom Herbert <therbert@google.com> net: Remove gso_send_check as an offload callback

The send_check logic was only interesting in cases of TCP offload and
UDP UFO where the checksum needed to be initialized to the pseudo
header checksum. Now we've moved that logic into the related
gso_segment functions so gso_send_check is no longer needed.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9667e9bb3f366435dde74f22578876daae850feb 09-Sep-2014 Tom Herbert <therbert@google.com> ipip: Add gro callbacks to ipip offload

Add inet_gro_receive and inet_gro_complete to ipip_offload to
support GRO.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
49a601589caaf0e93194c0cc9b4ecddbe75dd2d5 05-Sep-2014 Vincent Bernat <vincent@bernat.im> net/ipv4: bind ip_nonlocal_bind to current netns

net.ipv4.ip_nonlocal_bind sysctl was global to all network
namespaces. This patch allows to set a different value for each
network namespace.

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
c3caf1192f904de2f1381211f564537235d50de3 15-Jul-2014 Jerry Chu <hkchu@google.com> net-gre-gro: Fix a bug that breaks the forwarding path

Fixed a bug that was introduced by my GRE-GRO patch
(bf5a755f5e9186406bbf50f4087100af5bd68e40 net-gre-gro: Add GRE
support to the GRO stack) that breaks the forwarding path
because various GSO related fields were not set. The bug will
cause on the egress path either the GSO code to fail, or a
GRE-TSO capable (NETIF_F_GSO_GRE) NICs to choke. The following
fix has been tested for both cases.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4749c09c37030ccdc44aecebe0f71b02a377fc14 05-Jun-2014 Tom Herbert <therbert@google.com> gre: Call gso_make_checksum

Call gso_make_checksum. This should have the benefit of using a
checksum that may have been previously computed for the packet.

This also adds NETIF_F_GSO_GRE_CSUM to differentiate devices that
offload GRE GSO with and without the GRE checksum offloaed.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0f4f4ffa7b7c3d29d0537a126145c9f8d8ed5dbc 05-Jun-2014 Tom Herbert <therbert@google.com> net: Add GSO support for UDP tunnels with checksum

Added a new netif feature for GSO_UDP_TUNNEL_CSUM. This indicates
that a device is capable of computing the UDP checksum in the
encapsulating header of a UDP tunnel.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b26ba202e0500eb852e89499ece1b2deaa64c3a7 23-May-2014 Tom Herbert <therbert@google.com> net: Eliminate no_check from protosw

It doesn't seem like an protocols are setting anything other
than the default, and allowing to arbitrarily disable checksums
for a whole protocol seems dangerous. This can be done on a per
socket basis.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
122ff243f5f104194750ecbc76d5946dd1eec934 13-May-2014 WANG Cong <xiyou.wangcong@gmail.com> ipv4: make ip_local_reserved_ports per netns

ip_local_port_range is already per netns, so should ip_local_reserved_ports
be. And since it is none by default we don't actually need it when we don't
enable CONFIG_SYSCTL.

By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba6b918ab234186d3aa1663e296586a1b526b77a 06-May-2014 Cong Wang <xiyou.wangcong@gmail.com> ping: move ping_group_range out of CONFIG_SYSCTL

Similarly, when CONFIG_SYSCTL is not set, ping_group_range should still
work, just that no one can change it. Therefore we should move it out of
sysctl_net_ipv4.c. And, it should not share the same seqlock with
ip_local_port_range.

BTW, rename it to ->ping_group_range instead.

Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c9d8f1a64225dfcc2f721d73a5984a2444920744 06-May-2014 Cong Wang <xiyou.wangcong@gmail.com> ipv4: move local_port_range out of CONFIG_SYSCTL

When CONFIG_SYSCTL is not set, ip_local_port_range should still work,
just that no one can change it. Therefore we should move it out of sysctl_inet.c.
Also, rename it to ->ip_local_ports instead.

Cc: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com>
Reported-by: Stefan de Konink <stefan@konink.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
698365fa1874aa7635d51667a34a2842228e9837 06-May-2014 WANG Cong <xiyou.wangcong@gmail.com> net: clean up snmp stats code

commit 8f0ea0fe3a036a47767f9c80e (snmp: reduce percpu needs by 50%)
reduced snmp array size to 1, so technically it doesn't have to be
an array any more. What's more, after the following commit:

commit 933393f58fef9963eac61db8093689544e29a600
Date: Thu Dec 22 11:58:51 2011 -0600

percpu: Remove irqsafe_cpu_xxx variants

We simply say that regular this_cpu use must be safe regardless of
preemption and interrupt state. That has no material change for x86
and s390 implementations of this_cpu operations. However, arches that
do not provide their own implementation for this_cpu operations will
now get code generated that disables interrupts instead of preemption.

probably no arch wants to have SNMP_ARRAY_SZ == 2. At least after
almost 3 years, no one complains.

So, just convert the array to a single pointer and remove snmp_mib_init()
and snmp_mib_free() as well.

Cc: Christoph Lameter <cl@linux.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
57a7744e09867ebcfa0ccf1d6d529caa7728d552 14-Mar-2014 Eric W. Biederman <ebiederm@xmission.com> net: Replace u64_stats_fetch_begin_bh to u64_stats_fetch_begin_irq

Replace the bh safe variant with the hard irq safe variant.

We need a hard irq safe variant to deal with netpoll transmitting
packets from hard irq context, and we need it in most if not all of
the places using the bh safe variant.

Except on 32bit uni-processor the code is exactly the same so don't
bother with a bh variant, just have a hard irq safe variant that
everyone can use.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
91a48a2e85a3b70ce10ead34b4ab5347f8d215c9 24-Feb-2014 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: ipv6: better estimate tunnel header cut for correct ufo handling

Currently the UFO fragmentation process does not correctly handle inner
UDP frames.

(The following tcpdumps are captured on the parent interface with ufo
disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
both sit device):

IPv6:
16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471

We can see that fragmentation header offset is not correctly updated.
(fragmentation id handling is corrected by 916e4cf46d0204 ("ipv6: reuse
ip6_frag_id from ip6_ufo_append_data")).

IPv4:
16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109

In this case fragmentation id is incremented and offset is not updated.

First, I aligned inet_gso_segment and ipv6_gso_segment:
* align naming of flags
* ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
always ensure that the state of this flag is left untouched when
returning from upper gso segmenation function
* ipv6_gso_segment: move skb_reset_inner_headers below updating the
fragmentation header data, we don't care for updating fragmentation
header data
* remove currently unneeded comment indicating skb->encapsulation might
get changed by upper gso_segment callback (gre and udp-tunnel reset
encapsulation after segmentation on each fragment)

If we encounter an IPIP or SIT gso skb we now check for the protocol ==
IPPROTO_UDP and that we at least have already traversed another ip(6)
protocol header.

The reason why we have to special case GSO_IPIP and GSO_SIT is that
we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
protocol of GSO_UDP_TUNNEL or GSO_GRE packets.

Reported-by: Wolfgang Walter <linux@stwm.de>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8ed1dc44d3e9e8387a104b1ae8f92e9a3fbf1b1e 09-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: introduce hardened ip_no_pmtu_disc mode

This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
to be honored by protocols which do more stringent validation on the
ICMP's packet payload. This knob is useful for people who e.g. want to
run an unmodified DNS server in a namespace where they need to use pmtu
for TCP connections (as they are used for zone transfers or fallback
for requests) but don't want to use possibly spoofed UDP pmtu information.

Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
if the returned packet is in the window or if the association is valid.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
bf5a755f5e9186406bbf50f4087100af5bd68e40 07-Jan-2014 Jerry Chu <hkchu@google.com> net-gre-gro: Add GRE support to the GRO stack

This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.

The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.

Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.

The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.

Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.

I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.

All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)

An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.

The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).

1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%

1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%

1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%

1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%

The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).

2.1 ethtool -K gre1 gro on (hence will use gro_cells)

2.1.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 8.53Gbps
CPU utilization: 9%

2.1.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 8.97Gbps
CPU utilization: 7-8%

2.1.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 8.83Gbps
CPU utilization: 5-6%

2.1.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.98Gbps
CPU utilization: 5%

2.2 ethtool -K gre1 gro off

2.2.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 5.93Gbps
CPU utilization: 9%

2.2.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 5.62Gbps
CPU utilization: 8%

2.2.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 7.69Gbps
CPU utilization: 8%

2.2.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.96Gbps
CPU utilization: 5-6%

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
974eda11c54290a1be8f8b155edae7d791e5ce57 14-Dec-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> inet: make no_pmtu_disc per namespace and kill ipv4_config

The other field in ipv4_config, log_martians, was converted to a
per-interface setting, so we can just remove the whole structure.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
299603e8370a93dd5d8e8d800f0dff1ce2c53d36 12-Dec-2013 Jerry Chu <hkchu@google.com> net-gro: Prepare GRO stack for the upcoming tunneling support

This patch modifies the GRO stack to avoid the use of "network_header"
and associated macros like ip_hdr() and ipv6_hdr() in order to allow
an arbitary number of IP hdrs (v4 or v6) to be used in the
encapsulation chain. This lays the foundation for various IP
tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.

With this patch, the GRO stack traversing now is mostly based on
skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
skb->network_header). As a result all but the top layer (i.e., the
the transport layer) must have hdrs of the same length in order for
a pkt to be considered for aggregation. Therefore when adding a new
encap layer (e.g., for tunneling), one must check and skip flows
(e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
different hdr length.

Note that unlike the network header, the transport header can and
will continue to be set by the GRO code since there will be at
most one "transport layer" in the encap chain.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0e0d44ab4275549998567cd4700b43f7496eb62b 28-Aug-2013 Steffen Klassert <steffen.klassert@secunet.com> net: Remove FLOWI_FLAG_CAN_SLEEP

FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
dcd607718385d02ce3741de225927a57f528f93b 08-Nov-2013 Eric Dumazet <edumazet@google.com> inet: fix a UFO regression

While testing virtio_net and skb_segment() changes, Hannes reported
that UFO was sending wrong frames.

It appears this was introduced by a recent commit :
8c3a897bfab1 ("inet: restore gso for vxlan")

The old condition to perform IP frag was :

tunnel = !!skb->encapsulation;
...
if (!tunnel && proto == IPPROTO_UDP) {

So the new one should be :

udpfrag = !skb->encapsulation && proto == IPPROTO_UDP;
...
if (udpfrag) {

Initialization of udpfrag must be done before call
to ops->callbacks.gso_segment(skb, features), as
skb_udp_tunnel_segment() clears skb->encapsulation

(We want udpfrag to be true for UFO, false for VXLAN)

With help from Alexei Starovoitov

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
827da44c61419f29ae3be198c342e2147f1a10cb 08-Oct-2013 John Stultz <john.stultz@linaro.org> net: Explicitly initialize u64_stats_sync structures for lockdep

In order to enable lockdep on seqcount/seqlock structures, we
must explicitly initialize any locks.

The u64_stats_sync structure, uses a seqcount, and thus we need
to introduce a u64_stats_init() function and use it to initialize
the structure.

This unfortunately adds a lot of fairly trivial initialization code
to a number of drivers. But the benefit of ensuring correctness makes
this worth while.

Because these changes are required for lockdep to be enabled, and the
changes are quite trivial, I've not yet split this patch out into 30-some
separate patches, as I figured it would be better to get the various
maintainers thoughts on how to best merge this change along with
the seqcount lockdep enablement.

Feedback would be appreciated!

Signed-off-by: John Stultz <john.stultz@linaro.org>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: James Morris <jmorris@namei.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mirko Lindner <mlindner@marvell.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Roger Luethi <rl@hellgate.ch>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Simon Horman <horms@verge.net.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Cc: Wensong Zhang <wensong@linux-vs.org>
Cc: netdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
8c3a897bfab10f68f90252440bb29e6749a7312a 28-Oct-2013 Eric Dumazet <edumazet@google.com> inet: restore gso for vxlan

Alexei reported a performance regression on vxlan, caused
by commit 3347c9602955 "ipv4: gso: make inet_gso_segment() stackable"

GSO vxlan packets were not properly segmented, adding IP fragments
while they were not expected.

Rename 'bool tunnel' to 'bool encap', and add a new boolean
to express the fact that UDP should be fragmented.
This fragmentation is triggered by skb->encapsulation being set.

Remove a "skb->encapsulation = 1" added in above commit,
as its not needed, as frags inherit skb->frag from original
GSO skb.

Reported-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
61c1db7fae21ed33c614356a43bf6580c5e53118 21-Oct-2013 Eric Dumazet <edumazet@google.com> ipv6: sit: add GSO/TSO support

Now ipv6_gso_segment() is stackable, its relatively easy to
implement GSO/TSO support for SIT tunnels

Performance results, when segmentation is done after tunnel
device (as no NIC is yet enabled for TSO SIT support) :

Before patch :

lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 3168.31 4.81 4.64 2.988 2.877

After patch :

lpq84:~# ./netperf -H 2002:af6:1153:: -Cc
MIGRATED TCP STREAM TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:1153:: () port 0 AF_INET6
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 5525.00 7.76 5.17 2.763 1.840

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a4fe34bf902b8f709c635ab37f1f39de0b86cff2 20-Oct-2013 Eric W. Biederman <ebiederm@xmission.com> tcp_memcontrol: Remove the per netns control.

The code that is implemented is per memory cgroup not per netns, and
having per netns bits is just confusing. Remove the per netns bits to
make it easier to see what is really going on.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1bbdceef1e535add893bf71d7b7ab102e4eb69eb 19-Oct-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once

Initialize the ehash and ipv6_hash_secrets with net_get_random_once.

Each compilation unit gets its own secret now:
ipv4/inet_hashtables.o
ipv4/udp.o
ipv6/inet6_hashtables.o
ipv6/udp.o
rds/connection.o

The functions still get inlined into the hashing functions. In the fast
path we have at most two (needed in ipv6) if (unlikely(...)).

Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
cb32f511a70be8967ac9025cf49c44324ced9a39 19-Oct-2013 Eric Dumazet <edumazet@google.com> ipip: add GSO/TSO support

Now inet_gso_segment() is stackable, its relatively easy to
implement GSO/TSO support for IPIP

Performance results, when segmentation is done after tunnel
device (as no NIC is yet enabled for TSO IPIP support) :

Before patch :

lpq83:~# ./netperf -H 7.7.9.84 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 3357.88 5.09 3.70 2.983 2.167

After patch :

lpq83:~# ./netperf -H 7.7.9.84 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.9.84 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 16384 10.00 7710.19 4.52 6.62 1.152 1.687

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3347c960295583eee3fd58e5c539fb1972fbc005 19-Oct-2013 Eric Dumazet <edumazet@google.com> ipv4: gso: make inet_gso_segment() stackable

In order to support GSO on IPIP, we need to make
inet_gso_segment() stackable.

It should not assume network header starts right after mac
header.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
47d27aad44169372f358cda88a223883f6760fa5 18-Oct-2013 Eric Dumazet <edumazet@google.com> ipv4: gso: send_check() & segment() cleanups

inet_gso_segment() and inet_gso_send_check() are called by
skb_mac_gso_segment() under rcu lock, no need to use
rcu_read_lock() / rcu_read_unlock()

Avoid calling ip_hdr() twice per function.

We can use ip_send_check() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
421b3885bf6d56391297844f43fb7154a6396e12 07-Oct-2013 Shawn Bohrer <sbohrer@rgmadvisors.com> udp: ipv4: Add udp early demux

The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.

For UDP multicast we can only cache the dst if there is only one
receiving socket on the host. Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.

For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups. Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.

Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After 89789.68 transactions/s

Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After 12.63us RTT

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9a3bab6b05383f1e4c3716b3615500c51285959e 24-Sep-2013 Eric Dumazet <edumazet@google.com> net: net_secret should not depend on TCP

A host might need net_secret[] and never open a single socket.

Problem added in commit aebda156a570782
("net: defer net_secret[] initialization")

Based on prior patch from Hannes Frederic Sowa.

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@strressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a17a390de7bdbcfff9b8f344273a886ca4cf8bf 02-Sep-2013 Cong Wang <amwang@redhat.com> net: make snmp_mib_free static inline

Fengguang reported:

net/built-in.o: In function `in6_dev_finish_destroy':
(.text+0x4ca7d): undefined reference to `snmp_mib_free'

this is due to snmp_mib_free() is defined when CONFIG_INET is enabled,
but in6_dev_finish_destroy() is now moved to core kernel.

I think snmp_mib_free() is small enough to be inlined, so just make it
static inline.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5b9b6263775d45ccc0c8b27344bfb1a97cf6f725 12-Jun-2013 Eric Dumazet <edumazet@google.com> gro: remove a sparse error

Fix following sparse error :

net/ipv4/af_inet.c:1410:59: warning: restricted __be16 degrades to
integer

added in commit db8caf3dbc77599
("gro: should aggregate frames without DF")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
da5bab079f9b7d90ba234965a14914ace55e45e9 08-Jun-2013 Daniel Borkmann <dborkman@redhat.com> net: udp4: move GSO functions to udp_offload

Similarly to TCP offloading and UDPv6 offloading, move all related
UDPv4 functions to udp_offload.c to make things more explicit. Also,
by this, we can make those functions static.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
28850dc7c71da9d0c0e39246e9ff6913f41f8d0a 07-Jun-2013 Daniel Borkmann <dborkman@redhat.com> net: tcp: move GRO/GSO functions to tcp_offload

Would be good to make things explicit and move those functions to
a new file called tcp_offload.c, thus make this similar to tcpv6_offload.c.
While moving all related functions into tcp_offload.c, we can also
make some of them static, since they are only used there. Also, add
an explicit registration function.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
db8caf3dbc77599dc90f4ea0a803cd1d97116f30 31-May-2013 Eric Dumazet <edumazet@google.com> gro: should aggregate frames without DF

GRO on IPv4 doesn't aggregate frames if they don't have DF bit set.

Some servers use IP_MTU_DISCOVER/IP_PMTUDISC_PROBE, so linux receivers
are unable to aggregate this kind of traffic.

The right thing to do is to allow aggregation as long as the DF bit has
same value on all segments.

bnx2x LRO does this correctly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0d89d2035fe063461a5ddb609b2c12e7fb006e44 23-May-2013 Simon Horman <horms@verge.net.au> MPLS: Add limited GSO support

In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.

The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.

SKB Usage:

When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:

* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
skb->inner_protocol is added by this patch.

* Set skb->protocol to the new MPLS ethertype of the packet.

* Set skb->network_header to correspond to the
end of the L3 header, including the MPLS label stack.

I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code. The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.

Features:

I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols. Jesse, I understand that
you have some ideas here. I am more than happy to change my implementation.

This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.

A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload. Currently no devices support this
and MPLS GSO always falls back to software.

Alternate Implementation:

One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().

I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.

MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.

Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
9b3eb5edf33897dc9128aa27300066153d4f8b9c 02-May-2013 Pravin B Shelar <pshelar@nicira.com> gre: Fix GREv4 TCPv6 segmentation.

For ipv6 traffic, GRE can generate packet with strange GSO
bits, e.g. ipv4 packet with SKB_GSO_TCPV6 flag set. Therefore
following patch relaxes check in inet gso handler to allow
such packet for segmentation.
This patch also fixes wrong skb->protocol set that was done in
gre_gso_segment() handler.

Reported-by: Steinar H. Gunderson <sesse@google.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aebda156a570782a86fc4426842152237a19427d 29-Apr-2013 Eric Dumazet <edumazet@google.com> net: defer net_secret[] initialization

Instead of feeding net_secret[] at boot time, defer the init
at the point first socket is created.

This permits some platforms to use better entropy sources than
the ones available at boot time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
330305cc4a6b0cb75c22fc01b8826f0ad755550f 24-Mar-2013 Pravin B Shelar <pshelar@nicira.com> ipv4: Fix ip-header identification for gso packets.

ip-header id needs to be incremented even if IP_DF flag is set.
This behaviour was changed in commit 490ab08127cebc25e3a26
(IP_GRE: Fix IP-Identification).

Following patch fixes it so that identification is always
incremented.

Reported-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c54419321455631079c7d6e60bc732dd0c5914c5 25-Mar-2013 Pravin B Shelar <pshelar@nicira.com> GRE: Refactor GRE tunneling code.

Following patch refactors GRE code into ip tunneling code and GRE
specific code. Common tunneling code is moved to ip_tunnel module.
ip_tunnel module is written as generic library which can be used
by different tunneling implementations.

ip_tunnel module contains following components:
- packet xmit and rcv generic code. xmit flow looks like
(gre_xmit/ipip_xmit)->ip_tunnel_xmit->ip_local_out.
- hash table of all devices.
- lookup for tunnel devices.
- control plane operations like device create, destroy, ioctl, netlink
operations code.
- registration for tunneling modules, like gre, ipip etc.
- define single pcpu_tstats dev->tstats.
- struct tnl_ptk_info added to pass parsed tunnel packet parameters.

ipip.h header is renamed to ip_tunnel.h

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
25c7704d8bdcf6746dab4397953df51759924b37 24-Mar-2013 Pravin B Shelar <pshelar@nicira.com> ipv4: Fix ip-header identification for gso packets.

ip-header id needs to be incremented even if IP_DF flag is set.
This behaviour was changed in commit 490ab08127cebc25e3a26
(IP_GRE: Fix IP-Identification).

Following patch fixes it so that identification is always
incremented.

Reported-by: Cong Wang <amwang@redhat.com>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
731362674580cb0c696cd1b1a03d8461a10cf90a 07-Mar-2013 Pravin B Shelar <pshelar@nicira.com> tunneling: Add generic Tunnel segmentation.

Adds generic tunneling offloading support for IPv4-UDP based
tunnels.
GSO type is added to request this offload for a skb.
netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
udp-tunnel support. Currently no device supports this feature,
software offload is used.

This can be used by tunneling protocols like VXLAN.

CC: Jesse Gross <jesse@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ec5f061564238892005257c83565a0b58ec79295 07-Mar-2013 Pravin B Shelar <pshelar@nicira.com> net: Kill link between CSUM and SG features.

Earlier SG was unset if CSUM was not available for given device to
force skb copy to avoid sending inconsistent csum.
Commit c9af6db4c11c (net: Fix possible wrong checksum generation)
added explicit flag to force copy to fix this issue. Therefore
there is no need to link SG and CSUM, following patch kills this
link between there two features.

This patch is also required following patch in series.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
490ab08127cebc25e3a260a74556b56ce5f47c0f 22-Feb-2013 Pravin B Shelar <pshelar@nicira.com> IP_GRE: Fix IP-Identification.

GRE-GSO generates ip fragments with id 0,2,3,4... for every
GSO packet, which is not correct. Following patch fixes it
by setting ip-header id unique id of fragments are allowed.
As Eric Dumazet suggested it is optimized by using inner ip-header
whenever inner packet is ipv4.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5b0520425e5ea81ba95ec486dd6bbb59a09fff0e 21-Feb-2013 Li Wei <lw@cn.fujitsu.com> ipv4: fix error handling in icmp_protocol.

Now we handle icmp errors in each transport protocol's err_handler,
for icmp protocols, that is ping_err. Since this handler only care
of those icmp errors triggered by echo request, errors triggered
by echo reply(which sent by kernel) are sliently ignored.

So wrap ping_err() with icmp_err() to deal with those icmp errors.

Signed-off-by: Li Wei <lw@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
08dcdbf6a7b9d14c2302c5bd0c5390ddf122f664 21-Feb-2013 Eric Dumazet <edumazet@google.com> ipv6: use a stronger hash for tcp

It looks like its possible to open thousands of TCP IPv6
sessions on a server, all landing in a single slot of TCP hash
table. Incoming packets have to lookup sockets in a very
long list.

We should hash all bits from foreign IPv6 addresses, using
a salt and hash mix, not a simple XOR.

inet6_ehashfn() can also separately use the ports, instead
of xoring them.

Reported-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
68c331631143f5f039baac99a650e0b9e1ea02b6 14-Feb-2013 Pravin B Shelar <pshelar@nicira.com> v4 GRE: Add TCP segmentation offload for GRE

Following patch adds GRE protocol offload handler so that
skb_gso_segment() can segment GRE packets.
SKB GSO CB is added to keep track of total header length so that
skb_segment can push entire header. e.g. in case of GRE, skb_segment
need to push inner and outer headers to every segment.
New NETIF_F_GRE_GSO feature is added for devices which support HW
GRE TSO offload. Currently none of devices support it therefore GRE GSO
always fall backs to software GSO.

[ Compute pkt_len before ip_local_out() invocation. -DaveM ]

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c9af6db4c11ccc6c3e7f19bbc15d54023956f97c 11-Feb-2013 Pravin B Shelar <pshelar@nicira.com> net: Fix possible wrong checksum generation.

Patch cef401de7be8c4e (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
547472b8e1da72ae226430c0c4273e36fc8ca768 05-Feb-2013 David S. Miller <davem@davemloft.net> ipv4: Disallow non-namespace aware protocols to register.

All in-tree ipv4 protocol implementations are now namespace
aware. Therefore all the run-time checks are superfluous.

Reject registry of any non-namespace aware ipv4 protocol.
Eventually we'll remove prot->netns_ok and this registry
time check as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
cef401de7be8c4e155c6746bfccf721a4fa5fab9 25-Jan-2013 Eric Dumazet <edumazet@google.com> net: fix possible wrong checksum generation

Pravin Shelar mentioned that GSO could potentially generate
wrong TX checksum if skb has fragments that are overwritten
by the user between the checksum computation and transmit.

He suggested to linearize skbs but this extra copy can be
avoided for normal tcp skbs cooked by tcp_sendmsg().

This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
in skb_shinfo(skb)->gso_type if at least one frag can be
modified by the user.

Typical sources of such possible overwrites are {vm}splice(),
sendfile(), and macvtap/tun/virtio_net drivers.

Tested:

$ netperf -H 7.7.8.84
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
7.7.8.84 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3959.52

$ netperf -H 7.7.8.84 -t TCP_SENDFILE
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.00 3216.80

Performance of the SENDFILE is impacted by the extra allocation and
copy, and because we use order-0 pages, while the TCP_STREAM uses
bigger pages.

Reported-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
50c3a487d507568359ccbce1b15bcdfc01edf7ef 22-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> ipv4: Use IS_ERR_OR_NULL().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
95c7e0e4d4792f9c9629ad759c9be047a3ddbcc5 09-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> ipv4: Use FIELD_SIZEOF() in inet_init().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3594698a1fb8e5ae60a92c72ce9ca280256939a7 16-Nov-2012 Eric W. Biederman <ebiederm@xmission.com> net: Make CAP_NET_BIND_SERVICE per user namespace

Allow privileged users in any user namespace to bind to
privileged sockets in network namespaces they control.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
52e804c6dfaa5df1e4b0e290357b82ad4e4cda2c 16-Nov-2012 Eric W. Biederman <ebiederm@xmission.com> net: Allow userns root to control ipv4

Allow an unpriviled user who has created a user namespace, and then
created a network namespace to effectively use the new network
namespace, by reducing capable(CAP_NET_ADMIN) and
capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

Settings that merely control a single network device are allowed.
Either the network device is a logical network device where
restrictions make no difference or the network device is hardware NIC
that has been explicity moved from the initial network namespace.

In general policy and network stack state changes are allowed
while resource control is left unchanged.

Allow creating raw sockets.
Allow the SIOCSARP ioctl to control the arp cache.
Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting gre tunnels.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipip tunnels.

Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
adding, changing and deleting ipsec virtual tunnel interfaces.

Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
sockets.

Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
arbitrary ip options.

Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
Allow setting the IP_TRANSPARENT ipv4 socket option.
Allow setting the TCP_REPAIR socket option.
Allow setting the TCP_CONGESTION socket option.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f191a1d17f227032c159e5499809f545402b6dc6 15-Nov-2012 Vlad Yasevich <vyasevic@redhat.com> net: Remove code duplication between offload structures

Move the offload callbacks into its own structure.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
808a8f884554f93315f663b2694addb4a177c578 15-Nov-2012 Vlad Yasevich <vyasevic@redhat.com> ipv4: Pull GSO registration out of inet_init()

Since GSO/GRO support is now separated, make IPv4 GSO a
stand-alone init call and not part of inet_init().

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bca49f843eac59cbb1ddd1f4a5d65fcc23b62efd 15-Nov-2012 Vlad Yasevich <vyasevic@redhat.com> ipv4: Switch to using the new offload infrastructure.

Switch IPv4 code base to using the new GRO/GSO calls and data.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
de27d001d187721de775ed9e0c485aa6f517c745 15-Nov-2012 Vlad Yasevich <vyasevic@redhat.com> net: Add net protocol offload registration infrustructure

Create a new data structure for IPv4 protocols that holds GRO/GSO
callbacks and a new array to track the protocols that register GRO/GSO.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22061d8014455b01eb018bd6c35a1b3040ccc230 15-Nov-2012 Vlad Yasevich <vyasevic@redhat.com> net: Switch to using the new packet offload infrustructure

Convert to using the new GSO/GRO registration mechanism and new
packet offload structure.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bb68b64724a4fd6b93d83b39aeffa4aadb2562fc 18-Sep-2012 Christoph Paasch <christoph.paasch@uclouvain.be> ipv4: Don't add TCP-code in inet_sock_destruct

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: H.K. Jerry Chu <hkchu@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7ab4551f3b391818e29263279031dca1e26417c6 06-Sep-2012 Eric Dumazet <edumazet@google.com> tcp: fix TFO regression

Fengguang Wu reported various panics and bisected to commit
8336886f786fdac (tcp: TCP Fast Open Server - support TFO listeners)

Fix this by making sure socket is a TCP socket before accessing TFO data
structures.

[ 233.046014] kfree_debugcheck: out of range ptr ea6000000bb8h.
[ 233.047399] ------------[ cut here ]------------
[ 233.048393] kernel BUG at /c/kernel-tests/src/stable/mm/slab.c:3074!
[ 233.048393] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 233.048393] Modules linked in:
[ 233.048393] CPU 0
[ 233.048393] Pid: 3929, comm: trinity-watchdo Not tainted 3.6.0-rc3+
#4192 Bochs Bochs
[ 233.048393] RIP: 0010:[<ffffffff81169653>] [<ffffffff81169653>]
kfree_debugcheck+0x27/0x2d
[ 233.048393] RSP: 0018:ffff88000facbca8 EFLAGS: 00010092
[ 233.048393] RAX: 0000000000000031 RBX: 0000ea6000000bb8 RCX:
00000000a189a188
[ 233.048393] RDX: 000000000000a189 RSI: ffffffff8108ad32 RDI:
ffffffff810d30f9
[ 233.048393] RBP: ffff88000facbcb8 R08: 0000000000000002 R09:
ffffffff843846f0
[ 233.048393] R10: ffffffff810ae37c R11: 0000000000000908 R12:
0000000000000202
[ 233.048393] R13: ffffffff823dbd5a R14: ffff88000ec5bea8 R15:
ffffffff8363c780
[ 233.048393] FS: 00007faa6899c700(0000) GS:ffff88001f200000(0000)
knlGS:0000000000000000
[ 233.048393] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 233.048393] CR2: 00007faa6841019c CR3: 0000000012c82000 CR4:
00000000000006f0
[ 233.048393] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 233.048393] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 233.048393] Process trinity-watchdo (pid: 3929, threadinfo
ffff88000faca000, task ffff88000faec600)
[ 233.048393] Stack:
[ 233.048393] 0000000000000000 0000ea6000000bb8 ffff88000facbce8
ffffffff8116ad81
[ 233.048393] ffff88000ff588a0 ffff88000ff58850 ffff88000ff588a0
0000000000000000
[ 233.048393] ffff88000facbd08 ffffffff823dbd5a ffffffff823dbcb0
ffff88000ff58850
[ 233.048393] Call Trace:
[ 233.048393] [<ffffffff8116ad81>] kfree+0x5f/0xca
[ 233.048393] [<ffffffff823dbd5a>] inet_sock_destruct+0xaa/0x13c
[ 233.048393] [<ffffffff823dbcb0>] ? inet_sk_rebuild_header
+0x319/0x319
[ 233.048393] [<ffffffff8231c307>] __sk_free+0x21/0x14b
[ 233.048393] [<ffffffff8231c4bd>] sk_free+0x26/0x2a
[ 233.048393] [<ffffffff825372db>] sctp_close+0x215/0x224
[ 233.048393] [<ffffffff810d6835>] ? lock_release+0x16f/0x1b9
[ 233.048393] [<ffffffff823daf12>] inet_release+0x7e/0x85
[ 233.048393] [<ffffffff82317d15>] sock_release+0x1f/0x77
[ 233.048393] [<ffffffff82317d94>] sock_close+0x27/0x2b
[ 233.048393] [<ffffffff81173bbe>] __fput+0x101/0x20a
[ 233.048393] [<ffffffff81173cd5>] ____fput+0xe/0x10
[ 233.048393] [<ffffffff810a3794>] task_work_run+0x5d/0x75
[ 233.048393] [<ffffffff8108da70>] do_exit+0x290/0x7f5
[ 233.048393] [<ffffffff82707415>] ? retint_swapgs+0x13/0x1b
[ 233.048393] [<ffffffff8108e23f>] do_group_exit+0x7b/0xba
[ 233.048393] [<ffffffff8108e295>] sys_exit_group+0x17/0x17
[ 233.048393] [<ffffffff8270de10>] tracesys+0xdd/0xe2
[ 233.048393] Code: 59 01 5d c3 55 48 89 e5 53 41 50 0f 1f 44 00 00 48
89 fb e8 d4 b0 f0 ff 84 c0 75 11 48 89 de 48 c7 c7 fc fa f7 82 e8 0d 0f
57 01 <0f> 0b 5f 5b 5d c3 55 48 89 e5 0f 1f 44 00 00 48 63 87 d8 00 00
[ 233.048393] RIP [<ffffffff81169653>] kfree_debugcheck+0x27/0x2d
[ 233.048393] RSP <ffff88000facbca8>

Reported-by: Fengguang Wu <wfg@linux.intel.com>
Tested-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: "H.K. Jerry Chu" <hkchu@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: H.K. Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8336886f786fdacbc19b719c1f7ea91eb70706d4 31-Aug-2012 Jerry Chu <hkchu@google.com> tcp: TCP Fast Open Server - support TFO listeners

This patch builds on top of the previous patch to add the support
for TFO listeners. This includes -

1. allocating, properly initializing, and managing the per listener
fastopen_queue structure when TFO is enabled

2. changes to the inet_csk_accept code to support TFO. E.g., the
request_sock can no longer be freed upon accept(), not until 3WHS
finishes

3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
if it's a TFO socket

4. properly closing a TFO listener, and a TFO socket before 3WHS
finishes

5. supporting TCP_FASTOPEN socket option

6. modifying tcp_check_req() to use to check a TFO socket as well
as request_sock

7. supporting TCP's TFO cookie option

8. adding a new SYN-ACK retransmit handler to use the timer directly
off the TFO socket rather than the listener socket. Note that TFO
server side will not retransmit anything other than SYN-ACK until
the 3WHS is completed.

The patch also contains an important function
"reqsk_fastopen_remove()" to manage the somewhat complex relation
between a listener, its request_sock, and the corresponding child
socket. See the comment above the function for the detail.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a9e050f4e7f9d36afe0dcc0bddba864ee442715e 06-Aug-2012 Eric Dumazet <edumazet@google.com> net: tcp: GRO should be ECN friendly

While doing TCP ECN tests, I discovered GRO was reordering packets if it
receives one packet with CE set, while previous packets in same NAPI run
have ECT(0) for the same flow :

09:25:25.857620 IP (tos 0x2,ECT(0), ttl 64, id 27893, offset 0, flags
[DF], proto TCP (6), length 4396)
172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
233801:238145, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
1990627], length 4344

09:25:25.857626 IP (tos 0x3,CE, ttl 64, id 27892, offset 0, flags [DF],
proto TCP (6), length 1500)
172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
232353:233801, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
1990627], length 1448

09:25:25.857638 IP (tos 0x0, ttl 64, id 34581, offset 0, flags [DF],
proto TCP (6), length 64)
172.30.42.13.44139 > 172.30.42.19.54550: Flags [.], cksum 0xac8f
(incorrect -> 0xca69), ack 232353, win 1271, options [nop,nop,TS val
1990627 ecr 3397779,nop,nop,sack 1 {233801:238145}], length 0

We have two problems here :

1) GRO reorders packets

If NIC gave packet1, then packet2, which happen to be from "different
flows" GRO feeds stack with packet2, then packet1. I have yet to
understand how to solve this problem.

2) GRO is not ECN friendly

Delivering packets out of order makes TCP stack not as fast as it could
be.

In this patch I suggest we make the tos test not part of the 'same_flow'
determination, but part of the 'should flush' logic

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
cf60af03ca4e71134206809ea892e49b92a88896 19-Jul-2012 Yuchung Cheng <ycheng@google.com> net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)

sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
and write(2). The application should replace connect() with it to
send data in the opening SYN packet.

For blocking socket, sendmsg() blocks until all the data are buffered
locally and the handshake is completed like connect() call. It
returns similar errno like connect() if the TCP handshake fails.

For non-blocking socket, it returns the number of bytes queued (and
transmitted in the SYN-data packet) if cookie is available. If cookie
is not available, it transmits a data-less SYN packet with Fast Open
cookie request option and returns -EINPROGRESS like connect().

Using MSG_FASTOPEN on connecting or connected socket will result in
simlar errno like repeating connect() calls. Therefore the application
should only use this flag on new sockets.

The buffer size of sendmsg() is independent of the MSS of the connection.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
783237e8daf13481ee234997cbbbb823872ac388 19-Jul-2012 Yuchung Cheng <ycheng@google.com> net-tcp: Fast Open client - sending SYN-data

This patch implements sending SYN-data in tcp_connect(). The data is
from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).

The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
type of the SYN. If the cookie is not cached (len==0), the host sends
data-less SYN with Fast Open cookie request option to solicit a cookie
from the remote. If cookie is not available (len > 0), the host sends
a SYN-data with Fast Open cookie option. If cookie length is negative,
the SYN will not include any Fast Open option (for fall back operations).

To deal with middleboxes that may drop SYN with data or experimental TCP
option, the SYN-data is only sent once. SYN retransmits do not include
data or Fast Open options. The connection will fall back to regular TCP
handshake.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
41063e9dd11956f2d285e12e4342e1d232ba0ea2 20-Jun-2012 David S. Miller <davem@davemloft.net> ipv4: Early TCP socket demux.

Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.

Signed-off-by: David S. Miller <davem@davemloft.net>
f9242b6b28d61295f2bf7e8adfb1060b382e5381 20-Jun-2012 David S. Miller <davem@davemloft.net> inet: Sanitize inet{,6} protocol demux.

Don't pretend that inet_protos[] and inet6_protos[] are hashes, thay
are just a straight arrays. Remove all unnecessary hash masking.

Document MAX_INET_PROTOS.

Use RAW_HTABLE_SIZE when appropriate.

Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e3192690a3c889767d1161b228374f4926d92af0 03-Jun-2012 Joe Perches <joe@perches.com> net: Remove casts to same type

Adding casts of objects to the same type is unnecessary
and confusing for a human reader.

For example, this cast:

int y;
int *p = (int *)&y;

I used the coccinelle script below to find and remove these
unnecessary casts. I manually removed the conversions this
script produces of casts with __force and __user.

@@
type T;
T *p;
@@

- (T *)p
+ p

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4a17fd5229c1b6066aa478f6b690f8293ce811a1 19-Apr-2012 Pavel Emelyanov <xemul@parallels.com> sock: Introduce named constants for sk_reuse

Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5e73ea1a31c3612aa6dfe44f864ca5b7b6a4cff9 15-Apr-2012 Daniel Baluta <dbaluta@ixiacom.com> ipv4: fix checkpatch errors

Fix checkpatch errors of the following type:
* ERROR: "foo * bar" should be "foo *bar"
* ERROR: "(foo*)" should be "(foo *)"

Signed-off-by: Daniel Baluta <dbaluta@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ffc93f203c18a70623f21950f1dd473c9ec48cd 28-Mar-2012 David Howells <dhowells@redhat.com> Remove all #inclusions of asm/system.h

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
afd465030acb4098abcb6b965a5aebc7ea2209e0 12-Mar-2012 Joe Perches <joe@perches.com> net: ipv4: Standardize prefixes for message logging

Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
058bd4d2a4ff0aaa4a5381c67e776729d840c785 11-Mar-2012 Joe Perches <joe@perches.com> net: Convert printks to pr_<level>

Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini. Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
#define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
text data bss dec hex filename
887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4c507d2897bd9be810b3403ade73b04cf6fdfd4a 09-Feb-2012 Jiri Benc <jbenc@redhat.com> net: implement IP_RECVTOS for IP_PKTOPTIONS

Currently, it is not easily possible to get TOS/DSCP value of packets from
an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
not actually implemented for TOS, though.

This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
(IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
the TOS field is stored from the other party's ACK.

This is needed for proxies which require DSCP transparency. One such example
is at http://zph.bratcheda.org/.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3dc43e3e4d0b52197d3205214fe8f162f9e0c334 11-Dec-2011 Glauber Costa <glommer@parallels.com> per-netns ipv4 sysctl_tcp_mem

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c8f44affb7244f2ac3e703cab13d55ede27621bb 15-Nov-2011 Michał Mirosław <mirq-linux@rere.qmqm.pl> net: introduce and use netdev_features_t for device features sets

v2: add couple missing conversions in drivers
split unexporting netdev_fix_features()
implemented %pNF
convert sock::sk_route_(no?)caps

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
acb32ba3dee66d58704caeeb8c6ff95f60efdc66 08-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com> ipv4: reduce percpu needs for icmpmsg mibs

Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
(can be ~88000 us).

This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
for all possible cpus can read 16 Mbytes of memory.

ICMP messages are not considered as fast path on a typical server, and
eventually few cpus handle them anyway. We can afford an atomic
operation instead of using percpu data.

This saves 4096 bytes per cpu and per network namespace.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
686dc6b64b58e69715ce92177da0732a6464db69 15-Oct-2011 Gerrit Renker <gerrit@erg.abdn.ac.uk> ipv4: compat_ioctl is local to af_inet.c, make it static

ipv4: compat_ioctl is local to af_inet.c, make it static

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
29c486df6a208432b370bd4be99ae1369ede28d8 31-Aug-2011 Eric Dumazet <eric.dumazet@gmail.com> net: ipv4: relax AF_INET check in bind()

commit d0733d2e29b65 (Check for mistakenly passed in non-IPv4 address)
added regression on legacy apps that use bind() with AF_UNSPEC family.

Relax the check, but make sure the bind() is done on INADDR_ANY
addresses, as AF_UNSPEC has probably no sane meaning for other
addresses.

Bugzilla reference : https://bugzilla.kernel.org/show_bug.cgi?id=42012

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reported-and-bisected-by: Rene Meier <r_meier@freenet.de>
CC: Marcus Meissner <meissner@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
c349a528cd47e2272ded0ea358363855e86180da 04-Jul-2011 Marcus Meissner <meissner@novell.com> net: bind() fix error return on wrong address family

Hi,

Reinhard Max also pointed out that the error should EAFNOSUPPORT according
to POSIX.

The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.

Other protocols error values in their af bind() methods in current mainline git as far
as a brief look shows:
EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25,
No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip

Ciao, Marcus

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
1eddceadb0d6441cd39b2c38705a8f5fec86e770 17-Jun-2011 Eric Dumazet <eric.dumazet@gmail.com> net: rfs: enable RFS before first data packet is received

Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 17 Jun 2011 00:50:46 +0100
>
> > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
> >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> >> goto discard;
> >>
> >> if (nsk != sk) {
> >> + sock_rps_save_rxhash(nsk, skb->rxhash);
> >> if (tcp_child_process(sk, nsk, skb)) {
> >> rsk = nsk;
> >> goto reset;
> >>
> >
> > I haven't tried this, but it looks reasonable to me.
> >
> > What about IPv6? The logic in tcp_v6_do_rcv() looks very similar.
>
> Indeed ipv6 side needs the same fix.
>
> Eric please add that part and resubmit. And in fact I might stick
> this into net-2.6 instead of net-next-2.6
>

OK, here is the net-2.6 based one then, thanks !

[PATCH v2] net: rfs: enable RFS before first data packet is received

First packet received on a passive tcp flow is not correctly RFS
steered.

One sock_rps_record_flow() call is missing in inet_accept()

But before that, we also must record rxhash when child socket is setup.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@conan.davemloft.net>
8f0ea0fe3a036a47767f9c80e81b13e379a1f43b 10-Jun-2011 Eric Dumazet <eric.dumazet@gmail.com> snmp: reduce percpu needs by 50%

SNMP mibs use two percpu arrays, one used in BH context, another in USER
context. With increasing number of cpus in machines, and fact that ipv6
uses per network device ipstats_mib, this is consuming a lot of memory
if many network devices are registered.

commit be281e554e2a (ipv6: reduce per device ICMP mib sizes) shrinked
percpu needs for ipv6, but we can reduce memory use a bit more.

With recent percpu infrastructure (irqsafe_cpu_inc() ...), we no longer
need this BH/USER separation since we can update counters in a single
x86 instruction, regardless of the BH/USER context.

Other arches than x86 might need to disable irq in their
irqsafe_cpu_inc() implementation : If this happens to be a problem, we
can make SNMP_ARRAY_SZ arch dependent, but a previous poll
( https://lkml.org/lkml/2011/3/17/174 ) to arch maintainers did not
raise strong opposition.

Only on 32bit arches, we need to disable BH for 64bit counters updates
done from USER context (currently used for IP MIB)

This also reduces vmlinux size :

1) x86_64 build
$ size vmlinux.before vmlinux.after
text data bss dec hex filename
7853650 1293772 1896448 11043870 a8841e vmlinux.before
7850578 1293772 1896448 11040798 a8781e vmlinux.after

2) i386 build
$ size vmlinux.before vmlinux.afterpatch
text data bss dec hex filename
6039335 635076 3670016 10344427 9dd7eb vmlinux.before
6037342 635076 3670016 10342434 9dd022 vmlinux.afterpatch

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <andi@firstfloor.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Tejun Heo <tj@kernel.org>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org
CC: linux-arch@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
d0733d2e29b652b2e7b1438ececa732e4eed98eb 02-Jun-2011 Marcus Meissner <meissner@suse.de> net/ipv4: Check for mistakenly passed in non-IPv4 address

Check against mistakenly passing in IPv6 addresses (which would result
in an INADDR_ANY bind) or similar incompatible sockaddrs.

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
c319b4d76b9e583a5d88d6bf190e079c4e43213d 13-May-2011 Vasiliy Kulikov <segoon@openwall.com> net: ipv4: add IPPROTO_ICMP socket kind

This patch adds IPPROTO_ICMP socket kind. It makes it possible to send
ICMP_ECHO messages and receive the corresponding ICMP_ECHOREPLY messages
without any special privileges. In other words, the patch makes it
possible to implement setuid-less and CAP_NET_RAW-less /bin/ping. In
order not to increase the kernel's attack surface, the new functionality
is disabled by default, but is enabled at bootup by supporting Linux
distributions, optionally with restriction to a group or a group range
(see below).

Similar functionality is implemented in Mac OS X:
http://www.manpagez.com/man/4/icmp/

A new ping socket is created with

socket(PF_INET, SOCK_DGRAM, PROT_ICMP)

Message identifiers (octets 4-5 of ICMP header) are interpreted as local
ports. Addresses are stored in struct sockaddr_in. No port numbers are
reserved for privileged processes, port 0 is reserved for API ("let the
kernel pick a free number"). There is no notion of remote ports, remote
port numbers provided by the user (e.g. in connect()) are ignored.

Data sent and received include ICMP headers. This is deliberate to:
1) Avoid the need to transport headers values like sequence numbers by
other means.
2) Make it easier to port existing programs using raw sockets.

ICMP headers given to send() are checked and sanitized. The type must be
ICMP_ECHO and the code must be zero (future extensions might relax this,
see below). The id is set to the number (local port) of the socket, the
checksum is always recomputed.

ICMP reply packets received from the network are demultiplexed according
to their id's, and are returned by recv() without any modifications.
IP header information and ICMP errors of those packets may be obtained
via ancillary data (IP_RECVTTL, IP_RETOPTS, and IP_RECVERR). ICMP source
quenches and redirects are reported as fake errors via the error queue
(IP_RECVERR); the next hop address for redirects is saved to ee_info (in
network order).

socket(2) is restricted to the group range specified in
"/proc/sys/net/ipv4/ping_group_range". It is "1 0" by default, meaning
that nobody (not even root) may create ping sockets. Setting it to "100
100" would grant permissions to the single group (to either make
/sbin/ping g+s and owned by this group or to grant permissions to the
"netadmins" group), "0 4294967295" would enable it for the world, "100
4294967295" would enable it for the users, but not daemons.

The existing code might be (in the unlikely case anyone needs it)
extended rather easily to handle other similar pairs of ICMP messages
(Timestamp/Reply, Information Request/Reply, Address Mask Request/Reply
etc.).

Userspace ping util & patch for it:
http://openwall.info/wiki/people/segoon/ping

For Openwall GNU/*/Linux it was the last step on the road to the
setuid-less distro. A revision of this patch (for RHEL5/OpenVZ kernels)
is in use in Owl-current, such as in the 2011/03/12 LiveCD ISOs:
http://mirrors.kernel.org/openwall/Owl/current/iso/

Initially this functionality was written by Pavel Kankovsky for
Linux 2.4.32, but unfortunately it was never made public.

All ping options (-b, -p, -Q, -R, -s, -t, -T, -M, -I), are tested with
the patch.

PATCH v3:
- switched to flowi4.
- minor changes to be consistent with raw sockets code.

PATCH v2:
- changed ping_debug() to pr_debug().
- removed CONFIG_IP_PING.
- removed ping_seq_fops.owner field (unused for procfs).
- switched to proc_net_fops_create().
- switched to %pK in seq_printf().

PATCH v1:
- fixed checksumming bug.
- CAP_NET_RAW may not create icmp sockets anymore.

RFC v2:
- minor cleanups.
- introduced sysctl'able group range to restrict socket(2).

Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6e869138101bc607e4780187210b79d531f9b2ce 07-May-2011 David S. Miller <davem@davemloft.net> ipv4: Use cork flow in inet_sk_{reselect_saddr,rebuild_header}()

These two functions must be invoked only when the socket is locked
(because socket identity modifications are made non-atomically).

Therefore we can use the cork flow for output route lookups.

Signed-off-by: David S. Miller <davem@davemloft.net>
31e4543db29fb85496a122b965d6482c8d1a2bfe 04-May-2011 David S. Miller <davem@davemloft.net> ipv4: Make caller provide on-stack flow key to ip_route_output_ports().

Signed-off-by: David S. Miller <davem@davemloft.net>
b883187785004cacea053569f1655fada0dfc299 29-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Fetch route saddr from flow key in inet_sk_reselect_saddr().

Now that output route lookups update the flow with
source address selection, we can fetch it from
fl4->saddr instead of rt->rt_src

Signed-off-by: David S. Miller <davem@davemloft.net>
f6d8bd051c391c1c0458a30b2a7abcd939329259 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com> inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2d7192d6cbab20e153c47fa1559ffd41ceef0e79 26-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Sanitize and simplify ip_route_{connect,newports}()

These functions are used together as a unit for route resolution
during connect(). They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object. That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.

Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
b71d1d426d263b0b6cb5760322efebbfc89d4463 22-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com> inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
78fbfd8a653ca972afe479517a40661bfff6d8c3 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Create and use route lookup helpers.

The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.

Signed-off-by: David S. Miller <davem@davemloft.net>
b23dd4fe42b455af5c6e20966b7d6959fa8352ea 02-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make output route lookup return rtable directly.

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
273447b352e69c327efdecfd6e1d6fe3edbdcd14 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Kill can_sleep arg to ip_route_output_flow()

This boolean state is now available in the flow flags.

Signed-off-by: David S. Miller <davem@davemloft.net>
420d44daa7aa1cc847e9e527f0a27a9ce61768ca 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"

Since that is what the current vague "flags" argument means.

Signed-off-by: David S. Miller <davem@davemloft.net>
abdf7e7239da270e68262728f125ea94b9b7d42d 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Can final ip_route_connect() arg to boolean "can_sleep".

Since that's what the current vague "flags" thing means.

Signed-off-by: David S. Miller <davem@davemloft.net>
709b46e8d90badda1898caea50483c12af178e96 29-Jan-2011 Eric W. Biederman <ebiederm@xmission.com> net: Add compat ioctl support for the ipv4 multicast ioctl SIOCGETSGCNT

SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
which unfortunately means the existing infrastructure for compat networking
ioctls is insufficient. A trivial compact ioctl implementation would conflict
with:

SIOCAX25ADDUID
SIOCAIPXPRISLT
SIOCGETSGCNT_IN6
SIOCGETSGCNT
SIOCRSSCAUSE
SIOCX25SSUBSCRIP
SIOCX25SDTEFACILITIES

To make this work I have updated the compat_ioctl decode path to mirror the
the normal ioctl decode path. I have added an ipv4 inet_compat_ioctl function
so that I can have ipv4 specific compat ioctls. I have added a compat_ioctl
function into struct proto so I can break out ioctls by which kind of ip socket
I am using. I have added a compat_raw_ioctl function because SIOCGETSGCNT only
works on raw sockets. I have added a ipmr_compat_ioctl that mirrors the normal
ipmr_ioctl.

This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
has unsigned longs in it so changes between 32bit and 64bit kernels.

This change was sufficient to run a 32bit ip multicast routing daemon on a
64bit kernel.

Reported-by: Bill Fenner <fenner@aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
04ed3e741d0f133e02bed7fa5c98edba128f90e7 25-Jan-2011 Michał Mirosław <mirq-linux@rere.qmqm.pl> net: change netdev->features to u32

Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
struct netdev, as per Eric Dumazet's suggestion -DaveM ]

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
5811662b15db018c740c57d037523683fd3e6123 12-Nov-2010 Changli Gao <xiaosuo@gmail.com> net: use the macros defined for the members of flowi

Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
49e8ab03ebcacd8e37660ffec20c0c46721a2800 19-Aug-2010 Eric Dumazet <eric.dumazet@gmail.com> net: build_ehash_secret() and rt_bind_peer() cleanups

Now cmpxchg() is available on all arches, we can use it in
build_ehash_secret() and rt_bind_peer() instead of using spinlocks.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
7ba42910073f8432934d61a6c08b1023c408fb62 10-Jul-2010 Changli Gao <xiaosuo@gmail.com> inet, inet6: make tcp_sendmsg() and tcp_sendpage() through inet_sendmsg() and inet_sendpage()

a new boolean flag no_autobind is added to structure proto to avoid the autobind
calls when the protocol is TCP. Then sock_rps_record_flow() is called int the
TCP's sendmsg() and sendpage() pathes.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
include/net/inet_common.h | 4 ++++
include/net/sock.h | 1 +
include/net/tcp.h | 8 ++++----
net/ipv4/af_inet.c | 15 +++++++++------
net/ipv4/tcp.c | 11 +++++------
net/ipv4/tcp_ipv4.c | 3 +++
net/ipv6/af_inet6.c | 8 ++++----
net/ipv6/tcp_ipv6.c | 3 +++
8 files changed, 33 insertions(+), 20 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
4ce3c183fcade7f4b30a33dae90cd774c3d9e094 30-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com> snmp: 64bit ipstats_mib for all arches

/proc/net/snmp and /proc/net/netstat expose SNMP counters.

Width of these counters is either 32 or 64 bits, depending on the size
of "unsigned long" in kernel.

This means user program parsing these files must already be prepared to
deal with 64bit values, regardless of user program being 32 or 64 bit.

This patch introduces 64bit snmp values for IPSTAT mib, where some
counters can wrap pretty fast if they are 32bit wide.

# netstat -s|egrep "InOctets|OutOctets"
InOctets: 244068329096
OutOctets: 244069348848

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1823e4c80eeae2a774c75569ce3035070e5ee009 22-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com> snmp: add align parameter to snmp_mib_init()

In preparation for 64bit snmp counters for some mibs,
add an 'align' parameter to snmp_mib_init(), instead
of assuming mibs only contain 'unsigned long' fields.

Callers can use __alignof__(type) to provide correct
alignment.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7b2ff18ee7b0ec4bc3162f821e221781aaca48bd 15-Jun-2010 Jiri Olsa <jolsa@redhat.com> net - IP_NODEFRAG option for IPv4 socket

this patch is implementing IP_NODEFRAG option for IPv4 socket.
The reason is, there's no other way to send out the packet with user
customized header of the reassembly part.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d8d1f30b95a635dbd610dcc5eb641aca8f4768cf 11-Jun-2010 Changli Gao <xiaosuo@gmail.com> net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e3826f1e946e7d2354943232f1457be1455a29e2 05-May-2010 Amerigo Wang <amwang@redhat.com> net: reserve ports for applications using fixed port numbers

(Dropped the infiniband part, because Tetsuo modified the related code,
I will send a separate patch for it once this is accepted.)

This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.

The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c58dc01babfd58ec9e71a6ce080150dc27755d88 28-Apr-2010 David S. Miller <davem@davemloft.net> net: Make RFS socket operations not be inet specific.

Idea from Eric Dumazet.

As for placement inside of struct sock, I tried to choose a place
that otherwise has a 32-bit hole on 64-bit systems.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
0eae88f31ca2b88911ce843452054139e028771f 21-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> net: Fix various endianness glitches

Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aa395145165cb06a0d0885221bbe0ce4a564391d 20-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> net: sk_sleep() helper

Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fec5e652e58fa6017b2c9e06466cb2a6538de5b4 17-Apr-2010 Tom Herbert <therbert@google.com> rfs: Receive Flow Steering

This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).

The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.

The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.

To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.

rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.

rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.

Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.

And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:

- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.

Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.

This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.

There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.

The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).

The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.

Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.

e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU

RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b6c6712a42ca3f9fa7f4a3d7c40e3a9dd1fd9e03 09-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> net: sk_dst_cache RCUification

With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
work.

sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)

This rwlock is readlocked for a very small amount of time, and dst
entries are already freed after RCU grace period. This calls for RCU
again :)

This patch converts sk_dst_lock to a spinlock, and use RCU for readers.

__sk_dst_get() is supposed to be called with rcu_read_lock() or if
socket locked by user, so use appropriate rcu_dereference_check()
condition (rcu_read_lock_held() || sock_owned_by_user(sk))

This patch avoids two atomic ops per tx packet on UDP connected sockets,
for example, and permits sk_dst_lock to be much less dirtied.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6503d96168f891ffa3b70ae6c9698a1a722025a0 01-Apr-2010 Changli Gao <xiaosuo@gmail.com> net: check the length of the socket address passed to connect(2)

check the length of the socket address passed to connect(2).

Check the length of the socket address passed to connect(2). If the
length is invalid, -EINVAL will be returned.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
----
net/bluetooth/l2cap.c | 3 ++-
net/bluetooth/rfcomm/sock.c | 3 ++-
net/bluetooth/sco.c | 3 ++-
net/can/bcm.c | 3 +++
net/ieee802154/af_ieee802154.c | 3 +++
net/ipv4/af_inet.c | 5 +++++
net/netlink/af_netlink.c | 3 +++
7 files changed, 20 insertions(+), 3 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
ec733b15a3ef0b5759141a177f8044a2f40c41e7 18-Mar-2010 Eric Dumazet <eric.dumazet@gmail.com> net: snmp mib cleanup

There is no point to align or pad mibs to cache lines, they are per cpu
allocated with a 8 bytes alignment anyway.
This wastes space for no gain. This patch removes __SNMP_MIB_ALIGN__

Since SNMP mibs contain "unsigned long" fields only, we can relax the
allocation alignment from "unsigned long long" to "unsigned long"

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7d720c3e4f0c4fc152a6bf17e24244a3c85412d2 16-Feb-2010 Tejun Heo <tj@kernel.org> percpu: add __percpu sparse annotations to net

Add __percpu sparse annotations to net.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.

The macro and type tricks around snmp stats make things a bit
interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
snmp_mib_*() users which used to cast the argument to (void **) are
updated to cast it to (void __percpu **).

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
c84b3268da3b85c9d8a9e504e1001a14ed829e94 06-Nov-2009 Eric Paris <eparis@redhat.com> net: check kern before calling security subsystem

Before calling capable(CAP_NET_RAW) check if this operations is on behalf
of the kernel or on behalf of userspace. Do not do the security check if
it is on behalf of the kernel.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3f378b684453f2a028eda463ce383370545d9cc9 06-Nov-2009 Eric Paris <eparis@redhat.com> net: pass kern to net_proto_family create function

The generic __sock_create function has a kern argument which allows the
security system to make decisions based on if a socket is being created by
the kernel or by userspace. This patch passes that flag to the
net_proto_family specific create function, so it can do the same thing.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13f18aa05f5abe135f47b6417537ae2b2fedc18c 06-Nov-2009 Eric Paris <eparis@redhat.com> net: drop capability from protocol definitions

struct can_proto had a capability field which wasn't ever used. It is
dropped entirely.

struct inet_protosw had a capability field which can be more clearly
expressed in the code by just checking if sock->type = SOCK_RAW.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
38bfd8f5bec496e8e0db8849e01c99a33479418a 29-Oct-2009 Cyrill Gorcunov <gorcunov@openvz.org> net,socket: introduce DECLARE_SOCKADDR helper to catch overflow at build time

proto_ops->getname implies copying protocol specific data
into storage unit (particulary to __kernel_sockaddr_storage).
So when we implement new protocol support we should keep such
a detail in mind (which is easy to forget about).

Lets introduce DECLARE_SOCKADDR helper which check if
storage unit is not overfowed at build time.

Eventually inet_getname is switched to use DECLARE_SOCKADDR
(to show example of usage).

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
c720c7e8383aff1cb219bddf474ed89d850336e3 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> inet: rename some inet_sock fields

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ec1b4cf74c81bfd0fbe5bf62bafc86c45917e72f 05-Oct-2009 Stephen Hemminger <shemminger@vyatta.com> net: mark net_proto_ops as const

All usages of structure net_proto_ops should be declared const.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
914a9ab386a288d0f22252fc268ecbc048cdcbd5 02-Oct-2009 Atis Elsts <atis@mikrotik.com> net: Use sk_mark for routing lookup in more places

This patch against v2.6.31 adds support for route lookup using sk_mark in some
more places. The benefits from this patch are the following.
First, SO_MARK option now has effect on UDP sockets too.
Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
lookup correctly if TCP sockets with SO_MARK were used.

Signed-off-by: Atis Elsts <atis@mikrotik.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
32613090a96dba2ca2cc524c8d4749d3126fdde5 14-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com> net: constify struct net_protocol

Remove long removed "inet_protocol_base" declaration.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3d1427f87002735aa54c370558e0c2bacc61f31e 29-Aug-2009 Eric Dumazet <eric.dumazet@gmail.com> ipv4: af_inet.c cleanups

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d7ca4cc01fd154f2da30ae6dae160fa5800af758 09-Jul-2009 Sridhar Samudrala <sri@us.ibm.com> udpv4: Handle large incoming UDP/IPv4 packets and support software UFO.

- validate and forward GSO UDP/IPv4 packets from untrusted sources.
- do software UFO if the outgoing device doesn't support UFO.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2307f866f542f3397d24f78d0efd74f4ab214a96 04-Jun-2009 Rami Rosen <ramirose@gmail.com> ipv4: remove ip_mc_drop_socket() declaration from af_inet.c.

ip_mc_drop_socket() method is declared in linux/igmp.h, which
is included anyhow in af_inet.c. So there is no need for this declaration.
This patch removes it from af_inet.c.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f771bef98004d9d141b085d987a77d06669d4f4f 28-May-2009 Nivedita Singhvi <niv@us.ibm.com> ipv4: New multicast-all socket option

After some discussion offline with Christoph Lameter and David Stevens
regarding multicast behaviour in Linux, I'm submitting a slightly
modified patch from the one Christoph submitted earlier.

This patch provides a new socket option IP_MULTICAST_ALL.

In this case, default behaviour is _unchanged_ from the current
Linux standard. The socket option is set by default to provide
original behaviour. Sockets wishing to receive data only from
multicast groups they join explicitly will need to clear this
socket option.

Signed-off-by: Nivedita Singhvi <niv@us.ibm.com>
Signed-off-by: Christoph Lameter<cl@linux.com>
Acked-by: David Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1075f3f65d0e0f49351b7d4310e9f94483972a51 26-May-2009 Herbert Xu <herbert@gondor.apana.org.au> ipv4: Use 32-bit loads for ID and length in GRO

This patch optimises the IPv4 GRO code by using 32-bit loads
(instead of 16-bit ones) on the ID and length checks in the receive
function.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
a5b1cf288d4200506ab62fbb86cc81ace948a306 26-May-2009 Herbert Xu <herbert@gondor.apana.org.au> gro: Avoid unnecessary comparison after skb_gro_header

For the overwhelming majority of cases, skb_gro_header's return
value cannot be NULL. Yet we must check it because of its current
form. This patch splits it up into multiple functions in order
to avoid this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
573636cbafeba88f7c88832ae053dafe265bf94b 17-Apr-2009 Eric Dumazet <dada1@cosmosbay.com> [PATCH] net: remove superfluous call to synchronize_net()

inet_register_protosw() function is responsible for adding a new
inet protocol into a global table (inetsw[]) that is used with RCU rules.

As soon as the store of the pointer is done, other cpus might see
this new protocol in inetsw[], so we have to make sure new protocol
is ready for use. All pending memory updates should thus be committed
to memory before setting the pointer.
This is correctly done using rcu_assign_pointer()

synchronize_net() is typically used at unregister time, after
unsetting the pointer, to make sure no other cpu is still using
the object we want to dismantle. Using it at register time
is only adding an artificial delay that could hide a real bug,
and this bug could popup if/when synchronize_rcu() can proceed
faster than now.

This saves about 13 ms on boot time on a HZ=1000 8 cpus machine ;)
(4 calls to inet_register_protosw(), and about 3200 us per call)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7546dd97d27306d939c13e03318aae695badaa88 09-Mar-2009 Stephen Hemminger <shemminger@vyatta.com> net: convert usage of packet_type to read_mostly

Protocols that use packet_type can be __read_mostly section for better
locality. Elminate any unnecessary initializations of NULL.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
313e458f81ec3852106c5a83830fe0d4f405a71a 20-Feb-2009 Rusty Russell <rusty@rustcorp.com.au> alloc_percpu: add align argument to __alloc_percpu.

This prepares for a real __alloc_percpu, by adding an alignment argument.
Only one place uses __alloc_percpu directly, and that's for a string.

tj: af_inet also uses __alloc_percpu(), update it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
a5ad24be728d4352b71a81fba471aa41eb71f83a 08-Feb-2009 Herbert Xu <herbert@gondor.apana.org.au> gro: Optimise IPv4 packet reception

As this function can be called more than half a million times for
10GbE, it's important to optimise it as much as we can.

This patch does some obvious changes to use 2-byte and 4-byte
operations instead of byte-oriented ones where possible. Bit
ops are also used to replace logical ops to reduce branching.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
f15fbcd7d857ca2ea20b57ba6dfe63aab89d0b8b 02-Feb-2009 Herbert Xu <herbert@gondor.apana.org.au> ipv4: Delete redundant sk_family assignment

sk_alloc now sets sk_family so this is redundant. In fact it caught
my eye because sock_init_data already uses sk_family so this is too
late anyway.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
09640e6365c679b5642b1c41b6d7078f51689ddf 01-Feb-2009 Harvey Harrison <harvey.harrison@gmail.com> net: replace uses of __constant_{endian}

Base versions handle constant folding now.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
86911732d3996a9da07914b280621450111bb6da 29-Jan-2009 Herbert Xu <herbert@gondor.apana.org.au> gro: Avoid copying headers of unmerged packets

Unfortunately simplicity isn't always the best. The fraginfo
interface turned out to be suboptimal. The problem was quite
obvious. For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.

LRO didn't have this problem because it directly read the headers
from the frags structure.

This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it. Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
b4ee07df3d8121060200dbe1c6686a4e0682bee2 26-Dec-2008 Alexey Dobriyan <adobriyan@gmail.com> netns: igmp: allow IPPROTO_IGMP sockets in netns

Looks like everything is already ready.

Required for ebtables(8) for one thing.

Also, required for ipmr per-netns (coming soon). (Benjamin)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
bf296b125b21b8d558ceb6ec30bb4eba2730cd6b 16-Dec-2008 Herbert Xu <herbert@gondor.apana.org.au> tcp: Add GRO support

This patch adds the TCP-specific portion of GRO. The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation. Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
73cc19f1556b95976934de236fd9043f7208844f 16-Dec-2008 Herbert Xu <herbert@gondor.apana.org.au> ipv4: Add GRO infrastructure

This patch adds GRO support for IPv4.

The criteria for merging is more stringent than LRO, in particular,
we require all fields in the IP header to be identical except for
the length, ID and checksum. In addition, the ID must form an
arithmetic sequence with a difference of one.

The ID requirement might seem overly strict, however, most hardware
TSO solutions already obey this rule. Linux itself also obeys this
whether GSO is in use or not.

In future we could relax this rule by storing the IDs (or rather
making sure that we don't drop them when pulling the aggregate
skb's tail).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
04f258ce7f085dd69422fa01d41c8f0194a0e270 24-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: some optimizations in af_inet

1) Use eq_net() in inet_netns_ok() to speedup socket creation if
!CONFIG_NET_NS

2) Reorder the tests about inet_ehash_secret generation (once only)
Use the unlikely() macro when testing if inet_ehash_secret already
generated.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c82838458200ec4167ce7083b0a17474150c5bf7 20-Nov-2008 Balazs Scheidler <bazsi@balabit.hu> TPROXY: supply a struct flowi->flags argument in inet_sk_rebuild_header()

inet_sk_rebuild_header() does a new route lookup if the dst_entry
associated with a socket becomes stale. However inet_sk_rebuild_header()
didn't use struct flowi->flags, causing the route lookup to
fail for foreign-bound IP_TRANSPARENT sockets, causing an error
state to be set for the sockets in question.

Signed-off-by: Balazs Scheidler <bazsi@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
673d57e72398edfedc93fb50ff58048077c9d587 31-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com> net: replace NIPQUAD() in net/ipv4/ net/ipv6/

Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b9fb15067ce93497bef852c05e406d7a96212a9a 01-Oct-2008 Tóth László Attila <panther@balabit.hu> ipv4: Allow binding to non-local addresses if IP_TRANSPARENT is set

Setting IP_TRANSPARENT is not really useful without allowing non-local
binds for the socket. To make user-space code simpler we allow these
binds even if IP_TRANSPARENT is set but IP_FREEBIND is not.

Signed-off-by: Tóth László Attila <panther@balabit.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
bd7b1533cd6a68c734062aa69394bec7e2b1718e 15-Jul-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] sysctl: make sure that /proc/sys/net/ipv4 appears before per-ns ones

Massage ipv4 initialization - make sure that net.ipv4 appears as
non-per-net-namespace before it shows up in per-net-namespace sysctls.
That's the only change outside of sysctl.c needed to get sane ordering
rules and data structures for sysctls (esp. for procfs side of that
mess).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
547b792cac0a038b9dbf958d3c120df3740b5572 26-Jul-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> net: convert BUG_TRAP to generic WARN_ON

Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
696adfe84c11c571a1e0863460ff0ec142b4e5a9 25-Jul-2008 Paul E. McKenney <paulmck@linux.vnet.ibm.com> list_for_each_rcu must die: networking

All uses of list_for_each_rcu() can be profitably replaced by the
easier-to-use list_for_each_entry_rcu(). This patch makes this change for
networking, in preparation for removing the list_for_each_rcu() API
entirely.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d89cbbb1e69a3bfea38038fb058bd51013902ef3 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> ipv4: clean the init_ipv4_mibs error paths

After moving all the stuff outside this function it looks
a bit ugly - make it look better.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
923c6586b0dc0a00df07a1608185437145a0c68b 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put icmpmsg statistics on struct net

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
b60538a0d737609213e4b758881913498d3ff0b4 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put icmp statistics on struct net

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
386019d3514b3ed9de8d0b05b67e638a7048375b 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put udplite statistics on struct net

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2f275f91a438abd8eec5321798d66a4ffe6869fa 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put udp statistics on struct net

Similar to... ouch, I repeat myself.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
61a7e26028b94805fd686a6dc9dbd9941f8f19b0 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put net statistics on struct net

Similar to ip and tcp ones :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a20f5799ca7ceb24d63c74b6fdad4b0c0ee91f4f 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put ip statistics on struct net

Similar to tcp one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
57ef42d59d1c1d79be59fc3c6380ae14234e38c3 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: put tcp statistics on struct net

Proc temporary uses stats from init_net.

BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9b4661bd6e5437508e0920608f3213c23212cd1b 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> ipv4: add pernet mib operations

These ones are currently empty, but stuff from init_ipv4_mibs will
sequentially migrate there.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a9c19329eccdb145a08a4a2e969d7b40c54c9bcc 17-Jul-2008 Pavel Emelyanov <xemul@openvz.org> tcp: add net to tcp_mib_init

This one sets TCP MIBs after zeroing them, and thus requires
the net.

The existing single caller can use init_net (temporarily).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
03d2f897e9fb3218989baa2139a951ce7f5414bf 02-Jul-2008 Wang Chen <wangchen@cn.fujitsu.com> ipv4: Do cleanup for ip_mr_init

Same as ip6_mr_init(), make ip_mr_init() return errno if fails.
But do not do error handling in inet_init(), just print a msg.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
0b040829952d84bf2a62526f0e24b624e0699447 11-Jun-2008 Adrian Bunk <bunk@kernel.org> net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
801678c5a3b4c79236970bcca27c733f5559e0d1 29-Apr-2008 Hirofumi Nakagawa <hnakagawa@miraclelinux.com> Remove duplicated unlikely() in IS_ERR()

Some drivers have duplicated unlikely() macros. IS_ERR() already has
unlikely() in itself.

This patch cleans up such pointless code.

Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a7d632b6b4ad1c92746ed409e41f9dc571ec04e2 14-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4]: Use NIPQUAD_FMT to format ipv4 addresses.

And use %u to format port.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5616bdd6dfeb4e36be499dbac245e4d3be90a138 03-Apr-2008 Denis V. Lunev <den@openvz.org> [INET]: uc_ttl assignment in inet_ctl_sock_create is redundant.

uc_ttl is initialized in inet(6)_create and never changed except
setsockopt ioctl. Remove this assignment.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5677242f432102dea9e6eceec1dc089e2f709ca4 03-Apr-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Inet control socket should not hold a namespace.

This is a generic requirement, so make inet_ctl_sock_create namespace
aware and create a inet_ctl_sock_destroy wrapper around
sk_release_kernel.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eee4fe4ded6e9c196168aee8f9787771f4df9c90 03-Apr-2008 Denis V. Lunev <den@openvz.org> [INET]: Let inet_ctl_sock_create return sock rather than socket.

All upper protocol layers are already use sock internally.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3d58b5fa8e4c461ab09afdacd3d1754fccca06ad 03-Apr-2008 Denis V. Lunev <den@openvz.org> [INET]: Rename inet_csk_ctl_sock_create to inet_ctl_sock_create.

This call is nothing common with INET connection sockets code. It
simply creates an unhashes kernel sockets for protocol messages.

Move the new call into af_inet.c after the rename.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b1e0a655f8eba44ab1ee2a1068d169ccfb853b9 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.

Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
92f1fecb45ef97acae94463302f79228a4b454d9 24-Mar-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Enable TCP/UDP/ICMP inside namespace.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2342fd7e146f05edeb13feb03490c13a1bdab2e0 24-Mar-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Allow to create sockets in non-initial namespace.

Allow to create sockets in the namespace if the protocol ok with this.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
05cf89d40c85e622dac20e44713168767be5c520 24-Mar-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Process INET socket layer in the correct namespace.

Replace all the reast of the init_net with a proper net on the socket
layer.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
e6f1cebf71c4e7aae7dfa43414ce2631291def9f 18-Mar-2008 Al Viro <viro@zeniv.linux.org.uk> [NET] endianness noise: INADDR_ANY

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
db8dac20d5199307dcfcf4e01dac4bda5edf9e89 07-Mar-2008 David S. Miller <davem@davemloft.net> [UDP]: Revert udplite and code split.

This reverts commit db1ed684f6c430c4cdad67d058688b8a1b5e607c ("[IPV6]
UDP: Rename IPv6 UDP files."), commit
8be8af8fa4405652e6c0797db5465a4be8afb998 ("[IPV4] UDP: Move
IPv4-specific bits to other file.") and commit
e898d4db2749c6052072e9bc4448e396cbdeb06a ("[UDP]: Allow users to
configure UDP-Lite.").

First, udplite is of such small cost, and it is a core protocol just
like TCP and normal UDP are.

We spent enormous amounts of effort to make udplite share as much code
with core UDP as possible. All of that work is less valuable if we're
just going to slap a config option on udplite support.

It is also causing build failures, as reported on linux-next, showing
that the changeset was not tested very well. In fact, this is the
second build failure resulting from the udplite change.

Finally, the config options provided was a bool, instead of a modular
option. Meaning the udplite code does not even get build tested
by allmodconfig builds, and furthermore the user is not presented
with a reasonable modular build option which is particularly needed
by distribution vendors.

Signed-off-by: David S. Miller <davem@davemloft.net>
0dc47877a3de00ceadea0005189656ae8dc52669 06-Mar-2008 Harvey Harrison <harvey.harrison@gmail.com> net: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e898d4db2749c6052072e9bc4448e396cbdeb06a 29-Feb-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [UDP]: Allow users to configure UDP-Lite.

Let's give users an option for disabling UDP-Lite (~4K).

old:
| text data bss dec hex filename
| 286498 12432 6072 305002 4a76a net/ipv4/built-in.o
| 193830 8192 3204 205226 321aa net/ipv6/ipv6.o

new (without UDP-Lite):
| text data bss dec hex filename
| 284086 12136 5432 301654 49a56 net/ipv4/built-in.o
| 191835 7832 3076 202743 317f7 net/ipv6/ipv6.o

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
a5710d6582868a96cf0fe02eac11d34cfbc783c9 29-Feb-2008 Denis V. Lunev <den@openvz.org> [ICMP]: Add return code to icmp_init.

icmp_init could fail and this is normal for namespace other than initial.
So, the panic should be triggered only on init_net initialization path.

Additionally create rollback path for icmp_init as a separate function.
It will also be used later during namespace destruction.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9b0f976f27f00a81cf47643d90854659626795b4 29-Feb-2008 Denis V. Lunev <den@openvz.org> [INET]: Remove struct net_proto_family* from _init calls.

struct net_proto_family* is not used in icmp[v6]_init, ndisc_init,
igmp_init and tcp_v4_init. Remove it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e5b13cb10de209f924fdf9478214bcf7e4008d6d 29-Feb-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Process devinet ioctl in the correct namespace.

Add namespace parameter to devinet_ioctl and locate device inside it for
state changes.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f1b050bf7a88910f9f00c9c8989c1bf5a67dd140 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to ip_route_output_flow.

Needed to propagate it down to the __ip_route_output_key.

Signed_off_by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
1bad118a330d494b23663fce94d4e9d9d5065fa7 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Pass namespace through ip_rt_ioctl.

... up to rtentry_to_fib_config

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6b175b26c1048d331508940ad3516ead1998084f 10-Jan-2008 Eric W. Biederman <ebiederm@xmission.com> [NETNS]: Add netns parameter to inet_(dev_)add_type.

The patch extends the inet_addr_type and inet_dev_addr_type with the
network namespace pointer. That allows to access the different tables
relatively to the network namespace.

The modification of the signature function is reported in all the
callers of the inet_addr_type using the pointer to the well known
init_net.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7b1a74fdbb9ec38a9780620fae25519fde4b21ee 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Refactor fib initialization so it can handle multiple namespaces.

This patch makes the fib to be initialized as a subsystem for the
network namespaces. The code does not handle several namespaces yet,
so in case of a creation of a network namespace, the
creation/initialization will not occur.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
61a0265344786a548e8a0b26cb668e78a71f9602 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace to API for routing /proc entries creation.

This adds netns parameter to fib_proc_init/exit and replaces __init
specifier with __net_init. After this, we will not yet have these proc
files show info from the specific namespace - this will be done when
these tables become namespaced.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
95766fff6b9a78d11fc2d3812dd035381690b55d 31-Dec-2007 Hideo Aoki <haoki@redhat.com> [UDP]: Add memory accounting.

Signed-off-by: Takahiro Yasui <tyasui@redhat.com>
Signed-off-by: Hideo Aoki <haoki@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
32e569b7277f13c4b27bb29c761189963e49ce7a 16-Dec-2007 Pavel Emelyanov <xemul@openvz.org> [IPV4]: Pass the net pointer to the arp_req_set_proxy()

This one will need to set the IPV4_DEVCONF_ALL(PROXY_ARP), but
there's no ways to get the net right in place, so we have to
pull one from the inet_ioctl's struct sock.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8b7817f3a959ed99d7443afc12f78a7e1fcc2063 12-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au> [IPSEC]: Add ICMP host relookup support

RFC 4301 requires us to relookup ICMP traffic that does not match any
policies using the reverse of its payload. This patch implements this
for ICMP traffic that originates from or terminates on localhost.

This is activated on outbound with the new policy flag XFRM_POLICY_ICMP,
and on inbound by the new state flag XFRM_STATE_ICMP.

On inbound the policy check is now performed by the ICMP protocol so
that it can repeat the policy check where necessary.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
c69bce20dda7f79160856a338298d65a284ba303 24-Jan-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET]: Remove unused "mibalign" argument for snmp_mib_init().

With fixes from Arnaldo Carvalho de Melo.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ba639797606acbcd97be886f41fbce163914e7b 05-Dec-2007 Pavel Emelyanov <xemul@openvz.org> [IPV4]: Cleanup the sysctl_net_ipv4.c file

This includes several cleanups:

* tune Makefile to compile out this file when SYSCTL=n. Now
it looks like net/core/sysctl_net_core.c one;
* move the ipv4_config to af_inet.c to exist all the time;
* remove additional sysctl_ip_nonlocal_bind declaration
(it is already declared in net/ip.h);
* remove no nonger needed ifdefs from this file.

This is a preparation for using ctl paths for net/ipv4/
sysctl table.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9c55e01c0cc835818475a6ce8c4d684df9949ac8 07-Nov-2007 Jens Axboe <jens.axboe@oracle.com> [TCP]: Splice receive support.

Support for network splice receive.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6257ff2177ff02d7f260a7a501876aa41cb9a9f6 01-Nov-2007 Pavel Emelyanov <xemul@openvz.org> [NET]: Forget the zero_it argument of sk_alloc()

Finally, the zero_it argument can be completely removed from
the callers and from the function prototype.

Besides, fix the checkpatch.pl warnings about using the
assignments inside if-s.

This patch is rather big, and it is a part of the previous one.
I splitted it wishing to make the patches more readable. Hope
this particular split helped.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
96793b482540f3a26e2188eaf75cb56b7829d3e3 17-Sep-2007 David L Stevens <dlstevens@us.ibm.com> [IPV4]: Add ICMPMsgStats MIB (RFC 4293)

Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.

These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).

Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c40f6fff401f693f627d3d44ef7663b993943517 17-Sep-2007 Denis Cheng <crquan@gmail.com> [IPV4] af_inet.c: use ARRAY_SIZE macro from kernel.h instead

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1b8d7ae42d02e483ad94035cca851e4f7fbecb40 09-Oct-2007 Eric W. Biederman <ebiederm@xmission.com> [NET]: Make socket creation namespace safe.

This patch passes in the namespace a new socket should be created in
and has the socket code do the appropriate reference counting. By
virtue of this all socket create methods are touched. In addition
the socket create methods are modified so that they will fail if
you attempt to create a socket in a non-default network namespace.

Failing if we attempt to create a socket outside of the default
network namespace ensures that as we incrementally make the network stack
network namespace aware we will not export functionality that someone
has not audited and made certain is network namespace safe.
Allowing us to partially enable network namespaces before all of the
exotic protocols are supported.

Any protocol layers I have missed will fail to compile because I now
pass an extra parameter into the socket creation code.

[ Integrated AF_IUCV build fixes from Andrew Morton... -DaveM ]

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3516ffb0fef710749daf288c0fe146503e0cf9d4 03-Aug-2007 David S. Miller <davem@sunset.davemloft.net> [TCP]: Invoke tcp_sendmsg() directly, do not use inet_sendmsg().

As discovered by Evegniy Polyakov, if we try to sendmsg after
a connection reset, we can do incredibly stupid things.

The core issue is that inet_sendmsg() tries to autobind the
socket, but we should never do that for TCP. Instead we should
just go straight into TCP's sendmsg() code which will do all
of the necessary state and pending socket error checks.

TCP's sendpage already directly vectors to tcp_sendpage(), so this
merely brings sendmsg() in line with that.

Signed-off-by: David S. Miller <davem@davemloft.net>
d212f87b068c9d72065ef579d85b5ee6b8b59381 27-Jun-2007 Stephen Hemminger <shemminger@linux-foundation.org> [NET]: IPV6 checksum offloading in network devices

The existing model for checksum offload does not correctly handle
devices that can offload IPV4 and IPV6 only. The NETIF_F_HW_CSUM flag
implies device can do any arbitrary protocol.

This patch:
* adds NETIF_F_IPV6_CSUM for those devices
* fixes bnx2 and tg3 devices that need it
* add NETIF_F_IPV6_CSUM to ipv6 output (incl GSO)
* fixes assumptions about NETIF_F_ALL_CSUM in nat
* adjusts bridge union of checksumming computation

Signed-off-by: David S. Miller <davem@davemloft.net>
e63340ae6b6205fef26b40a75673d1c9c0c8bb90 08-May-2007 Randy Dunlap <randy.dunlap@oracle.com> header cleaning: don't include smp_lock.h when not used

Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5e0f04351d11e07a23b5ab4914282cbb78027e50 25-Apr-2007 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Consolidate common SNMP code

This patch moves the SNMP code shared between IPv4/IPv6 from proc.c
into net/ipv4/af_inet.c. This makes sense because these functions
aren't specific to /proc.

As a result we can again skip proc.o if /proc is disabled.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
334901700f9f58993ebd7f6136d3f9062460d34d 21-Apr-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4] SNMP: Move some statistic bits to net/ipv4/proc.c.

This also fixes memory leak in error path.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
be776281aee54626a474ba06f91926b98bdd180d 27-Mar-2007 Eric Dumazet <dada1@cosmosbay.com> [NET]: inet_ehash_secret should be __read_mostly and set only once

There is a very tiny probability that build_ehash_secret() is called
at the same time by different CPUS.

Also, using __read_mostly is a must for inet_ehash_secret

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b3da2cf37c5c6e47698957a25ab43a7223dbb90f 23-Mar-2007 David S. Miller <davem@davemloft.net> [INET]: Use jhash + random secret for ehash.

The days are gone when this was not an issue, there are folks out
there with huge bot networks that can be used to attack the
established hash tables on remote systems.

So just like the routing cache and connection tracking
hash, use Jenkins hash with random secret input.

Signed-off-by: David S. Miller <davem@davemloft.net>
badff6d01a8589a1c828b0bf118903ca38627f4e 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_reset_transport_header(skb)

For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eddc9ec53be2ecdbf4efe0efd4a83052594f0ac0 21-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d56f90a7c96da5187f0cdf07ee7434fe6aa78bbc 11-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_network_header()

For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
132adf54639cf7dd9315e8df89c2faa59f6e46d9 09-Mar-2007 Stephen Hemminger <shemminger@linux-foundation.org> [IPV4]: cleanup

Add whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ae40eb1ef30ab4120bd3c8b7e3da99ee53d27a23 19-Mar-2007 Eric Dumazet <dada1@cosmosbay.com> [NET]: Introduce SIOCGSTAMPNS ioctl to get timestamps with nanosec resolution

Now network timestamps use ktime_t infrastructure, we can add a new
ioctl() SIOCGSTAMPNS command to get timestamps in 'struct timespec'.
User programs can thus access to nanosecond resolution.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
e905a9edab7f4f14f9213b52234e4a346c690911 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] IPV4: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8eb9086f21c73b38b5ca27558db4c91d62d0e70b 08-Feb-2007 David S. Miller <davem@sunset.davemloft.net> [IPV4/IPV6]: Always wait for IPSEC SA resolution in socket contexts.

Do this even for non-blocking sockets. This avoids the silly -EAGAIN
that applications can see now, even for non-blocking sockets in some
cases (f.e. connect()).

With help from Venkat Tekkirala.

Signed-off-by: David S. Miller <davem@davemloft.net>
469de9b90f739f130ab3d483e819888e977596b8 09-Jan-2007 Paul Moore <paul.moore@hp.com> [INET]: style updates for the inet_sock->is_icsk assignment fix

A quick patch to change the inet_sock->is_icsk assignment to better fit with
existing kernel coding style.

Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
cbbd7d4f36a61631f8c0d73be43df985d1e7d6a6 05-Jan-2007 Paul Moore <paul.moore@hp.com> [INET]: Fix incorrect "inet_sock->is_icsk" assignment.

The inet_create() and inet6_create() functions incorrectly set the
inet_sock->is_icsk field. Both functions assume that the is_icsk field is
large enough to hold at least a INET_PROTOSW_ICSK value when it is actually
only a single bit. This patch corrects the assignment by doing a boolean
comparison whose result will safely fit into a single bit field.

Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
714e85be3557222bc25f69c252326207c900a7db 15-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV6]: Assorted trivial endianness annotations.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba4e58eca8aa9473b44fdfd312f26c4a2e7798b3 27-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [NET]: Supporting UDP-Lite (RFC 3828) in Linux

This is a revision of the previously submitted patch, which alters
the way files are organized and compiled in the following manner:

* UDP and UDP-Lite now use separate object files
* source file dependencies resolved via header files
net/ipv{4,6}/udp_impl.h
* order of inclusion files in udp.c/udplite.c adapted
accordingly

[NET/IPv4]: Support for the UDP-Lite protocol (RFC 3828)

This patch adds support for UDP-Lite to the IPv4 stack, provided as an
extension to the existing UDPv4 code:
* generic routines are all located in net/ipv4/udp.c
* UDP-Lite specific routines are in net/ipv4/udplite.c
* MIB/statistics support in /proc/net/snmp and /proc/net/udplite
* shared API with extensions for partial checksum coverage

[NET/IPv6]: Extension for UDP-Lite over IPv6

It extends the existing UDPv6 code base with support for UDP-Lite
in the same manner as per UDPv4. In particular,
* UDPv6 generic and shared code is in net/ipv6/udp.c
* UDP-Litev6 specific extensions are in net/ipv6/udplite.c
* MIB/statistics support in /proc/net/snmp6 and /proc/net/udplite6
* support for IPV6_ADDRFORM
* aligned the coding style of protocol initialisation with af_inet6.c
* made the error handling in udpv6_queue_rcv_skb consistent;
to return `-1' on error on all error cases
* consolidation of shared code

[NET]: UDP-Lite Documentation and basic XFRM/Netfilter support

The UDP-Lite patch further provides
* API documentation for UDP-Lite
* basic xfrm support
* basic netfilter support for IPv4 and IPv6 (LOG target)

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
72a3effaf633bcae9034b7e176bdbd78d64a71db 16-Nov-2006 Eric Dumazet <dada1@cosmosbay.com> [NET]: Size listen hash tables using backlog hint

We currently allocate a fixed size (TCP_SYNQ_HSIZE=512) slots hash table for
each LISTEN socket, regardless of various parameters (listen backlog for
example)

On x86_64, this means order-1 allocations (might fail), even for 'small'
sockets, expecting few connections. On the contrary, a huge server wanting a
backlog of 50000 is slowed down a bit because of this fixed limit.

This patch makes the sizing of listen hash table a dynamic parameter,
depending of :
- net.core.somaxconn tunable (default is 128)
- net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
- backlog value given by user application (2nd parameter of listen())

For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of
kmalloc().

We still limit memory allocation with the two existing tunables (somaxconn &
tcp_max_syn_backlog). So for standard setups, this patch actually reduce RAM
usage.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3ca3c68e76686bee058937ade2b96f4de58ee434 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: struct ip_options annotations

->faddr is net-endian; annotated as such, variables inferred to be net-endian
annotated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
321efff7c3b7a26fa0522cb12b2af2ac82c05f1e 28-Sep-2006 Olaf Kirch <okir@suse.de> [IPV4]: Fix order in inet_init failure path.

This is just a minor buglet I came across by accident - when inet_init
fails to register raw_prot, it jumps to out_unregister_udp_proto which
should unregister UDP _and_ TCP.

Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
bada8adc4e6622764205921e6ba3f717aa03c882 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: ip_route_connect() ipv4 address arguments annotated

annotated address arguments (port number left alone for now); ditto
for inferred net-endian variables in callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
ef047f5e1085d6393748d1ee27d6327905f098dc 01-Sep-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET]: Use BUILD_BUG_ON() for checking size of skb->cb.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ab32ea5d8a760e7dd4339634e95d7be24ee5b842 22-Sep-2006 Brian Haley <brian.haley@hp.com> [NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly

Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bf0d52492d82ad70684e17c8a46942c13d0e140e 22-Sep-2006 Dave Jones <davej@redhat.com> [NET]: Remove unnecessary config.h includes from net/

config.h is automatically included by kbuild these days.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
beb8d13bed80f8388f1a9a107d07ddd342e627e8 05-Aug-2006 Venkat Yekkirala <vyekkirala@TrustedCS.com> [MLSXFRM]: Add flow labeling

This labels the flows that could utilize IPSec xfrms at the points the
flows are defined so that IPSec policy and SAs at the right label can
be used.

The following protos are currently not handled, but they should
continue to be able to use single-labeled IPSec like they currently
do.

ipmr
ip_gre
ipip
igmp
sit
sctp
ip6_tunnel (IPv6 over IPv6 tunnel device)
decnet

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a430a43d087545c96542ee64573237919109d370 08-Jul-2006 Herbert Xu <herbert@gondor.apana.org.au> [NET] gso: Fix up GSO packets with broken checksums

Certain subsystems in the stack (e.g., netfilter) can break the partial
checksum on GSO packets. Until they're fixed, this patch allows this to
work by recomputing the partial checksums through the GSO mechanism.

Once they've all been converted to update the partial checksum instead of
clearing it, this workaround can be removed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
bbcf467dab42ea3c85f368df346c82af2fbba665 04-Jul-2006 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Verify gso_type too in gso_segment

We don't want nasty Xen guests to pass a TCPv6 packet in with gso_type set
to TCPv4 or even UDP (or a packet that's both TCP and UDP).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
576a30eb6453439b3c37ba24455ac7090c247b5a 27-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Added GSO header verification

When GSO packets come from an untrusted source (e.g., a Xen guest domain),
we need to verify the header integrity before passing it to the hardware.

Since the first step in GSO is to verify the header, we can reuse that
code by adding a new bit to gso_type: SKB_GSO_DODGY. Packets with this
bit set can only be fed directly to devices with the corresponding bit
NETIF_F_GSO_ROBUST. If the device doesn't have that bit, then the skb
is fed to the GSO engine which will allow the packet to be sent to the
hardware if it passes the header check.

This patch changes the sg flag to a full features flag. The same method
can be used to implement TSO ECN support. We simply have to mark packets
with CWR set with SKB_GSO_ECN so that only hardware with a corresponding
NETIF_F_TSO_ECN can accept them. The GSO engine can either fully segment
the packet, or segment the first MTU and pass the rest to the hardware for
further segmentation.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
f4c50d990dcf11a296679dc05de3873783236711 22-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Add software TSOv4

This patch adds the GSO implementation for IPv4 TCP.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
a536e0778781c1f2ce38cf8e1d82ce258f0193c1 29-Apr-2006 Heiko Carstens <heiko.carstens@de.ibm.com> [IPV4]: inet_init() -> fs_initcall

Convert inet_init to an fs_initcall to make sure its called before any
device driver's initcall.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
543d9cfeec4d58ad3fd974db5531b06b6b95deb4 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [NET]: Identation & other cleanups related to compat_[gs]etsockopt cset

No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs
to just after the function exported, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3fdadf7d27e3fbcf72930941884387d1f4936f04 21-Mar-2006 Dmitry Mishin <dim@openvz.org> [NET]: {get|set}sockopt compatibility layer

This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4fc268d24ceb9f4150777c1b5b2b8e6214e56b2b 11-Jan-2006 Randy Dunlap <rdunlap@xenotime.net> [PATCH] capable/capability.h (net/)

net: Use <linux/capability.h> where capable() is used.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b5e5fa5e093e42cab4ee3d6dcbc4f450ad29a723 03-Jan-2006 Christoph Hellwig <hch@lst.de> [NET]: Add a dev_ioctl() fallback to sock_ioctl()

Currently all network protocols need to call dev_ioctl as the default
fallback in their ioctl implementations. This patch adds a fallback
to dev_ioctl to sock_ioctl if the protocol returned -ENOIOCTLCMD.
This way all the procotol ioctl handlers can be simplified and we don't
need to export dev_ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
14c850212ed8f8cbb5972ad6b8812e08a0bc901c 27-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_SOCK]: Move struct inet_sock & helper functions to net/inet_sock.h

To help in reducing the number of include dependencies, several files were
touched as they were getting needed headers indirectly for stuff they use.

Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had
linux/dccp.h include twice.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
90ddc4f0470427df306f308ad03db6b6b21644b8 22-Dec-2005 Eric Dumazet <dada1@cosmosbay.com> [NET]: move struct proto_ops to const

I noticed that some of 'struct proto_ops' used in the kernel may share
a cache line used by locks or other heavily modified data. (default
linker alignement is 32 bytes, and L1_CACHE_LINE is 64 or 128 at
least)

This patch makes sure a 'struct proto_ops' can be declared as const,
so that all cpus can share all parts of it without false sharing.

This is not mandatory : a driver can still use a read/write structure
if it needs to (and eventually a __read_mostly)

I made a global stubstitute to change all existing occurences to make
them const.

This should reduce the possibility of false sharing on SMP, and
speedup some socket system calls.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d83d8461f902c672bc1bd8fbc6a94e19f092da97 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [IP_SOCKGLUE]: Remove most of the tcp specific calls

As DCCP needs to be called in the same spots.

Now we have a member in inet_sock (is_icsk), set at sock creation time from
struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and
DCCP) to see if a struct sock instance is a inet_connection_sock for places
like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if
sk_type was SOCK_STREAM, that is insufficient because we now use the same code
for DCCP, that has sk_type SOCK_DCCP.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
86c8f9d158f68538a971a47206a46a22c7479bac 03-Dec-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4] Fix EPROTONOSUPPORT error in inet_create

There is a coding error in inet_create that causes it to always return
ESOCKTNOSUPPORT. It should return EPROTONOSUPPORT when there are
protocols registered for a given socket type but none of them match
the requested protocol.

This is based on a patch by Jayachandran C.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
a51482bde22f99c63fbbb57d5d46cc666384e379 08-Nov-2005 Jesper Juhl <jesper.juhl@gmail.com> [NET]: kfree cleanup

From: Jesper Juhl <jesper.juhl@gmail.com>

This is the net/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in net/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
cb7b593c2c808b32a1ea188599713c434b95f849 09-Sep-2005 Stephen Hemminger <shemminger@osdl.org> [IPV4] fib_trie: fix proc interface

Create one iterator for walking over FIB trie, and use it
for all the /proc functions. Add a /proc/net/route
output for backwards compatibility with old applications.

Make initialization of fib_trie same as fib_hash so no #ifdef
is needed in af_inet.c

Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=5209

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba89966c1984513f4f2cc0a6c182266be44ddd03 26-Aug-2005 Eric Dumazet <dada1@cosmosbay.com> [NET]: use __read_mostly on kmem_cache_t , DEFINE_SNMP_STAT pointers

This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.

On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
20380731bc2897f2952ae055420972ded4cd786e 16-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [NET]: Fix sparse warnings

Of this type, mostly:

CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bb97d31f5130d677644d9931ef38613d1164ec94 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [INET]: Make inet_create try to load protocol modules

Syntax is net-pf-PROTOCOL_FAMILY-PROTOCOL-SOCK_TYPE and if this
fails net-pf-PROTOCOL_FAMILY-PROTOCOL.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
295f7324ff8d9ea58b4d3ec93b1aaa1d80e048a9 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [ICSK]: Introduce reqsk_queue_prune from code in tcp_synack_timer

With this we're very close to getting all of the current TCP
refactorings in my dccp-2.6 tree merged, next changeset will export
some functions needed by the current DCCP code and then dccp-2.6.git
will be born!

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
0a5578cf8e5e045aaa68643c17ce885426697c6b 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [ICSK]: Generalise tcp_listen_{start,stop}

This also moved inet_iif from tcp to inet_hashtables.h, as it is
needed by the inet_lookup callers, perhaps this needs a bit of
polishing, but for now seems fine.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
32519f11d38ea8f4f60896763bacec7db1760f9c 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [INET]: Introduce inet_sk_rebuild_header

From tcp_v4_rebuild_header, that already was pretty generic, I only
needed to use sk->sk_protocol instead of the hardcoded IPPROTO_TCP and
establish the requirement that INET transport layer protocols that
want to use this function map TCP_SYN_SENT to its equivalent state.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
e6848976b721eeb5551cd94673faafeef78d9f35 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [NET]: Cleanup INET_REFCNT_DEBUG code

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c877efb207bf4629cfa97ac13412f7392a873485 19-Jul-2005 Stephen Hemminger <shemminger@osdl.org> [IPV4]: Fix up lots of little whitespace indentation stuff in fib_trie.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
30e224d76f34e041c30df66a4dcbeeb53556ea3f 05-Jul-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Fix crash in ip_rcv while booting related to netconsole

Makes IPv4 ip_rcv registration happen last in af_inet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
19baf839ff4a8daa1f2a7400897094fc18e4f5e9 21-Jun-2005 Robert Olsson <Robert.Olsson@data.slu.se> [IPV4]: Add LC-Trie FIB lookup algorithm.

Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
cdac4e07748934e37e415437055ed591aed9eb21 14-Jun-2005 Neil Horman <nhorman@redhat.com> [SCTP] Add support for ip_nonlocal_bind sysctl & IP_FREEBIND socket option

Signed-off-by: Neil Horman <nhorman@redhat.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
02c30a84e6298b6b20a56f0896ac80b47839e134 06-May-2005 Jesper Juhl <juhl-lkml@dif.dk> [PATCH] update Ross Biro bouncing email address

Ross moved. Remove the bad email address so people will find the correct
one in ./CREDITS.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
5523662c4cd585b892811d7bb3e25d9a787e19b3 26-Apr-2005 Al Viro <viro@parcelfarce.linux.theplanet.co.uk> [NET]: kill gratitious includes of major.h

A lot of places in there are including major.h for no reason
whatsoever. Removed. And yes, it still builds.

The history of that stuff is often amusing. E.g. for net/core/sock.c
the story looks so, as far as I've been able to reconstruct it: we used to
need major.h in net/socket.c circa 1.1.early. In 1.1.13 that need had
disappeared, along with register_chrdev(SOCKET_MAJOR, "socket", &net_fops)
in sock_init(). Include had not. When 1.2 -> 1.3 reorg of net/* had moved
a lot of stuff from net/socket.c to net/core/sock.c, this crap had followed...

Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
b453257f057b834fdf9f4a6ad6133598b79bd982 26-Apr-2005 Al Viro <viro@www.linux.org.uk> [PATCH] kill gratitious includes of major.h under net/*

A lot of places in there are including major.h for no reason whatsoever.
Removed. And yes, it still builds.

The history of that stuff is often amusing. E.g. for net/core/sock.c
the story looks so, as far as I've been able to reconstruct it: we used
to need major.h in net/socket.c circa 1.1.early. In 1.1.13 that need
had disappeared, along with register_chrdev(SOCKET_MAJOR, "socket",
&net_fops) in sock_init(). Include had not. When 1.2 -> 1.3 reorg of
net/* had moved a lot of stuff from net/socket.c to net/core/sock.c,
this crap had followed...

Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!