History log of /net/ipv4/ip_output.c
Revision Date Author Comments
ba3d8d3f9f65807763b2e0e1ea7645d74a962248 31-Mar-2014 Lorenzo Colitti <lorenzo@google.com> net: core: Support UID-based routing.

This contains the following commits:

1. cc2f522 net: core: Add a UID range to fib rules.
2. d7ed2bd net: core: Use the socket UID in routing lookups.
3. 2f9306a net: core: Add a RTA_UID attribute to routes.
This is so that userspace can do per-UID route lookups.
4. 8e46efb net: ipv6: Use the UID in IPv6 PMTUD
IPv4 PMTUD already does this because ipv4_sk_update_pmtu
uses __build_flow_key, which includes the UID.

Bug: 15413527
Change-Id: I81bd31dae655de9cce7d7a1f9a905dc1c2feba7c
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
330966e501ffe282d7184fde4518d5e0c24bc7f8 20-Oct-2014 Florian Westphal <fw@strlen.de> net: make skb_gso_segment error handling more robust

skb_gso_segment has three possible return values:
1. a pointer to the first segmented skb
2. an errno value (IS_ERR())
3. NULL. This can happen when GSO is used for header verification.

However, several callers currently test IS_ERR instead of IS_ERR_OR_NULL
and would oops when NULL is returned.

Note that these call sites should never actually see such a NULL return
value; all callers mask out the GSO bits in the feature argument.

However, there have been issues with some protocol handlers erronously not
respecting the specified feature mask in some cases.

It is preferable to get 'have to turn off hw offloading, else slow' reports
rather than 'kernel crashes'.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
4062090e3e5caaf55bed4523a69f26c3265cc1d2 15-Oct-2014 Vasily Averin <vvs@parallels.com> ipv4: dst_entry leak in ip_send_unicast_reply()

ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
and in case errors in __ip_append_data() nobody frees stolen dst entry

Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
Signed-off-by: Vasily Averin <vvs@parallels.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1109a90c01177e8f4a5fd95c5b685ad02f1fe9bb 01-Oct-2014 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: use IS_ENABLED(CONFIG_BRIDGE_NETFILTER)

In 34666d4 ("netfilter: bridge: move br_netfilter out of the core"),
the bridge netfilter code has been modularized.

Use IS_ENABLED instead of ifdef to cover the module case.

Fixes: 34666d4 ("netfilter: bridge: move br_netfilter out of the core")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
24a2d43d8886f5a29c3cf108927f630c545a9a38 27-Sep-2014 Eric Dumazet <edumazet@google.com> ipv4: rename ip_options_echo to __ip_options_echo()

ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
Lets break this assumption, but provide a helper to not change all call points.

ip_send_unicast_reply() gets a new struct ip_options pointer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
09c2d251b70723650ba47e83571ff49281320f7c 05-Aug-2014 Willem de Bruijn <willemb@google.com> net-timestamp: add key to disambiguate concurrent datagrams

Datagrams timestamped on transmission can coexist in the kernel stack
and be reordered in packet scheduling. When reading looped datagrams
from the socket error queue it is not always possible to unique
correlate looped data with original send() call (for application
level retransmits). Even if possible, it may be expensive and complex,
requiring packet inspection.

Introduce a data-independent ID mechanism to associate timestamps with
send calls. Pass an ID alongside the timestamp in field ee_data of
sock_extended_err.

The ID is a simple 32 bit unsigned int that is associated with the
socket and incremented on each send() call for which software tx
timestamp generation is enabled.

The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to
avoid changing ee_data for existing applications that expect it 0.
The counter is reset each time the flag is reenabled. Reenabling
does not change the ID of already submitted data. It is possible
to receive out of order IDs if the timestamp stream is not quiesced
first.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
11878b40ed5c5bc20d6a115bae156a5b90b0fb3e 14-Jul-2014 Willem de Bruijn <willemb@google.com> net-timestamp: SOCK_RAW and PING timestamping

Add SO_TIMESTAMPING to sockets of type PF_INET[6]/SOCK_RAW:

Add the necessary sock_tx_timestamp calls to the datapath for RAW
sockets (ping sockets already had these calls).

Fix the IP output path to pass the timestamp flags on the first
fragment also for these sockets. The existing code relies on
transhdrlen != 0 to indicate a first fragment. For these sockets,
that assumption does not hold.

This fixes http://bugzilla.kernel.org/show_bug.cgi?id=77221

Tested SOCK_RAW on IPv4 and IPv6, not PING.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
73f156a6e8c1074ac6327e0abd1169e95eb66463 02-Jun-2014 Eric Dumazet <edumazet@google.com> inetpeer: get rid of ip_id_count

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.

4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e110861f86094cd78cc85593b873970092deb43a 13-May-2014 Lorenzo Colitti <lorenzo@google.com> net: add a sysctl to reflect the fwmark on replies

Kernel-originated IP packets that have no user socket associated
with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
are emitted with a mark of zero. Add a sysctl to make them have
the same mark as the packet they are replying to.

This allows an administrator that wishes to do so to use
mark-based routing, firewalling, etc. for these replies by
marking the original packets inbound.

Tested using user-mode linux:
- ICMP/ICMPv6 echo replies and errors.
- TCP RST packets (IPv4 and IPv6).

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
60ff746739bf805a912484643c720b6124826140 05-May-2014 WANG Cong <xiyou.wangcong@gmail.com> net: rename local_df to ignore_df

As suggested by several people, rename local_df to ignore_df,
since it means "ignore df bit if it is set".

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c7ba65d7b64984ff371cb5630b36af23506c50d5 05-May-2014 Florian Westphal <fw@strlen.de> net: ip: push gso skb forwarding handling down the stack

Doing the segmentation in the forward path has one major drawback:

When using virtio, we may process gso udp packets coming
from host network stack. In that case, netfilter POSTROUTING
will see one packet with udp header followed by multiple ip
fragments.

Delay the segmentation and do it after POSTROUTING invocation
to avoid this.

Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
aad88724c9d54acb1a9737cb6069d8470fa85f74 15-Apr-2014 Eric Dumazet <edumazet@google.com> ipv4: add a sock pointer to dst->output() path.

In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.

The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.

Fixes: 8f646c922d55 ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b0270e91014dabfceaf37f5b40ad51bbf21a1302 15-Apr-2014 Eric Dumazet <edumazet@google.com> ipv4: add a sock pointer to ip_queue_xmit()

ip_queue_xmit() assumes the skb it has to transmit is attached to an
inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
changed l2tp to not change skb ownership and thus broke this assumption.

One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
so that we do not assume skb->sk points to the socket used by l2tp
tunnel.

Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
Reported-by: Zhan Jianyu <nasa4836@gmail.com>
Tested-by: Zhan Jianyu <nasa4836@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1b346576359c72bee34b1476b4fc63d77d37b314 26-Feb-2014 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT

IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
generation of fragments if the interface mtu is exceeded, it is very
hard to make use of this option in already deployed name server software
for which I introduced this option.

This patch adds yet another new IP_MTU_DISCOVER option to not honor any
path mtu information and not accepting new icmp notifications destined for
the socket this option is enabled on. But we allow outgoing fragmentation
in case the packet size exceeds the outgoing interface mtu.

As such this new option can be used as a drop-in replacement for
IP_PMTUDISC_DONT, which is currently in use by most name server software
making the adoption of this option very smooth and easy.

The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
ignoring incoming path MTU updates and not honoring discovered path MTUs
in the output path.

Fixes: 482fc6094afad5 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
69647ce46a236a355a7a3096d793819a9bd7c1d3 26-Feb-2014 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragment

ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
is attached to the skb (in case of forwarding) or determines the mtu like
we do in ip_finish_output, which actually checks if we should branch to
ip_fragment. Thus use the same function to determine the mtu here, too.

This is important for the introduction of IP_PMTUDISC_OMIT, where we
want the packets getting cut in pieces of the size of the outgoing
interface mtu. IPv6 already does this correctly.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
478b360a47b71f3b5030eacd3aae6acb1a32c2b6 15-Feb-2014 Florian Westphal <fw@strlen.de> netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n

When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get
lots of "TRACE: filter:output:policy:1 IN=..." warnings as several
places will leave skb->nf_trace uninitialised.

Unlike iptables tracing functionality is not conditional in nftables,
so always copy/zero nf_trace setting when nftables is enabled.

Move this into __nf_copy() helper.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
72c1d3bdd5bf10a789608336ba0d61f1e44e4350 11-Jan-2014 WANG Cong <xiyou.wangcong@gmail.com> ipv4: register igmp_notifier even when !CONFIG_PROC_FS

We still need this notifier even when we don't config
PROC_FS.

It should be rare to have a kernel without PROC_FS,
so just for completeness.

Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f87c10a8aa1e82498c42d0335524d6ae7cf5a52b 09-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing

While forwarding we should not use the protocol path mtu to calculate
the mtu for a forwarded packet but instead use the interface mtu.

We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
introduced for multicast forwarding. But as it does not conflict with
our usage in unicast code path it is perfect for reuse.

I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
dependencies because of IPSKB_FORWARDED.

Because someone might have written a software which does probe
destinations manually and expects the kernel to honour those path mtus
I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
can disable this new behaviour. We also still use mtus which are locked on a
route for forwarding.

The reason for this change is, that path mtus information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv4 forwarding path to
wrongfully emit fragmentation needed notifications or start to fragment
packets along a path.

Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
won't be set and further fragmentation logic will use the path mtu to
determine the fragmentation size. They also recheck packet size with
help of path mtu discovery and report appropriate errors.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
61e7f09d0f437c9614029445754099383ec2eec4 19-Dec-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: consistent reporting of pmtu data in case of corking

We report different pmtu values back on the first write and on further
writes on an corked socket.

Also don't include the dst.header_len (respectively exthdrlen) as this
should already be dealt with by the interface mtu of the outgoing
(virtual) interface and policy of that interface should dictate if
fragmentation should happen.

Instead reduce the pmtu data by IP options as we do for IPv6. Make the
same changes for ip_append_data, where we did not care about options or
dst.header_len at all.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
482fc6094afad572a4ea1fd722e7b11ca72022a0 05-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE

Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
their sockets won't accept and install new path mtu information and they
will always use the interface mtu for outgoing packets. It is guaranteed
that the packet is not fragmented locally. But we won't set the DF-Flag
on the outgoing frames.

Florian Weimer had the idea to use this flag to ensure DNS servers are
never generating outgoing fragments. They may well be fragmented on the
path, but the server never stores or usees path mtu values, which could
well be forged in an attack.

(The root of the problem with path MTU discovery is that there is
no reliable way to authenticate ICMP Fragmentation Needed But DF Set
messages because they are sent from intermediate routers with their
source addresses, and the IMCP payload will not always contain sufficient
information to identify a flow.)

Recent research in the DNS community showed that it is possible to
implement an attack where DNS cache poisoning is feasible by spoofing
fragments. This work was done by Amir Herzberg and Haya Shulman:
<https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>

This issue was previously discussed among the DNS community, e.g.
<http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
without leading to fixes.

This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
regarding local fragmentation with UFO/CORK" for the enforcement of the
non-fragmentable checks. If other users than ip_append_page/data should
use this semantic too, we have to add a new flag to IPCB(skb)->flags to
suppress local fragmentation and check for this in ip_finish_output.

Many thanks to Florian Weimer for the idea and feedback while implementing
this patch.

Cc: David S. Miller <davem@davemloft.net>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
daba287b299ec7a2c61ae3a714920e90e8396ad5 27-Oct-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK

UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and
IP_PMTUDISC_PROBE well enough.

UFO enabled packet delivery just appends all frags to the cork and hands
it over to the network card. So we just deliver non-DF udp fragments
(DF-flag may get overwritten by hardware or virtual UFO enabled
interface).

UDP_CORK does enqueue the data until the cork is disengaged. At this
point it sets the correct IP_DF and local_df flags and hands it over to
ip_fragment which in this case will generate an icmp error which gets
appended to the error socket queue. This is not reflected in the syscall
error (of course, if UFO is enabled this also won't happen).

Improve this by checking the pmtudisc flags before appending data to the
socket and if we still can fit all data in one packet when IP_PMTUDISC_DO
or IP_PMTUDISC_PROBE is set, only then proceed.

We use (mtu-fragheaderlen) to check for the maximum length because we
ensure not to generate a fragment and non-fragmented data does not need
to have its length aligned on 64 bit boundaries. Also the passed in
ip_options are already aligned correctly.

Maybe, we can relax some other checks around ip_fragment. This needs
more research.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
e93b7d748be887cd7639b113ba7d7ef792a7efb9 19-Oct-2013 Jiri Pirko <jiri@resnulli.us> ip_output: do skb ufo init for peeked non ufo skb as well

Now, if user application does:
sendto len<mtu flag MSG_MORE
sendto len>mtu flag 0
The skb is not treated as fragmented one because it is not initialized
that way. So move the initialization to fix this.

introduced by:
commit e89e9cf539a28df7d0eb1d0a545368e9920b34ac "[IPv4/IPv6]: UFO Scatter-gather approach"

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
aa6615814533c634190019ee3a5b10490026d545 24-Sep-2013 Francesco Fusco <ffusco@redhat.com> ipv4: processing ancillary IP_TOS or IP_TTL

If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
703133de331a7a7df47f31fb9de51dc6f68a9de8 19-Sep-2013 Ansis Atteka <aatteka@nicira.com> ip: generate unique IP identificator if local fragmentation is allowed

If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.

For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
749154aa56b57652a282cbde57a57abc278d1205 19-Sep-2013 Ansis Atteka <aatteka@nicira.com> ip: use ip_hdr() in __ip_make_skb() to retrieve IP header

skb->data already points to IP header, but for the sake of
consistency we can also use ip_hdr() to retrieve it.

Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0ea9d5e3e0e03a63b11392f5613378977dae7eca 13-Aug-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> xfrm: introduce helper for safe determination of mtu

skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we
always have to make sure we a referring to the correct interpretation
of skb->sk.

We only depend on header defines to query the mtu, so we don't introduce
a new dependency to ipv6 by this change.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2fbd967973ae6ae1a989f5638da8bbed93cad2c5 09-May-2013 Denis Efremov <yefremov.denis@gmail.com> ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functions

EXPORT_SYMBOL and inline directives are contradictory to each other.
The patch fixes this inconsistency.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Denis Efremov <yefremov.denis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f0165888610a1701a39670c7eadf63a61fad708d 21-Mar-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: use IS_ENABLE to replace if defined in TRACE target

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
c9af6db4c11ccc6c3e7f19bbc15d54023956f97c 11-Feb-2013 Pravin B Shelar <pshelar@nicira.com> net: Fix possible wrong checksum generation.

Patch cef401de7be8c4e (net: fix possible wrong checksum
generation) fixed wrong checksum calculation but it broke TSO by
defining new GSO type but not a netdev feature for that type.
net_gso_ok() would not allow hardware checksum/segmentation
offload of such packets without the feature.

Following patch fixes TSO and wrong checksum. This patch uses
same logic that Eric Dumazet used. Patch introduces new flag
SKBTX_SHARED_FRAG if at least one frag can be modified by
the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
info tx_flags rather than gso_type.

tx_flags is better compared to gso_type since we can have skb with
shared frag without gso packet. It does not link SHARED_FRAG to
GSO, So there is no need to define netdev feature for this.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fc70fb640b159f1d6bf5ad2321cd55e874c8d1b8 07-Dec-2012 Alexander Duyck <alexander.h.duyck@intel.com> net: Handle encapsulated offloads before fragmentation or handing to lower dev

This change allows the VXLAN to enable Tx checksum offloading even on
devices that do not support encapsulated checksum offloads. The
advantage to this is that it allows for the lower device to change due
to routing table changes without impacting features on the VXLAN itself.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
155e8336c373d14d87a7f91e356d85ef4b93b8f9 08-Oct-2012 Julian Anastasov <ja@ssi.bg> ipv4: introduce rt_uses_gateway

Add new flag to remember when route is via gateway.
We will use it to allow rt_gateway to contain address of
directly connected host for the cases when DST_NOCACHE is
used or when the NH exception caches per-destination route
without DST_NOCACHE flag, i.e. when routes are not used for
other destinations. By this way we force the neighbour
resolving to work with the routed destination but we
can use different address in the packet, feature needed
for IPVS-DR where original packet for virtual IP is routed
via route to real IP.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
5640f7685831e088fe6c2e1f863a6805962f8e81 24-Sep-2012 Eric Dumazet <edumazet@google.com> net: use a per task frag allocator

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5f2d04f1f9b52604fca6ee08a77972c0df67e082 26-Aug-2012 Patrick McHardy <kaber@trash.net> ipv4: fix path MTU discovery with connection tracking

IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and
(in case of forwarded packets) refragments them at POST_ROUTING
independent of the IP_DF flag. Refragmentation uses the dst_mtu() of
the local route without caring about the original fragment sizes,
thereby breaking PMTUD.

This patch fixes this by keeping track of the largest received fragment
with IP_DF set and generates an ICMP fragmentation required error during
refragmentation if that size exceeds the MTU.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
a9915a1b52df52ad87f3b33422da95cf25372f09 20-Aug-2012 Eric Dumazet <edumazet@google.com> ipv4: fix ip header ident selection in __ip_make_skb()

Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
memory in __ip_select_ident().

It turns out that __ip_make_skb() called ip_select_ident() before
properly initializing iph->daddr.

This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of
cached route inetpeer.)

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131

Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b5ec8eeac46a99004c26791f70b15d001e970acf 10-Aug-2012 Eric Dumazet <edumazet@google.com> ipv4: fix ip_send_skb()

ip_send_skb() can send orphaned skb, so we must pass the net pointer to
avoid possible NULL dereference in error path.

Bug added by commit 3a7c384ffd57 (ipv4: tcp: unicast_sock should not
land outside of TCP stack)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3a7c384ffd57ef5fbd95f48edaa2ca4eb3d9f2ee 09-Aug-2012 Eric Dumazet <edumazet@google.com> ipv4: tcp: unicast_sock should not land outside of TCP stack

commit be9f4a44e7d41cee (ipv4: tcp: remove per net tcp_sock) added a
selinux regression, reported and bisected by John Stultz

selinux_ip_postroute_compat() expect to find a valid sk->sk_security
pointer, but this field is NULL for unicast_sock

It turns out that unicast_sock are really temporary stuff to be able
to reuse part of IP stack (ip_append_data()/ip_push_pending_frames())

Fact is that frames sent by ip_send_unicast_reply() should be orphaned
to not fool LSM.

Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a
fake socket at all. I'll probably implement tcp_v4_send_response() to
remove these unicast_sock in linux-3.7

Reported-by: John Stultz <johnstul@us.ibm.com>
Bisected-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9871f1ad677d95ffeca80e2c21b70af9bfc9cc91 06-Aug-2012 Vasiliy Kulikov <segoon@openwall.com> ip: fix error handling in ip_finish_output2()

__neigh_create() returns either a pointer to struct neighbour or PTR_ERR().
But the caller expects it to return either a pointer or NULL. Replace
the NULL check with IS_ERR() check.

The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184
("ipv4: Make neigh lookups directly in output packet path.").

Signed-off-by: Vasily Kulikov <segoon@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0980e56e506b1e45022fad00bca8c8a974fda4e6 21-Jul-2012 Eric Dumazet <eric.dumazet@gmail.com> ipv4: tcp: set unicast_sock uc_ttl to -1

Set unicast_sock uc_ttl to -1 so that we select the right ttl,
instead of sending packets with a 0 ttl.

Bug added in commit be9f4a44e7d4 (ipv4: tcp: remove per net tcp_sock)

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f8126f1d5136be1ca1a3536d43ad7a710b5620f8 13-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Adjust semantics of rt->rt_gateway.

In order to allow prefixed routes, we have to adjust how rt_gateway
is set and interpreted.

The new interpretation is:

1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

2) rt_gateway != 0, destination requires a nexthop gateway

Abstract the fetching of the proper nexthop value using a new
inline helper, rt_nexthop(), as suggested by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
be9f4a44e7d41cee50ddb5f038fc2391cbbb4046 19-Jul-2012 Eric Dumazet <edumazet@google.com> ipv4: tcp: remove per net tcp_sock

tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.

This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.

To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)

Additional features :

1) We also mirror the queue_mapping of the incoming skb, so that
answers use the same queue if possible.

2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()

3) We now limit the number of in-flight RST/ACK [1] packets
per cpu, instead of per namespace, and we honor the sysctl_wmem_default
limit dynamically. (Prior to this patch, sysctl_wmem_default value was
copied at boot time, so any further change would not affect tcp_sock
limit)

[1] These packets are only generated when no socket was matched for
the incoming packet.

Reported-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5110effee8fde2edfacac9cd12a9960ab2dc39ea 02-Jul-2012 David S. Miller <davem@davemloft.net> net: Do delayed neigh confirmation.

When a dst_confirm() happens, mark the confirmation as pending in the
dst. Then on the next packet out, when we have the neigh in-hand, do
the update.

This removes the dependency in dst_confirm() of dst's having an
attached neigh.

While we're here, remove the explicit 'dst' NULL check, all except 2
or 3 call sites ensure it's not NULL. So just fix those cases up.

Signed-off-by: David S. Miller <davem@davemloft.net>
a263b3093641fb1ec377582c90986a7fd0625184 02-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Make neigh lookups directly in output packet path.

Do not use the dst cached neigh, we'll be getting rid of that.

Signed-off-by: David S. Miller <davem@davemloft.net>
70e7341673a47fb1525cfc7d6651cc98b5348928 28-Jun-2012 David S. Miller <davem@davemloft.net> ipv4: Show that ip_send_reply() is purely unicast routine.

Rename it to ip_send_unicast_reply() and add explicit 'saddr'
argument.

This removed one of the few users of rt->rt_spec_dst.

Signed-off-by: David S. Miller <davem@davemloft.net>
95603e2293de556de7e82221649bfd7fd98b64a3 12-Jun-2012 Michel Machado <michel@digirati.com.br> net-next: add dev_loopback_xmit() to avoid duplicate code

Add dev_loopback_xmit() in order to deduplicate functions
ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).

I was about to reinvent the wheel when I noticed that
ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
need and are not IP-only functions, but they were not available to reuse
elsewhere.

ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
understand that this is harmless, and should be in dev_loopback_xmit().

Signed-off-by: Michel Machado <michel@digirati.com.br>
CC: "David S. Miller" <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jpirko@redhat.com>
CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
CC: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5d0ba55b6486f58cc890918d7167063d83f7fbb4 04-Jun-2012 Eric Dumazet <edumazet@google.com> net: use consume_skb() in place of kfree_skb()

Remove some dropwatch/drop_monitor false positives.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e87cc4728f0e2fb663e592a1141742b1d6c63256 13-May-2012 Joe Perches <joe@perches.com> net: Convert net_ratelimit uses to net_<level>_ratelimited

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ffc93f203c18a70623f21950f1dd473c9ec48cd 28-Mar-2012 David Howells <dhowells@redhat.com> Remove all #inclusions of asm/system.h

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2721745501a26d0dc3b88c0d2f3aa11471891388 02-Dec-2011 David Miller <davem@davemloft.net> net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.

To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
84f9307c5da84a7c8513e7607dc8427d2cbd63e3 30-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com> ipv4: use a 64bit load/store in output path

gcc compiler is smart enough to use a single load/store if we
memcpy(dptr, sptr, 8) on x86_64, regardless of
CONFIG_CC_OPTIMIZE_FOR_SIZE

In IP header, daddr immediately follows saddr, this wont change in the
future. We only need to make sure our flowi4 (saddr,daddr) fields wont
break the rule.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
66b13d99d96a1a69f47a6bc3dc47f45955967377 24-Oct-2011 Eric Dumazet <eric.dumazet@gmail.com> ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT

There is a long standing bug in linux tcp stack, about ACK messages sent
on behalf of TIME_WAIT sockets.

In the IP header of the ACK message, we choose to reflect TOS field of
incoming message, and this might break some setups.

Example of things that were broken :
- Routing using TOS as a selector
- Firewalls
- Trafic classification / shaping

We now remember in timewait structure the inet tos field and use it in
ACK generation, and route lookup.

Notes :
- We still reflect incoming TOS in RST messages.
- We could extend MuraliRaja Muniraju patch to report TOS value in
netlink messages for TIME_WAIT sockets.
- A patch is needed for IPv6

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9e903e085262ffbf1fc44a17ac06058aca03524a 18-Oct-2011 Eric Dumazet <eric.dumazet@gmail.com> net: add skb frag size accessors

To ease skb->truesize sanitization, its better to be able to localize
all references to skb frags size.

Define accessors : skb_frag_size() to fetch frag size, and
skb_frag_size_{set|add|sub}() to manipulate it.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aff65da0f1be5daec44231972b6b5fc45bfa7a58 23-Aug-2011 Ian Campbell <Ian.Campbell@citrix.com> net: ipv4: convert to SKB frag APIs

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
d52fbfc9e5c7bb0b0dbc256edf17dee170ce839d 07-Aug-2011 Julian Anastasov <ja@ssi.bg> ipv4: use dst with ref during bcast/mcast loopback

Make sure skb dst has reference when moving to
another context. Currently, I don't see protocols that can
hit it when sending broadcasts/multicasts to loopback using
noref dsts, so it is just a precaution.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
f2c31e32b378a6653f8de606149d963baf11d7d3 29-Jul-2011 Eric Dumazet <eric.dumazet@gmail.com> net: fix NULL dereferences in check_peer_redir()

Gergely Kalman reported crashes in check_peer_redir().

It appears commit f39925dbde778 (ipv4: Cache learned redirect
information in inetpeer.) added a race, leading to possible NULL ptr
dereference.

Since we can now change dst neighbour, we should make sure a reader can
safely use a neighbour.

Add RCU protection to dst neighbour, and make sure check_peer_redir()
can be called safely by different cpus in parallel.

As neighbours are already freed after one RCU grace period, this patch
should not add typical RCU penalty (cache cold effects)

Many thanks to Gergely for providing a pretty report pointing to the
bug.

Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d9be4f7a6f5a8da3133b832eca41c3591420b1ca 19-Jul-2011 Bill Sommerfeld <wsommerfeld@google.com> ipv4: Constrain UFO fragment sizes to multiples of 8 bytes

Because the ip fragment offset field counts 8-byte chunks, ip
fragments other than the last must contain a multiple of 8 bytes of
payload. ip_ufo_append_data wasn't respecting this constraint and,
depending on the MTU and ip option sizes, could create malformed
non-final fragments.

Google-Bug-Id: 5009328
Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
69cce1d1404968f78b177a0314f5822d5afdbbfb 18-Jul-2011 David S. Miller <davem@davemloft.net> net: Abstract dst->neighbour accesses behind helpers.

dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>
05e3aa0949c138803185f92bd7db9be59cfca1be 17-Jul-2011 David S. Miller <davem@davemloft.net> net: Create and use new helper, neigh_output().

Signed-off-by: David S. Miller <davem@davemloft.net>
fec8292d9cb662274b583c5c69fdfa6932e593a5 16-Jul-2011 David S. Miller <davem@davemloft.net> ipv4: Use calculated 'neigh' instead of re-evaluating dst->neighbour

Signed-off-by: David S. Miller <davem@davemloft.net>
f6b72b6217f8c24f2a54988e58af858b4e66024d 14-Jul-2011 David S. Miller <davem@davemloft.net> net: Embed hh_cache inside of struct neighbour.

Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:

1) dynamic allocation
2) attachment to dst->hh
3) refcounting

Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.

Signed-off-by: David S. Miller <davem@davemloft.net>
c146066ab80267c3305de5dda6a4083f06df9265 30-Jun-2011 Steffen Klassert <steffen.klassert@secunet.com> ipv4: Don't use ufo handling on later transformed packets

We might call ip_ufo_append_data() for packets that will be IPsec
transformed later. This function should be used just for real
udp packets. So we check for rt->dst.header_len which is only
nonzero on IPsec handling and call ip_ufo_append_data() just
if rt->dst.header_len is zero.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
353e5c9abd900de3d1a40925386ffe4abf76111e 22-Jun-2011 Steffen Klassert <steffen.klassert@secunet.com> ipv4: Fix IPsec slowpath fragmentation problem

ip_append_data() builds packets based on the mtu from dst_mtu(rt->dst.path).
On IPsec the effective mtu is lower because we need to add the protocol
headers and trailers later when we do the IPsec transformations. So after
the IPsec transformations the packet might be too big, which leads to a
slowpath fragmentation then. This patch fixes this by building the packets
based on the lower IPsec mtu from dst_mtu(&rt->dst) and adapts the exthdr
handling to this.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
33f99dc7fd948bbc808a24a0989c167f8973b643 22-Jun-2011 Steffen Klassert <steffen.klassert@secunet.com> ipv4: Fix packet size calculation in __ip_append_data

Git commit 59104f06 (ip: take care of last fragment in ip_append_data)
added a check to see if we exceed the mtu when we add trailer_len.
However, the mtu is already subtracted by the trailer length when the
xfrm transfomation bundles are set up. So IPsec packets with mtu
size get fragmented, or if the DF bit is set the packets will not
be send even though they match the mtu perfectly fine. This patch
actually reverts commit 59104f06.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
56f8a75c17abb854b5907f4a815dc4c3f186ba11 22-Jun-2011 Paul Gortmaker <paul.gortmaker@windriver.com> ip: introduce ip_is_fragment helper inline function

There are enough instances of this:

iph->frag_off & htons(IP_MF | IP_OFFSET)

that a helper function is probably warranted.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
96d7303e9cfb6a9bc664174a4dfdb6fa689284fe 05-Jun-2011 Steffen Klassert <steffen.klassert@secunet.com> ipv4: Fix packet size calculation for raw IPsec packets in __ip_append_data

We assume that transhdrlen is positive on the first fragment
which is wrong for raw packets. So we don't add exthdrlen to the
packet size for raw packets. This leads to a reallocation on IPsec
because we have not enough headroom on the skb to place the IPsec
headers. This patch fixes this by adding exthdrlen to the packet
size whenever the send queue of the socket is empty. This issue was
introduced with git commit 1470ddf7 (inet: Remove explicit write
references to sk/inet in ip_append_data)

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
22f728f8f311659b068e73ed92833c205651a47f 13-May-2011 David S. Miller <davem@davemloft.net> ipv4: Always call ip_options_build() after rest of IP header is filled in.

This will allow ip_options_build() to reliably look at the values of
iph->{daddr,saddr}

Signed-off-by: David S. Miller <davem@davemloft.net>
0a5ebb8000c5362be368df9d197943deb06b6916 09-May-2011 David S. Miller <davem@davemloft.net> ipv4: Pass explicit daddr arg to ip_send_reply().

This eliminates an access to rt->rt_src.

Signed-off-by: David S. Miller <davem@davemloft.net>
f5fca6086511294653a9e821f8e22f041415ba7b 09-May-2011 David S. Miller <davem@davemloft.net> ipv4: Pass flow key down into ip_append_*().

This way rt->rt_dst accesses are unnecessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
77968b78242ee25e2a4d759f0fca8dd52df6d479 09-May-2011 David S. Miller <davem@davemloft.net> ipv4: Pass flow keys down into datagram packet building engine.

This way ip_output.c no longer needs rt->rt_{src,dst}.

We already have these keys sitting, ready and waiting, on the stack or
in a socket structure.

Signed-off-by: David S. Miller <davem@davemloft.net>
ea4fc0d6193ff56fcef39b0d2210d402a7acb5f0 07-May-2011 David S. Miller <davem@davemloft.net> ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit().

Now we can pick it out of the provided flow key.

Signed-off-by: David S. Miller <davem@davemloft.net>
d9d8da805dcb503ef8ee49918a94d49085060f23 07-May-2011 David S. Miller <davem@davemloft.net> inet: Pass flowi to ->queue_xmit().

This allows us to acquire the exact route keying information from the
protocol, however that might be managed.

It handles all of the possibilities, from the simplest case of storing
the key in inet->cork.fl to the more complex setup SCTP has where
individual transports determine the flow.

Signed-off-by: David S. Miller <davem@davemloft.net>
b57ae01a8a8446dbbed7365c9b05aef1fc6dea20 07-May-2011 David S. Miller <davem@davemloft.net> ipv4: Use cork flow in ip_queue_xmit()

All invokers of ip_queue_xmit() must make certain that the
socket is locked. All of SCTP, TCP, DCCP, and L2TP now make
sure this is the case.

Therefore we can use the cork flow during output route lookup in
ip_queue_xmit() when the socket route check fails.

Signed-off-by: David S. Miller <davem@davemloft.net>
706527280ec38fcdcd0466f10b607105fd23801b 07-May-2011 David S. Miller <davem@davemloft.net> ipv4: Initialize cork->opt using NULL not 0.

Noticed by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
b80d72261aec5e763a76497eba5fddc84833a154 07-May-2011 David S. Miller <davem@davemloft.net> ipv4: Initialize on-stack cork more efficiently.

ip_setup_cork() explicitly initializes every member of
inet_cork except flags, addr, and opt. So we can simply
set those three members to zero instead of using a
memset() via an empty struct assignment.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
bdc712b4c2baf9515887de3a52e7ecd89fafc0c7 07-May-2011 David S. Miller <davem@davemloft.net> inet: Decrease overhead of on-stack inet_cork.

When we fast path datagram sends to avoid locking by putting
the inet_cork on the stack we use up lots of space that isn't
necessary.

This is because inet_cork contains a "struct flowi" which isn't
used in these code paths.

Split inet_cork to two parts, "inet_cork" and "inet_cork_full".
Only the latter of which has the "struct flowi" and is what is
stored in inet_sock.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
dd927a2694ee412b440284dd72dd8e32caada3fc 04-May-2011 David S. Miller <davem@davemloft.net> ipv4: In ip_build_and_send_pkt() use 'saddr' and 'daddr' args passed in.

Instead of rt->rt_{dst,src}

The only tricky part is source route option handling.

If the source route option is enabled we can't just use plain 'daddr',
we have to use opt->opt.faddr.

Signed-off-by: David S. Miller <davem@davemloft.net>
31e4543db29fb85496a122b965d6482c8d1a2bfe 04-May-2011 David S. Miller <davem@davemloft.net> ipv4: Make caller provide on-stack flow key to ip_route_output_ports().

Signed-off-by: David S. Miller <davem@davemloft.net>
f6d8bd051c391c1c0458a30b2a7abcd939329259 21-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com> inet: add RCU protection to inet->opt

We lack proper synchronization to manipulate inet->opt ip_options

Problem is ip_make_skb() calls ip_setup_cork() and
ip_setup_cork() possibly makes a copy of ipc->opt (struct ip_options),
without any protection against another thread manipulating inet->opt.

Another thread can change inet->opt pointer and free old one under us.

Use RCU to protect inet->opt (changed to inet->inet_opt).

Instead of handling atomic refcounts, just copy ip_options when
necessary, to avoid cache line dirtying.

We cant insert an rcu_head in struct ip_options since its included in
skb->cb[], so this patch is large because I had to introduce a new
ip_options_rcu structure.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
25985edcedea6396277003854657b5f3cb31a628 31-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi> Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
538de0e01f1ca3568ad03877ff297c646dd8ad23 31-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use flowi4_init_output() in ip_send_reply()

Signed-off-by: David S. Miller <davem@davemloft.net>
9cce96df5b76691712dba22e83ff5efe900361e1 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Put fl4_* macros to struct flowi4 and use them again.

Signed-off-by: David S. Miller <davem@davemloft.net>
9d6ec938019c6b16cb9ec96598ebe8f20de435fe 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use flowi4 in public route lookup interfaces.

Signed-off-by: David S. Miller <davem@davemloft.net>
6281dcc94a96bd73017b2baa8fa83925405109ef 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Make flowi ports AF dependent.

Create two sets of port member accessors, one set prefixed by fl4_*
and the other prefixed by fl6_*

This will let us to create AF optimal flow instances.

It will work because every context in which we access the ports,
we have to be fully aware of which AF the flowi is anyways.

Signed-off-by: David S. Miller <davem@davemloft.net>
1d28f42c1bd4bb2363d88df74d0128b4da135b4a 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Put flowi_* prefix on AF independent members of struct flowi

I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>
78fbfd8a653ca972afe479517a40661bfff6d8c3 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Create and use route lookup helpers.

The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.

Signed-off-by: David S. Miller <davem@davemloft.net>
b23dd4fe42b455af5c6e20966b7d6959fa8352ea 02-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make output route lookup return rtable directly.

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
07df5294a753dfac2cc9f75e6159fc25fdc22149 02-Mar-2011 Herbert Xu <herbert@gondor.apana.org.au> inet: Replace left-over references to inet->cork

The patch to replace inet->cork with cork left out two spots in
__ip_append_data that can result in bogus packet construction.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
273447b352e69c327efdecfd6e1d6fe3edbdcd14 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Kill can_sleep arg to ip_route_output_flow()

This boolean state is now available in the flow flags.

Signed-off-by: David S. Miller <davem@davemloft.net>
420d44daa7aa1cc847e9e527f0a27a9ce61768ca 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"

Since that is what the current vague "flags" argument means.

Signed-off-by: David S. Miller <davem@davemloft.net>
1c32c5ad6fac8cee1a77449f5abf211e911ff830 01-Mar-2011 Herbert Xu <herbert@gondor.apana.org.au> inet: Add ip_make_skb and ip_finish_skb

This patch adds the helper ip_make_skb which is like ip_append_data
and ip_push_pending_frames all rolled into one, except that it does
not send the skb produced. The sending part is carried out by
ip_send_skb, which the transport protocol can call after it has
tweaked the skb.

It is meant to be called in cases where corking is not used should
have a one-to-one correspondence to sendmsg.

This patch also adds the helper ip_finish_skb which is meant to
be replace ip_push_pending_frames when corking is required.
Previously the protocol stack would peek at the socket write
queue and add its header to the first packet. With ip_finish_skb,
the protocol stack can directly operate on the final skb instead,
just like the non-corking case with ip_make_skb.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1470ddf7f8cecf776921e5ccee72e3d2b3d60cbc 01-Mar-2011 Herbert Xu <herbert@gondor.apana.org.au> inet: Remove explicit write references to sk/inet in ip_append_data

In order to allow simultaneous calls to ip_append_data on the same
socket, it must not modify any shared state in sk or inet (other
than those that are designed to allow that such as atomic counters).

This patch abstracts out write references to sk and inet_sk in
ip_append_data and its friends so that we may use the underlying
code in parallel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a2ef92023506d4e9cd13617b5a46b4d0f1b6747 01-Mar-2011 Herbert Xu <herbert@gondor.apana.org.au> inet: Remove unused sk_sndmsg_* from UFO

UFO doesn't really use the sk_sndmsg_* parameters so touching
them is pointless. It can't use them anyway since the whole
point of UFO is to use the original pages without copying.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
323e126f0c5995f779d7df7fd035f6e8fed8764d 13-Dec-2010 David S. Miller <davem@davemloft.net> ipv4: Don't pre-seed hoplimit metric.

Always go through a new ip4_dst_hoplimit() helper, just like ipv6.

This allowed several simplifications:

1) The interim dst_metric_hoplimit() can go as it's no longer
userd.

2) The sysctl_ip_default_ttl entry no longer needs to use
ipv4_doint_and_flush, since the sysctl is not cached in
routing cache metrics any longer.

3) ipv4_doint_and_flush no longer needs to be exported and
therefore can be marked static.

When ipv4_doint_and_flush_strategy was removed some time ago,
the external declaration in ip.h was mistakenly left around
so kill that off too.

We have to move the sysctl_ip_default_ttl declaration into
ipv4's route cache definition header net/route.h, because
currently net/ip.h (where the declaration lives now) has
a back dependency on net/route.h

Signed-off-by: David S. Miller <davem@davemloft.net>
5170ae824ddf1988a63fb12cbedcff817634c444 13-Dec-2010 David S. Miller <davem@davemloft.net> net: Abstract RTAX_HOPLIMIT metric accesses behind helper.

Signed-off-by: David S. Miller <davem@davemloft.net>
5811662b15db018c740c57d037523683fd3e6123 12-Nov-2010 Changli Gao <xiaosuo@gmail.com> net: use the macros defined for the members of flowi

Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
59104f062435c7816e39ee5ed504a69cb8037f10 20-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> ip: take care of last fragment in ip_append_data

While investigating a bit, I found ip_fragment() slow path was taken
because ip_append_data() provides following layout for a send(MTU +
N*(MTU - 20)) syscall :

- one skb with 1500 (mtu) bytes
- N fragments of 1480 (mtu-20) bytes (before adding IP header)
last fragment gets 17 bytes of trail data because of following bit:

if (datalen == length + fraggap)
alloclen += rt->dst.trailer_len;

Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
another bug ?)

In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
so we take slow path, building another skb chain.

In order to avoid taking slow path, we should correct ip_append_data()
to make sure last fragment has real trail space, under mtu...

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3d13008e7345fa7a79d8f6438150dc15d6ba6e9d 21-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> ip: fix truesize mismatch in ip fragmentation

Special care should be taken when slow path is hit in ip_fragment() :

When walking through frags, we transfert truesize ownership from skb to
frags. Then if we hit a slow_path condition, we must undo this or risk
uncharging frags->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.

Many thanks to Nick Bowler, who provided a very clean bug report and
test program.

Thanks to Jarek for reviewing my first patch and providing a V2

While Nick bisection pointed to commit 2b85a34e911 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older
(2.6.12-rc5)

A side effect is to extend work done in commit b2722b1c3a893e
(ip_fragment: also adjust skb->truesize for packets not owned by a
socket) to ipv6 as well.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ec550d246e38e1b4ea8604b5c71ccb72e38f3290 24-Aug-2010 Eric Dumazet <eric.dumazet@gmail.com> net: ip_append_data() optim

Compiler is not smart enough to avoid a conditional branch.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21dc330157454046dd7c494961277d76e1c957fe 23-Aug-2010 David S. Miller <davem@davemloft.net> net: Rename skb_has_frags to skb_has_frag_list

SKBs can be "fragmented" in two ways, via a page array (called
skb_shinfo(skb)->frags[]) and via a list of SKBs (called
skb_shinfo(skb)->frag_list).

Since skb_has_frags() tests the latter, it's name is confusing
since it sounds more like it's testing the former.

Signed-off-by: David S. Miller <davem@davemloft.net>
2244d07bfa2097cb00600da91c715a8aa547917e 17-Aug-2010 Oliver Hartkopp <socketcan@hartkopp.net> net: simplify flags for tx timestamping

This patch removes the abstraction introduced by the union skb_shared_tx in
the shared skb data.

The access of the different union elements at several places led to some
confusion about accessing the shared tx_flags e.g. in skb_orphan_try().

http://marc.info/?l=linux-netdev&m=128084897415886&w=2

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c893b8066c7bf6156e4d760e5acaf4c148e37190 31-Jul-2010 Changli Gao <xiaosuo@gmail.com> ip_fragment: fix subtracting PPPOE_SES_HLEN from mtu twice

6c79bf0f2440fd250c8fce8d9b82fcf03d4e8350 subtracts PPPOE_SES_HLEN from mtu at
the front of ip_fragment(). So the later subtraction should be removed. The
MTU of 802.1q is also 1500, so MTU should not be changed.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo>
----
net/ipv4/ip_output.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
Signed-off-by: Bart De Schuymer <bdschuym@pandora.bo>
Signed-off-by: David S. Miller <davem@davemloft.net>
4bc2f18ba4f22a90ab593c0a580fc9a19c4777b6 09-Jul-2010 Eric Dumazet <eric.dumazet@gmail.com> net/ipv4: EXPORT_SYMBOL cleanups

CodingStyle cleanups

EXPORT_SYMBOL should immediately follow the symbol declaration.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
49085bd7d498c47d635851cfda22627b085cd9af 06-Jul-2010 George Kadianakis <desnacked@gmail.com> net/ipv4/ip_output.c: Removal of unused variable in ip_fragment()

Removal of unused integer variable in ip_fragment().

Signed-off-by: George Kadianakis <desnacked@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fe76cda3081b502986f0c8b28b0cf8bfc27d44d5 02-Jul-2010 Eric Dumazet <eric.dumazet@gmail.com> ipv4: use skb_dst_copy() in ip_copy_metadata()

Avoid touching dst refcount in ip_fragment().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
26cde9f7e2747b6d254b704594eed87ab959afa5 15-Jun-2010 Herbert Xu <herbert@gondor.apana.org.au> udp: Fix bogus UFO packet generation

It has been reported that the new UFO software fallback path
fails under certain conditions with NFS. I tracked the problem
down to the generation of UFO packets that are smaller than the
MTU. The software fallback path simply discards these packets.

This patch fixes the problem by not generating such packets on
the UFO path.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d8d1f30b95a635dbd610dcc5eb641aca8f4768cf 11-Jun-2010 Changli Gao <xiaosuo@gmail.com> net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ab6e3feba1f1bc3b9418b854da6f481408d243de 10-May-2010 Eric Dumazet <eric.dumazet@gmail.com> net: No dst refcounting in ip_queue_xmit()

TCP outgoing packets can avoid two atomic ops, and dirtying
of previously higly contended cache line using new refdst
infrastructure.

Note 1: loopback device excluded because of !IFF_XMIT_DST_RELEASE
Note 2: UDP packets dsts are built before ip_queue_xmit().

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6c79bf0f2440fd250c8fce8d9b82fcf03d4e8350 20-Apr-2010 Bart De Schuymer <bdschuym@pandora.be> netfilter: bridge-netfilter: fix refragmenting IP traffic encapsulated in PPPoE traffic

The MTU for IP traffic encapsulated inside PPPoE traffic is smaller
than the MTU of the Ethernet device (1500). Connection tracking
gathers all IP packets and sometimes will refragment them in
ip_fragment(). We then need to subtract the length of the
encapsulating header from the mtu used in ip_fragment(). The check in
br_nf_dev_queue_xmit() which determines if ip_fragment() has to be
called is also updated for the PPPoE-encapsulated packets.
nf_bridge_copy_header() is also updated to make sure the PPPoE data
length field has the correct value.

Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
cd58bcd9787ef4c16ab6e442c4f1bf3539b3ab39 19-Apr-2010 Jan Engelhardt <jengelh@medozas.de> netfilter: xt_TEE: have cloned packet travel through Xtables too

Since Xtables is now reentrant/nestable, the cloned packet can also go
through Xtables and be subject to rules itself.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
e281b19897dc21c1071802808d461627d747a877 19-Apr-2010 Jan Engelhardt <jengelh@medozas.de> netfilter: xtables: inclusion of xt_TEE

xt_TEE can be used to clone and reroute a packet. This can for
example be used to copy traffic at a router for logging purposes
to another dedicated machine.

References: http://www.gossamer-threads.com/lists/iptables/devel/68781
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
4e15ed4d930297c127d280ca1d0c785be870def4 15-Apr-2010 Shan Wei <shanwei@cn.fujitsu.com> net: replace ipfragok with skb->local_df

As Herbert Xu said: we should be able to simply replace ipfragok
with skb->local_df. commit f88037(sctp: Drop ipfargok in sctp_xmit function)
has droped ipfragok and set local_df value properly.

The patch kills the ipfragok parameter of .queue_xmit().

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e30b38c298b55e09456d3ccbc1df2f3e2e8dc6e9 15-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> ip: Fix ip_dev_loopback_xmit()

Eric Paris got following trace with a linux-next kernel

[ 14.203970] BUG: using smp_processor_id() in preemptible [00000000]
code: avahi-daemon/2093
[ 14.204025] caller is netif_rx+0xfa/0x110
[ 14.204035] Call Trace:
[ 14.204064] [<ffffffff81278fe5>] debug_smp_processor_id+0x105/0x110
[ 14.204070] [<ffffffff8142163a>] netif_rx+0xfa/0x110
[ 14.204090] [<ffffffff8145b631>] ip_dev_loopback_xmit+0x71/0xa0
[ 14.204095] [<ffffffff8145b892>] ip_mc_output+0x192/0x2c0
[ 14.204099] [<ffffffff8145d610>] ip_local_out+0x20/0x30
[ 14.204105] [<ffffffff8145d8ad>] ip_push_pending_frames+0x28d/0x3d0
[ 14.204119] [<ffffffff8147f1cc>] udp_push_pending_frames+0x14c/0x400
[ 14.204125] [<ffffffff814803fc>] udp_sendmsg+0x39c/0x790
[ 14.204137] [<ffffffff814891d5>] inet_sendmsg+0x45/0x80
[ 14.204149] [<ffffffff8140af91>] sock_sendmsg+0xf1/0x110
[ 14.204189] [<ffffffff8140dc6c>] sys_sendmsg+0x20c/0x380
[ 14.204233] [<ffffffff8100ad82>] system_call_fastpath+0x16/0x1b

While current linux-2.6 kernel doesnt emit this warning, bug is latent
and might cause unexpected failures.

ip_dev_loopback_xmit() runs in process context, preemption enabled, so
must call netif_rx_ni() instead of netif_rx(), to make sure that we
process pending software interrupt.

Same change for ip6_dev_loopback_xmit()

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
9bbc768aa911a3ef336272eaa6d220abfba8ce50 23-Mar-2010 Jan Engelhardt <jengelh@medozas.de> netfilter: ipv4: use NFPROTO values for NF_HOOK invocation

The semantic patch that was used:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_COND
|nf_hook
)(
-PF_INET,
+NFPROTO_IPV4,
...)
// </smpl>

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
7ad6848c7e81a603605fad3f3575841aab004eea 07-Jan-2010 Octavian Purdila <opurdila@ixiacom.com> ip: fix mc_loop checks for tunnels with multicast outer addresses

When we have L3 tunnels with different inner/outer families
(i.e. IPV4/IPV6) which use a multicast address as the outer tunnel
destination address, multicast packets will be loopbacked back to the
sending socket even if IP*_MULTICAST_LOOP is set to disabled.

The mc_loop flag is present in the family specific part of the socket
(e.g. the IPv4 or IPv4 specific part). setsockopt sets the inner
family mc_loop flag. When the packet is pushed through the L3 tunnel
it will eventually be processed by the outer family which if different
will check the flag in a different part of the socket then it was set.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b2722b1c3a893ec6021508da15b32282ec79f4da 02-Dec-2009 Patrick McHardy <kaber@trash.net> ip_fragment: also adjust skb->truesize for packets not owned by a socket

When a large packet gets reassembled by ip_defrag(), the head skb
accounts for all the fragments in skb->truesize. If this packet is
refragmented again, skb->truesize is not re-adjusted to reflect only
the head size since its not owned by a socket. If the head fragment
then gets recycled and reused for another received fragment, it might
exceed the defragmentation limits due to its large truesize value.

skb_recycle_check() explicitly checks for linear skbs, so any recycled
skb should reflect its true size in skb->truesize. Change ip_fragment()
to also adjust the truesize value of skbs not owned by a socket.

Reported-and-tested-by: Ben Menchaca <ben@bigfootnetworks.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9d4fb27db90043cd2640e4bc778f9c755d3c17c1 23-Nov-2009 Joe Perches <joe@perches.com> net/ipv4: Move && and || to end of previous line

On Sun, 2009-11-22 at 16:31 -0800, David Miller wrote:
> It should be of the form:
> if (x &&
> y)
>
> or:
> if (x && y)
>
> Fix patches, rather than complaints, for existing cases where things
> do not follow this pattern are certainly welcome.

Also collapsed some multiple tabs to single space.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c720c7e8383aff1cb219bddf474ed89d850336e3 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> inet: rename some inet_sock fields

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
914a9ab386a288d0f22252fc268ecbc048cdcbd5 02-Oct-2009 Atis Elsts <atis@mikrotik.com> net: Use sk_mark for routing lookup in more places

This patch against v2.6.31 adds support for route lookup using sk_mark in some
more places. The benefits from this patch are the following.
First, SO_MARK option now has effect on UDP sockets too.
Second, ip_queue_xmit() and inet_sk_rebuild_header() could fail to do routing
lookup correctly if TCP sockets with SO_MARK were used.

Signed-off-by: Atis Elsts <atis@mikrotik.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
6ce9e7b5fe3195d1ae6e3a0753d4ddcac5cd699e 03-Sep-2009 Eric Dumazet <eric.dumazet@gmail.com> ip: Report qdisc packet drops

Christoph Lameter pointed out that packet drops at qdisc level where not
accounted in SNMP counters. Only if application sets IP_RECVERR, drops
are reported to user (-ENOBUFS errors) and SNMP counters updated.

IP_RECVERR is used to enable extended reliable error message passing,
but these are not needed to update system wide SNMP stats.

This patch changes things a bit to allow SNMP counters to be updated,
regardless of IP_RECVERR being set or not on the socket.

Example after an UDP tx flood
# netstat -s
...
IP:
1487048 outgoing packets dropped
...
Udp:
...
SndbufErrors: 1487048


send() syscalls, do however still return an OK status, to not
break applications.

Note : send() manual page explicitly says for -ENOBUFS error :

"The output queue for a network interface was full.
This generally indicates that the interface has stopped sending,
but may be caused by transient congestion.
(Normally, this does not occur in Linux. Packets are just silently
dropped when a device queue overflows.) "

This is not true for IP_RECVERR enabled sockets : a send() syscall
that hit a qdisc drop returns an ENOBUFS error.

Many thanks to Christoph, David, and last but not least, Alexey !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
788d908f2879a17e5f80924f3da2e23f1034482d 27-Aug-2009 Julien TINNES <julien@cr0.org> ipv4: make ip_append_data() handle NULL routing table

Add a check in ip_append_data() for NULL *rtp to prevent future bugs in
callers from being exploitable.

Signed-off-by: Julien Tinnes <julien@cr0.org>
Signed-off-by: Tavis Ormandy <taviso@sdf.lonestar.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e51a67a9c8a2ea5c563f8c2ba6613fe2100ffe67 08-Jul-2009 Eric Dumazet <eric.dumazet@gmail.com> net: ip_push_pending_frames() fix

After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
we do not take any more references on sk->sk_refcnt on outgoing packets.

I forgot to delete two __sock_put() from ip_push_pending_frames()
and ip6_push_pending_frames().

Reported-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Tested-by: Emil S Tantilov <emils.tantilov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2b85a34e911bf483c27cfdd124aeb1605145dc80 11-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: No more expensive sock_hold()/sock_put() on each tx

One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.

This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.

We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.

We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.

As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.

skb_set_owner_w() doesnt call sock_hold() anymore

sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.

Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.

Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d7fcf1a5cae2c970e9afe7192fe0c13d931247e0 09-Jun-2009 David S. Miller <davem@davemloft.net> ipv4: Use frag list abstraction interfaces.

Signed-off-by: David S. Miller <davem@davemloft.net>
adf30907d63893e4208dfe3f5c88ae12bc2f25d5 02-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
511c3f92ad5b6d9f8f6464be1b4f85f0422be91a 02-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: skb->rtable accessor

Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
edf391ff17232f097d72441c9ad467bcb3b5db18 27-Apr-2009 Neil Horman <nhorman@tuxdriver.com> snmp: add missing counters for RFC 4293

The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
OutMcastOctets:
http://tools.ietf.org/html/rfc4293
But it seems we don't track those in any way that easy to separate from other
protocols. This patch adds those missing counters to the stats file. Tested
successfully by me

With help from Eric Dumazet.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
51f31cabe3ce5345b51e4a4f82138b38c4d5dc91 12-Feb-2009 Patrick Ohly <patrick.ohly@intel.com> ip: support for TX timestamps on UDP and RAW sockets

Instructions for time stamping outgoing packets are take from the
socket layer and later copied into the new skb.

Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a21bba945430f3f5e00c349665f88cdacdb32a8d 25-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames()

We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_push_pending_frames() steal the refcount its
callers had to take when filling inet->cork.dst.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2e77d89b2fa8e3f8325b8ce7893ec3645f41aff5 25-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: avoid a pair of dst_hold()/dst_release() in ip_append_data()

We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_append_data() eventually steal the refcount its
callers had to take on the dst entry.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d9319100c1ad7d0ed4045ded767684ad25670436 03-Nov-2008 Jianjun Kong <jianjun@zeuux.org> net: clean up net/ipv4/ah4.c esp4.c fib_semantics.c inet_connection_sock.c inetpeer.c ip_output.c

Signed-off-by: Jianjun Kong <jianjun@zeuux.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
86b08d867d7de001ab224180ed7865fab93fd56e 01-Oct-2008 KOVACS Krisztian <hidden@sch.bme.hu> ipv4: Make Netfilter's ip_route_me_harder() non-local address compatible

Netfilter's ip_route_me_harder() tries to re-route packets either
generated or re-routed by Netfilter. This patch changes
ip_route_me_harder() to handle packets from non-locally-bound sockets
with IP_TRANSPARENT set as local and to set the appropriate flowi
flags when re-doing the routing lookup.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
547b792cac0a038b9dbf958d3c120df3740b5572 26-Jul-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> net: convert BUG_TRAP to generic WARN_ON

Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
5e38e270444f2629de7a706b5a9ca1b333d14517 17-Jul-2008 Pavel Emelyanov <xemul@openvz.org> mib: add net to IP_INC_STATS

All the callers already have either the net itself, or the place
where to get it from.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0388b0042624714e6f8db8cc7994101a0a02d392 15-Jul-2008 Pavel Emelyanov <xemul@openvz.org> icmp: add struct net argument to icmp_out_count

This routine deals with ICMP statistics, but doesn't have a
struct net at hands, so add one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0b040829952d84bf2a62526f0e24b624e0699447 11-Jun-2008 Adrian Bunk <bunk@kernel.org> net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
be9164e769d57aa10b2bbe15d103edc041b9e7de 30-Apr-2008 Kostya B <bkostya@hotmail.com> [IPv4] UFO: prevent generation of chained skb destined to UFO device

Problem: ip_append_data() could wrongly generate a chained skb for
devices which support UFO. When sk_write_queue is not empty
(e.g. MSG_MORE), __instead__ of appending data into the next nr_frag
of the queued skb, a new chained skb is created.

I would normally assume UFO device should get data in nr_frags and not
in frag_list. Later the udp4_hwcsum_outgoing() resets csum to NONE
and skb_gso_segment() has oops.

Proposal:
1. Even length is less than mtu, employ ip_ufo_append_data()
and append data to the __existed__ skb in the sk_write_queue.

2. ip_ufo_append_data() is fixed due to a wrong manipulation of
peek-ing and later enqueue-ing of the same skb. Now, enqueuing is
always performed, because on error the further
ip_flush_pending_frames() would release the queued skb.

Signed-off-by: Kostya B <bkostya@hotmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b1e0a655f8eba44ab1ee2a1068d169ccfb853b9 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.

Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
c8cdaf998df221b01134a051aba38c570105061b 10-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4,IPV6]: Share cork.rt between IPv4 and IPv6.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
cb84663e4d239f23f0d872bc6463c272e74daad8 24-Mar-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Process IP layer in the context of the correct namespace.

Replace all the rest of the init_net with a proper net on the IP layer.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ee6b967301b4aa5d4a4b61e2f682f086266db9fb 06-Mar-2008 Eric Dumazet <dada1@cosmosbay.com> [IPV4]: Add 'rtable' field in struct sk_buff to alias 'dst' and avoid casts

(Anonymous) unions can help us to avoid ugly casts.

A common cast it the (struct rtable *)skb->dst one.

Defining an union like :
union {
struct dst_entry *dst;
struct rtable *rtable;
};
permits to use skb->rtable in place.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4a19ec5800fc3bb64e2d87c4d9fdd9e636086fe0 31-Jan-2008 Laszlo Attila Toth <panther@balabit.hu> [NET]: Introducing socket mark socket option.

A userspace program may wish to set the mark for each packets its send
without using the netfilter MARK target. Changing the mark can be used
for mark based routing without netfilter or for packet filtering.

It requires CAP_NET_ADMIN capability.

Signed-off-by: Laszlo Attila Toth <panther@balabit.hu>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
29ffe1a5c52dae13b6efead97aab9b058f38fce4 29-Jan-2008 Herbert Xu <herbert@gondor.apana.org.au> [INET]: Prevent out-of-sync truesize on ip_fragment slow path

When ip_fragment has to hit the slow path the value of skb->truesize
may go out of sync because we would have updated it without changing
the packet length. This violates the constraints on truesize.

This patch postpones the update of skb->truesize to prevent this.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
dde1bc0e6f86183bc095d0774cd109f4edf66ea2 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace for ICMP replying code.

All needed API is done, the namespace is available when required from
the device on the DST entry from the incoming packet. So, just replace
init_net with proper namespace.

Other protocols will follow.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f206351a50ea86250fabea96b9af8d8f8fc02603 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to ip_route_output_key.

Needed to propagate it down to the ip_route_output_flow.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f1b050bf7a88910f9f00c9c8989c1bf5a67dd140 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to ip_route_output_flow.

Needed to propagate it down to the __ip_route_output_key.

Signed_off_by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a067d9ac39cd207b5a0994c73199a56e7d5a17a3 06-Jan-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> [NET]: Remove obsolete comment

It seems that ip_build_xmit is no longer used in here and
ip_append_data is used.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
6e23ae2a48750bda407a4a58f52a4865d7308bf5 20-Nov-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: Introduce NF_INET_ hook values

The IPv4 and IPv6 hook values are identical, yet some code tries to figure
out the "correct" value by looking at the address family. Introduce NF_INET_*
values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__
section for userspace compatibility.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
c439cb2e4b13cf1cb2abcd006b906315a3381323 12-Jan-2008 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Add ip_local_out

Most callers of the LOCAL_OUT chain will set the IP packet length and
header checksum before doing so. They also share the same output
function dst_output.

This patch creates a new function called ip_local_out which does all
of that and converts the appropriate users over to it.

Apart from removing duplicate code, it will also help in merging the
IPsec output path once the same thing is done for IPv6.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
f945fa7ad9c12a3356a3de7fb2143ccc2f2c3bca 23-Jan-2008 Herbert Xu <herbert@gondor.apana.org.au> [INET]: Fix truesize setting in ip_append_data

As it is ip_append_data only counts page fragments to the skb that
allocated it. As such it means that the first skb gets hit with a
4K charge even though it might have only used a fraction of it while
all subsequent skb's that use the same page gets away with no charge
at all.

This bug was exposed by the UDP accounting patch.

[ The wmem_alloc bumping needs to be moved with the truesize,
noticed by Takahiro Yasui. -DaveM ]

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
1e34a11d55c9437775367d72ad03f9af99e78bd0 23-Jan-2008 David S. Miller <davem@davemloft.net> [IPV4]: Add missing skb->truesize increment in ip_append_page().

And as noted by Takahiro Yasui, we thus need to bump the
sk->sk_wmem_alloc at this spot as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
429f08e950a88cd826b203ea898c2f2d0f7db9de 06-Nov-2007 Pavel Emelyanov <xemul@openvz.org> [IPV4]: Consolidate the ip cork destruction in ip_output.c

The ip_push_pending_frames and ip_flush_pending_frames do the
same things to flush the sock's cork. Move this into a separate
function and save ~80 bytes from the .text

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
c2636b4d9e8ab8d16b9e2bf0f0744bb8418d4026 24-Oct-2007 Chuck Lever <chuck.lever@oracle.com> [NET]: Treat the sign of the result of skb_headroom() consistently

In some places, the result of skb_headroom() is compared to an unsigned
integer, and in others, the result is compared to a signed integer. Make
the comparisons consistent and correct.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
861d04860725dc85944bf9fa815af338d9e56b43 15-Oct-2007 Patrick McHardy <kaber@trash.net> [IPV4]: Uninline netfilter okfns

Now that we don't pass double skb pointers to nf_hook_slow anymore, gcc
can generate tail calls for some of the netfilter hook okfn invocations,
so there is no need to inline the functions anymore. This caused huge
code bloat since we ended up with one inlined version and one out-of-line
version since we pass the address to nf_hook_slow.

Before:
text data bss dec hex filename
8997385 1016524 524652 10538561 a0ce41 vmlinux

After:
text data bss dec hex filename
8994009 1016524 524652 10535185 a0c111 vmlinux
-------------------------------------------------------
-3376

All cases have been verified to generate tail-calls with and without
netfilter. The okfns in ipmr and xfrm4_input still remain inline because
gcc can't generate tail-calls for them.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b04ddde02cf1b6f14f2697da5c20eca5715017f 09-Oct-2007 Stephen Hemminger <shemminger@linux-foundation.org> [NET]: Move hardware header operations out of netdevice.

Since hardware header operations are part of the protocol class
not the device instance, make them into a separate object and
save memory.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
96793b482540f3a26e2188eaf75cb56b7829d3e3 17-Sep-2007 David L Stevens <dlstevens@us.ibm.com> [IPV4]: Add ICMPMsgStats MIB (RFC 4293)

Background: RFC 4293 deprecates existing individual, named ICMP
type counters to be replaced with the ICMPMsgStatsTable. This table
includes entries for both IPv4 and IPv6, and requires counting of all
ICMP types, whether or not the machine implements the type.

These patches "remove" (but not really) the existing counters, and
replace them with the ICMPMsgStats tables for v4 and v6.
It includes the named counters in the /proc places they were, but gets the
values for them from the new tables. It also counts packets generated
from raw socket output (e.g., OutEchoes, MLD queries, RA's from
radvd, etc).

Changes:
1) create icmpmsg_statistics mib
2) create icmpv6msg_statistics mib
3) modify existing counters to use these
4) modify /proc/net/snmp to add "IcmpMsg" with all ICMP types
listed by number for easy SNMP parsing
5) modify /proc/net/snmp printing for "Icmp" to get the named data
from new counters.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f49f9967b263cc88b48d912172afdc621bcb0a3c 11-Aug-2007 Jesper Juhl <jesper.juhl@gmail.com> [IPV4]: Clean up duplicate includes in net/ipv4/

This patch cleans up duplicate includes in
net/ipv4/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba9dda3ab5a865542e69dfe01edb2436857c9420 08-Jul-2007 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> [NETFILTER]: x_tables: add TRACE target

The TRACE target can be used to follow IP and IPv6 packets through
the ruleset.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick NcHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
d212f87b068c9d72065ef579d85b5ee6b8b59381 27-Jun-2007 Stephen Hemminger <shemminger@linux-foundation.org> [NET]: IPV6 checksum offloading in network devices

The existing model for checksum offload does not correctly handle
devices that can offload IPV4 and IPV6 only. The NETIF_F_HW_CSUM flag
implies device can do any arbitrary protocol.

This patch:
* adds NETIF_F_IPV6_CSUM for those devices
* fixes bnx2 and tg3 devices that need it
* add NETIF_F_IPV6_CSUM to ipv6 output (incl GSO)
* fixes assumptions about NETIF_F_ALL_CSUM in nat
* adjusts bridge union of checksumming computation

Signed-off-by: David S. Miller <davem@davemloft.net>
f0e48dbfc5c74e967fea4c0fd0c5ad07557ae0c8 05-Jun-2007 Patrick McHardy <kaber@trash.net> [TCP]: Honour sk_bound_dev_if in tcp_v4_send_ack

A time_wait socket inherits sk_bound_dev_if from the original socket,
but it is not used when sending ACK packets using ip_send_reply.

Fix by passing the oif to ip_send_reply in struct ip_reply_arg and
use it for output routing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
80787ebc2bbd8e675d8b9ff8cfa40f15134feebe 30-Apr-2007 Mitsuru Chinen <mitch@linux.vnet.ibm.com> [IPV4] SNMP: Support OutMcastPkts and OutBcastPkts

A transmitted IP multicast datagram should be counted as OutMcastPkts.
By the same token, a transmitted IP broadcast datagram should be
counted as OutBcastPkts.

Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
628a5c561890a9a9a74dea017873530584aab06e 21-Apr-2007 John Heffner <jheffner@psc.edu> [INET]: Add IP(V6)_PMTUDISC_RPOBE

Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER. This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery). This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
d626f62b11e00c16e81e4308ab93d3f13551812a 27-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_copy_from_linear_data{_offset}

To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
b0e380b1d8a8e0aca215df97702f99815f05c094 11-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: unions of just one member don't get anything done, kill them

Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cfe1fc7759fdacb0c650b575daed1692bf3eaece 16-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_network_header_len

For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bff9b61ce330df04c6830d823c30c04203543f01 16-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Use the helpers to get the layer header pointer

Some more cases...

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e7ac05f3407a3fb5a1b2ff5d5554899eaa0a10a3 15-Mar-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: add nf_copy() to safely copy members in skb

This unifies the codes to copy netfilter related datas. Before copying,
nf_copy() puts original members in destination skb.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9c70220b73908f64792422a2c39c593c4792f2c5 26-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_transport_header(skb)

For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aa8223c7bb0b05183e1737881ed21827aa5b9e73 11-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
967b05f64e27d04a4c8879addd0e1c52137e2c9e 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_set_transport_header

For the cases where the transport header is being set to a offset from
skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
badff6d01a8589a1c828b0bf118903ca38627f4e 13-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_reset_transport_header(skb)

For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eddc9ec53be2ecdbf4efe0efd4a83052594f0ac0 21-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c14d2450cb7fe1786e2ec325172baf66922bf597 12-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_set_network_header

For the cases where the network header is being set to a offset from skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d56f90a7c96da5187f0cdf07ee7434fe6aa78bbc 11-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_network_header()

For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bbe735e4247dba32568a305553b010081c8dea99 11-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_network_offset()

For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8856dfa3e9b71ac2177016f66ace3a8978afecc1 10-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Use skb_reset_network_header after skb_push

Some more cases where skb->nh.iph was being set that were converted
to using skb_reset_network_header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2ca9e6f2c2a4117d21947e911ae1f5e5306b0df0 10-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Some more skb_put cases converted to skb_reset_network_header

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e2d1bca7e6134671bcb19810d004a252aa6a644d 11-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Use skb_reset_network_header in skb_push cases

skb_push updates and returns skb->data, so we can just call
skb_reset_network_header after the call to skb_push.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c1d2bbe1cd6c7bbdc6d532cefebb66c7efb789ce 11-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_reset_network_header(skb)

For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can
later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
459a98ed881802dee55897441bc7f77af614368e 19-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_reset_mac_header(skb)

For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
132adf54639cf7dd9315e8df89c2faa59f6e46d9 09-Mar-2007 Stephen Hemminger <shemminger@linux-foundation.org> [IPV4]: cleanup

Add whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
cd354f1ae75e6466a7e31b727faede57a1f89ca5 14-Feb-2007 Tim Schmielau <tim@physik3.uni-rostock.de> [PATCH] remove many unneeded #includes of sched.h

After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.

To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.

Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e905a9edab7f4f14f9213b52234e4a346c690911 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] IPV4: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
e89862f4c5b3c4ac9afcd8cb1365d2f1e16ddc3b 26-Jan-2007 David S. Miller <davem@sunset.davemloft.net> [TCP]: Restore SKB socket owner setting in tcp_transmit_skb().

Revert 931731123a103cfb3f70ac4b7abfc71d94ba1f03

We can't elide the skb_set_owner_w() here because things like certain
netfilter targets (such as owner MATCH) need a socket to be set on the
SKB for correct operation.

Thanks to Jan Engelhardt and other netfilter list members for
pointing this out.

Signed-off-by: David S. Miller <davem@davemloft.net>
3644f0cee77494190452de132e82245107939284 08-Dec-2006 Stephen Hemminger <shemminger@osdl.org> [NET]: Convert hh_lock to seqlock.

The hard header cache is in the main output path, so using
seqlock instead of reader/writer lock should reduce overhead.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a1f8e7f7fb9d7e2cbcb53170edca7c0ac4680697 19-Oct-2006 Al Viro <viro@zeniv.linux.org.uk> [PATCH] severing skbuff.h -> highmem.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5084205faf45384fff25c4cf77dd5c96279283ad 15-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Annotate callers of csum_partial_copy_...() and csum_and_copy...() in net/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
44bb93633f57a55979f3c2589b10fd6a2bfc7c08 15-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Annotate csum_partial() callers in net/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
d3bc23e7ee9db8023dff5a86bb3b0069ed018789 15-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Annotate callers of csum_fold() in net/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
714e85be3557222bc25f69c252326207c900a7db 15-Nov-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV6]: Assorted trivial endianness annotations.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
931731123a103cfb3f70ac4b7abfc71d94ba1f03 10-Nov-2006 David S. Miller <davem@sunset.davemloft.net> [TCP]: Don't set SKB owner in tcp_transmit_skb().

The data itself is already charged to the SKB, doing
the skb_set_owner_w() just generates a lot of noise and
extra atomics we don't really need.

Lmbench improvements on lat_tcp are minimal:

before:
TCP latency using localhost: 23.2701 microseconds
TCP latency using localhost: 23.1994 microseconds
TCP latency using localhost: 23.2257 microseconds

after:
TCP latency using localhost: 22.8380 microseconds
TCP latency using localhost: 22.9465 microseconds
TCP latency using localhost: 22.8462 microseconds

Signed-off-by: David S. Miller <davem@davemloft.net>
82e91ffef60e6eba9848fe149ce1eecd2b5aef12 10-Nov-2006 Thomas Graf <tgraf@suug.ch> [NET]: Turn nfmark into generic mark

nfmark is being used in various subsystems and has become
the defacto mark field for all kinds of packets. Therefore
it makes sense to rename it to `mark' and remove the
dependency on CONFIG_NETFILTER.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
3ca3c68e76686bee058937ade2b96f4de58ee434 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: struct ip_options annotations

->faddr is net-endian; annotated as such, variables inferred to be net-endian
annotated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
13d8eaa06abfeb708b60fa64203a20db033088b3 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: ip_build_and_send_pkt() annotations

saddr and daddr are net-endian

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
9bcfcaf5e9cc887eb39236e43bdbe4b4b2572229 30-Aug-2006 Stephen Hemminger <shemminger@osdl.org> [NETFILTER] bridge: simplify nf_bridge_pad

Do some simple optimization on the nf_bridge_pad() function
and don't use magic constants. Eliminate a double call and
the #ifdef'd code for CONFIG_BRIDGE_NETFILTER.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ab32ea5d8a760e7dd4339634e95d7be24ee5b842 22-Sep-2006 Brian Haley <brian.haley@hp.com> [NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly

Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
84fa7933a33f806bbbaae6775e87459b1ec584c0 30-Aug-2006 Patrick McHardy <kaber@trash.net> [NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE

Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).

Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
beb8d13bed80f8388f1a9a107d07ddd342e627e8 05-Aug-2006 Venkat Yekkirala <vyekkirala@TrustedCS.com> [MLSXFRM]: Add flow labeling

This labels the flows that could utilize IPSec xfrms at the points the
flows are defined so that IPSec policy and SAs at the right label can
be used.

The following protos are currently not handled, but they should
continue to be able to use single-labeled IPSec like they currently
do.

ipmr
ip_gre
ipip
igmp
sit
sctp
ip6_tunnel (IPv6 over IPv6 tunnel device)
decnet

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0668b47205e42c04e9c1b594573be5a822ac7f09 01-Sep-2006 Wei Dong <weid@nanjing-fnst.com> [IPV4]: Fix SNMPv2 "ipFragFails" counter error

When I tested Linux kernel 2.6.17.7 about statistics
"ipFragFails",found that this counter couldn't increase correctly. The
criteria is RFC2011:
RFC2011
ipFragFails OBJECT-TYPE
SYNTAX Counter32
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of IP datagrams that have been discarded because
they needed to be fragmented at this entity but could not
be, e.g., because their Don't Fragment flag was set."
::= { ip 18 }

When I send big IP packet to a router with DF bit set to 1 which need to
be fragmented, and router just sends an ICMP error message
ICMP_FRAG_NEEDED but no increments for this counter(in the function
ip_fragment).

Signed-off-by: Wei Dong <weid@nanjing-fnst.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e9fa4f7bd291c29a785666e2fa5a9cf3241ee6c3 14-Aug-2006 Herbert Xu <herbert@gondor.apana.org.au> [INET]: Use pskb_trim_unique when trimming paged unique skbs

The IPv4/IPv6 datagram output path was using skb_trim to trim paged
packets because they know that the packet has not been cloned yet
(since the packet hasn't been given to anything else in the system).

This broke because skb_trim no longer allows paged packets to be
trimmed. Paged packets must be given to one of the pskb_trim functions
instead.

This patch adds a new pskb_trim_unique function to cover the IPv4/IPv6
datagram output path scenario and replaces the corresponding skb_trim
calls with it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
dafee490858f79e144c5e6cdd84ceb9efa20a3f1 02-Aug-2006 Wei Dong <weid@nanjing-fnst.com> [IPV6]: SNMPv2 "ipv6IfStatsOutFragCreates" counter error

When I tested linux kernel 2.6.71.7 about statistics
"ipv6IfStatsOutFragCreates", and found that it couldn't increase
correctly. The criteria is RFC 2465:

ipv6IfStatsOutFragCreates OBJECT-TYPE
SYNTAX Counter32
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The number of output datagram fragments that have
been generated as a result of fragmentation at
this output interface."
::= { ipv6IfStatsEntry 15 }

I think there are two issues in Linux kernel.
1st:
RFC2465 specifies the counter is "The number of output datagram
fragments...". I think increasing this counter after output a fragment
successfully is better. And it should not be increased even though a
fragment is created but failed to output.

2nd:
If we send a big ICMP/ICMPv6 echo request to a host, and receive
ICMP/ICMPv6 echo reply consisted of some fragments. As we know that in
Linux kernel first fragmentation occurs in ICMP layer(maybe saying
transport layer is better), but this is not the "real"
fragmentation,just do some "pre-fragment" -- allocate space for date,
and form a frag_list, etc. The "real" fragmentation happens in IP layer
-- set offset and MF flag and so on. So I think in "fast path" for
ip_fragment/ip6_fragment, if we send a fragment which "pre-fragment" by
upper layer we should also increase "ipv6IfStatsOutFragCreates".

Signed-off-by: Wei Dong <weid@nanjing-fnst.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
89114afd435a486deb8583e89f490fc274444d18 08-Jul-2006 Herbert Xu <herbert@gondor.apana.org.au> [NET] gso: Add skb_is_gso

This patch adds the wrapper function skb_is_gso which can be used instead
of directly testing skb_shinfo(skb)->gso_size. This makes things a little
nicer and allows us to change the primary key for indicating whether an skb
is GSO (if we ever want to do that).

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
f83ef8c0b58dac17211a4c0b6df0e2b1bd6637b1 30-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au> [IPV6]: Added GSO support for TCPv6

This patch adds GSO support for IPv6 and TCPv6. This is based on a patch
by Ananda Raju <Ananda.Raju@neterion.com>. His original description is:

This patch enables TSO over IPv6. Currently Linux network stacks
restricts TSO over IPv6 by clearing of the NETIF_F_TSO bit from
"dev->features". This patch will remove this restriction.

This patch will introduce a new flag NETIF_F_TSO6 which will be used
to check whether device supports TSO over IPv6. If device support TSO
over IPv6 then we don't clear of NETIF_F_TSO and which will make the
TCP layer to create TSO packets. Any device supporting TSO over IPv6
will set NETIF_F_TSO6 flag in "dev->features" along with NETIF_F_TSO.

In case when user disables TSO using ethtool, NETIF_F_TSO will get
cleared from "dev->features". So even if we have NETIF_F_TSO6 we don't
get TSO packets created by TCP layer.

SKB_GSO_TCPV4 renamed to SKB_GSO_TCP to make it generic GSO packet.
SKB_GSO_UDPV4 renamed to SKB_GSO_UDP as UFO is not a IPv4 feature.
UFO is supported over IPv6 also

The following table shows there is significant improvement in
throughput with normal frames and CPU usage for both normal and jumbo.

--------------------------------------------------
| | 1500 | 9600 |
| ------------------|-------------------|
| | thru CPU | thru CPU |
--------------------------------------------------
| TSO OFF | 2.00 5.5% id | 5.66 20.0% id |
--------------------------------------------------
| TSO ON | 2.63 78.0 id | 5.67 39.0% id |
--------------------------------------------------

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
7967168cefdbc63bf332d6b1548eca7cd65ebbcc 22-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Merge TSO/UFO fields in sk_buff

Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not
going to scale if we add any more segmentation methods (e.g., DCCP). So
let's merge them.

They were used to tell the protocol of a packet. This function has been
subsumed by the new gso_type field. This is essentially a set of netdev
feature bits (shifted by 16 bits) that are required to process a specific
skb. As such it's easy to tell whether a given device can process a GSO
skb: you just have to and the gso_type field and the netdev's features
field.

I've made gso_type a conjunction. The idea is that you have a base type
(e.g., SKB_GSO_TCPV4) that can be modified further to support new features.
For example, if we add a hardware TSO type that supports ECN, they would
declare NETIF_F_TSO | NETIF_F_TSO_ECN. All TSO packets with CWR set would
have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO
packets would be SKB_GSO_TCPV4. This means that only the CWR packets need
to be emulated in software.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
8648b3053bff39a7ee4c711d74268079c928a657 18-Jun-2006 Herbert Xu <herbert@gondor.apana.org.au> [NET]: Add NETIF_F_GEN_CSUM and NETIF_F_ALL_CSUM

The current stack treats NETIF_F_HW_CSUM and NETIF_F_NO_CSUM
identically so we test for them in quite a few places. For the sake
of brevity, I'm adding the macro NETIF_F_GEN_CSUM for these two. We
also test the disjunct of NETIF_F_IP_CSUM and the other two in various
places, for that purpose I've added NETIF_F_ALL_CSUM.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
984bc16cc92ea3c247bf34ad667cfb95331b9d3c 09-Jun-2006 James Morris <jmorris@namei.org> [SECMARK]: Add secmark support to core networking.

Add a secmark field to the skbuff structure, to allow security subsystems to
place security markings on network packets. This is similar to the nfmark
field, except is intended for implementing security policy, rather than than
networking policy.

This patch was already acked in principle by Dave Miller.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3d9dd7564d5d7c28eb87b14c13a23806484667f3 15-Apr-2006 Zach Brown <zach.brown@oracle.com> [PATCH] ip_output: account for fraggap when checking to add trailer_len

During other work I noticed that ip_append_data() seemed to be forgetting to
include the frag gap in its calculation of a fragment that consumes the rest of
the payload. Herbert confirmed that this was a bug that snuck in during a
previous rework.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2e2f7aefa8a8ba4adb6ecee8cbb43fbe9ca4cc89 04-Apr-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: Fix fragmentation issues with bridge netfilter

The conntrack code doesn't do re-fragmentation of defragmented packets
anymore but relies on fragmentation in the IP layer. Purely bridged
packets don't pass through the IP layer, so the bridge netfilter code
needs to take care of fragmentation itself.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1a55d57b107c3e06935763905dc0fb235214569d 22-Mar-2006 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> [TCP]: Do not use inet->id of global tcp_socket when sending RST.

The problem is in ip_push_pending_frames(), which uses:

if (!df) {
__ip_select_ident(iph, &rt->u.dst, 0);
} else {
iph->id = htons(inet->id++);
}

instead of ip_select_ident().

Right now I think the code is a nonsense. Most likely, I copied it from
old ip_build_xmit(), where it was really special, we had to decide
whether to generate unique ID when generating the first (well, the last)
fragment.

In ip_push_pending_frames() it does not make sense, it should use plain
ip_select_ident() instead.

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
baa829d8926f02ab04be6ec37780810d221c5b4b 13-Mar-2006 Patrick McHardy <kaber@trash.net> [IPV4/6]: Fix UFO error propagation

When ufo_append_data fails err is uninitialized, but returned back.
Strangely gcc doesn't notice it.

Coverity #901 and #902

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
48d5cad87c3a4998d0bda16ccfb5c60dfe4de5fb 16-Feb-2006 Patrick McHardy <kaber@trash.net> [XFRM]: Fix SNAT-related crash in xfrm4_output_finish

When a packet matching an IPsec policy is SNATed so it doesn't match any
policy anymore it looses its xfrm bundle, which makes xfrm4_output_finish
crash because of a NULL pointer dereference.

This patch directs these packets to the original output path instead. Since
the packets have already passed the POST_ROUTING hook, but need to start at
the beginning of the original output path which includes another
POST_ROUTING invocation, a flag is added to the IPCB to indicate that the
packet was rerouted and doesn't need to pass the POST_ROUTING hook again.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
cfacb0577e319b02ed42685a0a8e0f1657ac461b 09-Jan-2006 Patrick McHardy <kaber@trash.net> [IPV4]: ip_output.c needs xfrm.h

This patch fixes a warning from my IPsec patches:

CC net/ipv4/ip_output.o
net/ipv4/ip_output.c: In function 'ip_finish_output':
net/ipv4/ip_output.c:208: warning: implicit declaration of function
'xfrm4_output_finish'

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
97dc627fb3471664c72d0933790a90ba3f91e131 07-Jan-2006 Adrian Bunk <bunk@stusta.de> [IPV4]: make ip_fragment() static

Since there's no longer any external user of ip_fragment() we can make
it static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
5c901daaea3be0d900b3ae1fc9b5f64ff94e4f02 07-Jan-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: Redo policy lookups after NAT when neccessary

When NAT changes the key used for the xfrm lookup it needs to be done
again. If a new policy is returned in POST_ROUTING the packet needs
to be passed to xfrm4_output_one manually after all hooks were called
because POST_ROUTING is called with fixed okfn (ip_finish_output).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
76ab608d86cf1ef5c5c46819b5733eb9f9f964f8 06-Jan-2006 Alexey Dobriyan <adobriyan@gmail.com> [NET]: Endian-annotate struct iphdr

And fix trivial warnings that emerged.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1bd9bef6f9fe06dd0c628ac877c85b6b36aca062 05-Jan-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: Call POST_ROUTING hook before fragmentation

Call POST_ROUTING hook before fragmentation to get rid of the okfn use
in ip_refrag and save the useless fragmentation/defragmentation step
when NAT is used.

The patch introduces one user-visible change, the POSTROUTING chain
in the mangle table gets entire packets, not fragments, which should
simplify use of the MARK and CLASSIFY targets for queueing as a nice
side-effect.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
89cee8b1cbb9dac40c92ef1968aea2b45f82fd18 14-Dec-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Safer reassembly

Another spin of Herbert Xu's "safer ip reassembly" patch
for 2.6.16.

(The original patch is here:
http://marc.theaimsgroup.com/?l=linux-netdev&m=112281936522415&w=2
and my only contribution is to have tested it.)

This patch (optionally) does additional checks before accepting IP
fragments, which can greatly reduce the possibility of reassembling
fragments which originated from different IP datagrams.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4b30b1c6a3e58dc74f2dbb0aa39f16a23cfcdd56 30-Nov-2005 Adrian Bunk <bunk@stusta.de> [IPV4]: make two functions static

This patch makes two needlessly global functions static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
89f5f0aeed14ac7245f760b0b96c9269c87bcbbe 08-Nov-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Fix ip_queue_xmit identity increment for TSO packets

When ip_queue_xmit calls ip_select_ident_more for IP identity selection
it gives it the wrong packet count for TSO packets. The ip_select_*
functions expect one less than the number of packets, so we need to
subtract one for TSO packets.

This bug was diagnosed and fixed by Tom Young.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
a51482bde22f99c63fbbb57d5d46cc666384e379 08-Nov-2005 Jesper Juhl <jesper.juhl@gmail.com> [NET]: kfree cleanup

From: Jesper Juhl <jesper.juhl@gmail.com>

This is the net/ part of the big kfree cleanup patch.

Remove pointless checks for NULL prior to calling kfree() in net/.

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Arnaldo Carvalho de Melo <acme@conectiva.com.br>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
e89e9cf539a28df7d0eb1d0a545368e9920b34ac 19-Oct-2005 Ananda Raju <ananda.raju@neterion.com> [IPv4/IPv6]: UFO Scatter-gather approach

Attached is kernel patch for UDP Fragmentation Offload (UFO) feature.

1. This patch incorporate the review comments by Jeff Garzik.
2. Renamed USO as UFO (UDP Fragmentation Offload)
3. udp sendfile support with UFO

This patches uses scatter-gather feature of skb to generate large UDP
datagram. Below is a "how-to" on changes required in network device
driver to use the UFO interface.

UDP Fragmentation Offload (UFO) Interface:
-------------------------------------------
UFO is a feature wherein the Linux kernel network stack will offload the
IP fragmentation functionality of large UDP datagram to hardware. This
will reduce the overhead of stack in fragmenting the large UDP datagram to
MTU sized packets

1) Drivers indicate their capability of UFO using
dev->features |= NETIF_F_UFO | NETIF_F_HW_CSUM | NETIF_F_SG

NETIF_F_HW_CSUM is required for UFO over ipv6.

2) UFO packet will be submitted for transmission using driver xmit routine.
UFO packet will have a non-zero value for

"skb_shinfo(skb)->ufo_size"

skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP
fragment going out of the adapter after IP fragmentation by hardware.

skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[]
contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW
indicating that hardware has to do checksum calculation. Hardware should
compute the UDP checksum of complete datagram and also ip header checksum of
each fragmented IP packet.

For IPV6 the UFO provides the fragment identification-id in
skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating
IPv6 fragments.

Signed-off-by: Ananda Raju <ananda.raju@neterion.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
0d0d2bba97cb091218ea0bcb3d8debcc7bf48397 13-Oct-2005 Jayachandran C <jchandra@digeo.com> [IPV4]: Remove dead code from ip_output.c

skb_prev is assigned from skb, which cannot be NULL. This patch removes the
unnecessary NULL check.

Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
c98d80edc827277c28f88d662a7d6e9affa7e12f 22-Oct-2005 Julian Anastasov <ja@ssi.bg> [SK_BUFF]: ipvs_property field must be copied

IPVS used flag NFC_IPVS_PROPERTY in nfcache but as now nfcache was removed the
new flag 'ipvs_property' still needs to be copied. This patch should be
included in 2.6.14.

Further comments from Harald Welte:

Sorry, seems like the bug was introduced by me.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
33d043d65bbd3d97efca96c9bbada443cac3c4da 21-Aug-2005 Thomas Graf <tgraf@suug.ch> [IPV4]: ip_finish_output() can be inlined

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
64ce207306debd7157f47282be94770407bec01c 10-Aug-2005 Patrick McHardy <kaber@trash.net> [NET]: Make NETDEBUG pure printk wrappers

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
d8c97a9451068dd9f7b838a240bb6db894133a5e 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [NET]: Export symbols needed by the current DCCP code

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
32519f11d38ea8f4f60896763bacec7db1760f9c 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [INET]: Introduce inet_sk_rebuild_header

From tcp_v4_rebuild_header, that already was pretty generic, I only
needed to use sk->sk_protocol instead of the hardcoded IPPROTO_TCP and
establish the requirement that INET transport layer protocols that
want to use this function map TCP_SYN_SENT to its equivalent state.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6cbb0df788b90777a7ed0f9d8261260353f48076 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [SOCK]: Introduce sk_setup_caps

From tcp_v4_setup_caps, that always is preceded by a call to
__sk_dst_set, so coalesce this sequence into sk_setup_caps, removing
one call to a TCP function in the IP layer.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
0742fd53a3774781255bd1e471e7aa2e4a82d5f7 10-Aug-2005 Adrian Bunk <bunk@stusta.de> [IPV4]: possible cleanups

This patch contains the following possible cleanups:
- make needlessly global code static
- #if 0 the following unused global function:
- xfrm4_state.c: xfrm4_state_fini
- remove the following unneeded EXPORT_SYMBOL's:
- ip_output.c: ip_finish_output
- ip_output.c: sysctl_ip_default_ttl
- fib_frontend.c: ip_dev_find
- inetpeer.c: inet_peer_idlock
- ip_options.c: ip_options_compile
- ip_options.c: ip_options_undo
- net/core/request_sock.c: sysctl_max_syn_backlog

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6869c4d8e066e21623c812c448a05f1ed931c9c6 10-Aug-2005 Harald Welte <laforge@netfilter.org> [NETFILTER]: reduce netfilter sk_buff enlargement

As discussed at netconf'05, we're trying to save every bit in sk_buff.
The patch below makes sk_buff 8 bytes smaller. I did some basic
testing on my notebook and it seems to work.

The only real in-tree user of nfcache was IPVS, who only needs a
single bit. Unfortunately I couldn't find some other free bit in
sk_buff to stuff that bit into, so I introduced a separate field for
them. Maybe the IPVS guys can resolve that to further save space.

Initially I wanted to shrink pkt_type to three bits (PACKET_HOST and
alike are only 6 values defined), but unfortunately the bluetooth code
overloads pkt_type :(

The conntrack-event-api (out-of-tree) uses nfcache, but Rusty just
came up with a way how to do it without any skb fields, so it's safe
to remove it.

- remove all never-implemented 'nfcache' code
- don't have ipvs code abuse 'nfcache' field. currently get's their own
compile-conditional skb->ipvs_property field. IPVS maintainers can
decide to move this bit elswhere, but nfcache needs to die.
- remove skb->nfcache field to save 4 bytes
- move skb->nfctinfo into three unused bits to save further 4 bytes

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
84531c24f27b02daa8e54e2bb6dc74a730fdf0a5 12-Jul-2005 Phil Oester <kernel@linuxace.com> [NETFILTER]: Revert nf_reset change

Revert the nf_reset change that caused so much trouble, drop conntrack
references manually before packets are queued to packet sockets.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
30e224d76f34e041c30df66a4dcbeeb53556ea3f 05-Jul-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Fix crash in ip_rcv while booting related to netconsole

Makes IPv4 ip_rcv registration happen last in af_inet.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
e176fe8954a5239c24afe79b1001ba3c29511963 05-Jul-2005 Thomas Graf <tgraf@suug.ch> [NET]: Remove unused security member in sk_buff

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
9666dae51013b064e7d77fc36b5cee98dd167ed5 29-Jun-2005 Patrick McHardy <kaber@trash.net> [NETFILTER]: Fix connection tracking bug in 2.6.12

In 2.6.12 we started dropping the conntrack reference when a packet
leaves the IP layer. This broke connection tracking on a bridge,
because bridge-netfilter defers calling some NF_IP_* hooks to the bridge
layer for locally generated packets going out a bridge, where the
conntrack reference is no longer available. This patch keeps the
reference in this case as a temporary solution, long term we will
remove the defered hook calling. No attempt is made to drop the
reference in the bridge-code when it is no longer needed, tc actions
could already have sent the packet anywhere.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
18b8afc771102b1b6af97962808291a7d27f52af 21-Jun-2005 Patrick McHardy <kaber@trash.net> [NETFILTER]: Kill nf_debug

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2fdba6b085eb7068e9594cfa55ffe40466184b4d 19-May-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4/IPV6] Ensure all frag_list members have NULL sk

Having frag_list members which holds wmem of an sk leads to nightmares
with partially cloned frag skb's. The reason is that once you unleash
a skb with a frag_list that has individual sk ownerships into the stack
you can never undo those ownerships safely as they may have been cloned
by things like netfilter. Since we have to undo them in order to make
skb_linearize happy this approach leads to a dead-end.

So let's go the other way and make this an invariant:

For any skb on a frag_list, skb->sk must be NULL.

That is, the socket ownership always belongs to the head skb.
It turns out that the implementation is actually pretty simple.

The above invariant is actually violated in the following patch
for a short duration inside ip_fragment. This is OK because the
offending frag_list member is either destroyed at the end of the
slow path without being sent anywhere, or it is detached from
the frag_list before being sent.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
02c30a84e6298b6b20a56f0896ac80b47839e134 06-May-2005 Jesper Juhl <juhl-lkml@dif.dk> [PATCH] update Ross Biro bouncing email address

Ross moved. Remove the bad email address so people will find the correct
one in ./CREDITS.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
bd96535b81ad09d7593cc75093534acb984d3dc9 04-May-2005 Patrick McHardy <kaber@trash.net> [NETFILTER]: Drop conntrack reference in ip_dev_loopback_xmit()

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
b31e5b1bb53b99dfd5e890aa07e943aff114ae1c 25-Apr-2005 Patrick McHardy <kaber@trash.net> [NETFILTER]: Drop conntrack reference when packet leaves IP

In the event a raw socket is created for sending purposes only, the creator
never bothers to check the socket's receive queue. But we continue to
add skbs to its queue until it fills up.

Unfortunately, if ip_conntrack is loaded on the box, each skb we add to the
queue potentially holds a reference to a conntrack. If the user attempts
to unload ip_conntrack, we will spin around forever since the queued skbs
are pinned.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!