History log of /net/core/neighbour.c
Revision Date Author Comments
545469f7a5d7f7b2a17b74da0a1bd0c1aea2f545 25-Jul-2014 Jun Zhao <mypopydev@gmail.com> neighbour : fix ndm_type type error issue

ndm_type means L3 address type, in neighbour proxy and vxlan, it's RTN_UNICAST.
NDA_DST is for netlink TLV type, hence it's not right value in this context.

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ecf07a1d8f70f72ec99a0f102c8aa24609d84f4 12-Jul-2014 Mathias Krause <minipli@googlemail.com> neigh: sysctl - simplify address calculation of gc_* variables

The code in neigh_sysctl_register() relies on a specific layout of
struct neigh_table, namely that the 'gc_*' variables are directly
following the 'parms' member in a specific order. The code, though,
expresses this in the most ugly way.

Get rid of the ugly casts and use the 'tbl' pointer to get a handle to
the table. This way we can refer to the 'gc_*' variables directly.

Similarly seen in the grsecurity patch, written by Brad Spengler.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2176d5d41891753774f648b67470398a5acab584 09-May-2014 Duan Jiong <duanj.fnst@cn.fujitsu.com> neigh: set nud_state to NUD_INCOMPLETE when probing router reachability

Since commit 7e98056964("ipv6: router reachability probing"), a router falls
into NUD_FAILED will be probed.

Now if function rt6_select() selects a router which neighbour state is NUD_FAILED,
and at the same time function rt6_probe() changes the neighbour state to NUD_PROBE,
then function dst_neigh_output() can directly send packets, but actually the
neighbour still is unreachable. If we set nud_state to NUD_INCOMPLETE instead
NUD_PROBE, packets will not be sent out until the neihbour is reachable.

In addition, because the route should be probes with a single NS, so we must
set neigh->probes to neigh_max_probes(), then the neigh timer timeout and function
neigh_timer_handler() will not send other NS Messages.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
feff9ab2e7fa773b6a3965f77375fe89f7fd85cf 27-Feb-2014 Duan Jiong <duanj.fnst@cn.fujitsu.com> neigh: recompute reachabletime before returning from neigh_periodic_work()

If the neigh table's entries is less than gc_thresh1, the function
will return directly, and the reachabletime will not be recompute,
so the reachabletime can be guessed.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5e2c21dceb5d324b204fda1f28270bb3dbccedb3 27-Feb-2014 Duan Jiong <duanj.fnst@cn.fujitsu.com> neigh: directly goto out after setting nud_state to NUD_FAILED

Because those following if conditions will not be matched.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a960ff81f0cd6390940faa75a078ac76acec7940 26-Feb-2014 Timo Teräs <timo.teras@iki.fi> neigh: probe application via netlink in NUD_PROBE

iproute2 arpd seems to expect this as there's code and comments
to handle netlink probes with NUD_PROBE set. It is used to flush
the arpd cached mappings.

opennhrp instead turns off unicast probes (so it can handle all
neighbour discovery). Without this change it will not see NUD_PROBE
probes and cannot reconfirm the mapping. Thus currently neigh entry
will just fail and can cause few packets dropped until broadcast
discovery is restarted.

Earlier discussion on the subject:
http://marc.info/?t=139305877100001&r=1&w=2

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
b194c1f1dbd5f2671e49e0ac801b1b78dc7de93b 21-Feb-2014 Jiri Pirko <jiri@resnulli.us> neigh: fix setting of default gc_* values

This patch fixes bug introduced by:
commit 1d4c8c29841b9991cdf3c7cc4ba7f96a94f104ca
"neigh: restore old behaviour of default parms values"

The thing is that in neigh_sysctl_register, extra1 and extra2 which were
previously set for NEIGH_VAR_GC_* are overwritten. That leads to
nonsense int limits for gc_* variables. So fix this by not touching
extra* fields for gc_* variables.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
f618002b0b598036ebb8feceb44ea8e05f4cd37b 21-Jan-2014 viresh kumar <viresh.kumar@linaro.org> net/neighbour: queue work on power efficient wq

Workqueue used in neighbour layer have no real dependency of scheduling these on
the cpu which scheduled them.

On a idle system, it is observed that an idle cpu wakes up many times just to
service this work. It would be better if we can schedule it on a cpu which the
scheduler believes to be the most appropriate one.

This patch replaces normal workqueues with power efficient versions. This
doesn't change existing behavior of code unless CONFIG_WQ_POWER_EFFICIENT is
enabled.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3977458c9c61fc27a5a88cfcf51696b838d1ecc9 14-Jan-2014 Jiri Pirko <jiri@resnulli.us> neigh: split lines for NEIGH_VAR_SET so they are not too long

introduced by:
commit 1f9248e5606afc6485255e38ad57bdac08fa7711
"neigh: convert parms to an array"

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
63862b5bef7349dd1137e4c70702c67d77565785 11-Jan-2014 Aruna-Hewapathirane <aruna.hewapathirane@gmail.com> net: replace macros net_random and net_srandom with direct calls to prandom

This patch removes the net_random and net_srandom macros and replaces
them with direct calls to the prandom ones. As new commits only seem to
use prandom_u32 there is no use to keep them around.
This change makes it easier to grep for users of prandom_u32.

Signed-off-by: Aruna-Hewapathirane <aruna.hewapathirane@gmail.com>
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2205369a314e12fcec4781cc73ac9c08fc2b47de 31-Dec-2013 David S. Miller <davem@davemloft.net> vlan: Fix header ops passthru when doing TX VLAN offload.

When the vlan code detects that the real device can do TX VLAN offloads
in hardware, it tries to arrange for the real device's header_ops to
be invoked directly.

But it does so illegally, by simply hooking the real device's
header_ops up to the VLAN device.

This doesn't work because we will end up invoking a set of header_ops
routines which expect a device type which matches the real device, but
will see a VLAN device instead.

Fix this by providing a pass-thru set of header_ops which will arrange
to pass the proper real device instead.

To facilitate this add a dev_rebuild_header(). There are
implementations which provide a ->cache and ->create but not a
->rebuild (f.e. PLIP). So we need a helper function just like
dev_hard_header() to avoid crashes.

Use this helper in the one existing place where the
header_ops->rebuild was being invoked, the neighbour code.

With lots of help from Florian Westphal.

Signed-off-by: David S. Miller <davem@davemloft.net>
53385d2d1de84f4036a0919ec46964c4e81b83f5 15-Dec-2013 Bob Gilligan <gilligan@aristanetworks.com> neigh: Netlink notification for administrative NUD state change

The neighbour code sends up an RTM_NEWNEIGH netlink notification if
the NUD state of a neighbour cache entry is changed by a timer (e.g.
from REACHABLE to STALE), even if the lladdr of the entry has not
changed.

But an administrative change to the the NUD state of a neighbour cache
entry that does not change the lladdr (e.g. via "ip -4 neigh change
... nud ...") does not trigger a netlink notification. This means
that netlink listeners will not hear about administrative NUD state
changes such as from a resolved state to PERMANENT.

This patch changes the neighbor code to generate an RTM_NEWNEIGH
message when the NUD state of an entry is changed administratively.

Signed-off-by: Bob Gilligan <gilligan@aristanetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7e9805696428113e34625a65a30dbc62cb78acc5 11-Dec-2013 Jiri Benc <jbenc@redhat.com> ipv6: router reachability probing

RFC 4191 states in 3.5:

When a host avoids using any non-reachable router X and instead sends
a data packet to another router Y, and the host would have used
router X if router X were reachable, then the host SHOULD probe each
such router X's reachability by sending a single Neighbor
Solicitation to that router's address. A host MUST NOT probe a
router's reachability in the absence of useful traffic that the host
would have sent to the router if it were reachable. In any case,
these probes MUST be rate-limited to no more than one per minute per
router.

Currently, when the neighbour corresponding to a router falls into
NUD_FAILED, it's never considered again. Introduce a new rt6_nud_state
value, RT6_NUD_FAIL_PROBE, which suggests the route should not be used but
should be probed with a single NS. The probe is ratelimited by the existing
code. To better distinguish meanings of the failure values, rename
RT6_NUD_FAIL_SOFT to RT6_NUD_FAIL_DO_RR.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
77d47afbf3c58350c3708b609005358bbd33e085 10-Dec-2013 Jiri Pirko <jiri@resnulli.us> neigh: use neigh_parms_net() to get struct neigh_parms->net pointer

This fixes compile error when CONFIG_NET_NS is not set.

Introduced by:
commit 1d4c8c29841b9991cdf3c7cc4ba7f96a94f104ca
"neigh: restore old behaviour of default parms values"

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
bba24896f022d4d239494bebf18e713cd8aec7a5 07-Dec-2013 Jiri Pirko <jiri@resnulli.us> neigh: ipv6: respect default values set before an address is assigned to device

Make the behaviour similar to ipv4. This will allow user to set sysctl
default neigh param values and these values will be respected even by
devices registered before (that ones what do not have address set yet).

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
1d4c8c29841b9991cdf3c7cc4ba7f96a94f104ca 07-Dec-2013 Jiri Pirko <jiri@resnulli.us> neigh: restore old behaviour of default parms values

Previously inet devices were only constructed when addresses are added.
Therefore the default neigh parms values they get are the ones at the
time of these operations.

Now that we're creating inet devices earlier, this changes the behaviour
of default neigh parms values in an incompatible way (see bug #8519).

This patch creates a compromise by setting the default values at the
same point as before but only for those that have not been explicitly
set by the user since the inet device's creation.

Introduced by:
commit 8030f54499925d073a88c09f30d5d844fb1b3190
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu Feb 22 01:53:47 2007 +0900

[IPV4] devinet: Register inetdev earlier.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
73af614aedd221df8495fc8c9993c50e87f899f2 07-Dec-2013 Jiri Pirko <jiri@resnulli.us> neigh: use tbl->family to distinguish ipv4 from ipv6

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
cb5b09c17fe60056bc8f127ffc987d361c40ed4b 07-Dec-2013 Jiri Pirko <jiri@resnulli.us> neigh: wrap proc dointvec functions

This will be needed later on to provide better management of default values.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
1f9248e5606afc6485255e38ad57bdac08fa7711 07-Dec-2013 Jiri Pirko <jiri@resnulli.us> neigh: convert parms to an array

This patch converts the neigh param members to an array. This allows easier
manipulation which will be needed later on to provide better management of
default values.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
4ed377e36ec2f385484d12e516faf88516fad31c 21-Sep-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> net: neighbour: use source address of last enqueued packet for solicitation

Currently we always use the first member of the arp_queue to determine
the sender ip address of the arp packet (or in case of IPv6 - source
address of the ndisc packet). This skb is fixed as long as the queue is
not drained by a complete purge because of a timeout or by a successful
response.

If the first packet enqueued on the arp_queue is from a local application
with a manually set source address and the to be discovered system
does some kind of uRPF checks on the source address in the arp packet
the resolving process hangs until a timeout and restarts. This hurts
communication with the participating network node.

This could be mitigated a bit if we use the latest enqueued skb's
source address for the resolving process, which is not as static as
the arp_queue's head. This change of the source address could result in
better recovery of a failed solicitation.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3e25c65ed085b361cc91a8f02e028f1158c9f255 29-Aug-2013 Tim Gardner <tim.gardner@canonical.com> net: neighbour: Remove CONFIG_ARPD

This config option is superfluous in that it only guards a call
to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
change in behavior. There will now be call to __neigh_notify()
for each ARP resolution, which has no impact unless there is a
user space daemon waiting to receive the notification, i.e.,
the case for which CONFIG_ARPD was designed anyways.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Joe Perches <joe@perches.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fc7f8f5c53fdb82d4689e24df3da1a88bc3859f7 02-Aug-2013 Veaceslav Falico <vfalico@redhat.com> neighbour: populate neigh_parms on alloc before calling ndo_neigh_setup

dev->ndo_neigh_setup() might need some of the values of neigh_parms, so
populate them before calling it.

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
63134803a6369dcf7dddf7f0d5e37b9566b308d2 02-Aug-2013 Veaceslav Falico <vfalico@redhat.com> neighbour: populate neigh_parms on alloc before calling ndo_neigh_setup

dev->ndo_neigh_setup() might need some of the values of neigh_parms, so
populate them before calling it.

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
555445cd11803c6bc93b2be31968f3949ef7708b 24-Jul-2013 Francesco Fusco <ffusco@redhat.com> neigh: prevent overflowing params in /proc/sys/net/ipv4/neigh/

Without this patch, the fields app_solicit, gc_thresh1, gc_thresh2,
gc_thresh3, proxy_qlen, ucast_solicit, mcast_solicit could have
assumed negative values when setting large numbers.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c9ab4d85de222f3390c67aedc9c18a50e767531e 28-Jun-2013 Eric Dumazet <eric.dumazet@gmail.com> neighbour: fix a race in neigh_destroy()

There is a race in neighbour code, because neigh_destroy() uses
skb_queue_purge(&neigh->arp_queue) without holding neighbour lock,
while other parts of the code assume neighbour rwlock is what
protects arp_queue

Convert all skb_queue_purge() calls to the __skb_queue_purge() variant

Use __skb_queue_head_init() instead of skb_queue_head_init()
to make clear we do not use arp_queue.lock

And hold neigh->lock in neigh_destroy() to close the race.

Reported-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dc25c676f54addb10e598daa9da9b8dd4fd487ab 20-Jun-2013 Gao feng <gaofeng@cn.fujitsu.com> neigh: disallow un-init_net to change thresh of neigh

thresh and interval are global resources,
only init net can change them.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
170d6f99541600ec7512f1d2b0b0c349009098d2 20-Jun-2013 Gao feng <gaofeng@cn.fujitsu.com> neigh: only allow init_net to change the default neigh_parms

Though we don't export the /proc/sys/net/ipv[4,6]/neigh/default/
directory to the un-init_net, but we can still use cmd such as
"ip ntable change name arp_cache locktime 129" to change the locktime
of default neigh_parms.

This patch disallows the un-init_net to find out the neigh_table.parms.
So the un-init_net will failed to influence the init_net.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cf89d6b2803ab99ac596f95d585c3057d2be645c 20-Jun-2013 Gao feng <gaofeng@cn.fujitsu.com> neigh: no need to call lookup_neigh_parms in neigh_parms_alloc

neigh_table.parms always exist and is initialized,kmemdup
can use it to create new neigh_parms, actually lookup_neigh_parms
here will return neigh_table.parms too.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fe2c6338fd2c6f383c4d4164262f35c8f3708e1f 12-Jun-2013 Joe Perches <joe@perches.com> net: Convert uses of typedef ctl_table to struct ctl_table

Reduce the uses of this unnecessary typedef.

Done via perl script:

$ git grep --name-only -w ctl_table net | \
xargs perl -p -i -e '\
sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

Reflow the modified lines that now exceed 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d5d427cdaeae33752fbd5c674cc52a8f8e65a550 15-Apr-2013 Joe Perches <joe@perches.com> neighbour: Convert NEIGH_PRINTK to neigh_dbg

Update debugging messages to a more current style.

Emit these debugging messages at KERN_DEBUG instead
of KERN_DEFAULT.

Add and use neigh_dbg(level, fmt, ...) macro
Add dynamic_debug capability via pr_debug
Convert embedded function names to "%s: ", __func__

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d9dda78bad879595d8c4220a067fc029d6484a16 01-Apr-2013 Al Viro <viro@zeniv.linux.org.uk> procfs: new helper - PDE_DATA(inode)

The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data. Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
661d2967b3f1b34eeaa7e212e7b9bbe8ee072b59 21-Mar-2013 Thomas Graf <tgraf@suug.ch> rtnetlink: Remove passing of attributes into rtnl_doit functions

With decnet converted, we can finally get rid of rta_buf and its
computations around it. It also gets rid of the minimal header
length verification since all message handlers do that explicitly
anyway.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
08433eff2d041b263c68306f6a6ccb4e1f75e196 24-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> net neigh: Optimize neighbor entry size calculation.

When allocating memory for neighbour cache entry, if
tbl->entry_size is not set, we always calculate
sizeof(struct neighbour) + tbl->key_len, which is common
in the same table.

With this change, set tbl->entry_size during the table
initialization phase, if it was not set, and use it in
neigh_alloc() and neighbour_priv().

This change also allow us to have both of protocol private
data and device priate data at tha same time.

Note that the only user of prototcol private is DECnet
and the only user of device private is ATM CLIP.
Since those are exclusive, we have not been facing issues
here.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2724680bceee94eac391552863771af105a7355c 22-Jan-2013 YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> neigh: Keep neighbour cache entries if number of them is small enough.

Since we have removed NCE (Neighbour Cache Entry) reference from
routing entries, the only refcnt holders of an NCE are its timer
(if running) and its owner table, in usual cases. As a result,
neigh_periodic_work() purges NCEs over and over again even for
gateways.

It does not make sense to purge entries, if number of them is
very small, so keep them. The minimum number of entries to keep
is specified by gc_thresh1.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
b93196dc5af7729ff7cc50d3d322ab1a364aa14f 06-Dec-2012 Cong Wang <amwang@redhat.com> net: fix some compiler warning in net/core/neighbour.c

net/core/neighbour.c:65:12: warning: 'zero' defined but not used [-Wunused-variable]
net/core/neighbour.c:66:12: warning: 'unres_qlen_max' defined but not used [-Wunused-variable]

These variables are only used when CONFIG_SYSCTL is defined,
so move them under #ifdef CONFIG_SYSCTL.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Acked-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ce46cc64d47a8afaf13c300b09a7f9c29f4979b6 04-Dec-2012 Shan Wei <davidshan@tencent.com> net: neighbour: prohibit negative value for unres_qlen_bytes parameter

unres_qlen_bytes and unres_qlen are int type.
But multiple relation(unres_qlen_bytes = unres_qlen * SKB_TRUESIZE(ETH_FRAME_LEN))
will cause type overflow when seting unres_qlen. e.g.

$ echo 1027506 > /proc/sys/net/ipv4/neigh/eth1/unres_qlen
$ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen
1182657265
$ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen_bytes
-2147479756

The gutted value is not that we setting。
But user/administrator don't know this is caused by int type overflow.

what's more, it is meaningless and even dangerous that unres_qlen_bytes is set
with negative number. Because, for unresolved neighbour address, kernel will cache packets
without limit in __neigh_event_send()(e.g. (u32)-1 = 2GB).

Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b51642f6d77b131dc85d1d71029c3cbb5b07c262 16-Nov-2012 Eric W. Biederman <ebiederm@xmission.com> net: Enable a userns root rtnl calls that are safe for unprivilged users

- Only allow moving network devices to network namespaces you have
CAP_NET_ADMIN privileges over.

- Enable creating/deleting/modifying interfaces
- Enable adding/deleting addresses
- Enable adding/setting/deleting neighbour entries
- Enable adding/removing routes
- Enable adding/removing fib rules
- Enable setting the forwarding state
- Enable adding/removing ipv6 address labels
- Enable setting bridge parameter

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dfc47ef8639facd77210e74be831943c2fdd9c74 16-Nov-2012 Eric W. Biederman <ebiederm@xmission.com> net: Push capable(CAP_NET_ADMIN) into the rtnl methods

- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.

- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.

Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
464dc801c76aa0db88e16e8f5f47c6879858b9b2 16-Nov-2012 Eric W. Biederman <ebiederm@xmission.com> net: Don't export sysctls to unprivileged users

In preparation for supporting the creation of network namespaces
by unprivileged users, modify all of the per net sysctl exports
and refuse to allow them to unprivileged users.

This makes it safe for unprivileged users in general to access
per net sysctls, and allows sysctls to be exported to unprivileged
users on an individual basis as they are deemed safe.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e1f165032c8bade3a6bdf546f8faf61fda4dd01c 05-Oct-2012 ramesh.nagappa@gmail.com <ramesh.nagappa@gmail.com> net: Fix skb_under_panic oops in neigh_resolve_output

The retry loop in neigh_resolve_output() and neigh_connected_output()
call dev_hard_header() with out reseting the skb to network_header.
This causes the retry to fail with skb_under_panic. The fix is to
reset the network_header within the retry loop.

Signed-off-by: Ramesh Nagappa <ramesh.nagappa@ericsson.com>
Reviewed-by: Shawn Lu <shawn.lu@ericsson.com>
Reviewed-by: Robert Coulson <robert.coulson@ericsson.com>
Reviewed-by: Billie Alsup <billie.alsup@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15e473046cb6e5d18a4d0057e61d76315230382b 07-Sep-2012 Eric W. Biederman <ebiederm@xmission.com> netlink: Rename pid to portid to avoid confusion

It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.

I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.

I have successfully built an allyesconfig kernel with this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
203b42f7317494ae5e5efc7be6fb7f29c927f102 21-Aug-2012 Tejun Heo <tj@kernel.org> workqueue: make deferrable delayed_work initializer names consistent

Initalizers for deferrable delayed_work are confused.

* __DEFERRED_WORK_INITIALIZER()
* DECLARE_DEFERRED_WORK()
* INIT_DELAYED_WORK_DEFERRABLE()

Rename them to

* __DEFERRABLE_WORK_INITIALIZER()
* DECLARE_DEFERRABLE_WORK()
* INIT_DEFERRABLE_WORK()

This patch doesn't cause any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
13a43d94ab026c423dc8902170ef27c2bd36aa87 03-Jul-2012 David S. Miller <davem@davemloft.net> neigh: Convert over to dst_neigh_lookup_skb().

Signed-off-by: David S. Miller <davem@davemloft.net>
a263b3093641fb1ec377582c90986a7fd0625184 02-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Make neigh lookups directly in output packet path.

Do not use the dst cached neigh, we'll be getting rid of that.

Signed-off-by: David S. Miller <davem@davemloft.net>
4bd6683bd400c8b1d2ad544bb155d86a5d10f91c 07-Jun-2012 Eric Dumazet <edumazet@google.com> net: neighbour: fix neigh_dump_info()

Denys found out "ip neigh" output was truncated to
about 54 neighbours.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Denys Fedoryshchenko <denys@visp.net.lb>
Signed-off-by: David S. Miller <davem@davemloft.net>
e005d193d55ee5f757b13306112d8c23aac27a88 16-May-2012 Joe Perches <joe@perches.com> net: core: Use pr_<level>

Use the current logging style.

This enables use of dynamic debugging as well.

Convert printk(KERN_<LEVEL> to pr_<level>.
Add pr_fmt. Remove embedded prefixes, use
%s, __func__ instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8f40a1f9821a4ccb2a237f14d4eb6d6e0e665c14 19-Apr-2012 Eric W. Biederman <ebiederm@xmission.com> net neighbour: Convert to use register_net_sysctl

Using an ascii path to register_net_sysctl as opposed to the slightly
awkward ctl_path allows for much simpler code.

We no longer need to malloc dev_name to keep it alive the length of our
sysctl register instead we can use a small temporary buffer on the
stack.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5dd3df105b9f6cb7dd2472b59e028d0d1c878ecb 19-Apr-2012 Eric W. Biederman <ebiederm@xmission.com> net: Move all of the network sysctls without a namespace into init_net.

This makes it clearer which sysctls are relative to your current network
namespace.

This makes it a little less error prone by not exposing sysctls for the
initial network namespace in other namespaces.

This is the same way we handle all of our other network interfaces to
userspace and I can't honestly remember why we didn't do this for
sysctls right from the start.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
95c961747284a6b83a5e2d81240e214b0fa3464d 15-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com> net: cleanup unsigned to unsigned int

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dcd2ba92e842eec0d0372415fa26f1c411f5530d 13-Apr-2012 Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> neighbour: Make neigh_table_init_no_netlink() static.

neigh_table_init_no_netlink() is only used in net/core/neighbour.c file.

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9a6308d74edb791c05d0e292e6263efc69640942 02-Apr-2012 David S. Miller <davem@davemloft.net> neighbour: Stop using NLA_PUT*().

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.

Signed-off-by: David S. Miller <davem@davemloft.net>
84338a6c9dbb6ff3de4749864020f8f25d86fc81 21-Feb-2012 Michel Machado <michel@digirati.com.br> neighbour: Fixed race condition at tbl->nht

When the fixed race condition happens:

1. While function neigh_periodic_work scans the neighbor hash table
pointed by field tbl->nht, it unlocks and locks tbl->lock between
buckets in order to call cond_resched.

2. Assume that function neigh_periodic_work calls cond_resched, that is,
the lock tbl->lock is available, and function neigh_hash_grow runs.

3. Once function neigh_hash_grow finishes, and RCU calls
neigh_hash_free_rcu, the original struct neigh_hash_table that function
neigh_periodic_work was using doesn't exist anymore.

4. Once back at neigh_periodic_work, whenever the old struct
neigh_hash_table is accessed, things can go badly.

Signed-off-by: Michel Machado <michel@digirati.com.br>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
84920c1420e2b4a4150e5bb45ee5a23ea4641523 26-Jan-2012 Tony Zelenoff <antonz@parallels.com> net: Allow ipv6 proxies and arp proxies be shown with iproute2

Add ability to return neighbour proxies list to caller if
it sent full ndmsg structure and has NTF_PROXY flag set.

Before this patch (and before iproute2 patches):
$ ip neigh add proxy 2001::1 dev eth0
$ ip -6 neigh show
$

After it and with applied iproute2 patches:
$ ip neigh add proxy 2001::1 dev eth0
$ ip -6 neigh show
2001::1 dev eth0 proxy
$

Compatibility with old versions of iproute2 is not broken,
kernel checks for incoming structure size and properly
works if old structure is came.

[v2]
* changed comments style.
* removed useless line with continue and curly bracket.
* changed incoming message size check from equal to more or
equal.

CC: davem@davemloft.net
CC: kuznet@ms2.inr.ac.ru
CC: netdev@vger.kernel.org
CC: xemul@parallels.com
Signed-off-by: Tony Zelenoff <antonz@parallels.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2c2aba6c561ac425602f4a0be61422224cb87151 28-Dec-2011 David S. Miller <davem@davemloft.net> ipv6: Use universal hash for NDISC.

In order to perform a proper universal hash on a vector of integers,
we have to use different universal hashes on each vector element.

Which means we need 4 different hash randoms for ipv6.

Signed-off-by: David S. Miller <davem@davemloft.net>
447f219190bf0368b8b36cf60155744cb43510df 19-Dec-2011 David S. Miller <davem@davemloft.net> Revert "net: Remove unused neighbour layer ops."

This reverts commit 5c3ddec73d01a1fae9409c197078cb02c42238c3.

S390 qeth driver actually still uses the setup ops.

Reported-by: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5c3ddec73d01a1fae9409c197078cb02c42238c3 13-Dec-2011 David S. Miller <davem@davemloft.net> net: Remove unused neighbour layer ops.

It's simpler to just keep these things out until there is a real user
of them, so we can see what the needs actually are, rather than keep
these things around as useless overhead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2721745501a26d0dc3b88c0d2f3aa11471891388 02-Dec-2011 David Miller <davem@davemloft.net> net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.

To reflect the fact that a refrence is not obtained to the
resulting neighbour entry.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Roland Dreier <roland@purestorage.com>
da6a8fa0275e2178c44a875374cae80d057538d1 25-Jul-2011 David Miller <davem@davemloft.net> neigh: Add device constructor/destructor capability.

If the neigh entry has device private state, it will need
constructor/destructor ops.

Signed-off-by: David S. Miller <davem@davemloft.net>
596b9b68ef118f7409afbc78487263e08ef96261 25-Jul-2011 David Miller <davem@davemloft.net> neigh: Add infrastructure for allocating device neigh privates.

netdev->neigh_priv_len records the private area length.

This will trigger for neigh_table objects which set tbl->entry_size
to zero, and the first instances of this will be forthcoming.

Signed-off-by: David S. Miller <davem@davemloft.net>
5b8b0060cbd6332ae5d1fa0bec0e8e211248d0e7 25-Jul-2011 David Miller <davem@davemloft.net> neigh: Get rid of neigh_table->kmem_cachep

We are going to alloc for device specific private areas for
neighbour entries, and in order to do that we have to move
away from the fixed allocation size enforced by using
neigh_table->kmem_cachep

As a nice side effect we can now use kfree_rcu().

Signed-off-by: David S. Miller <davem@davemloft.net>
df07a94cf50eb73d09bf2350c3fe2598e4cbeee1 25-Nov-2011 Jorge Boncompte [DTI2] <jorge@dti2.net> netns: fix proxy ARP entries listing on a netns

Skip entries from foreign network namespaces.

Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
8b5c171bb3dc0686b2647a84e990199c5faa9ef8 09-Nov-2011 Eric Dumazet <eric.dumazet@gmail.com> neigh: new unresolved queue limits

Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST)
>
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 09 Nov 2011 12:14:09 +0100
> >
> >> unres_qlen is the number of frames we are able to queue per unresolved
> >> neighbour. Its default value (3) was never changed and is responsible
> >> for strange drops, especially if IP fragments are used, or multiple
> >> sessions start in parallel. Even a single tcp flow can hit this limit.
> > ...
> >
> > Ok, I've applied this, let's see what happens :-)
>
> Early answer, build fails.
>
> Please test build this patch with DECNET enabled and resubmit. The
> decnet neigh layer still refers to the removed ->queue_len member.
>
> Thanks.

Ouch, this was fixed on one machine yesterday, but not the other one I
used this morning, sorry.

[PATCH V5 net-next] neigh: new unresolved queue limits

unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. Even a single tcp flow can hit this limit.

$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms

Signed-off-by: David S. Miller <davem@davemloft.net>
045f7b3b0005bf30ad8d664c7651d816e2650cd2 01-Nov-2011 David S. Miller <davem@davemloft.net> neigh: Kill bogus SMP protected debugging message.

Whatever situations make this state legitimate when SMP
also would be legitimate when !SMP and f.e. preemption is
enabled.

This is dubious enough that we should just delete it entirely. If we
want to add debugging for neigh timer races, better more thorough
mechanisms are needed.

Signed-off-by: David S. Miller <davem@davemloft.net>
e049f28883126c689cf95859480d9ee4ab23b7fa 18-Oct-2011 roy.qing.li@gmail.com <roy.qing.li@gmail.com> neigh: fix rcu splat in neigh_update()

when use dst_get_neighbour to get neighbour, we need
rcu_read_lock to protect, since dst_get_neighbour uses
rcu_dereference.

The bug was reported by Ari Savolainen <ari.m.savolainen@gmail.com>

[ 105.612095]
[ 105.612096] ===================================================
[ 105.612100] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 105.612101] ---------------------------------------------------
[ 105.612103] include/net/dst.h:91 invoked rcu_dereference_check()
without protection!
[ 105.612105]
[ 105.612106] other info that might help us debug this:
[ 105.612106]
[ 105.612108]
[ 105.612108] rcu_scheduler_active = 1, debug_locks = 0
[ 105.612110] 1 lock held by dnsmasq/2618:
[ 105.612111] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815df8c7>]
rtnl_lock+0x17/0x20
[ 105.612120]
[ 105.612121] stack backtrace:
[ 105.612123] Pid: 2618, comm: dnsmasq Not tainted 3.1.0-rc1 #41
[ 105.612125] Call Trace:
[ 105.612129] [<ffffffff810ccdcb>] lockdep_rcu_dereference+0xbb/0xc0
[ 105.612132] [<ffffffff815dc5a9>] neigh_update+0x4f9/0x5f0
[ 105.612135] [<ffffffff815da001>] ? neigh_lookup+0xe1/0x220
[ 105.612139] [<ffffffff81639298>] arp_req_set+0xb8/0x230
[ 105.612142] [<ffffffff8163a59f>] arp_ioctl+0x1bf/0x310
[ 105.612146] [<ffffffff810baa40>] ? lock_hrtimer_base.isra.26+0x30/0x60
[ 105.612150] [<ffffffff8163fb75>] inet_ioctl+0x85/0x90
[ 105.612154] [<ffffffff815b5520>] sock_do_ioctl+0x30/0x70
[ 105.612157] [<ffffffff815b55d3>] sock_ioctl+0x73/0x280
[ 105.612162] [<ffffffff811b7698>] do_vfs_ioctl+0x98/0x570
[ 105.612165] [<ffffffff811a5c40>] ? fget_light+0x340/0x3a0
[ 105.612168] [<ffffffff811b7bbf>] sys_ioctl+0x4f/0x80
[ 105.612172] [<ffffffff816fdcab>] system_call_fastpath+0x16/0x1b

Reported-by: Ari Savolainen <ari.m.savolainen@gmail.com>
Signed-off-by: RongQing <roy.qing.li@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
20e6074eb8e096b3a595c093d1cb222f378cd671 22-Aug-2011 Eric Dumazet <eric.dumazet@gmail.com> arp: fix rcu lockdep splat in arp_process()

Dave Jones reported a lockdep splat triggered by an arp_process() call
from parp_redo().

Commit faa9dcf793be (arp: RCU changes) is the origin of the bug, since
it assumed arp_process() was called under rcu_read_lock(), which is not
true in this particular path.

Instead of adding rcu_read_lock() in parp_redo(), I chose to add it in
neigh_proxy_process() to take care of IPv6 side too.

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
include/linux/inetdevice.h:209 invoked rcu_dereference_check() without
protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
4 locks held by setfiles/2123:
#0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8114cbc4>]
walk_component+0x1ef/0x3e8
#1: (&isec->lock){+.+.+.}, at: [<ffffffff81204bca>]
inode_doinit_with_dentry+0x3f/0x41f
#2: (&tbl->proxy_timer){+.-...}, at: [<ffffffff8106a803>]
run_timer_softirq+0x157/0x372
#3: (class){+.-...}, at: [<ffffffff8141f256>] neigh_proxy_process
+0x36/0x103

stack backtrace:
Pid: 2123, comm: setfiles Tainted: G W
3.1.0-0.rc2.git7.2.fc16.x86_64 #1
Call Trace:
<IRQ> [<ffffffff8108ca23>] lockdep_rcu_dereference+0xa7/0xaf
[<ffffffff8146a0b7>] __in_dev_get_rcu+0x55/0x5d
[<ffffffff8146a751>] arp_process+0x25/0x4d7
[<ffffffff8146ac11>] parp_redo+0xe/0x10
[<ffffffff8141f2ba>] neigh_proxy_process+0x9a/0x103
[<ffffffff8106a8c4>] run_timer_softirq+0x218/0x372
[<ffffffff8106a803>] ? run_timer_softirq+0x157/0x372
[<ffffffff8141f220>] ? neigh_stat_seq_open+0x41/0x41
[<ffffffff8108f2f0>] ? mark_held_locks+0x6d/0x95
[<ffffffff81062bb6>] __do_softirq+0x112/0x25a
[<ffffffff8150d27c>] call_softirq+0x1c/0x30
[<ffffffff81010bf5>] do_softirq+0x4b/0xa2
[<ffffffff81062f65>] irq_exit+0x5d/0xcf
[<ffffffff8150dc11>] smp_apic_timer_interrupt+0x7c/0x8a
[<ffffffff8150baf3>] apic_timer_interrupt+0x73/0x80
<EOI> [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
[<ffffffff814fc285>] ? __slab_free+0x30/0x24c
[<ffffffff814fc283>] ? __slab_free+0x2e/0x24c
[<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
[<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
[<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f
[<ffffffff81130cb0>] kfree+0x108/0x131
[<ffffffff81204e74>] inode_doinit_with_dentry+0x2e9/0x41f
[<ffffffff81204fc6>] selinux_d_instantiate+0x1c/0x1e
[<ffffffff81200f4f>] security_d_instantiate+0x21/0x23
[<ffffffff81154625>] d_instantiate+0x5c/0x61
[<ffffffff811563ca>] d_splice_alias+0xbc/0xd2
[<ffffffff811b17ff>] ext4_lookup+0xba/0xeb
[<ffffffff8114bf1e>] d_alloc_and_lookup+0x45/0x6b
[<ffffffff8114cbea>] walk_component+0x215/0x3e8
[<ffffffff8114cdf8>] lookup_last+0x3b/0x3d
[<ffffffff8114daf3>] path_lookupat+0x82/0x2af
[<ffffffff8110fc53>] ? might_fault+0xa5/0xac
[<ffffffff8110fc0a>] ? might_fault+0x5c/0xac
[<ffffffff8114c564>] ? getname_flags+0x31/0x1ca
[<ffffffff8114dd48>] do_path_lookup+0x28/0x97
[<ffffffff8114df2c>] user_path_at+0x59/0x96
[<ffffffff811467ad>] ? cp_new_stat+0xf7/0x10d
[<ffffffff811469a6>] vfs_fstatat+0x44/0x6e
[<ffffffff811469ee>] vfs_lstat+0x1e/0x20
[<ffffffff81146b3d>] sys_newlstat+0x1a/0x33
[<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158
[<ffffffff812535fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8150af82>] system_call_fastpath+0x16/0x1b

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cd28ca0a3dd17c68d24b839602a0e6268ad28b5d 09-Aug-2011 Eric Dumazet <eric.dumazet@gmail.com> neigh: reduce arp latency

Remove the artificial HZ latency on arp resolution.

Instead of firing a timer in one jiffy (up to 10 ms if HZ=100), lets
send the ARP message immediately.

Before patch :

# arp -d 192.168.20.108 ; ping -c 3 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data.
64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=9.91 ms
64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.065 ms
64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.061 ms

After patch :

$ arp -d 192.168.20.108 ; ping -c 3 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data.
64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=0.152 ms
64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.064 ms
64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.074 ms

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
69cce1d1404968f78b177a0314f5822d5afdbbfb 18-Jul-2011 David S. Miller <davem@davemloft.net> net: Abstract dst->neighbour accesses behind helpers.

dst_{get,set}_neighbour()

Signed-off-by: David S. Miller <davem@davemloft.net>
8f40b161de4f27402b4c0659ad2ae83fad5a0cdd 17-Jul-2011 David S. Miller <davem@davemloft.net> neigh: Pass neighbour entry to output ops.

This will get us closer to being able to do "neigh stuff"
completely independent of the underlying dst_entry for
protocols (ipv4/ipv6) that wish to do so.

We will also be able to make dst entries neigh-less.

Signed-off-by: David S. Miller <davem@davemloft.net>
542d4d685febf3110d1a08d0bcb9f6ef060b76f7 17-Jul-2011 David S. Miller <davem@davemloft.net> neigh: Kill ndisc_ops->queue_xmit

It is always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
b23b5455b6458920179a1f27513ce42e70d11f37 17-Jul-2011 David S. Miller <davem@davemloft.net> neigh: Kill hh_cache->hh_output

It's just taking on one of two possible values, either
neigh_ops->output or dev_queue_xmit(). And this is purely depending
upon whether nud_state has NUD_CONNECTED set or not.

Signed-off-by: David S. Miller <davem@davemloft.net>
47ec132a40d788d45e2f088545dea68798034dab 17-Jul-2011 David S. Miller <davem@davemloft.net> neigh: Kill neigh_ops->hh_output

It's always dev_queue_xmit().

Signed-off-by: David S. Miller <davem@davemloft.net>
0895b08adeb3f660cdff21990d0a9c2b59a919e7 17-Jul-2011 David S. Miller <davem@davemloft.net> neigh: Simply destroy handling wrt. hh_cache.

Now that hh_cache entries are embedded inside of neighbour
entries, their lifetimes and accesses are now synchronous
to that of the encompassing neighbour object.

Therefore we don't need to hook up the blackhole op to
hh_output on destroy.

Signed-off-by: David S. Miller <davem@davemloft.net>
f6b72b6217f8c24f2a54988e58af858b4e66024d 14-Jul-2011 David S. Miller <davem@davemloft.net> net: Embed hh_cache inside of struct neighbour.

Now that there is a one-to-one correspondance between neighbour
and hh_cache entries, we no longer need:

1) dynamic allocation
2) attachment to dst->hh
3) refcounting

Initialization of the hh_cache entry is indicated by hh_len
being non-zero, and such initialization is always done with
the neighbour's lock held as a writer.

Signed-off-by: David S. Miller <davem@davemloft.net>
5c25f686db352082eef8daa21b760192351a023a 13-Jul-2011 David S. Miller <davem@davemloft.net> net: Kill support for multiple hh_cache entries per neighbour

This never, ever, happens.

Neighbour entries are always tied to one address family, and therefore
one set of dst_ops, and therefore one dst_ops->protocol "hh_type"
value.

This capability was blindly imported by Alexey Kuznetsov when he wrote
the neighbour layer.

Signed-off-by: David S. Miller <davem@davemloft.net>
e69dd336ee3a05a589629b505b18ba5e7a5b4c54 13-Jul-2011 David S. Miller <davem@davemloft.net> net: Push protocol type directly down to header_ops->cache()

Signed-off-by: David S. Miller <davem@davemloft.net>
f610b74b14d74a069f61583131e689550fd5bab3 11-Jul-2011 David S. Miller <davem@davemloft.net> ipv4: Use universal hash for ARP.

We need to make sure the multiplier is odd.

Signed-off-by: David S. Miller <davem@davemloft.net>
cd0893369ca85fd11bc517081b2d9079d2ef2f90 11-Jul-2011 David S. Miller <davem@davemloft.net> neigh: Store hash shift instead of mask.

And mask the hash function result by simply shifting
down the "->hash_shift" most significant bits.

Currently which bits we use is arbitrary since jhash
produces entropy evenly across the whole hash function
result.

But soon we'll be using universal hashing functions,
and in those cases more entropy exists in the higher
bits than the lower bits, because they use multiplies.

Signed-off-by: David S. Miller <davem@davemloft.net>
c7ac8679bec9397afe8918f788cbcef88c38da54 10-Jun-2011 Greg Rose <gregory.v.rose@intel.com> rtnetlink: Compute and store minimum ifinfo dump size

The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.

Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
6193d2be290990b789021e06fa770ecb45319f2d 19-Jan-2011 Eric Dumazet <eric.dumazet@gmail.com> neigh: __rcu annotations

fix some minor issues and sparse (__rcu) warnings

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4c306a9291a077879fc3e933326caac3bc319caa 20-Dec-2010 Shan Wei <shanwei@cn.fujitsu.com> net: kill unused macros

These macros never be used, so remove them.

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a5c30b349b872aa2ac13babbd5ceb26737f17e95 19-Oct-2010 Tejun Heo <tj@kernel.org> net/neighbour: cancel_delayed_work() + flush_scheduled_work() -> cancel_delayed_work_sync()

flush_scheduled_work() is going away. Prepare for it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0ed8ddf4045fcfcac36bad753dc4046118c603ec 07-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> neigh: Protect neigh->ha[] with a seqlock

Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
dirtying neighbour in stress situation (many different flows / dsts)

Dirtying takes place because of read_lock(&n->lock) and n->used writes.

Switching to a seqlock, and writing n->used only on jiffies changes
permits less dirtying.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
34d101dd6204bd100fc2e6f7b5f9a10f959ce2c9 11-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> neigh: speedup neigh_hh_init()

When a new dst is used to send a frame, neigh_resolve_output() tries to
associate an struct hh_cache to this dst, calling neigh_hh_init() with
the neigh rwlock write locked.

Most of the time, hh_cache is already known and linked into neighbour,
so we find it and increment its refcount.

This patch changes the logic so that we call neigh_hh_init() with
neighbour lock read locked only, so that fast path can be run in
parallel by concurrent cpus.

This brings part of the speedup we got with commit c7d4426a98a5f
(introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
removing one of the contention point that routers hit on multiqueue
enabled machines.

Further improvements would need to use a seqlock instead of an rwlock to
protect neigh->ha[], to not dirty neigh too often and remove two atomic
ops.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
767e97e1e0db0d0f3152cd2f3bd3403596aedbad 07-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> neigh: RCU conversion of struct neighbour

This is the second step for neighbour RCU conversion.

(first was commit d6bf7817 : RCU conversion of neigh hash table)

neigh_lookup() becomes lockless, but still take a reference on found
neighbour. (no more read_lock()/read_unlock() on tbl->lock)

struct neighbour gets an additional rcu_head field and is freed after an
RCU grace period.

Future work would need to eventually not take a reference on neighbour
for temporary dst (DST_NOCACHE), but this would need dst->_neighbour to
use a noref bit like we did for skb->_dst.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d6bf781712a1d25cc8987036b3a48535b331eb91 04-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> net neigh: RCU conversion of neigh hash table

David

This is the first step for RCU conversion of neigh code.

Next patches will convert hash_buckets[] and "struct neighbour" to RCU
protected objects.

Thanks

[PATCH net-next] net neigh: RCU conversion of neigh hash table

Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
neigh_table", a new structure is defined :

struct neigh_hash_table {
struct neighbour **hash_buckets;
unsigned int hash_mask;
__u32 hash_rnd;
struct rcu_head rcu;
};

And "struct neigh_table" has an RCU protected pointer to such a
neigh_hash_table.

This means the signature of (*hash)() function changed: We need to add a
third parameter with the actual hash_rnd value, since this is not
anymore a neigh_table field.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
110b2499370c401cdcc7c63e481084467291d556 04-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> net neigh: neigh_delete() and neigh_add() changes

neigh_delete() and neigh_add() dont need to touch device refcount,
we hold RTNL when calling them, so device cannot disappear under us.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c7d4426a98a5f6654cd0b4b33d9dab2e77192c18 04-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> net: introduce DST_NOCACHE flag

While doing stress tests with IP route cache disabled, and multi queue
devices, I noticed a very high contention on one rwlock used in
neighbour code.

When many cpus are trying to send frames (possibly using a high
performance multiqueue device) to the same neighbour, they fight for the
neigh->lock rwlock in order to call neigh_hh_init(), and fight on
hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())

But we dont need to call neigh_hh_init() for dst that are used only
once. It costs four atomic operations at least, on two contended cache
lines, plus the high contention on neigh->lock rwlock.

Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
inserted in route cache.

With the stress test bench, sending 160000000 frames on one neighbour,
results are :

Before patch:

real 2m28.406s
user 0m11.781s
sys 36m17.964s


After patch:

real 1m26.532s
user 0m12.185s
sys 20m3.903s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a02cec2155fbea457eca8881870fd2de1a4c4c76 22-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> net: return operator cleanup

Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
91a72a70594e5212c97705ca6a694bd307f7a26b 15-Jul-2010 Doug Kehn <rdkehn@yahoo.com> net/core: neighbour update Oops

When configuring DMVPN (GRE + openNHRP) and a GRE remote
address is configured a kernel Oops is observed. The
obserseved Oops is caused by a NULL header_ops pointer
(neigh->dev->header_ops) in neigh_update_hhs() when

void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *)
= neigh->dev->header_ops->cache_update;

is executed. The dev associated with the NULL header_ops is
the GRE interface. This patch guards against the
possibility that header_ops is NULL.

This Oops was first observed in kernel version 2.6.26.8.

Signed-off-by: Doug Kehn <rdkehn@yahoo.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a47311380e094bb201be8a818370c73c3f52122c 28-May-2010 Eric Dumazet <eric.dumazet@gmail.com> net: fix __neigh_event_send()

commit 7fee226ad23 (net: add a noref bit on skb dst) missed one spot
where an skb is enqueued, with a possibly not refcounted dst entry.

__neigh_event_send() inserts skb into arp_queue, so we must make sure
dst entry is refcounted, or dst entry can be freed by garbage collector
after caller exits from rcu protected section.

Reported-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
0a141509ede48ac33ef756ac1640f4d3f46fa2db 09-Mar-2010 Eric Dumazet <eric.dumazet@gmail.com> net: Annotates neigh_invalidate()

Annotates neigh_invalidate() with __releases() and __acquires() for
sparse sake.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
54716e3beb0ab20c49471348dfe399a71bfc8fd3 14-Feb-2010 Eric W. Biederman <ebiederm@xmission.com> net neigh: Decouple per interface neighbour table controls from binary sysctls

Stop computing the number of neighbour table settings we have by
counting the number of binary sysctls. This behaviour was silly
and meant that we could not add another neighbour table setting
without also adding another binary sysctl.

Don't pass the binary sysctl path for neighour table entries
into neigh_sysctl_register. These parameters are no longer
used and so are just dead code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
81c1ebfc4379f529b001e23164dd5c2282bdc0ec 22-Jan-2010 Alexey Dobriyan <adobriyan@gmail.com> neigh: simplify seq_file code

Simpily pass 'struct neigh_table' with seq_file private pointer,
and save one dereference. Proc entry itself isn't interesting.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
09ad9bc752519cc167d0a573e1acf69b5c707c67 26-Nov-2009 Octavian Purdila <opurdila@ixiacom.com> net: use net_eq to compare nets

Generated with the following semantic patch

@@
struct net *n1;
struct net *n2;
@@
- n1 == n2
+ net_eq(n1, n2)

@@
struct net *n1;
struct net *n2;
@@
- n1 != n2
+ !net_eq(n1, n2)

applied over {include,net,drivers/net}.

Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f8572d8f2a2ba75408b97dc24ef47c83671795d7 05-Nov-2009 Eric W. Biederman <ebiederm@xmission.com> sysctl net: Remove unused binary sysctl code

Now that sys_sysctl is a compatiblity wrapper around /proc/sys
all sysctl strategy routines, and all ctl_name and strategy
entries in the sysctl tables are unused, and can be
revmoed.

In addition neigh_sysctl_register has been modified to no longer
take a strategy argument and it's callers have been modified not
to pass one.

Cc: "David Miller" <davem@davemloft.net>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
e4c4e448cf557921ffbbbd6d6ddac81fdceacb4f 30-Jul-2009 Eric Dumazet <eric.dumazet@gmail.com> neigh: Convert garbage collection from softirq to workqueue

Current neigh_periodic_timer() function is fired by timer IRQ, and
scans one hash bucket each round (very litle work in fact)

As we are supposed to scan whole hash table in 15 seconds, this means
neigh_periodic_timer() can be fired very often. (depending on the number
of concurrent hash entries we stored in this table)

Converting this to a workqueue permits scanning whole table, minimizing
icache pollution, and firing this work every 15 seconds, independantly
of hash table size.

This 15 seconds delay is not a hard number, as work is a deferrable one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
97fd5bc7f2e442482a7a6cc4bc2a286cbb5f4754 13-Jul-2009 Tobias Klauser <klto@zhaw.ch> net: Rename lookup_neigh_params function

Rename lookup_neigh_params to lookup_neigh_parms as the struct is named
neigh_parms and all other functions dealing with the struct carry
neigh_parms in their names.

Signed-off-by: Tobias Klauser <klto@zhaw.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5ef12d98a19254ee5dc851bd83e214b43ec1f725 11-Jun-2009 Timo Teras <timo.teras@iki.fi> neigh: fix state transition INCOMPLETE->FAILED via Netlink request

The current code errors out the INCOMPLETE neigh entry skb queue only from
the timer if maximum probes have been attempted and there has been no reply.
This also causes the transtion to FAILED state.

However, the neigh entry can be also updated via Netlink to inform that the
address is unavailable. Currently, neigh_update() just stops the timers and
leaves the pending skb's unreleased. This results that the clean up code in
the timer callback is never called, preventing also proper garbage collection.

This fixes neigh_update() to process the pending skb queue immediately if
INCOMPLETE -> FAILED state transtion occurs due to a Netlink request.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
adf30907d63893e4208dfe3f5c88ae12bc2f25d5 02-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: skb->dst accessors

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c5c2d3089068d4aa378f7a40d2b5ad9d4f52ce8 04-Mar-2009 Eric Biederman <ebiederm@aristanetworks.com> neigh: Allow for user space users of the neighbour table

Currently it is possible to do just about everything with the arp table
from user space except treat an entry like you are using it. To that end
implement and a flag NTF_USE that when set in a netwlink update request
treats the neighbour table entry like the kernel does on the output path.

This allows user space applications to share the kernel's arp cache.

Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f3fbbe0f6f6cbac4c2aa3d71d95e49cf148286d6 25-Feb-2009 Wei Yongjun <yjwei@cn.fujitsu.com> core: remove some pointless conditionals before kfree_skb()

Remove some pointless conditionals before kfree_skb().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1ce85fe402137824246bad03ff85f3913d565c17 25-Feb-2009 Pablo Neira Ayuso <pablo@netfilter.org> netlink: change nlmsg_notify() return value logic

This patch changes the return value of nlmsg_notify() as follows:

If NETLINK_BROADCAST_ERROR is set by any of the listeners and
an error in the delivery happened, return the broadcast error;
else if there are no listeners apart from the socket that
requested a change with the echo flag, return the result of the
unicast notification. Thus, with this patch, the unicast
notification is handled in the same way of a broadcast listener
that has set the NETLINK_BROADCAST_ERROR socket flag.

This patch is useful in case that the caller of nlmsg_notify()
wants to know the result of the delivery of a netlink notification
(including the broadcast delivery) and take any action in case
that the delivery failed. For example, ctnetlink can drop packets
if the event delivery failed to provide reliable logging and
state-synchronization at the cost of dropping packets.

This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
efc683fc2a692735029067b4f939af2a3625e31d 06-Feb-2009 Gautam Kachroo <gk@aristanetworks.com> neigh: some entries can be skipped during dumping

neightbl_dump_info and neigh_dump_table can skip entries if the
*fill*info functions return an error. This results in an incomplete
dump ((invoked by netlink requests for RTM_GETNEIGHTBL or
RTM_GETNEIGH)

nidx and idx should not be incremented if the current entry was not
placed in the output buffer

Signed-off-by: Gautam Kachroo <gk@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0f23174aa8c1aa7a2a6050a72a60d290ef9ee578 29-Dec-2008 Rusty Russell <rusty@rustcorp.com.au> cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits: net

In future all cpumask ops will only be valid (in general) for bit
numbers < nr_cpu_ids. So use that instead of NR_CPUS in iterators
and other comparisons.

This is always safe: no cpu number can be >= nr_cpu_ids, and
nr_cpu_ids is initialized to NR_CPUS at boot.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
008298231abbeb91bc7be9e8b078607b816d1a4a 21-Nov-2008 Stephen Hemminger <shemminger@vyatta.com> netdev: add more functions to netdevice ops

This patch moves neigh_setup and hard_start_xmit into the network device ops
structure. For bisection, fix all the previously converted drivers as well.
Bonding driver took the biggest hit on this.

Added a prefetch of the hard_start_xmit in the fast path to try and reduce
any impact this would have.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e42ea986e4a4cab4209d982feffcaf50f21e80e3 12-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: Cleanup of neighbour code

Using read_pnet() and write_pnet() in neighbour code ease the reading
of code.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9b739ba5e66c96938fbc07a4dbd9da5b81eac56f 12-Nov-2008 Alexey Dobriyan <adobriyan@gmail.com> net: remove struct neigh_table::pde

->pde isn't actually needed, since name is stashed in ->id.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6d9f239a1edb31d6133230f478fd1dc2da338ec5 04-Nov-2008 Alexey Dobriyan <adobriyan@gmail.com> net: '&' redux

I want to compile out proc_* and sysctl_* handlers totally and
stub them to NULL depending on config options, however usage of &
will prevent this, since taking adress of NULL pointer will break
compilation.

So, drop & in front of every ->proc_handler and every ->strategy
handler, it was never needed in fact.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
93adcc80f3288f1827baf6f821af818f6eeef7f9 28-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> net: don't use INIT_RCU_HEAD

call_rcu() will unconditionally rewrite RCU head anyway.
Applies to
struct neigh_parms
struct neigh_table
struct net
struct cipso_v4_doi
struct in_ifaddr
struct in_device
rt->u.dst

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f72051b0674f36c960698653a0583edaec1e495e 23-Sep-2008 David S. Miller <davem@davemloft.net> neigh: Remove by-hand SKB queue handling.

Signed-off-by: David S. Miller <davem@davemloft.net>
745e203164a9057e0de769ff4649e6e455daf753 03-Aug-2008 Chris Larson <clarson@mvista.com> net: fix missing pneigh entries in the neighbor seq_file code

When pneigh entries exist, but the user's read buffer isn't sufficient to
hold them all, one of the pneigh entries will be missing from the results.

In neigh_get_idx_any, the number of elements which neigh_get_idx
encountered is not correctly subtracted from the position number before
the call to pneigh_get_idx. neigh_get_idx reduces the position by 1 for
each call to neigh_get_next, but it does not reduce it by one for the
first element (neigh_get_first). The patch alters the neigh_get_idx and
pneigh_get_idx functions to subtract one from pos, for the first element,
when pos is non-zero.

Signed-off-by: Chris Larson <clarson@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bff69732c9947f821a64a8477f7dcaa9c30e6a69 03-Aug-2008 Chris Larson <clarson@mvista.com> net: in the first call to neigh_seq_next, call neigh_get_first, not neigh_get_idx.

neigh_seq_next won't be called both with *pos > 0 && v ==
SEQ_START_TOKEN, so there's no point calling neigh_get_idx when we're
on the start token, just call neigh_get_first directly.

Signed-off-by: Chris Larson <clarson@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9a6d276e85aa3d8f308fc5e8de6892daeb60ae5f 17-Jul-2008 Neil Horman <nhorman@tuxdriver.com> core: add stat to track unresolved discards in neighbor cache

in __neigh_event_send, if we have a neighbour entry which is in
NUD_INCOMPLETE state, we enqueue any outbound frames to that neighbour
to the neighbours arp_queue, which is default capped to a length of 3
skbs. If that queue exceeds its set length, it will drop an skb on
the queue to enqueue the newly arrived skb. This results in a drop
for which we have no statistics incremented. This patch adds an
unresolved_discards stat to /proc/net/stat/ndisc_cache to track these
lost frames.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bc3ed28caaef55e7e3a9316464256353c5f9b1df 04-Jun-2008 Thomas Graf <tgraf@suug.ch> netlink: Improve returned error codes

Make nlmsg_trim(), nlmsg_cancel(), genlmsg_cancel(), and
nla_nest_cancel() void functions.

Return -EMSGSIZE instead of -1 if the provided message buffer is not
big enough.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
b9f5f52cca3e94f1e7509f366aa250ebbe1ed0b5 04-Jun-2008 Stephen Hemminger <shemminger@vyatta.com> net: neighbour table ABI problem

The neighbor table time of last use information is returned in the
incorrect unit. Kernel to user space ABI's need to use USER_HZ (or
milliseconds), otherwise the application has to try and discover the
real system HZ value which is problematic. Linux has standardized on
keeping USER_HZ consistent (100hz) even when kernel is running
internally at some other value.

This change is small, but it breaks the ABI for older version of
iproute2 utilities. But these utilities are already broken since they
are looking at the psched_hz values which are completely different. So
let's just go ahead and fix both kernel and user space. Older
utilities will just print wrong values.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5efdccbcda20d3e5fbaa85f726dcc9cfeb005577 02-May-2008 Denis V. Lunev <den@openvz.org> net: assign PDE->data before gluing PDE into /proc tree

Simply replace proc_create and further data assigned with proc_create_data.
Additionally, there is no need to assign NULL to PDE->data after creation,
/proc generic has already done this for us.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
be01d655d9b07c1350b19bf3d80eae0059254b4b 27-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NEIGHBOUR: Extract hash/lookup functions for pneigh entries.

Extract hash function for pneigh entries from pneigh_lookup(),
__pneigh_lookup() and pneigh_delete() as pneigh_hash().
Extract core of pneigh_lookup() and __pneigh_lookup() as
__pneigh_lookup_1().

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
0a204500f913974b4ca9b6f509a43e1544239c6d 24-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NEIGHBOUR: Make each EXPORT_SYMBOL{,_GPL}() immediately follow its function/variable.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
878628fbf2589eb24357e42027d5f54b1dafd3c8 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit namespace comparision without CONFIG_NET_NS.

Introduce an inline net_eq() to compare two namespaces.
Without CONFIG_NET_NS, since no namespace other than &init_net
exists, it is always 1.

We do not need to convert 1) inline vs inline and
2) inline vs &init_net comparisons.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
57da52c1e62c6c13875e97de6c69d3156f8416da 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit neigh_parms->net and pneigh_entry->net without CONFIG_NET_NS.

Introduce neigh_parms/pneigh_entry inlines: neigh_parms_net(), pneigh_net().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
1218854afa6f659be90b748cf1bc7badee954a35 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit seq_net_private->net without CONFIG_NET_NS.

Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
3b1e0a655f8eba44ab1ee2a1068d169ccfb853b9 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.

Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
c346dca10840a874240c78efe3f39acf4312a1f2 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit net_device->nd_net without CONFIG_NET_NS.

Introduce per-net_device inlines: dev_net(), dev_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
fa86d322d89995fef1bfb5cc768b89d8c22ea0d9 24-Mar-2008 Pavel Emelyanov <xemul@openvz.org> [NEIGH]: Fix race between pneigh deletion and ipv6's ndisc_recv_ns (v3).

Proxy neighbors do not have any reference counting, so any caller
of pneigh_lookup (unless it's a netlink triggered add/del routine)
should _not_ perform any actions on the found proxy entry.

There's one exception from this rule - the ipv6's ndisc_recv_ns()
uses found entry to check the flags for NTF_ROUTER.

This creates a race between the ndisc and pneigh_delete - after
the pneigh is returned to the caller, the nd_tbl.lock is dropped
and the deleting procedure may proceed.

One of the fixes would be to add a reference counting, but this
problem exists for ndisc only. Besides such a patch would be too
big for -rc4.

So I propose to introduce a __pneigh_lookup() which is supposed
to be called with the lock held and use it in ndisc code to check
the flags on alive pneigh entry.


Changes from v2:
As David noticed, Exported the __pneigh_lookup() to ipv6 module.
The checkpatch generates a warning on it, since the EXPORT_SYMBOL
does not follow the symbol itself, but in this file all the
exports come at the end, so I decided no to break this harmony.

Changes from v1:
Fixed comments from YOSHIFUJI - indentation of prototype in header
and the pndisc_check_router() name - and a compilation fix, pointed
by Daniel - the is_routed was (falsely) considered as uninitialized
by gcc.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7e36763b2c204d59de4e88087f84a2c0c8421f25 03-Mar-2008 Frank Blaschka <frank.blaschka@de.ibm.com> [NET]: Fix race in generic address resolution.

neigh_update sends skb from neigh->arp_queue while neigh_timer_handler
has increased skbs refcount and calls solicit with the
skb. neigh_timer_handler should not increase skbs refcount but make a
copy of the skb and do solicit with the copy.

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c65babd6ce758dd06330b3d9d677b7624f9e3fa 29-Feb-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Default arp parameters lookup.

Default ARP parameters should be findable regardless of the context.
Required to make inetdev_event working.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4ab438fcd7373da9e559576e418e890b7cfd94f4 29-Feb-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Register neighbour table parameters in the correct namespace.

neigh_sysctl_register should register sysctl entries inside correct namespace
to avoid naming conflict. Typical example is a loopback. Entries for it
present in all namespaces.

Required to make inetdev_event working.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
46ecf0b994715589b9f5f620beca4d6aaaa02028 28-Feb-2008 Wang Chen <wangchen@cn.fujitsu.com> [NEIGHBOUR]: Use proc_create() to setup ->proc_fops first

Use proc_create() to make sure that ->proc_fops be setup before gluing
PDE to main tree.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bc4bf5f38cf0a623e6a29f52ec80bfcc56a373c6 24-Feb-2008 Pavel Emelyanov <xemul@openvz.org> [NEIGH]: Fix race between neighbor lookup and table's hash_rnd update.

The neigh_hash_grow() may update the tbl->hash_rnd value, which
is used in all tbl->hash callbacks to calculate the hashval.

Two lookup routines may race with this, since they call the
->hash callback without the tbl->lock held. Since the hash_rnd
is changed with this lock write-locked moving the calls to ->hash
under this lock read-locked closes this gap.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
da12f7356da1dfb97f1c6c418f828b7ce442fef9 20-Feb-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Namespace leak in pneigh_lookup.

release_net is missed on the error path in pneigh_lookup.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ff566074689e3aed1488780b97714ec43ba361d 18-Feb-2008 David S. Miller <davem@davemloft.net> Revert "[NDISC]: Fix race in generic address resolution"

This reverts commit 69cc64d8d92bf852f933e90c888dfff083bd4fc9.

It causes recursive locking in IPV6 because unlike other
neighbour layer clients, it even needs neighbour cache
entries to send neighbour soliciation messages :-(

We'll have to find another way to fix this race.

Signed-off-by: David S. Miller <davem@davemloft.net>
69cc64d8d92bf852f933e90c888dfff083bd4fc9 12-Feb-2008 David S. Miller <davem@davemloft.net> [NDISC]: Fix race in generic address resolution

Frank Blaschka provided the bug report and the initial suggested fix
for this bug. He also validated this version of this fix.

The problem is that the access to neigh->arp_queue is inconsistent, we
grab references when dropping the lock lock to call
neigh->ops->solicit() but this does not prevent other threads of
control from trying to send out that packet at the same time causing
corruptions because both code paths believe they have exclusive access
to the skb.

The best option seems to be to hold the write lock on neigh->lock
during the ->solicit() call. I looked at all of the ndisc_ops
implementations and this seems workable. The only case that needs
special care is the IPV4 ARP implementation of arp_solicit(). It
wants to take neigh->lock as a reader to protect the header entry in
neigh->ha during the emission of the soliciation. We can simply
remove the read lock calls to take care of that since holding the lock
as a writer at the caller providers a superset of the protection
afforded by the existing read locking.

The rest of the ->solicit() implementations don't care whether the
neigh is locked or not.

Signed-off-by: David S. Miller <davem@davemloft.net>
06f0511df1b3b32fc8e0840514d4b207150f1fa7 24-Jan-2008 Denis V. Lunev <den@openvz.org> [ARP]: neigh_parms_put(destroy) are essentially local to core/neighbour.c.

Make them static.

[ Moved the inline before, instead of after, call sites. -DaveM ]

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
14db4133d59e2c1bed122bf87393e2ded05e42dc 15-Jan-2008 Denis V. Lunev <den@openvz.org> [ARP]: Remove forward declaration of neigh_changeaddr.

No need for this. It is declared in the neighbour.h

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
486b51d3706c5493b6c50992eaaafc44e628a7ed 15-Jan-2008 Denis V. Lunev <den@openvz.org> [ARP]: Remove overkill checks from neigh_param_alloc.

Valid network device is always passed into neigh_param_alloc, so
remove extra checking for dev == NULL. Additionally, cleanup bogus
netns assignment.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4250846146c04ac6f17bf92619ddfef6db2cf34f 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NEIGH]: Make /proc/net/arp opening consistent with seq_net_open semantics

seq_open_net requires that first field of the seq->private data to be
struct seq_net_private. In reality this is a single pointer to a
struct net for now. The patch makes code consistent.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9a429c4983deae020f1e757ecc8f547b6d4e2f2b 02-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [NET]: Add some acquires/releases sparse annotations.

Add __acquires() and __releases() annotations to suppress some sparse
warnings.

example of warnings :

net/ipv4/udp.c:1555:14: warning: context imbalance in 'udp_seq_start' - wrong
count at exit
net/ipv4/udp.c:1571:13: warning: context imbalance in 'udp_seq_stop' -
unexpected unlock

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
426b5303eb435d98b9bee37a807be386bc2b3320 24-Jan-2008 Eric W. Biederman <ebiederm@xmission.com> [NETNS]: Modify the neighbour table code so it handles multiple network namespaces

I'm actually surprised at how much was involved. At first glance it
appears that the neighbour table data structures are already split by
network device so all that should be needed is to modify the user
interface commands to filter the set of neighbours by the network
namespace of their devices.

However a couple things turned up while I was reading through the
code. The proxy neighbour table allows entries with no network
device, and the neighbour parms are per network device (except for the
defaults) so they now need a per network namespace default.

So I updated the two structures (which surprised me) with their very
own network namespace parameter. Updated the relevant lookup and
destroy routines with a network namespace parameter and modified the
code that interacts with users to filter out neighbour table entries
for devices of other namespaces.

I'm a little concerned that we can modify and display the global table
configuration and from all network namespaces. But this appears good
enough for now.

I keep thinking modifying the neighbour table to have per network
namespace instances of each table type would should be cleaner. The
hash table is already dynamically sized so there are it is not a
limiter. The default parameter would be straight forward to take care
of. However when I look at the how the network table is built and
used I still find some assumptions that there is only a single
neighbour table for each type of table in the kernel. The netlink
operations, neigh_seq_start, the non-core network users that call
neigh_lookup. So while it might be doable it would require more
refactoring than my current approach of just doing a little extra
filtering in the code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a43d8994b959a6daeeadcd1be6d4a9045b7029ac 21-Dec-2007 Pavel Emelyanov <xemul@openvz.org> [NEIGH]: Make neigh_add_timer symmetrical to neigh_del_timer.

The neigh_del_timer() looks sane - it removes the timer and
(conditionally) puts the neighbor. I expected, that the
neigh_add_timer() is symmetrical to the del one - i.e. it
holds the neighbor and arms the timer - but it turned out
that it was not so.

I think, that making them look symmetrical makes the code
more readable.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
c3bac5a71b24f6ed892b250d4f7511cedc33d34c 01-Dec-2007 Pavel Emelyanov <xemul@openvz.org> [NEIGH]: Use the ctl paths to create neighbours sysctls

The appropriate path is prepared right inside this function. It
is prepared similar to how the ctl tables were.

Since the path is modified, it is put on the stack, to avoid
possible races with multiple calls to neigh_sysctl_register() : it
is called by protocols and I didn't find any protection in this
case. Did I overlooked the rtnl lock?.

The stack growth of the neigh_sysctl_register() is 40 bytes. I
believe this is OK, since this is not that much and this function
is not called with the deep stack (device/protocols register).

The device's name is stored on the template to free it later.

This will help with the net namespaces, as each namespace should
have its own set of these ctls.

Besides, this saves ~350 bytes from the neigh template :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
3c607bbb472814f01b077af01ae326944ff6b8b3 01-Dec-2007 Pavel Emelyanov <xemul@openvz.org> [NEIGH]: Cleanup the neigh_sysctl_register

This mainly removes the err variable, as this call always
return the same error code (-ENOBUFS).

Besides, I moved the call to kmalloc() from the *t declaration
into the code (this is confusing when a variable is initialized
with the result of some call) and removed unneeded comment near
the error path.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
97c53cacf00d1f5aa04adabfebcc806ca8b22b10 20-Nov-2007 Denis V. Lunev <den@openvz.org> [NET]: Make rtnetlink infrastructure network namespace aware (v3)

After this patch none of the netlink callback support anything
except the initial network namespace but the rtnetlink infrastructure
now handles multiple network namespaces.

Changes from v2:
- IPv6 addrlabel processing

Changes from v1:
- no need for special rtnl_unlock handling
- fixed IPv6 ndisc

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b854272b3c732316676e9128f7b9e6f1e1ff88b0 30-Nov-2007 Denis V. Lunev <den@openvz.org> [NET]: Modify all rtnetlink methods to only work in the initial namespace (v2)

Before I can enable rtnetlink to work in all network namespaces I need
to be certain that something won't break. So this patch deliberately
disables all of the rtnletlink methods in everything except the
initial network namespace. After the methods have been audited this
extra check can be disabled.

Changes from v1:
- added IPv6 addrlabel protection

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
b24b8a247ff65c01b252025926fe564209fae4fc 24-Jan-2008 Pavel Emelyanov <xemul@openvz.org> [NET]: Convert init_timer into setup_timer

Many-many code in the kernel initialized the timer->function
and timer->data together with calling init_timer(timer). There
is already a helper for this. Use it for networking code.

The patch is HUGE, but makes the code 130 lines shorter
(98 insertions(+), 228 deletions(-)).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cecbb63967b4f36701b9412a12377e8fe006a93b 21-Jan-2008 David S. Miller <davem@davemloft.net> [NEIGH]: Revert 'Fix race between neigh_parms_release and neightbl_fill_parms'

Commit 9cd40029423701c376391da59d2c6469672b4bed (Fix race between
neigh_parms_release and neightbl_fill_parms) introduced device
reference counting regressions for several people, see:

http://bugzilla.kernel.org/show_bug.cgi?id=9778

for example.

Signed-off-by: David S. Miller <davem@davemloft.net>
9cd40029423701c376391da59d2c6469672b4bed 10-Jan-2008 Pavel Emelyanov <xemul@openvz.org> [NEIGH]: Fix race between neigh_parms_release and neightbl_fill_parms

The neightbl_fill_parms() is called under the write-locked tbl->lock
and accesses the parms->dev. The negh_parm_release() calls the
dev_put(parms->dev) without this lock. This creates a tiny race window
on which the parms contains potentially stale dev pointer.

To fix this race it's enough to move the dev_put() upper under the
tbl->lock, but note, that the parms are held by neighbors and thus can
live after the neigh_parms_release() is called, so we still can have a
parm with bad dev pointer.

I didn't find where the neigh->parms->dev is accessed, but still think
that putting the dev is to be done in a place, where the parms are
really freed. Am I right with that?

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3f192b5c584b8ecddc6069717aaf36d8fa244713 06-Nov-2007 Alexey Dobriyan <adobriyan@sw.ru> [NET]: Remove /proc/net/stat/*_arp_cache upon module removal

neigh_table_init_no_netlink() creates them, but they aren't removed anywhere.

Steps to reproduce:

modprobe clip
rmmod clip
cat /proc/net/stat/clip_arp_cache

BUG: unable to handle kernel paging request at virtual address f89d7758
printing eip: c05a99da *pdpt = 0000000000004001 *pde = 0000000004408067 *pte = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: atm af_packet ipv6 binfmt_misc sbs sbshc fan dock battery backlight ac power_supply parport loop rtc_cmos rtc_core rtc_lib serio_raw button k8temp hwmon amd_rng sr_mod cdrom shpchp pci_hotplug ehci_hcd ohci_hcd uhci_hcd usbcore
Pid: 2082, comm: cat Not tainted (2.6.24-rc1-b1d08ac064268d0ae2281e98bf5e82627e0f0c56-bloat #4)
EIP: 0060:[<c05a99da>] EFLAGS: 00210256 CPU: 0
EIP is at neigh_stat_seq_next+0x26/0x3f
EAX: 00000001 EBX: f89d7600 ECX: c587bf40 EDX: 00000000
ESI: 00000000 EDI: 00000001 EBP: 00000400 ESP: c587bf1c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process cat (pid: 2082, ti=c587b000 task=c5984e10 task.ti=c587b000)
Stack: c06228cc c5313790 c049e5c0 0804f000 c45a7b00 c53137b0 00000000 00000000
00000082 00000001 00000000 00000000 00000000 fffffffb c58d6780 c049e437
c45a7b00 c04b1f93 c587bfa0 00000400 0804f000 00000400 0804f000 c04b1f2f
Call Trace:
[<c049e5c0>] seq_read+0x189/0x281
[<c049e437>] seq_read+0x0/0x281
[<c04b1f93>] proc_reg_read+0x64/0x77
[<c04b1f2f>] proc_reg_read+0x0/0x77
[<c048907e>] vfs_read+0x80/0xd1
[<c0489491>] sys_read+0x41/0x67
[<c04080fa>] sysenter_past_esp+0x6b/0xc1
=======================
Code: e9 ec 8d 05 00 56 8b 11 53 8b 40 70 8b 58 3c eb 29 0f a3 15 80 91 7b c0 19 c0 85 c0 8d 42 01 74 17 89 c6 c1 fe 1f 89 01 89 71 04 <8b> 83 58 01 00 00 f7 d0 8b 04 90 eb 09 89 c2 83 fa 01 7e d2 31
EIP: [<c05a99da>] neigh_stat_seq_next+0x26/0x3f SS:ESP 0068:c587bf1c

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
bfb85c9f753a7172bd962e8717118191dfd612cc 22-Oct-2007 Randy Dunlap <randy.dunlap@oracle.com> [ATM]: Fix clip module reload crash.

net/atm/clip.c crashes the kernel if it (module) is loaded, removed,
and then loaded again. Its exit call to neigh_table_clear()
should destroy the cache after freeing it.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d12af679bcf8995a237560bdf7a4d734f8df5dbb 18-Oct-2007 Eric W. Biederman <ebiederm@xmission.com> sysctl: fix neighbour table sysctls.

- In ipv6 ndisc_ifinfo_syctl_change so it doesn't depend on binary
sysctl names for a function that works with proc.

- In neighbour.c reorder the table to put the possibly unused entries
at the end so we can remove them by terminating the table early.

- In neighbour.c kill the entries with questionable binary sysctl
handling behavior.

- In neighbour.c if we don't have a strategy routine remove the
binary path. So we don't the default sysctl strategy routine
on data that is not ready for it.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4ae289444b968b4cefd776ada8da519ce10e56fa 15-Oct-2007 Pavel Emelyanov <xemul@openvz.org> [NEIGH]: Ensure that pneigh_lookup is protected with RTNL

The pnigh_lookup is used to lookup proxy entries and to
create them in case lookup failed.

However, the "creation" code does not perform the re-lookup
after GFP_KERNEL allocation. This is done because the code
is expected to be protected with the RTNL lock, so add the
assertion (mainly to address future questions from new network
developers like me :) ).

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b04ddde02cf1b6f14f2697da5c20eca5715017f 09-Oct-2007 Stephen Hemminger <shemminger@linux-foundation.org> [NET]: Move hardware header operations out of netdevice.

Since hardware header operations are part of the protocol class
not the device instance, make them into a separate object and
save memory.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c4e85813d0a94eeb8bf813397a4907bdd7bb610 09-Oct-2007 Stephen Hemminger <shemminger@linux-foundation.org> [NET]: Wrap netdevice hardware header creation.

Add inline for common usage of hardware header creation, and
fix bug in IPV6 mcast where the assumption about negative return is
an errno. Negative return from hard_header means not enough space
was available,(ie -N bytes).

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
881d966b48b035ab3f3aeaae0f3d3f9b584f45b2 17-Sep-2007 Eric W. Biederman <ebiederm@xmission.com> [NET]: Make the device list and device lookups per namespace.

This patch makes most of the generic device layer network
namespace safe. This patch makes dev_base_head a
network namespace variable, and then it picks up
a few associated variables. The functions:
dev_getbyhwaddr
dev_getfirsthwbytype
dev_get_by_flags
dev_get_by_name
__dev_get_by_name
dev_get_by_index
__dev_get_by_index
dev_ioctl
dev_ethtool
dev_load
wireless_process_ioctl

were modified to take a network namespace argument, and
deal with it.

vlan_ioctl_set and brioctl_set were modified so their
hooks will receive a network namespace argument.

So basically anthing in the core of the network stack that was
affected to by the change of dev_base was modified to handle
multiple network namespaces. The rest of the network stack was
simply modified to explicitly use &init_net the initial network
namespace. This can be fixed when those components of the network
stack are modified to handle multiple network namespaces.

For now the ifindex generator is left global.

Fundametally ifindex numbers are per namespace, or else
we will have corner case problems with migration when
we get that far.

At the same time there are assumptions in the network stack
that the ifindex of a network device won't change. Making
the ifindex number global seems a good compromise until
the network stack can cope with ifindex changes when
you change namespaces, and the like.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
457c4cbc5a3dde259d2a1f15d5f9785290397267 12-Sep-2007 Eric W. Biederman <ebiederm@xmission.com> [NET]: Make /proc/net per network namespace

This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.

Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d961db358f41033a8fc7b62948bc7cff1b4bb1fe 09-Aug-2007 Thomas Graf <tgraf@suug.ch> [NEIGH]: Netlink notifications

Currently neighbour event notifications are limited to update
notifications and only sent if the ARP daemon is enabled. This
patch extends the existing notification code by also reporting
neighbours being removed due to gc or administratively and
removes the dependency on the ARP daemon. This allows to keep
track of neighbour states without periodically fetching the
complete neighbour table.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4f494554f9b95d0de57c14c460d525e3715e3f6f 09-Aug-2007 Thomas Graf <tgraf@suug.ch> [NEIGH]: Combine neighbour cleanup and release

Introduces neigh_cleanup_and_release() to be used after a
neighbour has been removed from its neighbour table. Serves
as preparation to add event notifications.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
c3609d510f844100669965db8a9ff10ba029bb4a 25-Aug-2007 vignesh babu <vignesh.babu@wipro.com> [NET]: is_power_of_2 in net/core/neighbour.c

Replacing n & (n - 1) for power of 2 check by is_power_of_2(n)

Signed-off-by: vignesh babu <vignesh.babu@wipro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
20c2df83d25c6a95affe6157a4c9cac4cf5ffaac 20-Jul-2007 Paul Mundt <lethal@linux-sh.org> mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
ef7c79ed645f52bcbdd88f8d54a9702c4d3fd15d 05-Jun-2007 Patrick McHardy <kaber@trash.net> [NETLINK]: Mark netlink policies const

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c8822a4e00442e65d42d50db8e529d75c2025630 22-Mar-2007 Thomas Graf <tgraf@suug.ch> [NEIGH]: Use rtnl registration interface

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
f690808e17925fc45217eb22e8670902ecee5c1b 12-Mar-2007 Stephen Hemminger <shemminger@linux-foundation.org> [NET]: make seq_operations const

The seq_file operations stuff can be marked constant to
get it out of dirty cache.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
bbe735e4247dba32568a305553b010081c8dea99 11-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_network_offset()

For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c2ecba71717c4f60671175fd26083c35a4b9ad58 17-Apr-2007 Pavel Emelianov <xemul@sw.ru> [NET]: Set a separate lockdep class for neighbour table's proxy_queue

Otherwise the following calltrace will lead to a wrong
lockdep warning:

neigh_proxy_process()
`- lock(neigh_table->proxy_queue.lock);
arp_redo /* via tbl->proxy_redo */
arp_process
neigh_event_ns
neigh_update
skb_queue_purge
`- lock(neighbor->arp_queue.lock);

This is not a deadlock actually, as neighbor table's proxy_queue
and the neighbor's arp_queue are different queues.

Lockdep thinks there is a deadlock as both queues are initialized
with skb_queue_head_init() and thus have a common class.

Signed-off-by: David S. Miller <davem@davemloft.net>
ecbb416939da77c0d107409976499724baddce7b 24-Mar-2007 Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> [NET]: Fix neighbour destructor handling.

->neigh_destructor() is killed (not used), replaced with
->neigh_cleanup(), which is called when neighbor entry goes to dead
state. At this point everything is still valid: neigh->dev,
neigh->parms etc.

The device should guarantee that dead neighbor entries (neigh->dead !=
0) do not get private part initialized, otherwise nobody will cleanup
it.

I think this is enough for ipoib which is the only user of this thing.
Initialization private part of neighbor entries happens in ipib
start_xmit routine, which is not reached when device is down. But it
would be better to add explicit test for neigh->dead in any case.

Signed-off-by: David S. Miller <davem@davemloft.net>
0b4d414714f0d2f922d39424b0c5c82ad900a381 14-Feb-2007 Eric W. Biederman <ebiederm@xmission.com> [PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cd354f1ae75e6466a7e31b727faede57a1f89ca5 14-Feb-2007 Tim Schmielau <tim@physik3.uni-rostock.de> [PATCH] remove many unneeded #includes of sched.h

After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.

To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.

Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9a32144e9d7b4e21341174b1a83b82a82353be86 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com> [PATCH] mark struct file_operations const 7

Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c376222960ae91d5ffb9197ee36771aaed1d9f90 10-Feb-2007 Robert P. J. Day <rpjday@mindspring.com> [PATCH] Transform kmem_cache_alloc()+memset(0) -> kmem_cache_zalloc().

Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
corresponding "kmem_cache_zalloc()" call.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4ec93edb14fe5fdee9fae6335f2cbba204627eac 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] CORE: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f5a6e01c093ca60c0cab15c47c8e7e199fbbc9e6 06-Feb-2007 Arjan van de Ven <arjan@linux.intel.com> [NET]: user of the jiffies rounding code: Networking

This patch introduces users of the round_jiffies() function in the
networking code.

These timers all were of the "about once a second" or "about once
every X seconds" variety and several showed up in the "what wakes the
cpu up" profiles that the tickless patches provide. Some timers are
highly dynamic based on network load; but even on low activity systems
they still show up so the rounding is done only in cases of low
activity, allowing higher frequency timers in the high activity case.

The various hardware watchdogs are an obvious case; they run every 2
seconds but aren't otherwise specific of exactly when they need to
run.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
26932566a42d46aee7e5d526cb34fba9380cad10 01-Feb-2007 Patrick McHardy <kaber@trash.net> [NETLINK]: Don't BUG on undersized allocations

Currently netlink users BUG when the allocated skb for an event
notification is undersized. While this is certainly a kernel bug,
its not critical and crashing the kernel is too drastic, especially
when considering that these errors have appeared multiple times in
the past and it BUGs even if no listeners are present.

This patch replaces BUG by WARN_ON and changes the notification
functions to inform potential listeners of undersized allocations
using a unique error code (EMSGSIZE).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
3644f0cee77494190452de132e82245107939284 08-Dec-2006 Stephen Hemminger <shemminger@osdl.org> [NET]: Convert hh_lock to seqlock.

The hard header cache is in the main output path, so using
seqlock instead of reader/writer lock should reduce overhead.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
54e6ecb23951b195d02433a741c7f7cb0b796c78 07-Dec-2006 Christoph Lameter <clameter@sgi.com> [PATCH] slab: remove SLAB_ATOMIC

SLAB_ATOMIC is an alias of GFP_ATOMIC

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b1a98bf685e26f1a0b509d6f0f6bd8f7764303a5 21-Nov-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [NET] neighbour: Use kmemdup where applicable

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
339bf98ffc6a8d8eb16fc532ac57ffbced2f8a68 10-Nov-2006 Thomas Graf <tgraf@suug.ch> [NETLINK]: Do precise netlink message allocations where possible

Account for the netlink message header size directly in nlmsg_new()
instead of relying on the caller calculate it correctly.

Replaces error handling of message construction functions when
constructing notifications with bug traps since a failure implies
a bug in calculating the size of the skb.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c5e29460f5f9eb189cab5d9fdaa137e64f7734b6 04-Oct-2006 Julian Anastasov <ja@ssi.bg> [NEIGH]: always use hash_mask under tbl lock

Make sure hash_mask is protected with tbl->lock in all cases just like
the hash_buckets.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
d77072ecfb6d28287d5e2a61d60d87a3a444ac97 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [NET]: Annotate dst_ops protocol

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
62dd93181aaa1d5a501a9cebcb254f44b8a48af7 22-Sep-2006 Ville Nuorvala <vnuorval@tcs.hut.fi> [IPV6] NDISC: Set per-entry is_router flag in Proxy NA.

We have sent NA with router flag from the node-wide forwarding
configuration. This is not appropriate for proxy NA, and it should be
set according to each proxy entry's configuration.

This is used by Mobile IPv6 home agent to support physical home link
in acting as a proxy router for mobile node which is not a router,
for example.

Based on MIPL2 kernel patch.

Signed-off-by: Ville Nuorvala <vnuorval@tcs.hut.fi>
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
e5d679f33900c71d1a76ba07c5b04055abd34480 27-Aug-2006 Alexey Dobriyan <adobriyan@gmail.com> [NET]: Use SLAB_PANIC

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e92b43a3455d3e817c13481bb3ea3cd29d0a47f4 18-Aug-2006 Stephen Hemminger <shemminger@osdl.org> [NET] neighbour: reduce exports

There are several symbols only used by rtnetlink and since it can
not be a module, there is no reason to export them.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ab32ea5d8a760e7dd4339634e95d7be24ee5b842 22-Sep-2006 Brian Haley <brian.haley@hp.com> [NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly

Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b8673311804ca29680dd584bd08352001fcbe2f8 15-Aug-2006 Thomas Graf <tgraf@suug.ch> [NEIGH]: Convert neighbour notifications ot use rtnl_notify()

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
ca860fb39b4aa1479e2fea67435a2c1eac9ce789 08-Aug-2006 Thomas Graf <tgraf@suug.ch> [NEIGH]: Convert neighbour table dumping to new netlink api

Also fixes skipping of already dumped neighbours.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
6b3f8674bccbb2e784d01e44373fb730af6cb149 08-Aug-2006 Thomas Graf <tgraf@suug.ch> [NEIGH]: Convert neighbour table modification to new netlink api

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
8b8aec508302d4e63fd88f47894805115277f70f 08-Aug-2006 Thomas Graf <tgraf@suug.ch> [NEIGH]: Convert neighbour dumping to new netlink api

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5208debd0f1da07bbb350f8b0b142775d4f002ea 08-Aug-2006 Thomas Graf <tgraf@suug.ch> [NEIGH]: Convert neighbour addition to new netlink api

Fixes:
Return EAFNOSUPPORT if no table matches the specified
address family.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
a14a49d2b7b9290e87751f21f503f1954267d4c4 08-Aug-2006 Thomas Graf <tgraf@suug.ch> [NEIGH]: Convert neighbour deletion to new netlink api

Fixes:
Return ENOENT if the neighbour is not found (was EINVAL)
Return EAFNOSUPPORT if no table matches the specified
address family.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
3fcde74b3877756f4b4725a883d0b48696c0d369 01-Sep-2006 Kirill Korotaev <dev@openvz.org> [NEIGH]: neigh_table_clear() doesn't free stats

neigh_table_clear() doesn't free tbl->stats.
Found by Alexey Kuznetsov. Though Alexey considers this
leak minor for mainstream, I still believe that cleanup
code should not forget to free some of the resources :)

At least, this is critical for OpenVZ with virtualized
neighbour tables.

Signed-Off-By: Kirill Korotaev <dev@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8d71740c56a9058acc4378504a356d543ff1308b 31-Jul-2006 Tom Tucker <tom@opengridcomputing.com> [NET]: Core net changes to generate netevents

Generate netevents for:
- neighbour changes
- routing redirects
- pmtu changes

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
bd89efc532fe41f867f848144cc8b42054ddf6f9 12-May-2006 Simon Kelley <simon@thekelleys.org.uk> [NEIGH]: Fix IP-over-ATM and ARP interaction.

The classical IP over ATM code maintains its own IPv4 <-> <ATM stuff>
ARP table, using the standard neighbour-table code. The
neigh_table_init function adds this neighbour table to a linked list
of all neighbor tables which is used by the functions neigh_delete()
neigh_add() and neightbl_set(), all called by the netlink code.

Once the ATM neighbour table is added to the list, there are two
tables with family == AF_INET there, and ARP entries sent via netlink
go into the first table with matching family. This is indeterminate
and often wrong.

To see the bug, on a kernel with CLIP enabled, create a standard IPv4
ARP entry by pinging an unused address on a local subnet. Then attempt
to complete that entry by doing

ip neigh replace <ip address> lladdr <some mac address> nud reachable

Looking at the ARP tables by using

ip neigh show

will reveal two ARP entries for the same address. One of these can be
found in /proc/net/arp, and the other in /proc/net/atm/arp.

This patch adds a new function, neigh_table_init_no_netlink() which
does everything the neigh_table_init() does, except add the table to
the netlink all-arp-tables chain. In addition neigh_table_init() has a
check that all tables on the chain have a distinct address family.
The init call in clip.c is changed to call
neigh_table_init_no_netlink().

Since ATM ARP tables are rather more complicated than can currently be
handled by the available rtattrs in the netlink protocol, no
functionality is lost by this patch, and non-ATM ARP manipulation via
netlink is rescued. A more complete solution would involve a rtattr
for ATM ARP entries and some way for the netlink code to give
neigh_add and friends more information than just address family with
which to find the correct ARP table.

[ I've changed the assertion checking in neigh_table_init() to not
use BUG_ON() while holding neigh_tbl_lock. Instead we remember that
we found an existing tbl with the same family, and after dropping
the lock we'll give a diagnostic kernel log message and a stack dump.
-DaveM ]

Signed-off-by: Simon Kelley <simon@thekelleys.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
6f912042256c12b0927438122594f5379b364f5d 11-Apr-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [PATCH] for_each_possible_cpu: network codes

for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.

This patch replaces for_each_cpu with for_each_possible_cpu under /net

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
77d04bd957ddca9d48a664e28b40f33993f4550e 07-Apr-2006 Andrew Morton <akpm@osdl.org> [NET]: More kzalloc conversions.

Signed-off-by: David S. Miller <davem@davemloft.net>
c5ecd62c25400a3c6856e009f84257d5bd03f03b 21-Mar-2006 Michael S. Tsirkin <mst@mellanox.co.il> [NET]: Move destructor from neigh->ops to neigh_params

struct neigh_ops currently has a destructor field, which no in-kernel
drivers outside of infiniband use. The infiniband/ulp/ipoib in-tree
driver stashes some info in the neighbour structure (the results of
the second-stage lookup from ARP results to real link-level path), and
it uses neigh->ops->destructor to get a callback so it can clean up
this extra info when a neighbour is freed. We've run into problems
with this: since the destructor is in an ops field that is shared
between neighbours that may belong to different net devices, there's
no way to set/clear it safely.

The following patch moves this field to neigh_parms where it can be
safely set, together with its twin neigh_setup. Two additional
patches in the patch series update ipoib to use this new interface.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
955aaa2fe39e21e49521449c09548ce1ba501010 21-Mar-2006 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET]: NEIGHBOUR: Ensure to record time to neigh->updated when neighbour's state changed.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
670c02c2bfd2c8a305a90f5285409a7b0a8fd630 13-Oct-2005 John Hawkes <hawkes@sgi.com> [NET]: Wider use of for_each_*cpu()

In 'net' change the explicit use of for-loops and NR_CPUS into the
general for_each_cpu() or for_each_online_cpu() constructs, as
appropriate. This widens the scope of potential future optimizations
of the general constructs, as well as takes advantage of the existing
optimizations of first_cpu() and next_cpu(), which is advantageous
when the true CPU count is much smaller than NR_CPUS.

Signed-off-by: John Hawkes <hawkes@sgi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
49636bb12892786e4a7b207b37ca7b0c5ca1cae0 23-Oct-2005 Herbert Xu <herbert@gondor.apana.org.au> [NEIGH] Fix timer leak in neigh_changeaddr

neigh_changeaddr attempts to delete neighbour timers without setting
nud_state. This doesn't work because the timer may have already fired
when we acquire the write lock in neigh_changeaddr. The result is that
the timer may keep firing for quite a while until the entry reaches
NEIGH_FAILED.

It should be setting the nud_state straight away so that if the timer
has already fired it can simply exit once we relinquish the lock.

In fact, this whole function is simply duplicating the logic in
neigh_ifdown which in turn is already doing the right thing when
it comes to deleting timers and setting nud_state.

So all we have to do is take that code out and put it into a common
function and make both neigh_changeaddr and neigh_ifdown call it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
6fb9974f49f7a6032118c5b6caa6e08e7097913e 23-Oct-2005 Herbert Xu <herbert@gondor.apana.org.au> [NEIGH] Fix add_timer race in neigh_add_timer

neigh_add_timer cannot use add_timer unconditionally. The reason is that
by the time it has obtained the write lock someone else (e.g., neigh_update)
could have already added a new timer.

So it should only use mod_timer and deal with its return value accordingly.

This bug would have led to rare neighbour cache entry leaks.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
203755029e063066ecc4cf5eee1110ab946c2d88 23-Oct-2005 Herbert Xu <herbert@gondor.apana.org.au> [NEIGH] Print stack trace in neigh_add_timer

Stack traces are very helpful in determining the exact nature of a bug.
So let's print a stack trace when the timer is added twice.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
667347f1ca7e099f6833551f194cf2bcc778871b 27-Sep-2005 David S. Miller <davem@sunset.davemloft.net> [NEIGH]: Add debugging check when adding timers.

If we double-add a neighbour entry timer, which should be
impossible but has been reported, dump the current state of
the entry so that we can debug this.

Signed-off-by: David S. Miller <davem@davemloft.net>
45fc3b11f1d419ed6c636e5ca84472d9805f520e 25-Sep-2005 Amos Waterland <apw@us.ibm.com> [NET]: Protect neigh_stat_seq_fops by CONFIG_PROC_FS

From: Amos Waterland <apw@us.ibm.com>

If CONFIG_PROC_FS is not selected, the compiler emits this warning:

net/core/neighbour.c:64: warning: `neigh_stat_seq_fops' defined but not used

Which is correct, because neigh_stat_seq_fops is in fact only
initialized and used by code that is protected by CONFIG_PROC_FS. So
this patch fixes that up.

Signed-off-by: Amos Waterland <apw@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ac6d439d2097b72ea0cbc2322ce1263a38bc1fd0 15-Aug-2005 Patrick McHardy <kaber@trash.net> [NETLINK]: Convert netlink users to use group numbers instead of bitmasks

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
a61bbcf28a8cb0ba56f8193d512f7222e711a294 15-Aug-2005 Patrick McHardy <kaber@trash.net> [NET]: Store skb->timestamp as offset to a base timestamp

Reduces skb size by 8 bytes on 64-bit.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ef1d4c7c7aca1cd436612b6ca785b726ffb8ed8 28-Jun-2005 Patrick McHardy <kaber@trash.net> [NETLINK]: Missing initializations in dumped data

Mostly missing initialization of padding fields of 1 or 2 bytes length,
two instances of uninitialized nlmsgerr->msg of 16 bytes length.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
543537bd922692bc978e2e356fcd8bfc9c2ee7d5 23-Jun-2005 Paulo Marques <pmarques@grupopie.com> [PATCH] create a kstrdup library function

This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.

Most of the changes come from the sound and net subsystems. The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.

I left UML alone for now because I would need more time to read the code
carefully before making changes there.

Signed-off-by: Paulo Marques <pmarques@grupopie.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b6544c0b4cf2bd96195f3cdb7cebfb35090fc557 19-Jun-2005 Jamal Hadi Salim <hadi@cyberus.ca> [NETLINK]: Correctly set NLM_F_MULTI without checking the pid

This patch rectifies some rtnetlink message builders that derive the
flags from the pid. It is now explicit like the other cases
which get it right. Also fixes half a dozen dumpers which did not
set NLM_F_MULTI at all.

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
1797754ea7ee5e0d859b0a32506ff999f8d5fb71 19-Jun-2005 Thomas Graf <tgraf@suug.ch> [NETLINK]: Introduce NLMSG_NEW macro to better handle netlink flags

Introduces a new macro NLMSG_NEW which extends NLMSG_PUT but takes
a flags argument. NLMSG_PUT stays there for compatibility but now
calls NLMSG_NEW with flags == 0. NLMSG_PUT_ANSWER is renamed to
NLMSG_NEW_ANSWER which now also takes a flags argument.

Also converts the users of NLMSG_PUT_ANSWER to use NLMSG_NEW_ANSWER
and fixes the two direct users of __nlmsg_put to either provide
the flags or use NLMSG_NEW(_ANSWER).

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
e386c6eb431ca2e435d0202ad6997f3d2ccab2ce 19-Jun-2005 Thomas Graf <tgraf@suug.ch> [NEIGH]: Fix use of uninitialized variable when trimming in neightbl_fill_parms

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4b6ea82dd18c97598c3caaa8d0b1feec87857e70 19-Jun-2005 Thomas Graf <tgraf@suug.ch> [NETLINK]: Kill bogus NLMSG_SET_MULTIPART uses.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
c7fb64db001f83ece669c76a02d8ec2fdb1dd307 19-Jun-2005 Thomas Graf <tgraf@suug.ch> [NETLINK]: Neighbour table configuration and statistics via rtnetlink

To retrieve the neighbour tables send RTM_GETNEIGHTBL with the
NLM_F_DUMP flag set. Every neighbour table configuration is
spread over multiple messages to avoid running into message
size limits on systems with many interfaces. The first message
in the sequence transports all not device specific data such as
statistics, configuration, and the default parameter set.
This message is followed by 0..n messages carrying device
specific parameter sets.

Although the ordering should be sufficient, NDTA_NAME can be
used to identify sequences. The initial message can be identified
by checking for NDTA_CONFIG. The device specific messages do
not contain this TLV but have NDTPA_IFINDEX set to the
corresponding interface index.

To change neighbour table attributes, send RTM_SETNEIGHTBL
with NDTA_NAME set. Changeable attribute include NDTA_THRESH[1-3],
NDTA_GC_INTERVAL, and all TLVs in NDTA_PARMS unless marked
otherwise. Device specific parameter sets can be changed by
setting NDTPA_IFINDEX to the interface index of the corresponding
device.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
5bec0039f4ac8d707d7afe7739cc2e7004447e38 28-Apr-2005 Olaf Rempel <razzor@kopf-tisch.de> [NET]: /proc/net/stat/* header cleanup

Signed-off-by: Olaf Rempel <razzor@kopf-tisch.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!