History log of /net/ipv4/fib_trie.c
Revision Date Author Comments
1d023284c31a4e40a94d5bbcb7dbb7a35ee0bcbc 07-Aug-2014 Ken Helias <kenhelias@firemail.de> list: fix order of arguments for hlist_add_after(_rcu)

All other add functions for lists have the new item as first argument
and the position where it is added as second argument. This was changed
for no good reason in this function and makes using it unnecessary
confusing.

The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f830b0223cabfc614552a73dabff920859191f2e 05-Jun-2014 Sasha Levin <sasha.levin@oracle.com> net: Revert "fib_trie: use seq_file_net rather than seq->private"

This reverts commit 30f38d2fdd79f13fc929489f7e6e517b4a4bfe63.

fib_triestat is surrounded by a big lie: while it claims that it's a
seq_file (fib_triestat_seq_open, fib_triestat_seq_show), it isn't:

static const struct file_operations fib_triestat_fops = {
.owner = THIS_MODULE,
.open = fib_triestat_seq_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release_net,
};

Yes, fib_triestat is just a regular file.

A small detail (assuming CONFIG_NET_NS=y) is that while for seq_files
you could do seq_file_net() to get the net ptr, doing so for a regular
file would be wrong and would dereference an invalid pointer.

The fib_triestat lie claimed a victim, and trying to show the file would
be bad for the kernel. This patch just reverts the issue and fixes
fib_triestat, which still needs a rewrite to either be a seq_file or
stop claiming it is.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
30f38d2fdd79f13fc929489f7e6e517b4a4bfe63 30-May-2014 David Ahern <dsahern@gmail.com> fib_trie: use seq_file_net rather than seq->private

Make fib_triestat_seq_show consistent with other /proc/net files and
use seq_file_net.

Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
652586df95e5d76b37d07a11839126dcfede1621 14-Nov-2013 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> seq_file: remove "%n" usage from seq_file users

All seq_printf() users are using "%n" for calculating padding size,
convert them to use seq_setwidth() / seq_pad() pair.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4c60f1d67fae632743df9324301e3cb2682f54d4 08-Oct-2013 baker.zhang <baker.kernel@gmail.com> fib_trie: only calc for the un-first node

This is a enhancement.

for the first node in fib_trie, newpos is 0, bit is 1.
Only for the leaf or node with unmatched key need calc pos.

Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bbe34cf8a1a2cc174e6516fc230b91b531da7ddf 01-Oct-2013 baker.zhang <baker.kernel@gmail.com> fib_trie: avoid a redundant bit judgement in inflate

Because 'node' is the i'st child of 'oldnode',
thus, here 'i' equals
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits)

we just get 1 more bit,
and need not care the detail value of this bits.

I apologize for the mistake.

I generated the patch on a branch version,
and did not notice the put_child has been changed.

I have redone the test on HEAD version with my patch.

two cases are used.
case 1. inflate a node which has a leaf child node.
case 2: inflate a node which has a an child node with skipped bits

test env:
ip link set eth0 up
ip a add dev eth0 192.168.11.1/32
here, we just focus on route table(MAIN),
so I use a "192.168.11.1/32" address to simplify the test case.

call trace:
+ fib_insert_node
+ + trie_rebalance
+ + + resize
+ + + + inflate

Test case 1: inflate a node which has a leaf child node.

===========================================================
step 1. prepare a fib trie
------------------------------------------
ip r a 192.168.0.0/24 via 192.168.11.1
ip r a 192.168.1.0/24 via 192.168.11.1

we get a fib trie.
root@baker:~# cat /proc/net/fib_trie
Main:
+-- 192.168.0.0/23 1 0 0
|-- 192.168.0.0
/24 universe UNICAST
|-- 192.168.1.0
/24 universe UNICAST
Local:
.....

step 2. Add the third route
------------------------------------------
root@baker:~# ip r a 192.168.2.0/24 via 192.168.11.1

A fib_trie leaf will be inserted in fib_insert_node before trie_rebalance.

For function 'inflate':
'inflate' is called with following trie.
+-- 192.168.0.0/22 1 1 0 <=== tn node
+-- 192.168.0.0/23 1 0 0 <== node a
|-- 192.168.0.0
/24 universe UNICAST
|-- 192.168.1.0
/24 universe UNICAST
|-- 192.168.2.0 <== leaf(node b)

When process node b, which is a leaf. here:
i is 1,
node key "192.168.2.0"
oldnode is (pos:22, bits:1)

unpatch source:
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,2,0", 22 + 1, 1)

thus got 0, and call put_child(tn, 2*i, node); <== 2*i=2.

patched source:
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits + 1),
tkey_extract_bits("192.168,2,0", 22, 1 + 1) <== get 2.

Test case 2: inflate a node which has a an child node with skipped bits
==========================================================================
step 1. prepare a fib trie.
ip link set eth0 up
ip a add dev eth0 192.168.11.1/32
ip r a 192.168.128.0/24 via 192.168.11.1
ip r a 192.168.0.0/24 via 192.168.11.1
ip r a 192.168.16.0/24 via 192.168.11.1
ip r a 192.168.32.0/24 via 192.168.11.1
ip r a 192.168.48.0/24 via 192.168.11.1
ip r a 192.168.144.0/24 via 192.168.11.1
ip r a 192.168.160.0/24 via 192.168.11.1
ip r a 192.168.176.0/24 via 192.168.11.1

check:
root@baker:~# cat /proc/net/fib_trie
Main:
+-- 192.168.0.0/16 1 0 0
+-- 192.168.0.0/18 2 0 0
|-- 192.168.0.0
/24 universe UNICAST
|-- 192.168.16.0
/24 universe UNICAST
|-- 192.168.32.0
/24 universe UNICAST
|-- 192.168.48.0
/24 universe UNICAST
+-- 192.168.128.0/18 2 0 0
|-- 192.168.128.0
/24 universe UNICAST
|-- 192.168.144.0
/24 universe UNICAST
|-- 192.168.160.0
/24 universe UNICAST
|-- 192.168.176.0
/24 universe UNICAST
Local:
...

step 2. add a route to trigger inflate.
ip r a 192.168.96.0/24 via 192.168.11.1

This command will call serveral times inflate.
In the first time, the fib_trie is:
________________________
+-- 192.168.128.0/(16, 1) <== tn node
+-- 192.168.0.0/(17, 1) <== node a
+-- 192.168.0.0/(18, 2)
|-- 192.168.0.0
|-- 192.168.16.0
|-- 192.168.32.0
|-- 192.168.48.0
|-- 192.168.96.0
+-- 192.168.128.0/(18, 2) <== node b.
|-- 192.168.128.0
|-- 192.168.144.0
|-- 192.168.160.0
|-- 192.168.176.0

NOTE: node b is a interal node with skipped bits.
here,
i:1,
node->key "192.168.128.0",
oldnode:(pos:16, bits:1)
so
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16 + 1, 1) <=== 0

tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16, 1+1) <=== 2

2*i + 0 == 2, so the result is same.

Signed-off-by: baker.zhang <baker.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aab515d7c32a34300312416c50314e755ea6f765 05-Aug-2013 Eric Dumazet <edumazet@google.com> fib_trie: remove potential out of bound access

AddressSanitizer [1] dynamic checker pointed a potential
out of bound access in leaf_walk_rcu()

We could allocate one more slot in tnode_new() to leave the prefetch()
in-place but it looks not worth the pain.

Bug added in commit 82cfbb008572b ("[IPV4] fib_trie: iterator recode")

[1] :
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f585a991e1d1612265f0d4e812f77e40dd54975b 22-Jul-2013 Jerry Snitselaar <jerry.snitselaar@oracle.com> fib_trie: potential out of bounds access in trie_show_stats()

With the <= max condition in the for loop, it will be always go 1
element further than needed. If the condition for the while loop is
never met, then max is MAX_STAT_DEPTH, and for loop will walk off the
end of nodesizes[].

Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0020356355192cbaf6d315515e6c95bd09618c3b 05-May-2013 Al Viro <viro@ZenIV.linux.org.uk> fib_trie: no need to delay vfree()

Now that vfree() can be called from interrupt contexts, there's no
need to play games with schedule_work() to escape calling vfree()
from RCU callbacks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
b67bfe0d42cac56c512dd5da4b1b347a23f4b70a 28-Feb-2013 Sasha Levin <sasha.levin@oracle.com> hlist: drop the node parameter from iterators

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ece31ffd539e8e2b586b1ca5f50bc4f4591e3893 18-Feb-2013 Gao feng <gaofeng@cn.fujitsu.com> net: proc: change proc_net_remove to remove_proc_entry

proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d4beaa66add8aebf83ab16d2fde4e4de8dac36df 18-Feb-2013 Gao feng <gaofeng@cn.fujitsu.com> net: proc: change proc_net_fops_create to proc_create

Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bafa6d9d89072c1a18853afe9ee5de05c491c13a 07-Sep-2012 Nicolas Dichtel <nicolas.dichtel@6wind.com> ipv4/route: arg delay is useless in rt_cache_flush()

Since route cache deletion (89aef8921bfbac22f), delay is no
more used. Remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15e473046cb6e5d18a4d0057e61d76315230382b 07-Sep-2012 Eric W. Biederman <ebiederm@xmission.com> netlink: Rename pid to portid to avoid confusion

It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.

I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.

I have successfully built an allyesconfig kernel with this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4ccfe6d4109252dfadcd6885f33ed600ee03dbf8 07-Sep-2012 Nicolas Dichtel <nicolas.dichtel@6wind.com> ipv4/route: arg delay is useless in rt_cache_flush()

Since route cache deletion (89aef8921bfbac22f), delay is no
more used. Remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ad5b310228da567e35a2ea5dcb2fd62e3a36654e 13-Aug-2012 Igor Maravic <igorm@etf.rs> net: ipv4: fib_trie: Don't unnecessarily search for already found fib leaf

We've already found leaf, don't search for it again. Same is for fib leaf info.

Signed-off-by: Igor Maravic <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c03eca3d995e73d691edea8c787e25929ec156d 07-Aug-2012 Eric Dumazet <edumazet@google.com> net: fib: fix incorrect call_rcu_bh()

After IP route cache removal, I believe rcu_bh() has very little use and
we should remove this RCU variant, since it adds some cycles in fast
path.

Anyway, the call_rcu_bh() use in fib_true is obviously wrong, since
some users only assert rcu_read_lock().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
79cda75a107da0d49732b5cb642b456264dd7e0e 07-Aug-2012 Eric Dumazet <edumazet@google.com> fib: use __fls() on non null argument

__fls(x) is a bit faster than fls(x), granted we know x is non null.

As Ben Hutchings pointed out, fls(x) = __fls(x) + 1

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
61648d91fc278fd1d500da8061d17e6920cd3500 29-Jul-2012 Lin Ming <mlin@ss.pku.edu.cn> ipv4: clean up put_child

The first parameter struct trie *t is not used anymore.
Remove it.

Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
4ea4bf7ebcbacee2f4736d261efb0693e87a34c9 29-Jul-2012 Lin Ming <mlin@ss.pku.edu.cn> ipv4: fix debug info in tnode_new

It should print size of struct rt_trie_node * allocated instead of size
of struct rt_trie_node.

Signed-off-by: Lin Ming <mlin@ss.pku.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
391e5c22f5f4e55817f8ba18a08ea717ed2d4a1f 12-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Remove tb_peers from fib_table.

No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
8e77327783c753689a1a766ab9d301b81c2529f1 11-Jun-2012 David S. Miller <davem@davemloft.net> inet: Add inetpeer tree roots to the FIB tables.

Signed-off-by: David S. Miller <davem@davemloft.net>
e3192690a3c889767d1161b228374f4926d92af0 03-Jun-2012 Joe Perches <joe@perches.com> net: Remove casts to same type

Adding casts of objects to the same type is unnecessary
and confusing for a human reader.

For example, this cast:

int y;
int *p = (int *)&y;

I used the coccinelle script below to find and remove these
unnecessary casts. I manually removed the conversions this
script produces of casts with __force and __user.

@@
type T;
T *p;
@@

- (T *)p
+ p

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dccd9ecc374462e5d6a5b8f8110415a86c2213d8 11-May-2012 David S. Miller <davem@davemloft.net> ipv4: Do not use dead fib_info entries.

Due to RCU lookups and RCU based release, fib_info objects can
be found during lookup which have fi->fib_dead set.

We must ignore these entries, otherwise we risk dereferencing
the parts of the entry which are being torn down.

Reported-by: Yevgen Pronenko <yevgen.pronenko@sonymobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ffc93f203c18a70623f21950f1dd473c9ec48cd 28-Mar-2012 David Howells <dhowells@redhat.com> Remove all #inclusions of asm/system.h

Remove all #inclusions of asm/system.h preparatory to splitting and killing
it. Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
058bd4d2a4ff0aaa4a5381c67e776729d840c785 11-Mar-2012 Joe Perches <joe@perches.com> net: Convert printks to pr_<level>

Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini. Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
#define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
text data bss dec hex filename
887888 31558 249696 1169142 11d6f6 net/ipv4/built-in.o.new
887934 31558 249800 1169292 11d78c net/ipv4/built-in.o.old

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cf778b00e96df6d64f8e21b8395d1f8a859ecdc7 12-Jan-2012 Eric Dumazet <eric.dumazet@gmail.com> net: reintroduce missing rcu_assign_pointer() calls

commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).

We miss needed barriers, even on x86, when y is not NULL.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6fc01438a94702bd160cb1b89203d9b97ae68ced 25-Aug-2011 Florian Westphal <fw@strlen.de> net: ipv4: export fib_lookup and fib_table_lookup

The reverse path filter module will use fib_lookup.

If CONFIG_IP_MULTIPLE_TABLES is not set, fib_lookup is
only a static inline helper that calls fib_table_lookup,
so export that too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
bc3b2d7fb9b014d75ebb79ba371a763dbab5e8cf 15-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com> net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules

These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
a9b3cd7f323b2e57593e7215362a7b02fc933e3a 01-Aug-2011 Stephen Hemminger <shemminger@vyatta.com> rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER

When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

// </smpl>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5c74501f76360ce6f410730b9b5e5976f38e8504 18-Jul-2011 Eric Dumazet <eric.dumazet@gmail.com> ipv4: save cpu cycles from check_leaf()

Compiler is not smart enough to avoid double BSWAP instructions in
ntohl(inet_make_mask(plen)).

Lets cache this value in struct leaf_info, (fill a hole on 64bit arches)

With route cache disabled, this saves ~2% of cpu in udpflood bench on
x86_64 machine.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
70c71606190e9115e5f8363bfcd164c582eb314a 22-May-2011 Paul Gortmaker <paul.gortmaker@windriver.com> Add appropriate <linux/prefetch.h> include for prefetch users

After discovering that wide use of prefetch on modern CPUs
could be a net loss instead of a win, net drivers which were
relying on the implicit inclusion of prefetch.h via the list
headers showed up in the resulting cleanup fallout. Give
them an explicit include via the following $0.02 script.

=========================================
#!/bin/bash
MANUAL=""
for i in `git grep -l 'prefetch(.*)' .` ; do
grep -q '<linux/prefetch.h>' $i
if [ $? = 0 ] ; then
continue
fi

( echo '?^#include <linux/?a'
echo '#include <linux/prefetch.h>'
echo .
echo w
echo q
) | ed -s $i > /dev/null 2>&1
if [ $? != 0 ]; then
echo $i needs manual fixup
MANUAL="$i $MANUAL"
fi
done
echo ------------------- 8\<----------------------
echo vi $MANUAL
=========================================

Signed-off-by: Paul <paul.gortmaker@windriver.com>
[ Fixed up some incorrect #include placements, and added some
non-network drivers and the fib_trie.c case - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
120a3d5c7c1f37752e7956d296a4f031f38cb217 23-May-2011 David S. Miller <davem@davemloft.net> ipv4: Include linux/prefetch.h in fib_trie.c

Signed-off-by: David S. Miller <davem@davemloft.net>
bceb0f4512d448763fe98c9f37504c98bbebbed6 18-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com> net,rcu: convert call_rcu(__leaf_info_free_rcu) to kfree_rcu()

The rcu callback __leaf_info_free_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(__leaf_info_free_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
7cfd260910b881250cde76ba92ebe3cbf8493a8f 01-May-2011 Alexey Dobriyan <adobriyan@gmail.com> ipv4: don't spam dmesg with "Using LC-trie" messages

fib_trie_table() is called during netns creation and
Chromium uses clone(CLONE_NEWNET) to sandbox renderer process.

Don't print anything.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
21d8c49e01a0c1c6eb6c750cd04110db4a539284 14-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Call fib_select_default() only when actually necessary.

fib_select_default() is a complete NOP, and completely pointless
to invoke, when we have no more than 1 default route installed.

And this is far and away the common case.

So remember how many prefixlen==0 routes we have in the routing
table, and elide the call when we have no more than one of those.

This cuts output route creation time by 157 cycles on Niagara2+.

In order to add the new int to fib_table, we have to correct the type
of ->tb_data[] to unsigned long, otherwise the private area will be
unaligned on 64-bit systems.

Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
25985edcedea6396277003854657b5f3cb31a628 31-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi> Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
0a5c047507aaaf00519921336d19c0f8f5f9f363 31-Mar-2011 Eric Dumazet <eric.dumazet@gmail.com> fib: add __rcu annotations

Add __rcu annotations and lockdep checks.

Add const qualifiers

node_parent() and node_parent_rcu() can use
rcu_dereference_index_check()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1fbc78439291627642517f15b9b91f3125588143 26-Mar-2011 Julian Anastasov <ja@ssi.bg> ipv4: do not ignore route errors

The "ipv4: Inline fib_semantic_match into check_leaf"
change forgets to return the route errors. check_leaf should
return the same results as fib_table_lookup.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
37e826c513883099c298317bad1b3b677b2905fb 25-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Fix nexthop caching wrt. scoping.

Move the scope value out of the fib alias entries and into fib_info,
so that we always use the correct scope when recomputing the nexthop
cached source address.

Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
74cb3c108bc0f599a4eb40980db8580cfba725c9 19-Mar-2011 Julian Anastasov <ja@ssi.bg> ipv4: match prefsrc when deleting routes

fib_table_delete forgets to match the routes by prefsrc.
Callers can specify known IP in fc_prefsrc and we should remove
the exact route. This is needed for cases when same local or
broadcast addresses are used in different subnets and the
routes differ only in prefsrc. All callers that do not provide
fc_prefsrc will ignore the route prefsrc as before and will
delete the first occurence. That is how the ip route del default
magic works.

Current callers are:

- ip_rt_ioctl where rtentry_to_fib_config provides fc_prefsrc only
when the provided device name matches IP label with colon.

- inet_rtm_delroute where RTA_PREFSRC is optional too

- fib_magic which deals with routes when deleting addresses
and where the fc_prefsrc is always set with the primary IP
for the concerned IFA.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
22bd5b9b13f2931ac80949f8bfbc40e8cab05be7 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Pass ipv4 flow objects into fib_lookup() paths.

To start doing these conversions, we need to add some temporary
flow4_* macros which will eventually go away when all the protocol
code paths are changed to work on AF specific flowi objects.

Signed-off-by: David S. Miller <davem@davemloft.net>
1d28f42c1bd4bb2363d88df74d0128b4da135b4a 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Put flowi_* prefix on AF independent members of struct flowi

I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>
3be0686b6e2f953afe83626e871b4a7b0ceae49b 08-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Inline fib_semantic_match into check_leaf

This elimiates a lot of pure overhead due to parameter
passing.

Signed-off-by: David S. Miller <davem@davemloft.net>
3b004569d86d02786ebae496e75dc0b625be3e9a 16-Feb-2011 David S. Miller <davem@davemloft.net> ipv4: Avoid use of signed integers in fib_trie code.

GCC emits all kinds of crazy zero extensions when we go from signed
int, to unsigned short, etc. etc.

This transformation has to be legal because:

1) In tkey_extract_bits() in mask_pfx(), the values are used to
perform shifts, on which negative values are undefined by C.

2) In fib_table_lookup() we perform comparisons with unsigned
values, constants, and additions. None of which should
encounter negative values.

Signed-off-by: David S. Miller <davem@davemloft.net>
b299e4f001cfa16205f9121f4630970049652268 03-Feb-2011 David S. Miller <davem@davemloft.net> ipv4: Fix fib_trie build in some configurations.

If we end up including include/linux/node.h (either explicitly
or implicitly) that header has a definition of "structt node"
too.

So rename the one we use in fib_trie to "rt_trie_node" to avoid
the conflict.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
5348ba85a02ffe80a8af33a524b6610966760d3d 02-Feb-2011 David S. Miller <davem@davemloft.net> ipv4: Update some fib_hash centric interface names.

fib_hash_init() --> fib_trie_init()
fib_hash_table() --> fib_trie_table()

Signed-off-by: David S. Miller <davem@davemloft.net>
0c838ff1ade71162775afffd9e5c6478a60bdca6 01-Feb-2011 David S. Miller <davem@davemloft.net> ipv4: Consolidate all default route selection implementations.

Both fib_trie and fib_hash have a local implementation of
fib_table_select_default(). This is completely unnecessary
code duplication.

Since we now remember the fib_table and the head of the fib
alias list of the default route, we can implement one single
generic version of this routine.

Looking at the fib_hash implementation you may get the impression
that it's possible for there to be multiple top-level routes in
the table for the default route. The truth is, it isn't, the
insert code will only allow one entry to exist in the zero
prefix hash table, because all keys evaluate to zero and all
keys in a hash table must be unique.

Signed-off-by: David S. Miller <davem@davemloft.net>
5b4704419cbd0b7597a91c19f9e8e8b17c1af071 01-Feb-2011 David S. Miller <davem@davemloft.net> ipv4: Remember FIB alias list head and table in lookup results.

This will be used later to implement fib_select_default() in a
completely generic manner, instead of the current situation where the
default route is re-looked up in the TRIE/HASH table and then the
available aliases are analyzed.

Signed-off-by: David S. Miller <davem@davemloft.net>
7a1c8e5ab120a5f352e78bbc1fa5bb64e6f23639 20-Nov-2010 Eric Dumazet <eric.dumazet@gmail.com> net: allow GFP_HIGHMEM in __vmalloc()

We forgot to use __GFP_HIGHMEM in several __vmalloc() calls.

In ceph, add the missing flag.

In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is
cleaner and allows using HIGHMEM pages as well.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4aa2c466a7733af093a526e9d1cdd0b3b90d47e9 28-Oct-2010 Pavel Emelyanov <xemul@parallels.com> fib: Fix fib zone and its hash leak on namespace stop

When we stop a namespace we flush the table and free one, but the
added fn_zone-s (and their hashes if grown) are leaked. Need to free.
Tries releases all its stuff in the flushing code.

Shame on us - this bug exists since the very first make-fib-per-net
patches in 2.6.27 :(

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9b0c290e78d667e6a483bde8c7cef7dd15f49017 21-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> fib: introduce fib_alias_accessed() helper

Perf tools session at NFWS 2010 pointed out a false sharing on struct
fib_alias that can be avoided pretty easily, if we set FA_S_ACCESSED bit
only if needed (ie : not already set)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
631dd1a885b6d7e9f6f51b4e5b311c2bb04c323c 18-Oct-2010 Justin P. Mattock <justinmattock@gmail.com> Update broken web addresses in the kernel.

The patch below updates broken web addresses in the kernel

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Dimitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mike Frysinger <vapier.adi@gmail.com>
Acked-by: Ben Pfaff <blp@cs.stanford.edu>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Reviewed-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
874ffa8f72444d6253d2669fed304875c128f86b 13-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> fib_trie: use fls() instead of open coded loop

fib_table_lookup() might use fls() to speedup an open coded loop.

Noticed while doing a profile analysis.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ebc0ffae5dfb4447e0a431ffe7fe1d467c48bbb9 05-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> fib: RCU conversion of fib_lookup()

fib_lookup() converted to be called in RCU protected context, no
reference taken and released on a contended cache line (fib_clntref)

fib_table_lookup() and fib_semantic_match() get an additional parameter.

struct fib_info gets an rcu_head field, and is freed after an rcu grace
period.

Stress test :
(Sending 160.000.000 UDP frames on same neighbour,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_HASH) (about same results for FIB_TRIE)

Before patch :

real 1m31.199s
user 0m13.761s
sys 23m24.780s

After patch:

real 1m5.375s
user 0m14.997s
sys 15m50.115s

Before patch Profile :

13044.00 15.4% __ip_route_output_key vmlinux
8438.00 10.0% dst_destroy vmlinux
5983.00 7.1% fib_semantic_match vmlinux
5410.00 6.4% fib_rules_lookup vmlinux
4803.00 5.7% neigh_lookup vmlinux
4420.00 5.2% _raw_spin_lock vmlinux
3883.00 4.6% rt_set_nexthop vmlinux
3261.00 3.9% _raw_read_lock vmlinux
2794.00 3.3% fib_table_lookup vmlinux
2374.00 2.8% neigh_resolve_output vmlinux
2153.00 2.5% dst_alloc vmlinux
1502.00 1.8% _raw_read_lock_bh vmlinux
1484.00 1.8% kmem_cache_alloc vmlinux
1407.00 1.7% eth_header vmlinux
1406.00 1.7% ipv4_dst_destroy vmlinux
1298.00 1.5% __copy_from_user_ll vmlinux
1174.00 1.4% dev_queue_xmit vmlinux
1000.00 1.2% ip_output vmlinux

After patch Profile :

13712.00 15.8% dst_destroy vmlinux
8548.00 9.9% __ip_route_output_key vmlinux
7017.00 8.1% neigh_lookup vmlinux
4554.00 5.3% fib_semantic_match vmlinux
4067.00 4.7% _raw_read_lock vmlinux
3491.00 4.0% dst_alloc vmlinux
3186.00 3.7% neigh_resolve_output vmlinux
3103.00 3.6% fib_table_lookup vmlinux
2098.00 2.4% _raw_read_lock_bh vmlinux
2081.00 2.4% kmem_cache_alloc vmlinux
2013.00 2.3% _raw_spin_lock vmlinux
1763.00 2.0% __copy_from_user_ll vmlinux
1763.00 2.0% ip_output vmlinux
1761.00 2.0% ipv4_dst_destroy vmlinux
1631.00 1.9% eth_header vmlinux
1440.00 1.7% _raw_read_unlock_bh vmlinux

Reference results, if IP route cache is enabled :

real 0m29.718s
user 0m10.845s
sys 7m37.341s

25213.00 29.5% __ip_route_output_key vmlinux
9011.00 10.5% dst_release vmlinux
4817.00 5.6% ip_push_pending_frames vmlinux
4232.00 5.0% ip_finish_output vmlinux
3940.00 4.6% udp_sendmsg vmlinux
3730.00 4.4% __copy_from_user_ll vmlinux
3716.00 4.4% ip_route_output_flow vmlinux
2451.00 2.9% __xfrm_lookup vmlinux
2221.00 2.6% ip_append_data vmlinux
1718.00 2.0% _raw_spin_lock_bh vmlinux
1655.00 1.9% __alloc_skb vmlinux
1572.00 1.8% sock_wfree vmlinux
1345.00 1.6% kfree vmlinux

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a034ee3cca5726b14107f281f4bed1c0fd44472a 10-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> fib: cleanups

Use rcu_dereference_rtnl() helper

Change hard coded constants in fib_flag_trans()
7 -> RTN_UNREACHABLE
8 -> RTN_PROHIBIT

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f6b085b69d1cbbd62f49f34e71a3d58cb6d34b7e 07-Sep-2010 Jarek Poplawski <jarkao2@gmail.com> ipv4: Suppress lockdep-RCU false positive in FIB trie (3)

Hi,
Here is one more of these warnings and a patch below:

Sep 5 23:52:33 del kernel: [46044.244833] ===================================================
Sep 5 23:52:33 del kernel: [46044.269681] [ INFO: suspicious rcu_dereference_check() usage. ]
Sep 5 23:52:33 del kernel: [46044.277000] ---------------------------------------------------
Sep 5 23:52:33 del kernel: [46044.285185] net/ipv4/fib_trie.c:1756 invoked rcu_dereference_check() without protection!
Sep 5 23:52:33 del kernel: [46044.293627]
Sep 5 23:52:33 del kernel: [46044.293632] other info that might help us debug this:
Sep 5 23:52:33 del kernel: [46044.293634]
Sep 5 23:52:33 del kernel: [46044.325333]
Sep 5 23:52:33 del kernel: [46044.325335] rcu_scheduler_active = 1, debug_locks = 0
Sep 5 23:52:33 del kernel: [46044.348013] 1 lock held by pppd/1717:
Sep 5 23:52:33 del kernel: [46044.357548] #0: (rtnl_mutex){+.+.+.}, at: [<c125dc1f>] rtnl_lock+0xf/0x20
Sep 5 23:52:33 del kernel: [46044.367647]
Sep 5 23:52:33 del kernel: [46044.367652] stack backtrace:
Sep 5 23:52:33 del kernel: [46044.387429] Pid: 1717, comm: pppd Not tainted 2.6.35.4.4a #3
Sep 5 23:52:33 del kernel: [46044.398764] Call Trace:
Sep 5 23:52:33 del kernel: [46044.409596] [<c12f9aba>] ? printk+0x18/0x1e
Sep 5 23:52:33 del kernel: [46044.420761] [<c1053969>] lockdep_rcu_dereference+0xa9/0xb0
Sep 5 23:52:33 del kernel: [46044.432229] [<c12b7235>] trie_firstleaf+0x65/0x70
Sep 5 23:52:33 del kernel: [46044.443941] [<c12b74d4>] fib_table_flush+0x14/0x170
Sep 5 23:52:33 del kernel: [46044.455823] [<c1033e92>] ? local_bh_enable_ip+0x62/0xd0
Sep 5 23:52:33 del kernel: [46044.467995] [<c12fc39f>] ? _raw_spin_unlock_bh+0x2f/0x40
Sep 5 23:52:33 del kernel: [46044.480404] [<c12b24d0>] ? fib_sync_down_dev+0x120/0x180
Sep 5 23:52:33 del kernel: [46044.493025] [<c12b069d>] fib_flush+0x2d/0x60
Sep 5 23:52:33 del kernel: [46044.505796] [<c12b06f5>] fib_disable_ip+0x25/0x50
Sep 5 23:52:33 del kernel: [46044.518772] [<c12b10d3>] fib_netdev_event+0x73/0xd0
Sep 5 23:52:33 del kernel: [46044.531918] [<c1048dfd>] notifier_call_chain+0x2d/0x70
Sep 5 23:52:33 del kernel: [46044.545358] [<c1048f0a>] raw_notifier_call_chain+0x1a/0x20
Sep 5 23:52:33 del kernel: [46044.559092] [<c124f687>] call_netdevice_notifiers+0x27/0x60
Sep 5 23:52:33 del kernel: [46044.573037] [<c124faec>] __dev_notify_flags+0x5c/0x80
Sep 5 23:52:33 del kernel: [46044.586489] [<c124fb47>] dev_change_flags+0x37/0x60
Sep 5 23:52:33 del kernel: [46044.599394] [<c12a8a8d>] devinet_ioctl+0x54d/0x630
Sep 5 23:52:33 del kernel: [46044.612277] [<c12aabb7>] inet_ioctl+0x97/0xc0
Sep 5 23:52:34 del kernel: [46044.625208] [<c123f6af>] sock_ioctl+0x6f/0x270
Sep 5 23:52:34 del kernel: [46044.638046] [<c109d2b0>] ? handle_mm_fault+0x420/0x6c0
Sep 5 23:52:34 del kernel: [46044.650968] [<c123f640>] ? sock_ioctl+0x0/0x270
Sep 5 23:52:34 del kernel: [46044.663865] [<c10c3188>] vfs_ioctl+0x28/0xa0
Sep 5 23:52:34 del kernel: [46044.676556] [<c10c38fa>] do_vfs_ioctl+0x6a/0x5c0
Sep 5 23:52:34 del kernel: [46044.688989] [<c1048676>] ? up_read+0x16/0x30
Sep 5 23:52:34 del kernel: [46044.701411] [<c1021376>] ? do_page_fault+0x1d6/0x3a0
Sep 5 23:52:34 del kernel: [46044.714223] [<c10b6588>] ? fget_light+0xf8/0x2f0
Sep 5 23:52:34 del kernel: [46044.726601] [<c1241f98>] ? sys_socketcall+0x208/0x2c0
Sep 5 23:52:34 del kernel: [46044.739140] [<c10c3eb3>] sys_ioctl+0x63/0x70
Sep 5 23:52:34 del kernel: [46044.751967] [<c12fca3d>] syscall_call+0x7/0xb
Sep 5 23:52:34 del kernel: [46044.764734] [<c12f0000>] ? cookie_v6_check+0x3d0/0x630

-------------->

This patch fixes the warning:
===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
net/ipv4/fib_trie.c:1756 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by pppd/1717:
#0: (rtnl_mutex){+.+.+.}, at: [<c125dc1f>] rtnl_lock+0xf/0x20

stack backtrace:
Pid: 1717, comm: pppd Not tainted 2.6.35.4a #3
Call Trace:
[<c12f9aba>] ? printk+0x18/0x1e
[<c1053969>] lockdep_rcu_dereference+0xa9/0xb0
[<c12b7235>] trie_firstleaf+0x65/0x70
[<c12b74d4>] fib_table_flush+0x14/0x170
...

Allow trie_firstleaf() to be called either under rcu_read_lock()
protection or with RTNL held. The same annotation is added to
node_parent_rcu() to prevent a similar warning a bit later.

Followup of commits 634a4b20 and 4eaa0e3c.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3fa21e07e6acefa31f974d57fba2b6920a7ebd1a 18-May-2010 Joe Perches <joe@perches.com> net: Remove unnecessary returns from void function()s

This patch removes from net/ (but not any netfilter files)
all the unnecessary return; statements that precede the
last closing brace of void functions.

It does not remove the returns that are immediately
preceded by a label as gcc doesn't like that.

Done via:
$ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4eaa0e3c869acd5dbc7c2e3818a9ae9cbf221d27 15-Apr-2010 Eric Dumazet <eric.dumazet@gmail.com> fib: suppress lockdep-RCU false positive in FIB trie.

Followup of commit 634a4b20

Allow tnode_get_child_rcu() to be called either under rcu_read_lock()
protection or with RTNL held.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
634a4b2038a6eba4c211fb906fa2f6ec9a4bbfc7 22-Mar-2010 Paul E. McKenney <paulmck@linux.vnet.ibm.com> net: suppress lockdep-RCU false positive in FIB trie.

Allow fib_find_node() to be called either under rcu_read_lock()
protection or with RTNL held.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
16c6cf8bb471392fd09b48b7c27e7d83a446b4bc 20-Sep-2009 Stephen Hemminger <shemminger@vyatta.com> ipv4: fib table algorithm performance improvement

The FIB algorithim for IPV4 is set at compile time, but kernel goes through
the overhead of function call indirection at runtime. Save some
cycles by turning the indirect calls to direct calls to either
hash or trie code.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
80b71b80df14d885f7e50e115c1348398f418759 29-Aug-2009 Jens Låås <jens.laas@its.uu.se> fib_trie: resize rework

Here is rework and cleanup of the resize function.

Some bugs we had. We were using ->parent when we should use
node_parent(). Also we used ->parent which is not assigned by
inflate in inflate loop.

Also a fix to set thresholds to power 2 to fit halve
and double strategy.

max_resize is renamed to max_work which better indicates
it's function.

Reaching max_work is not an error, so warning is removed.
max_work only limits amount of work done per resize.
(limits CPU-usage, outstanding memory etc).

The clean-up makes it relatively easy to add fixed sized
root-nodes if we would like to decrease the memory pressure
on routers with large routing tables and dynamic routing.
If we'll need that...

Its been tested with 280k routes.

Work done together with Robert Olsson.

Signed-off-by: Jens Låås <jens.laas@its.uu.se>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
36cbd3dcc10384f813ec0814255f576c84f2bcd4 05-Aug-2009 Jan Engelhardt <jengelh@medozas.de> net: mark read-only arrays as const

String literals are constant, and usually, we can also tag the array
of pointers const too, moving it to the .rodata section.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
b902e5735272b6a79fe2853180b2ad6658aa9678 14-Jul-2009 Jarek Poplawski <jarkao2@gmail.com> ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups

While looking for other fib_trie problems reported by Pawel Staszewski
I noticed there are a few uses of tnode_get_child() and node_parent()
in lookups instead of their rcu versions.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
be916cdebe4dc720a23b1a9bb589f2c22afd6589 14-Jul-2009 Jarek Poplawski <jarkao2@gmail.com> ipv4: Fix inflate_threshold_root automatically

During large updates there could be triggered warnings like: "Fix
inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
node isn't finished in 10 loops. It should be much rarer now, after
changing the threshold from 15 to 25, and a temporary problem, so
this patch tries to handle it automatically using a fix variable to
increase by one inflate threshold for next root resizes (up to the 35
limit, max fix = 10). The fix variable is decreased when root's
inflate() finishes below 7 loops (even if some other, smaller table/
trie is updated -- for simplicity the fix variable is global for now).

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
c3059477fce2d956a0bb3e04357324780c5d8eeb 14-Jul-2009 Jarek Poplawski <jarkao2@gmail.com> ipv4: Use synchronize_rcu() during trie_rebalance()

During trie_rebalance() we free memory after resizing with call_rcu(),
but large updates, especially with PREEMPT_NONE configs, can cause
memory stresses, so this patch calls synchronize_rcu() in
tnode_free_flush() after each sync_pages to guarantee such freeing
(especially before resizing the root node).

The value of sync_pages = 128 is based on Pawel Staszewski's tests as
the lowest which doesn't hinder updating times. (For testing purposes
there was a sysfs module parameter to change it on demand, but it's
removed until we're sure it could be really useful.)

The patch is based on suggestions by: Paul E. McKenney
<paulmck@linux.vnet.ibm.com>

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
345aa031207d02d7438c1aa96ed9315911ecd745 08-Jul-2009 Jarek Poplawski <jarkao2@gmail.com> ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 bits

cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
Aver depth: 2.28
Max depth: 6
Leaves: 276539
Prefixes: 289922
Internal nodes: 66762
1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5
9: 1 18: 1
Pointers: 691228
Null ptrs: 347928
Total size: 35709 kB
</blockquote>

It seems, the current threshold for root resizing is too aggressive,
and it causes misleading warnings during big updates, but it might be
also responsible for memory problems, especially with non-preempt
configs, when RCU freeing is delayed long after call_rcu.

It should be also mentioned that because of non-atomic changes during
resizing/rebalancing the current lookup algorithm can miss valid leaves
so it's additional argument to shorten these activities even at a cost
of a minimally longer searching.

This patch restores values before the patch "[IPV4]: fib_trie root
node settings", commit: 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5 from
v2.6.22.

Pawel's report:
<blockquote>
I dont see any big change of (cpu load or faster/slower
routing/propagating routes from bgpd or something else) - in avg there
is from 2% to 3% more of CPU load i dont know why but it is - i change
from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL
1 30"
always avg cpu load was from 2 to 3% more compared to "no preempt"
[...]
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
Aver depth: 2.44
Max depth: 6
Leaves: 277814
Prefixes: 291306
Internal nodes: 66420
1: 32737 2: 14850 3: 10332 4: 4871 5: 2313 6: 942 7: 371 8: 3 17: 1
Pointers: 599098
Null ptrs: 254865
Total size: 18067 kB
</blockquote>

According to this and other similar reports average depth is slightly
increased (~0.2), and root nodes are shorter (log 17 vs. 18), but
there is no visible performance decrease. So, until memory handling is
improved or added parameters for changing this individually, this
patch resets to safer defaults.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08 30-Jun-2009 Jarek Poplawski <jarkao2@gmail.com> ipv4: Fix fib_trie rebalancing, part 3

Alas current delaying of freeing old tnodes by RCU in trie_rebalance
is still not enough because we can free a top tnode before updating a
t->trie pointer.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7b85576d15bf2574b0a451108f59f9ad4170dd3f 18-Jun-2009 Jarek Poplawski <jarkao2@gmail.com> ipv4: Fix fib_trie rebalancing, part 2

My previous patch, which explicitly delays freeing of tnodes by adding
them to the list to flush them after the update is finished, isn't
strict enough. It treats exceptionally tnodes without parent, assuming
they are newly created, so "invisible" for the read side yet.

But the top tnode doesn't have parent as well, so we have to exclude
all exceptions (at least until a better way is found). Additionally we
need to move rcu assignment of this node before flushing, so the
return type of the trie_rebalance() function is changed.

Reported-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f 15-Jun-2009 Jarek Poplawski <jarkao2@gmail.com> ipv4: Fix fib_trie rebalancing

While doing trie_rebalance(): resize(), inflate(), halve() RCU free
tnodes before updating their parents. It depends on RCU delaying the
real destruction, but if RCU readers start after call_rcu() and before
parent update they could access freed memory.

It is currently prevented with preempt_disable() on the update side,
but it's not safe, except maybe classic RCU, plus it conflicts with
memory allocations with GFP_KERNEL flag used from these functions.

This patch explicitly delays freeing of tnodes by adding them to the
list, which is flushed after the update is finished.

Reported-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3ed18d76d959e5cbfa5d70c8f7ba95476582a556 22-May-2009 Robert Olsson <robert.olsson@its.uu.se> ipv4: Fix oops with FIB_TRIE

It seems we can fix this by disabling preemption while we re-balance the
trie. This is with the CONFIG_CLASSIC_RCU. It's been stress-tested at high
loads continuesly taking a full BGP table up/down via iproute -batch.

Note. fib_trie is not updated for CONFIG_PREEMPT_RCU

Reported-by: Andrei Popa
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
e204a345a0b11c6522ec06e62dee71628a492408 18-May-2009 Rami Rosen <ramirose@gmail.com> ipv4: cleanup - remove two unused parameters from fib_semantic_match().

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
673d57e72398edfedc93fb50ff58048077c9d587 31-Oct-2008 Harvey Harrison <harvey.harrison@gmail.com> net: replace NIPQUAD() in net/ipv4/ net/ipv6/

Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b6fcbdb4f283f7ba67cec3cda6be23da8e959031 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> proc: consolidate per-net single-release callers

They are symmetrical to single_open ones :)

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
de05c557b24c7dffc6d392e3db120cf11c9f6ae7 18-Jul-2008 Pavel Emelyanov <xemul@openvz.org> proc: consolidate per-net single_open callers

There are already 7 of them - time to kill some duplicate code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2e655571c618434c24ac2ca989374fdd84470d6d 11-Jul-2008 Ben Hutchings <bhutchings@solarflare.com> ipv4: fib_trie: Fix lookup error return

In commit a07f5f508a4d9728c8e57d7f66294bf5b254ff7f "[IPV4] fib_trie: style
cleanup", the changes to check_leaf() and fn_trie_lookup() were wrong - where
fn_trie_lookup() would previously return a negative error value from
check_leaf(), it now returns 0.

Now fn_trie_lookup() doesn't appear to care about plen, so we can revert
check_leaf() to returning the error value.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Tested-by: William Boughton <bill@boughton.de>
Acked-by: Stephen Heminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
76e6ebfb40a2455c18234dcb0f9df37533215461 06-Jul-2008 Denis V. Lunev <den@openvz.org> netns: add namespace parameter to rt_cache_flush

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0b040829952d84bf2a62526f0e24b624e0699447 11-Jun-2008 Adrian Bunk <bunk@kernel.org> net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5e659e4cb0eedacdc1f621a61e400a4611ddef8a 24-Apr-2008 Pavel Emelyanov <xemul@openvz.org> [NET]: Fix heavy stack usage in seq_file output routines.

Plan C: we can follow the Al Viro's proposal about %n like in this patch.
The same applies to udp, fib (the /proc/net/route file), rt_cache and
sctp debug. This is minus ~150-200 bytes for each.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a7d632b6b4ad1c92746ed409e41f9dc571ec04e2 14-Apr-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [IPV4]: Use NIPQUAD_FMT to format ipv4 addresses.

And use %u to format port.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
387a5487f5a1f8bfc3b2c5818e50dfd19eeb4f3f 10-Apr-2008 Stephen Hemminger <shemminger@vyatta.com> ipv4: fib_trie leaf free optimization

Avoid unneeded test in the case where object to be freed
has to be a leaf. Don't need to use the generic tnode_free()
function, instead just setup leaf to be freed.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ef3660ce0649fa10265455f539b72607cff53d02 10-Apr-2008 Stephen Hemminger <shemminger@vyatta.com> ipv4: fib_trie remove unused argument

The trie pointer is passed down to flush_list and flush_leaf
but never used.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15be75cdb5db442d0e33d37b20832b88f3ccd383 10-Apr-2008 Stephen Hemminger <shemminger@vyatta.com> IPV4: fib_trie use vmalloc for large tnodes

Use vmalloc rather than alloc_pages to avoid wasting memory.
The problem is that tnode structure has a power of 2 sized array,
plus a header. So the current code wastes almost half the memory
allocated because it always needs the next bigger size to hold
that small header.

This is similar to an earlier patch by Eric, but instead of a list
and lock, I used a workqueue to handle the fact that vfree can't
be done in interrupt context.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1218854afa6f659be90b748cf1bc7badee954a35 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit seq_net_private->net without CONFIG_NET_NS.

Without CONFIG_NET_NS, no namespace other than &init_net exists,
no need to store net in seq_net_private.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
3d3b2d25a4debaff05a9e6f5c55a0d31e4334234 24-Mar-2008 Stephen Hemminger <shemminger@vyatta.com> fib_trie: print information on all routing tables

Make /proc/net/fib_trie and /proc/net/fib_triestat display
all routing tables, not just local and main.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6440cc9e0f48ade57af7be28008cbfa6a991f287 23-Mar-2008 Stephen Hemminger <shemminger@vyatta.com> [IPV4] fib_trie: fix warning from rcu_assign_poinger

This gets rid of a warning caused by the test in rcu_assign_pointer.
I tried to fix rcu_assign_pointer, but that devolved into a long set
of discussions about doing it right that came to no real solution.
Since the test in rcu_assign_pointer for constant NULL would never
succeed in fib_trie, just open code instead.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8315f5d80a90247bf92232f92ca49933ac49327b 12-Feb-2008 Stephen Hemminger <shemminger@vyatta.com> fib_trie: /proc/net/route performance improvement

Use key/offset caching to change /proc/net/route (use by iputils route)
from O(n^2) to O(n). This improves performance from 30sec with 160,000
routes to 1sec.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ec28cf738d899e9d0652108e1986101771aacb2e 12-Feb-2008 Stephen Hemminger <shemminger@vyatta.com> fib_trie: handle empty tree

This fixes possible problems when trie_firstleaf() returns NULL
to trie_leafindex().

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b9c4d82a853713d49ac53b507964d7cf30ee408d 05-Feb-2008 Denis V. Lunev <den@openvz.org> [IPV4]: Formatting fix for /proc/net/fib_trie.

The line in the /proc/net/fib_trie for route with TOS specified
- has extra \n at the end
- does not have a space after route scope
like below.
|-- 1.1.1.1
/32 universe UNICASTtos =1

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
71d67e666e73e3b7e9ef124745ee2e454ac04be8 01-Feb-2008 Stephen Hemminger <shemminger@vyatta.com> [IPV4] fib_trie: rescan if key is lost during dump

Normally during a dump the key of the last dumped entry is used for
continuation, but since lock is dropped it might be lost. In that case
fallback to the old counter based N^2 behaviour. This means the dump
will end up skipping some routes which matches what FIB_HASH does.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
936f6f8e1bc46834bbb3e3fa3ac13ab44f1e7ba6 29-Jan-2008 Julian Anastasov <ja@ssi.bg> [IPV4] fib_trie: apply fixes from fib_hash

Update fib_trie with some fib_hash fixes:
- check for duplicate alternative routes for prefix+tos+priority when
replacing route
- properly insert by matching tos together with priority
- fix alias walking to use list_for_each_entry_continue for insertion
and deletion when fa_head is not NULL
- copy state from fa to new_fa on replace (not a problem for now)
- additionally, avoid replacement without error if new route is same,
as Joonwoo Park suggests.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
ac97f75faae2a18648145bc6bbcdd326bac6a1c2 24-Jan-2008 Stephen Hemminger <shemminger@vyatta.com> [IPV4] fib_trie: remove unneeded NULL check

Since fib_route_seq_show now uses hlist_for_each_entry(), the leaf
info can not be NULL.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f638a2f0579f74dc93d7da4299146e2822c06889 24-Jan-2008 Stephen Hemminger <shemminger@vyatta.com> [IPV4] fib_trie: More whitespace cleanup.

Remove extra blank lines.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d5ce8a0e97073169b5fe0b7c52bd020cdb017dfa 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: avoid rescan on dump

This converts dumping (and flushing) of large route tables form O(N^2)
to O(N). If the route dump took multiple pages then the dump routine
gets called again. The old code kept track of location by counter, the
new code instead uses the last key.

This is a really big win ( 0.3 sec vs 12 sec) for big route tables.

One side effect is that if the table changes during the dump, then the
last key will not be found, and we will return -EBUSY.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9195bef7fb0ba0a91d5ffa566bcf8e007e3c7172 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: avoid extra search on delete

Get rid of extra search that made route deletion O(n).

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a88ee229253b31e3a844b30525ff77fbfe3111d3 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: dump table in sorted order

It is easier with TRIE to dump the data traversal rather than
interating over every possible prefix. This saves some time and makes
the dump come out in sorted order.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
82cfbb008572b1a953091ef78f767aa3ca213092 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: iterator recode

Remove the complex loop structure of nextleaf() and replace it with a
simpler tree walker. This improves the performance and is much
cleaner.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
64347f786d13349d6a6f812f3a83c269e26c0136 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: dump message multiple part flag

Match fib_hash, and set NLM_F_MULTI to handle multiple part messages.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1328042e268c936189f15eba5bd9a5a4605a8581 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: use hash list

The code to dump can use the existing hash chain rather than doing
repeated lookup.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
936722922f6d2366378de606a40c14f96915474d 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: compute size when needed

Compute the number of prefixes when needed, rather than doing bookeeping.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a07f5f508a4d9728c8e57d7f66294bf5b254ff7f 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: style cleanup

Style cleanups:
* make check_leaf return -1 or plen, rather than by reference
* Get rid of #ifdef that is always set
* split out embedded function calls in if statements.
* checkpatch warnings

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bc3c8c1e02ae89668239742fd592f21e1998fa46 23-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: put leaf nodes in a slab cache

This improves locality for operations that touch all the leaves. Save
space since these entries don't need to be hardware cache aligned.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b59cfbf77dc8368c2c90b012c79553613f4d70c3 18-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [FIB]: Fix rcu_dereference() abuses in fib_trie.c

node_parent() and tnode_get_child() currently use rcu_dereference().

These functions are called from both
- readers only paths (where rcu_dereference() is needed), and
- writer path (where rcu_dereference() is not needed)

To make explicit where rcu_dereference() is really needed, I
introduced new node_parent_rcu() and tnode_get_child_rcu() functions
which use rcu_dereference(), while node_parent() and tnode_get_child()
dont use it.

Then I changed calling sites where rcu_dereference() was really needed
to call the _rcu() variants.

This should have no impact but for alpha architecture, and may help
future sparse checks.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7f9b80529b8a2ad8b3273b15fb444a0e34b760a9 15-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4]: fib hash|trie initialization

Initialization of the slab cache's should be done when IP is
initialized to make sure of available memory, and that code can be
marked __init.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d717a9a62049a03e85c3c2dd3399416eeb34a8be 15-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: size and statistics

Show number of entries in trie, the size field was being set but never used,
but it only counted leaves, not all entries. Refactor the two cases in
fib_triestat_seq_show into a single routine.

Note: the stat structure was being malloc'd but the stack usage isn't so
high (288 bytes) that it is worth the additional complexity.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
28d36e3702fcbed73c38e877bcf2a8f8946b7f3d 15-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [FIB]: Avoid using static variables without proper locking

fib_trie_seq_show() uses two helper functions, rtn_scope() and
rtn_type() that can write to static storage without locking.

Just pass to them a temporary buffer to avoid potential corruption
(probably not triggerable but still...)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8d96544475b236a0f319e492f4828aa8c0801c7f 14-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [FIB]: full_children & empty_children should be uint, not ushort

If declared as unsigned short, these fields can overflow, and whole
trie logic is broken. I could not make the machine crash, but some
tnode can never be freed.

Note for 64 bit arches : By reordering t_key and parent in [node,
leaf, tnode] structures, we can use 32 bits hole after t_key so that
sizeof(struct tnode) doesnt change after this patch.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
4dde4610c4ab54e9d36a4afaa98c23b017f7f9e3 13-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [IPV4] fib_trie: removes a memset() call in tnode_new()

tnode_alloc() already clears allocated memory, using kcalloc() or
alloc_pages(GFP_KERNEL|__GFP_ZERO, ...)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
112d8cfcbf4f5ef0cf669cb5864f1206972076d6 13-Jan-2008 Eric Dumazet <dada1@cosmosbay.com> [FIB]: Reduce text size of net/ipv4/fib_trie.o

In struct tnode, we use two fields of 5 bits for 'pos' and 'bits'.
Switching to plain 'unsigned char' (8 bits) take the same space
because of compiler alignments, and reduce text size by 435 bytes
on i386.

On i386 :
$ size net/ipv4/fib_trie.o.before_patch net/ipv4/fib_trie.o
text data bss dec hex filename
13714 4 64 13782 35d6 net/ipv4/fib_trie.o.before
13279 4 64 13347 3423 net/ipv4/fib_trie.o

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c95aaf9af5a1f6dee56d1f2ab4915cd722d608da 13-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: Fix sparse warnings.

Make FIB TRIE go through sparse checker without warnings.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
66a2f7fd2fddee1ddc5d1d286cd832e50a97258e 13-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: Add statistics.

The FIB TRIE code has a bunch of statistics, but the code is hidden
behind an ifdef that was never implemented. Since it was dead code, it
was broken as well.

This patch fixes that by making it a config option.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fea86ad8123df0d49188cbc1dd2f48da6ae49d65 13-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: fib_insert_node cleanup

The only error from fib_insert_node is if memory allocation fails, so
instead of passing by reference, just use the convention of returning
NULL.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
187b5188a78694fa6608fa1252d5197a7b3ab076 13-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: Use %u for unsigned printfs.

Use %u instead of %d when printing unsigned values.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
93e4308b3bea445dc2d3e3c1897a93fe111eba17 13-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: Get rid of unused revision element.

The revision element must of been part of an earlier design, because
currently it is set but never used.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c28a1cf448e59019fa681741963c3acaeaeb6d27 13-Jan-2008 Stephen Hemminger <stephen.hemminger@vyatta.com> [IPV4] fib_trie: Get rid of trie_init().

trie_init is worthless it is just zeroing stuff that is already zero!
Move the memset() down to make it obvious.

Signed-off-by: Stephen Hemminger <stephen.hemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1c340b2fd73880136c438e6e7978288fbec8273f 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Show routing information from correct namespace (fib_trie.c)

This is the second part (for the CONFIG_IP_FIB_TRIE case) of the patch
#4, where we have created proc files in namespaces.

Now we can dump correct info in them.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8ad4942cd5bdad4143f7aa1d1bd4f7b2526c19c5 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add netns parameter to fib_get_table/fib_new_table.

This patch extends the fib_get_table and the fib_new_table functions
with the network namespace pointer. That will allow to access the
table relatively from the network namespace.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7b1a74fdbb9ec38a9780620fae25519fde4b21ee 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Refactor fib initialization so it can handle multiple namespaces.

This patch makes the fib to be initialized as a subsystem for the
network namespaces. The code does not handle several namespaces yet,
so in case of a creation of a network namespace, the
creation/initialization will not occur.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
61a0265344786a548e8a0b26cb668e78a71f9602 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace to API for routing /proc entries creation.

This adds netns parameter to fib_proc_init/exit and replaces __init
specifier with __net_init. After this, we will not yet have these proc
files show info from the specific namespace - this will be done when
these tables become namespaced.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f5026fabda54e5ab5d469d8cfac5f46b4d321ce9 13-Dec-2007 Denis V. Lunev <den@openvz.org> [IPV4]: Thresholds in fib_trie.c are used as consts, so make them const.

There are several thresholds for trie fib hash management. They are used
in the code as a constants. Make them constants from the compiler point of
view.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
971b893e79db0f7dccfcea15dbdebca3ca64a84d 08-Dec-2007 Denis V. Lunev <den@openvz.org> [IPV4]: last default route is a fib table property

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
a2bbe6822f8928e254452765c07cb863633113b8 08-Dec-2007 Denis V. Lunev <den@openvz.org> [IPV4]: Unify assignment of fi to fib_result

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
c17860a039bbde134324ad6f9331500635f5799d 08-Dec-2007 Denis V. Lunev <den@openvz.org> [IPV4]: no need pass pointer to a default into fib_detect_death

ipv4: no need pass pointer to a default into fib_detect_death

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
877a9bff3889512d7326d6bf0ba6ed3ddda6d772 07-Dec-2007 Eric W. Biederman <ebiederm@xmission.com> [IPV4]: Move trie_local and trie_main into the proc iterator.

We only use these variables when displaying the trie in proc so
place them into the iterator to make this explicit. We should
probably do something smarter to handle the CONFIG_IP_MULTIPLE_TABLES
case but at least this makes it clear that the silliness is limited
to the display in /proc.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6725033fa27c8f49e1221d2badbaaaf1ef459519 18-Jan-2008 Joonwoo Park <joonwpark81@gmail.com> [IPV4] fib_trie: fix duplicated route issue

http://bugzilla.kernel.org/show_bug.cgi?id=9493

The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.

Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1977f032722c27ee3730284582fd3991ad9ac81b 19-Oct-2007 Jiri Slaby <jirislaby@gmail.com> remove asm/bitops.h includes

remove asm/bitops.h includes

including asm/bitops directly may cause compile errors. don't include it
and include linux/bitops instead. next patch will deny including asm header
directly.

Cc: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cf7732e4cc14b56d593ff53352673e1fd5e3ba52 10-Oct-2007 Pavel Emelyanov <xemul@openvz.org> [NET]: Make core networking code use seq_open_private

This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.

The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
457c4cbc5a3dde259d2a1f15d5f9785290397267 12-Sep-2007 Eric W. Biederman <ebiederm@xmission.com> [NET]: Make /proc/net per network namespace

This patch makes /proc/net per network namespace. It modifies the global
variables proc_net and proc_net_stat to be per network namespace.
The proc_net file helpers are modified to take a network namespace argument,
and all of their callers are fixed to pass &init_net for that argument.
This ensures that all of the /proc/net files are only visible and
usable in the initial network namespace until the code behind them
has been updated to be handle multiple network namespaces.

Making /proc/net per namespace is necessary as at least some files
in /proc/net depend upon the set of network devices which is per
network namespace, and even more files in /proc/net have contents
that are relevant to a single network namespace.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ab66b4a7a3969077f6e2a18a0d2d849d3b84a337 11-Aug-2007 Stephen Hemminger <shemminger@linux-foundation.org> [IPV4] fib_trie: macro cleanup

This patch converts the messy macro for MASK_PFX to inline function
and expands TKEY_GET_MASK in the one place it is used.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0680191642c27c81c9be4557d9c6aa3487c15f69 11-Aug-2007 Stephen Hemminger <shemminger@linux-foundation.org> [IPV4] fib_trie: cleanup

Try this out:
* replace macro's with inlines
* get rid of places doing multiple evaluations of NODE_PARENT

[akpm@linux-foundation.org: rcu_dereference wants an lval]

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
20c2df83d25c6a95affe6157a4c9cac4cf5ffaac 20-Jul-2007 Paul Mundt <lethal@linux-sh.org> mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
b8f558313506b5bc435f2e031f3bec4b1725098e 23-May-2007 Milan Kocian <milon@wq.cz> [RTNETLINK]: Fix sending netlink message when replace route.

When you replace route via ip r r command the netlink multicast message is
not send. This patch corrects it. NL message is sent with NLM_F_REPLACE
flag.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=8320

Signed-off-by: Milan Kocian <milon@wq.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
965ffea43d4ebe8cd7b9fee78d651268dd7d23c5 20-Mar-2007 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie root node settings

The threshold for root node can be more aggressive set to get
better tree compression. The new setting mekes the root grow
from 16 to 19 bits and substansial improvemnt in Aver depth
this with the current table of 214393 prefixes

But really the dynamic resize should need more investigation
both in terms convergence and performance and maybe it should
be possible to change...

Maybe just for the brave to start with or we may have to back
this out.
05eee48c5af8213a71bd908ce17f577b2b776f79 20-Mar-2007 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie resize break

The patch below adds break condition for the resize operations. If
we don't achieve the desired fill factor a warning is printed. Trie
should still be operational but new thresholds should be considered.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
f690808e17925fc45217eb22e8670902ecee5c1b 12-Mar-2007 Stephen Hemminger <shemminger@linux-foundation.org> [NET]: make seq_operations const

The seq_file operations stuff can be marked constant to
get it out of dirty cache.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
132adf54639cf7dd9315e8df89c2faa59f6e46d9 09-Mar-2007 Stephen Hemminger <shemminger@linux-foundation.org> [IPV4]: cleanup

Add whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
d562f1f8a92035d5d4681c178fccb991ce57f33a 26-Mar-2007 Robert Olsson <robert.olsson@its.uu.se> [IPV4] fib_trie: Document locking.

Paul E. McKenney writes:

> Those of use who dive into networking only occasionally would much
> appreciate this. ;-)

No problem here...

Acked-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (but trivial)
Signed-off-by: David S. Miller <davem@davemloft.net>
d5cc4a73a5b5c8374b810d5371e9e7ed05c1e02c 16-Mar-2007 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: Do not disable preemption in trie_leaf_remove().

Hello, Just discussed this Patrick...

We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete
both are holding RTNL. So there shouldn't be need for this preempt stuff.
This is assumed to a leftover from an older RCU-take.

> Mhh .. I think I just remembered something - me incorrectly suggesting
> to add it there while we were talking about this at OLS :) IIRC the
> idea was to make sure tnode_free (which at that time didn't use
> call_rcu) wouldn't free memory while still in use in a rcu read-side
> critical section. It should have been synchronize_rcu of course,
> but with tnode_free using call_rcu it seems to be completely
> unnecessary. So I guess we can simply remove it.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
cd354f1ae75e6466a7e31b727faede57a1f89ca5 14-Feb-2007 Tim Schmielau <tim@physik3.uni-rostock.de> [PATCH] remove many unneeded #includes of sched.h

After Al Viro (finally) succeeded in removing the sched.h #include in module.h
recently, it makes sense again to remove other superfluous sched.h includes.
There are quite a lot of files which include it but don't actually need
anything defined in there. Presumably these includes were once needed for
macros that used to live in sched.h, but moved to other header files in the
course of cleaning it up.

To ease the pain, this time I did not fiddle with any header files and only
removed #includes from .c-files, which tend to cause less trouble.

Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha,
arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig,
allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all
configs in arch/arm/configs on arm. I also checked that no new warnings were
introduced by the patch (actually, some warnings are removed that were emitted
by unnecessarily included header files).

Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9a32144e9d7b4e21341174b1a83b82a82353be86 12-Feb-2007 Arjan van de Ven <arjan@linux.intel.com> [PATCH] mark struct file_operations const 7

Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e905a9edab7f4f14f9213b52234e4a346c690911 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] IPV4: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
095b8501e4168ae5a879fcb9420ac48cbd43f95a 27-Jan-2007 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: Fix single-entry /proc/net/fib_trie output.

When main table is just a single leaf this gets printed as belonging to the
local table in /proc/net/fib_trie. A fix is below.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6640e69731b42fd5e3d2b26201c8b34fc897a0ee 24-Jan-2007 Eric W. Biederman <ebiederm@xmission.com> [IPV4]: Fix the fib trie iterator to work with a single entry routing tables

In a kernel with trie routing enabled I had a simple routing setup
with only a single route to the outside world and no default
route. "ip route table list main" showed my the route just fine but
/proc/net/route was an empty file. What was going on?

Thinking it was a bug in something I did and I looked deeper. Eventually
I setup a second route and everything looked correct, huh? Finally I
realized that the it was just the iterator pair in fib_trie_get_first,
fib_trie_get_next just could not handle a routing table with a single entry.

So to save myself and others further confusion, here is a simple fix for
the fib proc iterator so it works even when there is only a single route
in a routing table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
e18b890bb0881bbab6f4f1a6cd20d9c60d66b003 07-Dec-2006 Christoph Lameter <clameter@sgi.com> [PATCH] slab: remove kmem_cache_t

Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#

set -e

for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done

The script was run like this

sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e94b1766097d53e6f3ccfb36c8baa562ffeda3fc 07-Dec-2006 Christoph Lameter <clameter@sgi.com> [PATCH] slab: remove SLAB_KERNEL

SLAB_KERNEL is an alias of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
32ab5f80334fc067386c4c56c434010c01cff6b9 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4] fib_trie.c: trivial annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
be403ea1856f1428b5912b42184acbba808c41d6 18-Aug-2006 Thomas Graf <tgraf@suug.ch> [IPv4]: Convert FIB dumping to use new netlink api

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4e902c57417c4c285b98ba2722468d1c3ed83d1b 18-Aug-2006 Thomas Graf <tgraf@suug.ch> [IPv4]: FIB configuration using struct fib_config

Introduces struct fib_config replacing the ugly struct kern_rta
prone to ordering issues. Avoids creating faked netlink messages
for auto generated routes or requests via ioctl.

A new interface net/nexthop.h is added to help navigate through
nexthop configuration arrays.

A new struct nl_info will be used to carry the necessary netlink
information to be used for notifications later on.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
1af5a8c4a11cfed0c9a7f30fcfb689981750599c 11-Aug-2006 Patrick McHardy <kaber@trash.net> [IPV4]: Increase number of possible routing tables to 2^32

Increase the number of possible routing tables to 2^32 by replacing the
fixed sized array of pointers by a hash table and replacing iterations
over all possible table IDs by hash table walking.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2dfe55b47e3d66ded5a84caf71e0da5710edf48b 11-Aug-2006 Patrick McHardy <kaber@trash.net> [NET]: Use u32 for routing table IDs

Use u32 for routing table IDs in net/ipv4 and net/decnet in preparation of
support for a larger number of routing tables. net/ipv6 already uses u32
everywhere and needs no further changes. No functional changes are made by
this patch.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
888454c57a45511808d3fa52597b3d765df034a6 19-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4] fib_trie: missing ntohl() when calling fib_semantic_match()

fib_trie.c::check_leaf() passes host-endian where fib_semantic_match()
expects (and stores into) net-endian.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
b47b2ec19892ffc2b06ebf138ed4aa141275a1c2 12-Jul-2006 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Fix error handling for fib_insert_node call

The error handling around fib_insert_node was broken because we always
zeroed the error before checking it.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
550e29bc96e6f1ced2bca82dace197b009434367 04-Apr-2006 Robert Olsson <robert.olsson@its.uu.se> [FIB_TRIE]: Fix leaf freeing.

Seems like leaf (end-nodes) has been freed by __tnode_free_rcu and not
by __leaf_free_rcu. This fixes the problem. Only tnode_free is now
used which checks for appropriate node type. free_leaf can be removed.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
06ef921d60bbf6f765d1b9492fb4fc88ac7814bd 21-Mar-2006 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie stats fix

fib_triestats has been buggy and caused oopses some platforms as
openwrt. The patch below should cure those problems.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
5ddf0eb2bfd613e941dd8748870c71da2e5ad409 21-Mar-2006 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie initialzation fix

In some kernel configs /proc functions seems to be accessed before the
trie is initialized. The patch below checks for this.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
cd8787ab04d23f925f440b712b43a6fd5cb31ece 03-Jan-2006 Stephen Hemminger <shemminger@osdl.org> [IPV4] fib_trie: build fix

Need this to fix build of fib_trie in net-2.6.16 (rebased) tree.
The code needs the new inet_make_mask inline.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
fd9662555cc35f8bf9242cd7bba8b44ae168a68b 22-Dec-2005 Robert Olsson <robert.olsson@its.uu.se> [IPV4] fib_trie: Add credits.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
c9e53cbe7ad6eabb3c7c5140b6127b4e5f9ee840 21-Nov-2005 Patrick McHardy <kaber@trash.net> [FIB_TRIE]: Don't show local table in /proc/net/route output

Don't show local table to behave similar to fib_hash.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1371e37da299d4df6267ad0ddf010435782c28e9 15-Oct-2005 Herbert Xu <herbert@gondor.apana.org.au> [IPV4]: Kill redundant rcu_dereference on fa_info

This patch kills a redundant rcu_dereference on fa->fa_info in fib_trie.c.
As this dereference directly follows a list_for_each_entry_rcu line, we
have already taken a read barrier with respect to getting an entry from
the list.

This read barrier guarantees that all values read out of fa are valid.
In particular, the contents of structure pointed to by fa->fa_info is
initialised before fa->fa_info is actually set (see fn_trie_insert);
the setting of fa->fa_info itself is further separated with a write
barrier from the insertion of fa into the list.

Therefore by taking a read barrier after obtaining fa from the list
(which is given by list_for_each_entry_rcu), we can be sure that
fa->fa_info contains a valid pointer, as well as the fact that the
data pointed to by fa->fa_info is itself valid.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
e6308be85afee685347fa3440bed10faaa5d6c1a 04-Oct-2005 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie root-node expansion

The patch below introduces special thresholds to keep root node in the trie
large. This gives a flatter tree at the cost of a modest memory increase.
Overall it seems to be gain and this was also proposed by one the authors
of the paper in recent a seminar.

Main table after loading 123 k routes.

Aver depth: 3.30
Max depth: 9
Root-node size 12 bits
Total size: 4044 kB

With the patch:
Aver depth: 2.78
Max depth: 8
Root-node size 15 bits
Total size: 4150 kB

An increase of 8-10% was seen in forwading performance for an rDoS attack.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
78c6671a88313fd3c4364dc46e8c8186612616b8 21-Sep-2005 Stephen Hemminger <shemminger@osdl.org> [FIB_TRIE]: message cleanup

Cleanup the printk's in fib_trie:
* Convert a couple of places in the dump code to BUG_ON
* Put log level's on each message
The version message really needed the message since it leaks out
on the pretty Fedora bootup.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Acked-by: Robert Olsson <Robert.Olsson@data.slu.se>,
Signed-off-by: David S. Miller <davem@davemloft.net>
772cb712b1373d335ef2874ea357ec681edc754b 20-Sep-2005 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie RCU refinements

* This patch is from Paul McKenney's RCU reviewing.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
1d25cd6cc2528e4af12ab18e84fe95ed78f3f21a 20-Sep-2005 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie tnode stats refinements

* Prints the route tnode and set the stats level deepth as before.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
cb7b593c2c808b32a1ea188599713c434b95f849 09-Sep-2005 Stephen Hemminger <shemminger@osdl.org> [IPV4] fib_trie: fix proc interface

Create one iterator for walking over FIB trie, and use it
for all the /proc functions. Add a /proc/net/route
output for backwards compatibility with old applications.

Make initialization of fib_trie same as fib_hash so no #ifdef
is needed in af_inet.c

Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=5209

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba89966c1984513f4f2cc0a6c182266be44ddd03 26-Aug-2005 Eric Dumazet <dada1@cosmosbay.com> [NET]: use __read_mostly on kmem_cache_t , DEFINE_SNMP_STAT pointers

This patch puts mostly read only data in the right section
(read_mostly), to help sharing of these data between CPUS without
memory ping pongs.

On one of my production machine, tcp_statistics was sitting in a
heavily modified cache line, so *every* SNMP update had to force a
reload.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2373ce1ca04dd46bf2b8b0f9a799eb2a90da92fb 25-Aug-2005 Robert Olsson <Robert.Olsson@data.slu.se> [IPV4]: Convert FIB Trie to RCU.

* Removes RW-lock
* Proteced read functions uses
rcu_dereference proteced with rcu_read_lock()
* writing of procted pointer w. rcu_assigen_pointer
* Insert/Replace atomic list_replace_rcu
* A BUG_ON condition removed.in trie_rebalance

With help from Paul E. McKenney.

Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c7770c740156c8802c23d24fc094d06967d997d 24-Aug-2005 Stephen Hemminger <shemminger@osdl.org> [IPV4]: FIB trie cleanup

This is a redo of earlier cleanup stuff:
* replace DBG() macro with pr_debug()
* get rid of duplicate extern's that are already in fib_lookup.h
* use BUG_ON and WARN_ON
* don't use BUG checks for null pointers where next statement would
get a fault anyway
* remove debug printout when rebalance causes deep tree
* remove trailing blanks

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
bb435b8d816582064ee0ddb1e2a6fbca67f34108 10-Aug-2005 Stephen Hemmigner <shemminger@osdl.org> [IPV4]: fib_trie: Use const

Use const where possible and get rid of EXTRACT() macro
that was never used.

Signed-off-by: Stephen Hemmigner <shemminger@osdl.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2f80b3c8262d0d646812f776db024d88d569a0c1 10-Aug-2005 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: fib_trie: Use ERR_PTR to handle errno return

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
91b9a277fc4d207249e459a455abf804ebb5499d 10-Aug-2005 Olof Johansson <olof@lixom.net> [IPV4]: FIB Trie cleanups.

Below is a patch that cleans up some of this, supposedly without
changing any behaviour:

* Whitespace cleanups
* Introduce DBG()
* BUG_ON() instead of if () { BUG(); }
* Remove some of the deep nesting to make the code flow more
comprehensible
* Some mask operations were simplified

Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
06c7427021f1cc83703f14659d8405ca773ba1ef 24-Aug-2005 Patrick McHardy <kaber@trash.net> [FIB_TRIE]: Don't ignore negative results from fib_semantic_match

When a semantic match occurs either success, not found or an error
(for matching unreachable routes/blackholes) is returned. fib_trie
ignores the errors and looks for a different matching route. Treat
results other than "no match" as success and end lookup.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c877efb207bf4629cfa97ac13412f7392a873485 19-Jul-2005 Stephen Hemminger <shemminger@osdl.org> [IPV4]: Fix up lots of little whitespace indentation stuff in fib_trie.

Signed-off-by: Stephen Hemminger <shemminger@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2f36895aa774cf4d1c3d68921e0209e796b66600 06-Jul-2005 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: More broken memory allocation fixes for fib_trie

Below a patch to preallocate memory when doing resize of trie (inflate halve)
If preallocations fails it just skips the resize of this tnode for this time.

The oops we got when killing bgpd (with full routing) is now gone.
Patrick memory patch is also used.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
f0e36f8cee8101604378085171c980d9cc71d779 05-Jul-2005 Patrick McHardy <kaber@trash.net> [IPV4]: Handle large allocations in fib_trie

Inflating a node a couple of times makes it exceed the 128k kmalloc limit.
Use __get_free_pages for allocations > PAGE_SIZE, as in fib_hash.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
f835e471b557c45d2e5701ea5215f6e739b4eb39 29-Jun-2005 Robert Olsson <robert.olsson@its.uu.se> [IPV4]: Broken memory allocation in fib_trie

This should help up the insertion... but the resize is more crucial.
and complex and needs some thinking.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
90f66914c89b0be63548d4387d1211280aa7bc8e 21-Jun-2005 David S. Miller <davem@davemloft.net> [IPV4]: Fix fib_trie.c's args to fib_dump_info().

Signed-off-by: David S. Miller <davem@davemloft.net>
19baf839ff4a8daa1f2a7400897094fc18e4f5e9 21-Jun-2005 Robert Olsson <Robert.Olsson@data.slu.se> [IPV4]: Add LC-Trie FIB lookup algorithm.

Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>