History log of /net/netfilter/nf_conntrack_core.c
Revision Date Author Comments
43612d7c04f1a4f5e60104143918fcdf018b66ee 25-Nov-2014 Pablo Neira <pablo@netfilter.org> Revert "netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse"

This reverts commit 5195c14c8b27cc0b18220ddbf0e5ad3328a04187.

If the conntrack clashes with an existing one, it is left out of
the unconfirmed list, thus, crashing when dropping the packet and
releasing the conntrack since golden rule is that conntracks are
always placed in any of the existing lists for traceability reasons.

Reported-by: Daniel Borkmann <dborkman@redhat.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=88841
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
5195c14c8b27cc0b18220ddbf0e5ad3328a04187 06-Nov-2014 bill bonaparte <programme110@gmail.com> netfilter: conntrack: fix race in __nf_conntrack_confirm against get_next_corpse

After removal of the central spinlock nf_conntrack_lock, in
commit 93bb0ceb75be2 ("netfilter: conntrack: remove central
spinlock nf_conntrack_lock"), it is possible to race against
get_next_corpse().

The race is against the get_next_corpse() cleanup on
the "unconfirmed" list (a per-cpu list with seperate locking),
which set the DYING bit.

Fix this race, in __nf_conntrack_confirm(), by removing the CT
from unconfirmed list before checking the DYING bit. In case
race occured, re-add the CT to the dying list.

While at this, fix coding style of the comment that has been
updated.

Fixes: 93bb0ceb75be2 ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Reported-by: bill bonaparte <programme110@gmail.com>
Signed-off-by: bill bonaparte <programme110@gmail.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8fc54f68919298ff9689d980efb495707ef43f30 23-Aug-2014 Daniel Borkmann <dborkman@redhat.com> net: use reciprocal_scale() helper

Replace open codings of (((u64) <x> * <y>) >> 32) with reciprocal_scale().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
d2de875c6d4cbec8a99c880160181a3ed5b9992e 23-Aug-2014 Eric Dumazet <edumazet@google.com> net: use ktime_get_ns() and ktime_get_real_ns() helpers

ktime_get_ns() replaces ktime_to_ns(ktime_get())

ktime_get_real_ns() replaces ktime_to_ns(ktime_get_real())

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9500507c61381ceda4edbefa7361a4d26f54eb17 10-Jun-2014 Florian Westphal <fw@strlen.de> netfilter: conntrack: remove timer from ecache extension

This brings the (per-conntrack) ecache extension back to 24 bytes in size
(was 152 byte on x86_64 with lockdep on).

When event delivery fails, re-delivery is attempted via work queue.

Redelivery is attempted at least every 0.1 seconds, but can happen
more frequently if userspace is not congested.

The nf_ct_release_dying_list() function is removed.
With this patch, ownership of the to-be-redelivered conntracks
(on-dying-list-with-DYING-bit not yet set) is with the work queue,
which will release the references once event is out.

Joint work with Pablo Neira Ayuso.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
4e857c58efeb99393cba5a5d0d8ec7117183137c 17-Mar-2014 Peter Zijlstra <peterz@infradead.org> arch: Mass conversion of smp_mb__*()

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ee214d54bf3d51259adf8917e26dc84df1cab05a 11-Apr-2014 Andrey Vagin <avagin@openvz.org> netfilter: nf_conntrack: initialize net.ct.generation

[ 251.920788] INFO: trying to register non-static key.
[ 251.921386] the code is fine but needs lockdep annotation.
[ 251.921386] turning off the locking correctness validator.
[ 251.921386] CPU: 2 PID: 15715 Comm: socket_listen Not tainted 3.14.0+ #294
[ 251.921386] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 251.921386] 0000000000000000 000000009d18c210 ffff880075f039b8 ffffffff816b7ecd
[ 251.921386] ffffffff822c3b10 ffff880075f039c8 ffffffff816b36f4 ffff880075f03aa0
[ 251.921386] ffffffff810c65ff ffffffff810c4a85 00000000fffffe01 ffffffffa0075172
[ 251.921386] Call Trace:
[ 251.921386] [<ffffffff816b7ecd>] dump_stack+0x45/0x56
[ 251.921386] [<ffffffff816b36f4>] register_lock_class.part.24+0x38/0x3c
[ 251.921386] [<ffffffff810c65ff>] __lock_acquire+0x168f/0x1b40
[ 251.921386] [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
[ 251.921386] [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
[ 251.921386] [<ffffffff816c1215>] ? _raw_spin_unlock_bh+0x35/0x40
[ 251.921386] [<ffffffffa0075172>] ? nf_nat_setup_info+0x252/0x3a0 [nf_nat]
[ 251.921386] [<ffffffff810c7272>] lock_acquire+0xa2/0x120
[ 251.921386] [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[ 251.921386] [<ffffffffa0055989>] __nf_conntrack_confirm+0x129/0x410 [nf_conntrack]
[ 251.921386] [<ffffffffa008ab90>] ? ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[ 251.921386] [<ffffffffa008ab90>] ipv4_confirm+0x90/0xf0 [nf_conntrack_ipv4]
[ 251.921386] [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[ 251.921386] [<ffffffff815d8c5a>] nf_iterate+0xaa/0xc0
[ 251.921386] [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[ 251.921386] [<ffffffff815d8d14>] nf_hook_slow+0xa4/0x190
[ 251.921386] [<ffffffff815e7b00>] ? ip_fragment+0x9f0/0x9f0
[ 251.921386] [<ffffffff815e98f2>] ip_output+0x92/0x100
[ 251.921386] [<ffffffff815e8df9>] ip_local_out+0x29/0x90
[ 251.921386] [<ffffffff815e9240>] ip_queue_xmit+0x170/0x4c0
[ 251.921386] [<ffffffff815e90d5>] ? ip_queue_xmit+0x5/0x4c0
[ 251.921386] [<ffffffff81601208>] tcp_transmit_skb+0x498/0x960
[ 251.921386] [<ffffffff81602d82>] tcp_connect+0x812/0x960
[ 251.921386] [<ffffffff810e3dc5>] ? ktime_get_real+0x25/0x70
[ 251.921386] [<ffffffff8159ea2a>] ? secure_tcp_sequence_number+0x6a/0xc0
[ 251.921386] [<ffffffff81606f57>] tcp_v4_connect+0x317/0x470
[ 251.921386] [<ffffffff8161f645>] __inet_stream_connect+0xb5/0x330
[ 251.921386] [<ffffffff8158dfc3>] ? lock_sock_nested+0x33/0xa0
[ 251.921386] [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
[ 251.921386] [<ffffffff81078885>] ? __local_bh_enable_ip+0x75/0xe0
[ 251.921386] [<ffffffff8161f8f8>] inet_stream_connect+0x38/0x50
[ 251.921386] [<ffffffff8158b157>] SYSC_connect+0xe7/0x120
[ 251.921386] [<ffffffff810e3789>] ? current_kernel_time+0x69/0xd0
[ 251.921386] [<ffffffff810c4a85>] ? trace_hardirqs_on_caller+0x105/0x1d0
[ 251.921386] [<ffffffff810c4b5d>] ? trace_hardirqs_on+0xd/0x10
[ 251.921386] [<ffffffff8158c36e>] SyS_connect+0xe/0x10
[ 251.921386] [<ffffffff816caf69>] system_call_fastpath+0x16/0x1b
[ 312.014104] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=60003 jiffies, g=42359, c=42358, q=333)
[ 312.015097] INFO: Stall ended before state dump start

Fixes: 93bb0ceb75be ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
d5d20912d33f13766902a27087323f5c94e831c8 17-Mar-2014 Eric Dumazet <edumazet@google.com> netfilter: conntrack: Fix UP builds

ARRAY_SIZE(nf_conntrack_locks) is undefined if spinlock_t is an
empty structure. Replace it by CONNTRACK_LOCKS

Fixes: 93bb0ceb75be ("netfilter: conntrack: remove central spinlock nf_conntrack_lock")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
93bb0ceb75be2fdfa9fc0dd1fb522d9ada515d9c 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com> netfilter: conntrack: remove central spinlock nf_conntrack_lock

nf_conntrack_lock is a monolithic lock and suffers from huge contention
on current generation servers (8 or more core/threads).

Perf locking congestion is clear on base kernel:

- 72.56% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock_bh
- _raw_spin_lock_bh
+ 25.33% init_conntrack
+ 24.86% nf_ct_delete_from_lists
+ 24.62% __nf_conntrack_confirm
+ 24.38% destroy_conntrack
+ 0.70% tcp_packet
+ 2.21% ksoftirqd/6 [kernel.kallsyms] [k] fib_table_lookup
+ 1.15% ksoftirqd/6 [kernel.kallsyms] [k] __slab_free
+ 0.77% ksoftirqd/6 [kernel.kallsyms] [k] inet_getpeer
+ 0.70% ksoftirqd/6 [nf_conntrack] [k] nf_ct_delete
+ 0.55% ksoftirqd/6 [ip_tables] [k] ipt_do_table

This patch change conntrack locking and provides a huge performance
improvement. SYN-flood attack tested on a 24-core E5-2695v2(ES) with
10Gbit/s ixgbe (with tool trafgen):

Base kernel: 810.405 new conntrack/sec
After patch: 2.233.876 new conntrack/sec

Notice other floods attack (SYN+ACK or ACK) can easily be deflected using:
# iptables -A INPUT -m state --state INVALID -j DROP
# sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

Use an array of hashed spinlocks to protect insertions/deletions of
conntracks into the hash table. 1024 spinlocks seem to give good
results, at minimal cost (4KB memory). Due to lockdep max depth,
1024 becomes 8 if CONFIG_LOCKDEP=y

The hash resize is a bit tricky, because we need to take all locks in
the array. A seqcount_t is used to synchronize the hash table users
with the resizing process.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ca7433df3a672efc88e08222cfa4b3aa965ca324 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com> netfilter: conntrack: seperate expect locking from nf_conntrack_lock

Netfilter expectations are protected with the same lock as conntrack
entries (nf_conntrack_lock). This patch split out expectations locking
to use it's own lock (nf_conntrack_expect_lock).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
e1b207dac13ddb2f8ddebc7dc9729a97421909bd 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com> netfilter: avoid race with exp->master ct

Preparation for disconnecting the nf_conntrack_lock from the
expectations code. Once the nf_conntrack_lock is lifted, a race
condition is exposed.

The expectations master conntrack exp->master, can race with
delete operations, as the refcnt increment happens too late in
init_conntrack(). Race is against other CPUs invoking
->destroy() (destroy_conntrack()), or nf_ct_delete() (via timeout
or early_drop()).

Avoid this race in nf_ct_find_expectation() by using atomic_inc_not_zero(),
and checking if nf_ct_is_dying() (path via nf_ct_delete()).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
b7779d06f9950e14a008a2de970b44233fe49c86 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com> netfilter: conntrack: spinlock per cpu to protect special lists.

One spinlock per cpu to protect dying/unconfirmed/template special lists.
(These lists are now per cpu, a bit like the untracked ct)
Add a @cpu field to nf_conn, to make sure we hold the appropriate
spinlock at removal time.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
b476b72a0f8514a5a4c561bab731ddd506a284e7 03-Mar-2014 Jesper Dangaard Brouer <brouer@redhat.com> netfilter: trivial code cleanup and doc changes

Changes while reading through the netfilter code.

Added hint about how conntrack nf_conn refcnt is accessed.
And renamed repl_hash to reply_hash for readability

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
e53376bef2cd97d3e3f61fdc677fb8da7d03d0da 03-Feb-2014 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt

With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1 CPU2
____nf_conntrack_find()
nf_ct_put()
destroy_conntrack()
...
init_conntrack
__nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
if (!l4proto->new(ct, skb, dataoff, timeouts))
nf_conntrack_free(ct); (use = 2 !!!)
...
__nf_conntrack_alloc (set use = 1)
if (!nf_ct_key_equal(h, tuple, zone))
nf_ct_put(ct); (use = 0)
destroy_conntrack()
/* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G C --------------- 2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[<ffffffffa03d99ac>] [<ffffffffa03d99ac>] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255] [<ffffffff814844a7>] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255] [<ffffffffa03d9bb5>] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255] [<ffffffffa03d9fb2>] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255] [<ffffffffa048c771>] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255] [<ffffffff81484419>] nf_iterate+0x69/0xb0
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814845d4>] nf_hook_slow+0x74/0x110
<4>[67096.760255] [<ffffffff814b5b00>] ? dst_output+0x0/0x20
<4>[67096.760255] [<ffffffff814b66d5>] raw_sendmsg+0x775/0x910
<4>[67096.760255] [<ffffffff8104c5a8>] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814c136a>] inet_sendmsg+0x4a/0xb0
<4>[67096.760255] [<ffffffff81444e93>] ? sock_sendmsg+0x13/0x140
<4>[67096.760255] [<ffffffff81444f97>] sock_sendmsg+0x117/0x140
<4>[67096.760255] [<ffffffff8102e299>] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255] [<ffffffff81519beb>] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255] [<ffffffff8109d930>] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255] [<ffffffff814960f0>] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255] [<ffffffff814457c9>] sys_sendto+0x139/0x190
<4>[67096.760255] [<ffffffff810efa77>] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255] [<ffffffff810ef7c5>] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255] [<ffffffff81474daf>] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Andrew Vagin <avagin@parallels.com>
Cc: Florian Westphal <fw@strlen.de>
Reported-by: Andrew Vagin <avagin@parallels.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Andrew Vagin <avagin@parallels.com>
c6825c0976fa7893692e0e43b09740b419b23c09 29-Jan-2014 Andrey Vagin <avagin@openvz.org> netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

Lets look at destroy_conntrack:

hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
...
nf_conntrack_free(ct)
kmem_cache_free(net->ct.nf_conntrack_cachep, ct);

net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.

The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.

But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().

Florian Westphal suggested to check the confirm bit too. I think it's
right.

task 1 task 2 task 3
nf_conntrack_find_get
____nf_conntrack_find
destroy_conntrack
hlist_nulls_del_rcu
nf_conntrack_free
kmem_cache_free
__nf_conntrack_alloc
kmem_cache_alloc
memset(&ct->tuplehash[IP_CT_DIR_MAX],
if (nf_ct_is_dying(ct))
if (!nf_ct_tuple_equal()

I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.

<2>[46267.083061] kernel BUG at net/ipv4/netfilter/nf_nat_core.c:322!
...
<4>[46267.083951] RIP: 0010:[<ffffffffa01e00a4>] [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590 [nf_nat]
...
<4>[46267.085549] Call Trace:
<4>[46267.085622] [<ffffffffa023421b>] alloc_null_binding+0x5b/0xa0 [iptable_nat]
<4>[46267.085697] [<ffffffffa02342bc>] nf_nat_rule_find+0x5c/0x80 [iptable_nat]
<4>[46267.085770] [<ffffffffa0234521>] nf_nat_fn+0x111/0x260 [iptable_nat]
<4>[46267.085843] [<ffffffffa0234798>] nf_nat_out+0x48/0xd0 [iptable_nat]
<4>[46267.085919] [<ffffffff814841b9>] nf_iterate+0x69/0xb0
<4>[46267.085991] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086063] [<ffffffff81484374>] nf_hook_slow+0x74/0x110
<4>[46267.086133] [<ffffffff81494e70>] ? ip_finish_output+0x0/0x2f0
<4>[46267.086207] [<ffffffff814b5890>] ? dst_output+0x0/0x20
<4>[46267.086277] [<ffffffff81495204>] ip_output+0xa4/0xc0
<4>[46267.086346] [<ffffffff814b65a4>] raw_sendmsg+0x8b4/0x910
<4>[46267.086419] [<ffffffff814c10fa>] inet_sendmsg+0x4a/0xb0
<4>[46267.086491] [<ffffffff814459aa>] ? sock_update_classid+0x3a/0x50
<4>[46267.086562] [<ffffffff81444d67>] sock_sendmsg+0x117/0x140
<4>[46267.086638] [<ffffffff8151997b>] ? _spin_unlock_bh+0x1b/0x20
<4>[46267.086712] [<ffffffff8109d370>] ? autoremove_wake_function+0x0/0x40
<4>[46267.086785] [<ffffffff81495e80>] ? do_ip_setsockopt+0x90/0xd80
<4>[46267.086858] [<ffffffff8100be0e>] ? call_function_interrupt+0xe/0x20
<4>[46267.086936] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087006] [<ffffffff8118cb10>] ? ub_slab_ptr+0x20/0x90
<4>[46267.087081] [<ffffffff8118f2e8>] ? kmem_cache_alloc+0xd8/0x1e0
<4>[46267.087151] [<ffffffff81445599>] sys_sendto+0x139/0x190
<4>[46267.087229] [<ffffffff81448c0d>] ? sock_setsockopt+0x16d/0x6f0
<4>[46267.087303] [<ffffffff810efa47>] ? audit_syscall_entry+0x1d7/0x200
<4>[46267.087378] [<ffffffff810ef795>] ? __audit_syscall_exit+0x265/0x290
<4>[46267.087454] [<ffffffff81474885>] ? compat_sys_setsockopt+0x75/0x210
<4>[46267.087531] [<ffffffff81474b5f>] compat_sys_socketcall+0x13f/0x210
<4>[46267.087607] [<ffffffff8104dea3>] ia32_sysret+0x0/0x5
<4>[46267.087676] Code: 91 20 e2 01 75 29 48 89 de 4c 89 f7 e8 56 fa ff ff 85 c0 0f 84 68 fc ff ff 0f b6 4d c6 41 8b 45 00 e9 4d fb ff ff e8 7c 19 e9 e0 <0f> 0b eb fe f6 05 17 91 20 e2 80 74 ce 80 3d 5f 2e 00 00 00 74
<1>[46267.088023] RIP [<ffffffffa01e00a4>] nf_nat_setup_info+0x564/0x590

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
dcd93ed4cd1669b2c1510e801fe5f1132390761c 31-Dec-2013 stephen hemminger <stephen@networkplumber.org> netfilter: nf_conntrack: remove dead code

The following code is not used in current upstream code.
Some of this seems to be old hooks, other might be used by some
out of tree module (which I don't care about breaking), and
the need_ipv4_conntrack was used by old NAT code but no longer
called.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
0c3c6c00c69649f4749642b3e5d82125fde1600c 18-Nov-2013 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: decrement global counter after object release

nf_conntrack_free() decrements our counter (net->ct.count)
before releasing the conntrack object. That counter is used in the
nf_conntrack_cleanup_net_list path to check if it's time to
kmem_cache_destroy our cache of conntrack objects. I think we have
a race there that should be easier to trigger (although still hard)
with CONFIG_DEBUG_OBJECTS_FREE as object releases become slowier
according to the following splat:

[ 1136.321305] WARNING: CPU: 2 PID: 2483 at lib/debugobjects.c:260
debug_print_object+0x83/0xa0()
[ 1136.321311] ODEBUG: free active (active state 0) object type:
timer_list hint: delayed_work_timer_fn+0x0/0x20
...
[ 1136.321390] Call Trace:
[ 1136.321398] [<ffffffff8160d4a2>] dump_stack+0x45/0x56
[ 1136.321405] [<ffffffff810514e8>] warn_slowpath_common+0x78/0xa0
[ 1136.321410] [<ffffffff81051557>] warn_slowpath_fmt+0x47/0x50
[ 1136.321414] [<ffffffff812f8883>] debug_print_object+0x83/0xa0
[ 1136.321420] [<ffffffff8106aa90>] ? execute_in_process_context+0x90/0x90
[ 1136.321424] [<ffffffff812f99fb>] debug_check_no_obj_freed+0x20b/0x250
[ 1136.321429] [<ffffffff8112e7f2>] ? kmem_cache_destroy+0x92/0x100
[ 1136.321433] [<ffffffff8115d945>] kmem_cache_free+0x125/0x210
[ 1136.321436] [<ffffffff8112e7f2>] kmem_cache_destroy+0x92/0x100
[ 1136.321443] [<ffffffffa046b806>] nf_conntrack_cleanup_net_list+0x126/0x160 [nf_conntrack]
[ 1136.321449] [<ffffffffa046c43d>] nf_conntrack_pernet_exit+0x6d/0x80 [nf_conntrack]
[ 1136.321453] [<ffffffff81511cc3>] ops_exit_list.isra.3+0x53/0x60
[ 1136.321457] [<ffffffff815124f0>] cleanup_net+0x100/0x1b0
[ 1136.321460] [<ffffffff8106b31e>] process_one_work+0x18e/0x430
[ 1136.321463] [<ffffffff8106bf49>] worker_thread+0x119/0x390
[ 1136.321467] [<ffffffff8106be30>] ? manage_workers.isra.23+0x2a0/0x2a0
[ 1136.321470] [<ffffffff8107210b>] kthread+0xbb/0xc0
[ 1136.321472] [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321477] [<ffffffff8161b8fc>] ret_from_fork+0x7c/0xb0
[ 1136.321479] [<ffffffff81072050>] ? kthread_create_on_node+0x110/0x110
[ 1136.321481] ---[ end trace 25f53c192da70825 ]---

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
f7b13e4330ef3c20e62ac4908cc96c1c318056c2 26-Sep-2013 Holger Eitzenberger <holger@eitzenberger.org> netfilter: introduce nf_conn_acct structure

Encapsulate counters for both directions into nf_conn_acct. During
that process also consistently name pointers to the extend 'acct',
not 'counters'. This patch is a cleanup.

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
48b1de4c110a7afa4b85862f6c75af817db26fad 27-Aug-2013 Patrick McHardy <kaber@trash.net> netfilter: add SYNPROXY core/target

Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
41d73ec053d2424599c4ed8452b889374d523ade 27-Aug-2013 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: make sequence number adjustments usuable without NAT

Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
c655bc6896b94ee0223393f26155c6daf1e2d148 29-Jul-2013 Florian Westphal <fw@strlen.de> netfilter: nf_conntrack: don't send destroy events from iterator

Let nf_ct_delete handle delivery of the DESTROY event.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2d89c68ac78ae432038ef23371d2fa949d725d43 28-Jul-2013 Patrick McHardy <kaber@trash.net> netfilter: nf_nat: change sequence number adjustments to 32 bits

Using 16 bits is too small, when many adjustments happen the offsets might
overflow and break the connection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
02982c27ba1e1bd9f9d4747214e19ca83aa88d0e 29-Jul-2013 Florian Westphal <fw@strlen.de> netfilter: nf_conntrack: remove duplicate code in ctnetlink

ctnetlink contains copy-paste code from death_by_timeout. In order to
avoid changing both places in upcoming event delivery patch,
export death_by_timeout functionality and use it in the ctnetlink code.

Based on earlier patch from Pablo Neira.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
312a0c16c1fa9dd7cb5af413cf73b2fe2806c962 28-Jul-2013 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: constify sk_buff argument to nf_ct_attach()

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ca3d41a588d3cceb06e151cf31cc73f910384e2b 30-Apr-2013 Akinobu Mita <akinobu.mita@gmail.com> net/netfilter: rename random32() to prandom_u32()

Use preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ec464e5dc504a164c5dbff4a06812d495e44e34d 17-Apr-2013 Patrick McHardy <kaber@trash.net> netfilter: rename netlink related "pid" variables to "portid"

Get rid of the confusing mix of pid and portid and use portid consistently
for all netlink related socket identities.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
f229f6ce481ceb33a966311722b8ef0cb6c25de7 06-Apr-2013 Patrick McHardy <kaber@trash.net> netfilter: add my copyright statements

Add copyright statements to all netfilter files which have had significant
changes done by myself in the past.

Some notes:

- nf_conntrack_ecache.c was incorrectly attributed to Rusty and Netfilter
Core Team when it got split out of nf_conntrack_core.c. The copyrights
even state a date which lies six years before it was written. It was
written in 2005 by Harald and myself.

- net/ipv{4,6}/netfilter.c, net/netfitler/nf_queue.c were missing copyright
statements. I've added the copyright statement from net/netfilter/core.c,
where this code originated

- for nf_conntrack_proto_tcp.c I've also added Jozsef, since I didn't want
it to give the wrong impression

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
dece40e848f6e022f960dc9de54be518928460c3 14-Mar-2013 Vladimir Davydov <VDavydov@parallels.com> netfilter: nf_conntrack: speed up module removal path if netns in use

The patch introduces nf_conntrack_cleanup_net_list(), which cleanups
nf_conntrack for a list of netns and calls synchronize_net() only once
for them all. This should reduce netns destruction time.

I've measured cleanup time for 1k dummy net ns. Here are the results:

<without the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack

real 0m10.337s
user 0m0.000s
sys 0m0.376s

<with the patch>
# modprobe nf_conntrack
# time modprobe -r nf_conntrack

real 0m5.661s
user 0m0.000s
sys 0m0.216s

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
493763684fefca54502e2d95b057075ac8e279ea 16-Mar-2013 stephen hemminger <stephen@networkplumber.org> netfilter: nf_conntrack: add include to fix sparse warning

Include header file to pickup prototype of nf_nat_seq_adjust_hook

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
04d870017908f40bbb1c51910acc030ae4979db4 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_proto: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5f69b8f5218dc303cbcb6f71d221c27d3cd17ebb 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_labels: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5e615b220087c5551f486c967831cecdfd338dbe 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_helper: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
8684094cf17d8ce96e0a8c63003f331aa017e22d 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_timeout: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
3fe0f943d4f52f875f0fdf8dbe472c8a9b852891 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_ecache: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
73f4001a52c986114f540504d70b21e52eb0d92a 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_tstamp: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
b7ff3a1fae78783e0ab1ef82f5978aeb89ddd16b 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_acct: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
83b4dbe19844b5472a8f44b6cf1d88693c080ef7 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_expect: move initialization out of pernet_operations

Move the global initial codes to the module_init/exit context.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
f94161c1bbdf7af11729cf106b4452f2432448e0 21-Jan-2013 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_conntrack: move initialization out of pernet operations

nf_conntrack initialization and cleanup codes happens in pernet
operations function. This task should be done in module_init/exit.
We can't use init_net to identify if it's the right time to initialize
or cleanup since we cannot make assumption on the order netns are
created/destroyed.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
c539f01717c239cfa0921dd43927afc976f1eedc 11-Jan-2013 Florian Westphal <fw@strlen.de> netfilter: add connlabel conntrack extension

similar to connmarks, except labels are bit-based; i.e.
all labels may be attached to a flow at the same time.

Up to 128 labels are supported. Supporting more labels
is possible, but requires increasing the ct offset delta
from u8 to u16 type due to increased extension sizes.

Mapping of bit-identifier to label name is done in userspace.

The extension is enabled at run-time once "-m connlabel" netfilter
rules are added.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
1e47ee8367babe6a5e8adf44a714c7086657b87e 10-Jan-2013 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: fix BUG_ON while removing nf_conntrack with netns

canqun zhang reported that we're hitting BUG_ON in the
nf_conntrack_destroy path when calling kfree_skb while
rmmod'ing the nf_conntrack module.

Currently, the nf_ct_destroy hook is being set to NULL in the
destroy path of conntrack.init_net. However, this is a problem
since init_net may be destroyed before any other existing netns
(we cannot assume any specific ordering while releasing existing
netns according to what I read in recent emails).

Thanks to Gao feng for initial patch to address this issue.

Reported-by: canqun zhang <canqunzhang@gmail.com>
Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
252b3e8c1bc0c2b20348ae87d67efcd0a8209f72 11-Dec-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: xt_CT: fix crash while destroy ct templates

In (d871bef netfilter: ctnetlink: dump entries from the dying and
unconfirmed lists), we assume that all conntrack objects are
inserted in any of the existing lists. However, template conntrack
objects were not. This results in hitting BUG_ON in the
destroy_conntrack path while removing a rule that uses the CT target.

This patch fixes the situation by adding the template lists, which
is where template conntrack objects reside now.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
a71258d79e3d05632e90c9f7db5ccf929d276529 10-Dec-2012 Abhijit Pawar <abhi.c.pawar@gmail.com> net: remove obsolete simple_strto<foo>

This patch removes the redundant occurences of simple_strto<foo>

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4b5511ebc7e1cf94e4f13be19c2cf3e90edc3395 10-Dec-2012 Abhijit Pawar <abhi.c.pawar@gmail.com> net: remove obsolete simple_strto<foo>

This patch replace the obsolete simple_strto<foo> with kstrto<foo>

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
04dac0111da7e1d284952cd415162451ffaa094d 27-Nov-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: improve nf_conn object traceability

This patch modifies the conntrack subsystem so that all existing
allocated conntrack objects can be found in any of the following
places:

* the hash table, this is the typical place for alive conntrack objects.
* the unconfirmed list, this is the place for newly created conntrack objects
that are still traversing the stack.
* the dying list, this is where you can find conntrack objects that are dying
or that should die anytime soon (eg. once the destroy event is delivered to
the conntrackd daemon).

Thus, we make sure that we follow the track for all existing conntrack
objects. This patch, together with some extension of the ctnetlink interface
to dump the content of the dying and unconfirmed lists, will help in case
to debug suspected nf_conn object leaks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
b0cdb1d9a9522b4f0905f11e4c7d7a59e0f7dc44 19-Sep-2012 Patrick McHardy <kaber@trash.net> netfilter: nf_nat: fix oops when unloading protocol modules

When unloading a protocol module nf_ct_iterate_cleanup() is used to
remove all conntracks using the protocol from the bysource hash and
clean their NAT sections. Since the conntrack isn't actually killed,
the NAT callback is invoked twice, once for each direction, which
causes an oops when trying to delete it from the bysource hash for
the second time.

The same oops can also happen when removing both an L3 and L4 protocol
since the cleanup function doesn't check whether the conntrack has
already been cleaned up.

Pid: 4052, comm: modprobe Not tainted 3.6.0-rc3-test-nat-unload-fix+ #32 Red Hat KVM
RIP: 0010:[<ffffffffa002c303>] [<ffffffffa002c303>] nf_nat_proto_clean+0x73/0xd0 [nf_nat]
RSP: 0018:ffff88007808fe18 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800728550c0 RCX: ffff8800756288b0
RDX: dead000000200200 RSI: ffff88007808fe88 RDI: ffffffffa002f208
RBP: ffff88007808fe28 R08: ffff88007808e000 R09: 0000000000000000
R10: dead000000200200 R11: dead000000100100 R12: ffffffff81c6dc00
R13: ffff8800787582b8 R14: ffff880078758278 R15: ffff88007808fe88
FS: 00007f515985d700(0000) GS:ffff88007cd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f515986a000 CR3: 000000007867a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 4052, threadinfo ffff88007808e000, task ffff8800756288b0)
Stack:
ffff88007808fe68 ffffffffa002c290 ffff88007808fe78 ffffffff815614e3
ffffffff00000000 00000aeb00000246 ffff88007808fe68 ffffffff81c6dc00
ffff88007808fe88 ffffffffa00358a0 0000000000000000 000000000040f5b0
Call Trace:
[<ffffffffa002c290>] ? nf_nat_net_exit+0x50/0x50 [nf_nat]
[<ffffffff815614e3>] nf_ct_iterate_cleanup+0xc3/0x170
[<ffffffffa002c55a>] nf_nat_l3proto_unregister+0x8a/0x100 [nf_nat]
[<ffffffff812a0303>] ? compat_prepare_timeout+0x13/0xb0
[<ffffffffa0035848>] nf_nat_l3proto_ipv4_exit+0x10/0x23 [nf_nat_ipv4]
...

To fix this,

- check whether the conntrack has already been cleaned up in
nf_nat_proto_clean

- change nf_ct_iterate_cleanup() to only invoke the callback function
once for each conntrack (IP_CT_DIR_ORIGINAL).

The second change doesn't affect other callers since when conntracks are
actually killed, both directions are removed from the hash immediately
and the callback is already only invoked once. If it is not killed, the
second callback invocation will always return the same decision not to
kill it.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
84b5ee939eba0115739c19c0e01ea903b029c9da 28-Aug-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: add nf_ct_timeout_lookup

This patch adds the new nf_ct_timeout_lookup function to encapsulate
the timeout policy attachment that is called in the nf_conntrack_in
path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5b423f6a40a0327f9d40bc8b97ce9be266f74368 29-Aug-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: fix racy timer handling with reliable events

Existing code assumes that del_timer returns true for alive conntrack
entries. However, this is not true if reliable events are enabled.
In that case, del_timer may return true for entries that were
just inserted in the dying list. Note that packets / ctnetlink may
hold references to conntrack entries that were just inserted to such
list.

This patch fixes the issue by adding an independent timer for
event delivery. This increases the size of the ecache extension.
Still we can revisit this later and use variable size extensions
to allocate this area on demand.

Tested-by: Oliver Smith <olipro@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
c7232c9979cba684c50b64c513c4a83c9aa70563 26-Aug-2012 Patrick McHardy <kaber@trash.net> netfilter: add protocol independent NAT core

Convert the IPv4 NAT implementation to a protocol independent core and
address family specific modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>
1afc56794e03229fa53cfa3c5012704d226e1dec 07-Jun-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_ct_helper: implement variable length helper private data

This patch uses the new variable length conntrack extensions.

Instead of using union nf_conntrack_help that contain all the
helper private data information, we allocate variable length
area to store the private helper data.

This patch includes the modification of all existing helpers.
It also includes a couple of include header to avoid compilation
warnings.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
15f585bd76b6bd2974b23c9e69ff038a0826a0be 28-May-2012 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_ct_generic: add namespace support

This patch adds namespace support for the generic layer 4 protocol
tracker.

Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
e3192690a3c889767d1161b228374f4926d92af0 03-Jun-2012 Joe Perches <joe@perches.com> net: Remove casts to same type

Adding casts of objects to the same type is unnecessary
and confusing for a human reader.

For example, this cast:

int y;
int *p = (int *)&y;

I used the coccinelle script below to find and remove these
unnecessary casts. I manually removed the conversions this
script produces of casts with __force and __user.

@@
type T;
T *p;
@@

- (T *)p
+ p

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e87cc4728f0e2fb663e592a1141742b1d6c63256 13-May-2012 Joe Perches <joe@perches.com> net: Convert net_ratelimit uses to net_<level>_ratelimited

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a9006892643a8f4e885b692de0708bcb35a7d530 18-Apr-2012 Eric Leblond <eric@regit.org> netfilter: nf_ct_helper: allow to disable automatic helper assignment

This patch allows you to disable automatic conntrack helper
lookup based on TCP/UDP ports, eg.

echo 0 > /proc/sys/net/netfilter/nf_conntrack_helper

[ Note: flows that already got a helper will keep using it even
if automatic helper assignment has been disabled ]

Once this behaviour has been disabled, you have to explicitly
use the iptables CT target to attach helper to flows.

There are good reasons to stop supporting automatic helper
assignment, for further information, please read:

http://www.netfilter.org/news.html#2012-04-03

This patch also adds one message to inform that automatic helper
assignment is deprecated and it will be removed soon (this is
spotted only once, with the first flow that gets a helper attached
to make it as less annoying as possible).

Signed-off-by: Eric Leblond <eric@regit.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
6ba900676bec8baaf61aa2f85b7345c0e65774d9 07-Apr-2012 Gao feng <gaofeng@cn.fujitsu.com> netfilter: nf_conntrack: fix incorrect logic in nf_conntrack_init_net

in function nf_conntrack_init_net,when nf_conntrack_timeout_init falied,
we should call nf_conntrack_ecache_fini to do rollback.
but the current code calls nf_conntrack_timeout_fini.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
d96fc659aeb27686cef42d305cfd0c9702f8841c 03-Apr-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: fix count leak in error path of __nf_conntrack_alloc

We have to decrement the conntrack counter if we fail to access the
zone extension.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
bae65be896cc420f58460cb6f6ac03e71d1bf240 02-Apr-2012 David S. Miller <davem@davemloft.net> nf_conntrack_core: Stop using NLA_PUT*().

These macros contain a hidden goto, and are thus extremely error
prone and make code hard to audit.

Signed-off-by: David S. Miller <davem@davemloft.net>
60b5f8f745739a4789395648595ed31ede582448 23-Mar-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: permanently attach timeout policy to conntrack

We need to permanently attach the timeout policy to the conntrack,
otherwise we may apply the custom timeout policy inconsistently.

Without this patch, the following example:

nfct timeout add test inet icmp timeout 100
iptables -I PREROUTING -t raw -p icmp -s 1.1.1.1 -j CT --timeout test

Will only apply the custom timeout policy to outgoing packets from
1.1.1.1, but not to reply packets from 2.2.2.2 going to 1.1.1.1.

To fix this issue, this patch modifies the current logic to attach the
timeout policy when the first packet is seen (which is when the
conntrack entry is created). Then, we keep using the attached timeout
policy until the conntrack entry is destroyed.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
24de58f465165298aaa8f286b2592f0163706cfe 29-Feb-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: xt_CT: allow to attach timeout policy + glue code

This patch allows you to attach the timeout policy via the
CT target, it adds a new revision of the target to ensure
backward compatibility. Moreover, it also contains the glue
code to stick the timeout object defined via nfnetlink_cttimeout
to the given flow.

Example usage (it requires installing the nfct tool and
libnetfilter_cttimeout):

1) create the timeout policy:

nfct timeout add tcp-policy0 inet tcp \
established 1000 close 10 time_wait 10 last_ack 10

2) attach the timeout policy to the packet:

iptables -I PREROUTING -t raw -p tcp -j CT --timeout tcp-policy0

You have to install the following user-space software:

a) libnetfilter_cttimeout:
git://git.netfilter.org/libnetfilter_cttimeout

b) nfct:
git://git.netfilter.org/nfct

You also have to get iptables with -j CT --timeout support.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
dd705072412225a97784fe38feee2ebf8d14814d 28-Feb-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_ct_ext: add timeout extension

This patch adds the timeout extension, which allows you to attach
specific timeout policies to flows.

This extension is only used by the template conntrack.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2c8503f55fbdfbeff4164f133df804cf4d316290 28-Feb-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: pass timeout array to l4->new and l4->packet

This patch defines a new interface for l4 protocol trackers:

unsigned int *(*get_timeouts)(struct net *net);

that is used to return the array of unsigned int that contains
the timeouts that will be applied for this flow. This is passed
to the l4proto->new(...) and l4proto->packet(...) functions to
specify the timeout policy.

This interface allows per-net global timeout configuration
(although only DCCP supports this by now) and it will allow
custom custom timeout configuration by means of follow-up
patches.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
741385119706d4370eb7899c5ca96ad125c520e5 06-Mar-2012 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: fix early_drop with reliable event delivery

If reliable event delivery is enabled and ctnetlink fails to deliver
the destroy event in early_drop, the conntrack subsystem cannot
drop any the candidate flow that was planned to be evicted.

Reported-by: Kerin Millar <kerframil@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7d367e06688dc7a2cc98c2ace04e1296e1d987e2 24-Feb-2012 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> netfilter: ctnetlink: fix soft lockup when netlink adds new entries (v2)

Marcell Zambo and Janos Farago noticed and reported that when
new conntrack entries are added via netlink and the conntrack table
gets full, soft lockup happens. This is because the nf_conntrack_lock
is held while nf_conntrack_alloc is called, which is in turn wants
to lock nf_conntrack_lock while evicting entries from the full table.

The patch fixes the soft lockup with limiting the holding of the
nf_conntrack_lock to the minimum, where it's absolutely required.
It required to extend (and thus change) nf_conntrack_hash_insert
so that it makes sure conntrack and ctnetlink do not add the same entry
twice to the conntrack table.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
cf778b00e96df6d64f8e21b8395d1f8a859ecdc7 12-Jan-2012 Eric Dumazet <eric.dumazet@gmail.com> net: reintroduce missing rcu_assign_pointer() calls

commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to
RCU_INIT_POINTER) did a lot of incorrect changes, since it did a
complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x,
y).

We miss needed barriers, even on x86, when y is not NULL.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4d4e61c6ca683cdc0ea07d39c80cc8d6d478b31e 23-Dec-2011 Patrick McHardy <kaber@trash.net> netfilter: nf_nat: use hash random for bysource hash

Use nf_conntrack_hash_rnd in NAT bysource hash to avoid hash chain attacks.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
966567b7644b540962d90a0878706f59ae22c7e1 19-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com> net: two vzalloc() cleanups

We can use vzalloc() helper now instead of __vmalloc() trick

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b3e0bfa71b1db9d7a9fbea6965867784fd00ca3c 14-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: use atomic64 for accounting counters

We can use atomic64_t infrastructure to avoid taking a spinlock in fast
path, and remove inaccuracies while reading values in
ctnetlink_dump_counters() and connbytes_mt() on 32bit arches.

Suggested by Pablo.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
c0cd115667bcd23c2a31fe2114beaab3608de68c 12-Dec-2011 Igor Maravić <igorm@etf.rs> net:netfilter: use IS_ENABLED

Use IS_ENABLED(CONFIG_FOO)
instead of defined(CONFIG_FOO) || defined (CONFIG_FOO_MODULE)

Signed-off-by: Igor Maravić <igorm@etf.rs>
Signed-off-by: David S. Miller <davem@davemloft.net>
0a9ee81349d90c6c85831f38118bf569c60a4d51 29-Aug-2011 Joe Perches <joe@perches.com> netfilter: Remove unnecessary OOM logging messages

Site specific OOM messages are duplications of a generic MM
out of memory message and aren't really useful, so just
delete them.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
a9b3cd7f323b2e57593e7215362a7b02fc933e3a 01-Aug-2011 Stephen Hemminger <shemminger@vyatta.com> rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER

When assigning a NULL value to an RCU protected pointer, no barrier
is needed. The rcu_assign_pointer, used to handle that but will soon
change to not handle the special case.

Convert all rcu_assign_pointer of NULL value.

//smpl
@@ expression P; @@

- rcu_assign_pointer(P, NULL)
+ RCU_INIT_POINTER(P, NULL)

// </smpl>

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
88ed01d17b44bc2bed4ad4835d3b1099bff3dd71 02-Jun-2011 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: fix ct refcount leak in l4proto->error()

This patch fixes a refcount leak of ct objects that may occur if
l4proto->error() assigns one conntrack object to one skbuff. In
that case, we have to skip further processing in nf_conntrack_in().

With this patch, we can also fix wrong return values (-NF_ACCEPT)
for special cases in ICMP[v6] that should not bump the invalid/error
statistic counters.

Reported-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
fb04883371f2cb7867d24783e7d590036dc9b548 19-May-2011 Eric Dumazet <eric.dumazet@gmail.com> netfilter: add more values to enum ip_conntrack_info

Following error is raised (and other similar ones) :

net/ipv4/netfilter/nf_nat_standalone.c: In function ‘nf_nat_fn’:
net/ipv4/netfilter/nf_nat_standalone.c:119:2: warning: case value ‘4’
not in enumerated type ‘enum ip_conntrack_info’

gcc barfs on adding two enum values and getting a not enumerated
result :

case IP_CT_RELATED+IP_CT_IS_REPLY:

Add missing enum values

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
25985edcedea6396277003854657b5f3cb31a628 31-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi> Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
fe8f661f2c2bb058822f13f6f232e121bde1338f 14-Mar-2011 Stephen Hemminger <shemminger@vyatta.com> netfilter: nf_conntrack: fix sysctl memory leak

Message in log because sysctl table was not empty at netns exit
WARNING: at net/sysctl_net.c:84 sysctl_net_exit+0x2a/0x2c()

Instrumenting showed that the nf_conntrack_timestamp was the entry
that was being created but not cleared.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
c317428644c0af137d80069ab178cd797da3be45 09-Feb-2011 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: set conntrack templates again if we return NF_REPEAT

The TCP tracking code has a special case that allows to return
NF_REPEAT if we receive a new SYN packet while in TIME_WAIT state.

In this situation, the TCP tracking code destroys the existing
conntrack to start a new clean session.

[DESTROY] tcp 6 src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925 [ASSURED]
[NEW] tcp 6 120 SYN_SENT src=192.168.0.2 dst=192.168.1.2 sport=38925 dport=8000 [UNREPLIED] src=192.168.1.2 dst=192.168.1.100 sport=8000 dport=38925

However, this is a problem for the iptables' CT target event filtering
which will not work in this case since the conntrack template will not
be there for the new session. To fix this, we reassign the conntrack
template to the packet if we return NF_REPEAT.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
a992ca2a0498edd22a88ac8c41570f536de29c9e 19-Jan-2011 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack_tstamp: add flow-based timestamp extension

This patch adds flow-based timestamping for conntracks. This
conntrack extension is disabled by default. Basically, we use
two 64-bits variables to store the creation timestamp once the
conntrack has been confirmed and the other to store the deletion
time. This extension is disabled by default, to enable it, you
have to:

echo 1 > /proc/sys/net/netfilter/nf_conntrack_timestamp

This patch allows to save memory for user-space flow-based
loogers such as ulogd2. In short, ulogd2 does not need to
keep a hashtable with the conntrack in user-space to know
when they were created and destroyed, instead we use the
kernel timestamp. If we want to have a sane IPFIX implementation
in user-space, this nanosecs resolution timestamps are also
useful. Other custom user-space applications can benefit from
this via libnetfilter_conntrack.

This patch modifies the /proc output to display the delta time
in seconds since the flow start. You can also obtain the
flow-start date by means of the conntrack-tools.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
45eec34195853e918518231dcefaca1ea4ebacfc 18-Jan-2011 Changli Gao <xiaosuo@gmail.com> netfilter: nf_conntrack: remove an atomic bit operation

As this ct won't be seen by the others, we don't need to set the
IPS_CONFIRMED_BIT in atomic way.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Cc: Tim Gardner <tim.gardner@canonical.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
d862a6622e9db508d4b28cc7c5bc28bd548cc24e 14-Jan-2011 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: use is_vmalloc_addr()

Use is_vmalloc_addr() in nf_ct_free_hashtable() and get rid of
the vmalloc flags to indicate that a hash table has been allocated
using vmalloc().

Signed-off-by: Patrick McHardy <kaber@trash.net>
f682cefa5ad204d3bfaa54a58046c66d2d035ac1 05-Jan-2011 Changli Gao <xiaosuo@gmail.com> netfilter: fix the race when initializing nf_ct_expect_hash_rnd

Since nf_ct_expect_dst_hash() may be called without nf_conntrack_lock
locked, nf_ct_expect_hash_rnd should be initialized in the atomic way.

In this patch, we use nf_conntrack_hash_rnd instead of
nf_ct_expect_hash_rnd.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e5fc9e7a666e5964b60e05903b90aa832354b68c 12-Nov-2010 Changli Gao <xiaosuo@gmail.com> netfilter: nf_conntrack: don't always initialize ct->proto

ct->proto is big(60 bytes) due to structure ip_ct_tcp, and we don't need
to initialize the whole for all the other protocols. This patch moves
proto to the end of structure nf_conn, and pushes the initialization down
to the individual protocols.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
6b1686a71e3158d3c5f125260effce171cc7852b 28-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: allow nf_ct_alloc_hashtable() to get highmem pages

commit ea781f197d6a8 (use SLAB_DESTROY_BY_RCU and get rid of call_rcu())
did a mistake in __vmalloc() call in nf_ct_alloc_hashtable().

I forgot to add __GFP_HIGHMEM, so pages were taken from LOWMEM only.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
99f07e91bef34db0fc8b1a224096e97f02dc0d56 21-Sep-2010 Changli Gao <xiaosuo@gmail.com> netfilter: save the hash of the tuple in the original direction for latter use

Since we don't change the tuple in the original direction, we can save it
in ct->tuplehash[IP_CT_DIR_REPLY].hnode.pprev for __nf_conntrack_confirm()
use.

__hash_conntrack() is split into two steps: hash_conntrack_raw() is used
to get the raw hash, and __hash_bucket() is used to get the bucket id.

In SYN-flood case, early_drop() doesn't need to recompute the hash again.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
b23909695c33f53df5f1d16696b1aa5b874c1904 16-Sep-2010 Changli Gao <xiaosuo@gmail.com> netfilter: nf_conntrack: fix the hash random initializing race

nf_conntrack_alloc() isn't called with nf_conntrack_lock locked, so hash
random initializing code maybe executed more than once on different
CPUs.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
6661481d5a8975657742c7ed40ae16bdaa7d0a6e 02-Aug-2010 Changli Gao <xiaosuo@gmail.com> netfilter: nf_conntrack_acct: use skb->len for accounting

use skb->len for accounting as xt_quota does. Since nf_conntrack works
at the network layer, skb_network_offset should always returns ZERO.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
b3c5163fe0193a74016dba1bb22491e0d1e9aaa4 09-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: per_cpu untracking

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.

This patch converts it to a per_cpu variable.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
5bfddbd46a95c978f4d3c992339cbdf4f4b790a3 08-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: IPS_UNTRACKED bit

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet. This is bad for performance.
__read_mostly annotation is also a bad choice.

This patch introduces IPS_UNTRACKED bit so that we can use later a
per_cpu untrack structure more easily.

A new helper, nf_ct_untracked_get() returns a pointer to
nf_conntrack_untracked.

Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
IPS_NAT_DONE_MASK bits to untracked status.

nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
c2d9ba9bce8d7323ca96f239e1f505c14d6244fb 01-Jun-2010 Eric Dumazet <eric.dumazet@gmail.com> net: CONFIG_NET_NS reduction

Use read_pnet() and write_pnet() to reduce number of ifdef CONFIG_NET_NS

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fc350777c705a39a312728ac5e8a6f164a828f5d 20-May-2010 Joerg Marx <joerg.marx@secunet.com> netfilter: nf_conntrack: fix a race in __nf_conntrack_confirm against nf_ct_get_next_corpse()

This race was triggered by a 'conntrack -F' command running in parallel
to the insertion of a hash for a new connection. Losing this race led to
a dead conntrack entry effectively blocking traffic for a particular
connection until timeout or flushing the conntrack hashes again.
Now the check for an already dying connection is done inside the lock.

Signed-off-by: Joerg Marx <joerg.marx@secunet.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
654d0fbdc8fe1041918741ed5b6abc8ad6b4c1d8 13-May-2010 Stephen Hemminger <shemminger@vyatta.com> netfilter: cleanup printk messages

Make sure all printk messages have a severity level.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
af740b2c8f4521e2c45698ee6040941a82d6349d 23-Apr-2010 Jesper Dangaard Brouer <hawk@comx.dk> netfilter: nf_conntrack: extend with extra stat counter

I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.

Adding a stats counter to see if the search is restarted too often.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
5d0aa2ccd4699a01cfdf14886191c249d7b45a01 15-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: add support for "conntrack zones"

Normally, each connection needs a unique identity. Conntrack zones allow
to specify a numerical zone using the CT target, connections in different
zones can use the same identity.

Example:

iptables -t raw -A PREROUTING -i veth0 -j CT --zone 1
iptables -t raw -A OUTPUT -o veth1 -j CT --zone 1

Signed-off-by: Patrick McHardy <kaber@trash.net>
8fea97ec1772bbf553d89187340ef624d548e115 15-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: pass template to l4proto ->error() handler

The error handlers might need the template to get the conntrack zone
introduced in the next patches to perform a conntrack lookup.

Signed-off-by: Patrick McHardy <kaber@trash.net>
d696c7bdaa55e2208e56c6f98e6bc1599f34286d 08-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix hash resizing with namespaces

As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
size is global and not per namespace, but modifiable at runtime through
/sys/module/nf_conntrack/hashsize. Changing the hash size will only
resize the hash in the current namespace however, so other namespaces
will use an invalid hash size. This can cause crashes when enlarging
the hashsize, or false negative lookups when shrinking it.

Move the hash size into the per-namespace data and only use the global
hash size to initialize the per-namespace value when instanciating a
new namespace. Additionally restrict hash resizing to init_net for
now as other namespaces are not handled currently.

Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5b3501faa8741d50617ce4191c20061c6ef36cb3 08-Feb-2010 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: per netns nf_conntrack_cachep

nf_conntrack_cachep is currently shared by all netns instances, but
because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.

If we use a shared slab cache, one object can instantly flight between
one hash table (netns ONE) to another one (netns TWO), and concurrent
reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
can be fooled without notice, because no RCU grace period has to be
observed between object freeing and its reuse.

We dont have this problem with UDP/TCP slab caches because TCP/UDP
hashtables are global to the machine (and each object has a pointer to
its netns).

If we use per netns conntrack hash tables, we also *must* use per netns
conntrack slab caches, to guarantee an object can not escape from one
namespace to another one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[Patrick: added unique slab name allocation]
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
9edd7ca0a3e3999c260642c92fa008892d82ca6e 08-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix memory corruption with multiple namespaces

As discovered by Jon Masters <jonathan@jonmasters.org>, the "untracked"
conntrack, which is located in the data section, might be accidentally
freed when a new namespace is instantiated while the untracked conntrack
is attached to a skb because the reference count it re-initialized.

The best fix would be to use a seperate untracked conntrack per
namespace since it includes a namespace pointer. Unfortunately this is
not possible without larger changes since the namespace is not easily
available everywhere we need it. For now move the untracked conntrack
initialization to the init_net setup function to make sure the reference
count is not re-initialized and handle cleanup in the init_net cleanup
function to make sure namespaces can exit properly while the untracked
conntrack is in use in other namespaces.

Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9ab48ddcb144fdee908708669448dd136cf4894a 08-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix hash resizing with namespaces

As noticed by Jon Masters <jonathan@jonmasters.org>, the conntrack hash
size is global and not per namespace, but modifiable at runtime through
/sys/module/nf_conntrack/hashsize. Changing the hash size will only
resize the hash in the current namespace however, so other namespaces
will use an invalid hash size. This can cause crashes when enlarging
the hashsize, or false negative lookups when shrinking it.

Move the hash size into the per-namespace data and only use the global
hash size to initialize the per-namespace value when instanciating a
new namespace. Additionally restrict hash resizing to init_net for
now as other namespaces are not handled currently.

Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
ab59b19be78aac65cdd599fb5002c9019885e061 04-Feb-2010 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: per netns nf_conntrack_cachep

nf_conntrack_cachep is currently shared by all netns instances, but
because of SLAB_DESTROY_BY_RCU special semantics, this is wrong.

If we use a shared slab cache, one object can instantly flight between
one hash table (netns ONE) to another one (netns TWO), and concurrent
reader (doing a lookup in netns ONE, 'finding' an object of netns TWO)
can be fooled without notice, because no RCU grace period has to be
observed between object freeing and its reuse.

We dont have this problem with UDP/TCP slab caches because TCP/UDP
hashtables are global to the machine (and each object has a pointer to
its netns).

If we use per netns conntrack hash tables, we also *must* use per netns
conntrack slab caches, to guarantee an object can not escape from one
namespace to another one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[Patrick: added unique slab name allocation]
Signed-off-by: Patrick McHardy <kaber@trash.net>
b2a15a604d379af323645e330638e2cfcc696aff 03-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: support conntrack templates

Support initializing selected parameters of new conntrack entries from a
"conntrack template", which is a specially marked conntrack entry attached
to the skb.

Currently the helper and the event delivery masks can be initialized this
way.

Signed-off-by: Patrick McHardy <kaber@trash.net>
0cebe4b4163b6373c9d24c1a192939777bc27e55 03-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: ctnetlink: support selective event delivery

Add two masks for conntrack end expectation events to struct nf_conntrack_ecache
and use them to filter events. Their default value is "all events" when the
event sysctl is on and "no events" when it is off. A following patch will add
specific initializations. Expectation events depend on the ecache struct of
their master conntrack.

Signed-off-by: Patrick McHardy <kaber@trash.net>
858b31330054a9ad259feceea0ad1ce5385c47f0 03-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: split up IPCT_STATUS event

Split up the IPCT_STATUS event into an IPCT_REPLY event, which is generated
when the IPS_SEEN_REPLY bit is set, and an IPCT_ASSURED event, which is
generated when the IPS_ASSURED bit is set.

In combination with a following patch to support selective event delivery,
this can be used for "sparse" conntrack replication: start replicating the
conntrack entry after it reached the ASSURED state and that way it's SYN-flood
resistant.

Signed-off-by: Patrick McHardy <kaber@trash.net>
056ff3e3bd1563969a311697323ff929df94415c 03-Feb-2010 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix memory corruption with multiple namespaces

As discovered by Jon Masters <jonathan@jonmasters.org>, the "untracked"
conntrack, which is located in the data section, might be accidentally
freed when a new namespace is instantiated while the untracked conntrack
is attached to a skb because the reference count it re-initialized.

The best fix would be to use a seperate untracked conntrack per
namespace since it includes a namespace pointer. Unfortunately this is
not possible without larger changes since the namespace is not easily
available everywhere we need it. For now move the untracked conntrack
initialization to the init_net setup function to make sure the reference
count is not re-initialized and handle cleanup in the init_net cleanup
function to make sure namespaces can exit properly while the untracked
conntrack is in use in other namespaces.

Signed-off-by: Patrick McHardy <kaber@trash.net>
f9dd09c7f7199685601d75882447a6598be8a3e0 06-Nov-2009 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> netfilter: nf_nat: fix NAT issue in 2.6.30.4+

Vitezslav Samel discovered that since 2.6.30.4+ active FTP can not work
over NAT. The "cause" of the problem was a fix of unacknowledged data
detection with NAT (commit a3a9f79e361e864f0e9d75ebe2a0cb43d17c4272).
However, actually, that fix uncovered a long standing bug in TCP conntrack:
when NAT was enabled, we simply updated the max of the right edge of
the segments we have seen (td_end), by the offset NAT produced with
changing IP/port in the data. However, we did not update the other parameter
(td_maxend) which is affected by the NAT offset. Thus that could drift
away from the correct value and thus resulted breaking active FTP.

The patch below fixes the issue by *not* updating the conntrack parameters
from NAT, but instead taking into account the NAT offsets in conntrack in a
consistent way. (Updating from NAT would be more harder and expensive because
it'd need to re-calculate parameters we already calculated in conntrack.)

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5ae27aa2b16478a84d833ab4065798e752941c5a 05-Nov-2009 Changli Gao <xiaosuo@gmail.com> netfilter: nf_conntrack: avoid additional compare.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
d43c36dc6b357fa1806800f18aa30123c747a6d1 07-Oct-2009 Alexey Dobriyan <adobriyan@gmail.com> headers: remove sched.h from interrupt.h

After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
4481374ce88ba8f460c8b89f2572027bd27057d0 22-Sep-2009 Jan Beulich <JBeulich@novell.com> mm: replace various uses of num_physpages by totalram_pages

Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.

Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ee254fa44d902ab89fd0d66851701098f07872a7 31-Aug-2009 Alexey Dobriyan <adobriyan@gmail.com> netfilter: nf_conntrack: netns fix re reliable conntrack event delivery

Conntracks in netns other than init_net dying list were never killed.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
3993832464dd4e14a4c926583a11f0fa92c1f0f0 25-Aug-2009 Patrick McHardy <kaber@trash.net> netfilter: nfnetlink: constify message attributes and headers

Signed-off-by: Patrick McHardy <kaber@trash.net>
941297f443f871b8c3372feccf27a8733f6ce9e9 16-Jul-2009 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: nf_conntrack_alloc() fixes

When a slab cache uses SLAB_DESTROY_BY_RCU, we must be careful when allocating
objects, since slab allocator could give a freed object still used by lockless
readers.

In particular, nf_conntrack RCU lookups rely on ct->tuplehash[xxx].hnnode.next
being always valid (ie containing a valid 'nulls' value, or a valid pointer to next
object in hash chain.)

kmem_cache_zalloc() setups object with NULL values, but a NULL value is not valid
for ct->tuplehash[xxx].hnnode.next.

Fix is to call kmem_cache_alloc() and do the zeroing ourself.

As spotted by Patrick, we also need to make sure lookup keys are committed to
memory before setting refcount to 1, or a lockless reader could get a reference
on the old version of the object. Its key re-check could then pass the barrier.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
8d8890b7751387f58ce0a6428773de2fbc0fd596 22-Jun-2009 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix conntrack lookup race

The RCU protected conntrack hash lookup only checks whether the entry
has a refcount of zero to decide whether it is stale. This is not
sufficient, entries are explicitly removed while there is at least
one reference left, possibly more. Explicitly check whether the entry
has been marked as dying to fix this.

Signed-off-by: Patrick McHardy <kaber@trash.net>
5c8ec910e789a92229978d8fd1fce7b62e8ac711 22-Jun-2009 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix confirmation race condition

New connection tracking entries are inserted into the hash before they
are fully set up, namely the CONFIRMED bit is not set and the timer not
started yet. This can theoretically lead to a race with timer, which
would set the timeout value to a relative value, most likely already in
the past.

Perform hash insertion as the final step to fix this.

Signed-off-by: Patrick McHardy <kaber@trash.net>
8cc20198cfccd06cef705c14fd50bde603e2e306 22-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> netfilter: nf_conntrack: death_by_timeout() fix

death_by_timeout() might delete a conntrack from hash list
and insert it in dying list.

nf_ct_delete_from_lists(ct);
nf_ct_insert_dying_list(ct);

I believe a (lockless) reader could *catch* ct while doing a lookup
and miss the end of its chain.
(nulls lookup algo must check the null value at the end of lookup and
should restart if the null value is not the expected one.
cf Documentation/RCU/rculist_nulls.txt for details)

We need to change nf_conntrack_init_net() and use a different "null" value,
guaranteed not being used in regular lists. Choose very large values, since
hash table uses [0..size-1] null values.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
dd7669a92c6066b2b31bae7e04cd787092920883 13-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: conntrack: optional reliable conntrack event delivery

This patch improves ctnetlink event reliability if one broadcast
listener has set the NETLINK_BROADCAST_ERROR socket option.

The logic is the following: if an event delivery fails, we keep
the undelivered events in the missed event cache. Once the next
packet arrives, we add the new events (if any) to the missed
events in the cache and we try a new delivery, and so on. Thus,
if ctnetlink fails to deliver an event, we try to deliver them
once we see a new packet. Therefore, we may lose state
transitions but the userspace process gets in sync at some point.

At worst case, if no events were delivered to userspace, we make
sure that destroy events are successfully delivered. Basically,
if ctnetlink fails to deliver the destroy event, we remove the
conntrack entry from the hashes and we insert them in the dying
list, which contains inactive entries. Then, the conntrack timer
is added with an extra grace timeout of random32() % 15 seconds
to trigger the event again (this grace timeout is tunable via
/proc). The use of a limited random timeout value allows
distributing the "destroy" resends, thus, avoiding accumulating
lots "destroy" events at the same time. Event delivery may
re-order but we can identify them by means of the tuple plus
the conntrack ID.

The maximum number of conntrack entries (active or inactive) is
still handled by nf_conntrack_max. Thus, we may start dropping
packets at some point if we accumulate a lot of inactive conntrack
entries that did not successfully report the destroy event to
userspace.

During my stress tests consisting of setting a very small buffer
of 2048 bytes for conntrackd and the NETLINK_BROADCAST_ERROR socket
flag, and generating lots of very small connections, I noticed
very few destroy entries on the fly waiting to be resend.

A simple way to test this patch consist of creating a lot of
entries, set a very small Netlink buffer in conntrackd (+ a patch
which is not in the git tree to set the BROADCAST_ERROR flag)
and invoke `conntrack -F'.

For expectations, no changes are introduced in this patch.
Currently, event delivery is only done for new expectations (no
events from expectation expiration, removal and confirmation).
In that case, they need a per-expectation event cache to implement
the same idea that is exposed in this patch.

This patch can be useful to provide reliable flow-accouting. We
still have to add a new conntrack extension to store the creation
and destroy time.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
9858a3ae1d4b390fbaa9c30b83cb66d861b76294 13-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: conntrack: move helper destruction to nf_ct_helper_destroy()

This patch moves the helper destruction to a function that lives
in nf_conntrack_helper.c. This new function is used in the patch
to add ctnetlink reliable event delivery.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
a0891aa6a635f658f29bb061a00d6d3486941519 13-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: conntrack: move event caching to conntrack extension infrastructure

This patch reworks the per-cpu event caching to use the conntrack
extension infrastructure.

The main drawback is that we consume more memory per conntrack
if event delivery is enabled. This patch is required by the
reliable event delivery that follows to this patch.

BTW, this patch allows you to enable/disable event delivery via
/proc/sys/net/netfilter/nf_conntrack_events in runtime, although
you can still disable event caching as compilation option.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
65cb9fda32be613216f601a330b311c3bd7a8436 13-Jun-2009 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: use mod_timer_pending() for conntrack refresh

Use mod_timer_pending() instead of atomic sequence of del_timer()/
add_timer(). mod_timer_pending() does not rearm an inactive timer,
so we don't need the conntrack lock anymore to make sure we don't
accidentally rearm a timer of a conntrack which is in the process
of being destroyed.

With this change, we don't need to take the global lock anymore at all,
counter updates can be performed under the per-conntrack lock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
440f0d588555892601cfe511728a0fc0c8204063 10-Jun-2009 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: use per-conntrack locks for protocol data

Introduce per-conntrack locks and use them instead of the global protocol
locks to avoid contention. Especially tcp_lock shows up very high in
profiles on larger machines.

This will also allow to simplify the upcoming reliable event delivery patches.

Signed-off-by: Patrick McHardy <kaber@trash.net>
17e6e4eac070607a35464ea7e2c5eceac32e5eca 02-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: conntrack: simplify event caching system

This patch simplifies the conntrack event caching system by removing
several events:

* IPCT_[*]_VOLATILE, IPCT_HELPINFO and IPCT_NATINFO has been deleted
since the have no clients.
* IPCT_COUNTER_FILLING which is a leftover of the 32-bits counter
days.
* IPCT_REFRESH which is not of any use since we always include the
timeout in the messages.

After this patch, the existing events are:

* IPCT_NEW, IPCT_RELATED and IPCT_DESTROY, that are used to identify
addition and deletion of entries.
* IPCT_STATUS, that notes that the status bits have changes,
eg. IPS_SEEN_REPLY and IPS_ASSURED.
* IPCT_PROTOINFO, that reports that internal protocol information has
changed, eg. the TCP, DCCP and SCTP protocol state.
* IPCT_HELPER, that a helper has been assigned or unassigned to this
entry.
* IPCT_MARK and IPCT_SECMARK, that reports that the mark has changed, this
covers the case when a mark is set to zero.
* IPCT_NATSEQADJ, to report that there's updates in the NAT sequence
adjustment.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
274d383b9c1906847a64bbb267b0183599ce86a0 02-Jun-2009 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: conntrack: don't report events on module removal

During the module removal there are no possible event listeners
since ctnetlink must be removed before to allow removing
nf_conntrack. This patch removes the event reporting for the
module removal case which is not of any use in the existing code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
5c0de29d06318ec8f6e3ba0d17d62529dbbdc1e8 25-Mar-2009 Holger Eitzenberger <holger@eitzenberger.org> netfilter: nf_conntrack: add generic function to get len of generic policy

Usefull for all protocols which do not add additional data, such
as GRE or UDPlite.

Signed-off-by: Holger Eitzenberger <holger@eitzenberger.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ea781f197d6a835cbb93a0bf88ee1696296ed8aa 25-Mar-2009 Eric Dumazet <dada1@cosmosbay.com> netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()

Use "hlist_nulls" infrastructure we added in 2.6.29 for RCUification of UDP & TCP.

This permits an easy conversion from call_rcu() based hash lists to a
SLAB_DESTROY_BY_RCU one.

Avoiding call_rcu() delay at nf_conn freeing time has numerous gains.

First, it doesnt fill RCU queues (up to 10000 elements per cpu).
This reduces OOM possibility, if queued elements are not taken into account
This reduces latency problems when RCU queue size hits hilimit and triggers
emergency mode.

- It allows fast reuse of just freed elements, permitting better use of
CPU cache.

- We delete rcu_head from "struct nf_conn", shrinking size of this structure
by 8 or 16 bytes.

This patch only takes care of "struct nf_conn".
call_rcu() is still used for less critical conntrack parts, that may
be converted later if necessary.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
78f3648601fdc7a8166748bbd6d0555a88efa24a 25-Mar-2009 Eric Dumazet <dada1@cosmosbay.com> netfilter: nf_conntrack: use hlist_add_head_rcu() in nf_conntrack_set_hashsize()

Using hlist_add_head() in nf_conntrack_set_hashsize() is quite dangerous.
Without any barrier, one CPU could see a loop while doing its lookup.
Its true new table cannot be seen by another cpu, but previous table is still
readable.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
1d45209d89e647e9f27e4afa1f47338df73bc112 24-Mar-2009 Eric Dumazet <dada1@cosmosbay.com> netfilter: nf_conntrack: Reduce conntrack count in nf_conntrack_free()

We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Tested-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ec8d540969da9a70790e9028d57b5b577dd7aa77 16-Mar-2009 Christoph Paasch <christoph.paasch@gmail.com> netfilter: conntrack: fix dropping packet after l4proto->packet()

We currently use the negative value in the conntrack code to encode
the packet verdict in the error. As NF_DROP is equal to 0, inverting
NF_DROP makes no sense and, as a result, no packets are ever dropped.

Signed-off-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
7d1e04598e5e92527840b6889fb75b4b30fdd33b 24-Feb-2009 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: account packets drop by tcp_packet()

Since tcp_packet() may return -NF_DROP in two situations, the
packet-drop stats must be increased.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
af07d241dc76f0a52c7ff04df3a3970020fe6157 20-Feb-2009 Hagen Paul Pfeifer <hagen@jauu.net> netfilter: fix hardcoded size assumptions

get_random_bytes() is sometimes called with a hard coded size assumption
of an integer. This could not be true for next centuries. This patch
replace it with a compile time statement.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
e478075c6f07a383c378fb400edc1a7407a941b0 20-Feb-2009 Hagen Paul Pfeifer <hagen@jauu.net> netfilter: nf_conntrack: table max size should hold at least table size

Table size is defined as unsigned, wheres the table maximum size is
defined as a signed integer. The calculation of max is 8 or 4,
multiplied the table size. Therefore the max value is aligned to
unsigned.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
cd7fcbf1cb6933bfb9171452b4a370c92923544d 12-Jan-2009 Julia Lawall <julia@diku.dk> netfilter 07/09: simplify nf_conntrack_alloc() error handling

nf_conntrack_alloc cannot return NULL, so there is no need to check for
NULL before using the value. I have also removed the initialization of ct
to NULL in nf_conntrack_alloc, since the value is never used, and since
perhaps it might lead one to think that return ct at the end might return
NULL.

The semantic patch that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@match exists@
expression x, E;
position p1,p2;
statement S1, S2;
@@

x@p1 = nf_conntrack_alloc(...)
... when != x = E
(
if (x@p2 == NULL || ...) S1 else S2
|
if (x@p2 == NULL && ...) S1 else S2
)

@other_match exists@
expression match.x, E1, E2;
position p1!=match.p1,match.p2;
@@

x@p1 = E1
... when != x = E2
x@p2

@ script:python depends on !other_match@
p1 << match.p1;
p2 << match.p2;
@@

print "%s: call to nf_conntrack_alloc %s bad test %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
b54ad409fd09a395b839fb81f300880d76861c0e 25-Nov-2008 Patrick McHardy <kaber@trash.net> netfilter: ctnetlink: fix conntrack creation race

Conntrack creation through ctnetlink has two races:

- the timer may expire and free the conntrack concurrently, causing an
invalid memory access when attempting to put it in the hash tables

- an identical conntrack entry may be created in the packet processing
path in the time between the lookup and hash insertion

Hold the conntrack lock between the lookup and insertion to avoid this.

Reported-by: Zoltan Borbely <bozo@andrews.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
e17b666a468285409ab9f6caff9df16936d27d71 18-Nov-2008 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix warning and prototype mismatch

net/netfilter/nf_conntrack_core.c:46:1: warning: symbol 'nfnetlink_parse_nat_setup_hook' was not declared. Should it be static?

Including the proper header also revealed an incorrect prototype.

Signed-off-by: Patrick McHardy <kaber@trash.net>
19abb7b090a6bce88d4e9b2914a0367f4f684432 18-Nov-2008 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: ctnetlink: deliver events for conntracks changed from userspace

As for now, the creation and update of conntracks via ctnetlink do not
propagate an event to userspace. This can result in inconsistent situations
if several userspace processes modify the connection tracking table by means
of ctnetlink at the same time. Specifically, using the conntrack command
line tool and conntrackd at the same time can trigger unconsistencies.

This patch also modifies the event cache infrastructure to pass the
process PID and the ECHO flag to nfnetlink_send() to report back
to userspace if the process that triggered the change needs so.
Based on a suggestion from Patrick McHardy.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
226c0c0ef2abdf91b8d9cce1aaf7d4635a5e5926 18-Nov-2008 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: ctnetlink: helper modules load-on-demand support

This patch adds module loading for helpers via ctnetlink.

* Creation path: We support explicit and implicit helper assignation. For
the explicit case, we try to load the module. If the module is correctly
loaded and the helper is present, we return EAGAIN to re-start the
creation. Otherwise, we return EOPNOTSUPP.
* Update path: release the spin lock, load the module and check. If it is
present, then return EAGAIN to re-start the update.

This patch provides a refactorized function to lookup-and-set the
connection tracking helper. The function removes the exported symbol
__nf_ct_helper_find as it has not clients anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
e6a7d3c04f8fe49099521e6dc9a46b0272381f2f 14-Oct-2008 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: ctnetlink: remove bogus module dependency between ctnetlink and nf_nat

This patch removes the module dependency between ctnetlink and
nf_nat by means of an indirect call that is initialized when
nf_nat is loaded. Now, nf_conntrack_netlink only requires
nf_conntrack and nfnetlink.

This patch puts nfnetlink_parse_nat_setup_hook into the
nf_conntrack_core to avoid dependencies between ctnetlink,
nf_conntrack_ipv4 and nf_conntrack_ipv6.

This patch also introduces the function ctnetlink_change_nat
that is only invoked from the creation path. Actually, the
nat handling cannot be invoked from the update path since
this is not allowed. By introducing this function, we remove
the useless nat handling in the update path and we avoid
deadlock-prone code.

This patch also adds the required EAGAIN logic for nfnetlink.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
08f6547d266fdba087f7fa7963fc0610be5b7cd7 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: final netns tweaks

Add init_net checks to not remove kmem_caches twice and so on.

Refactor functions to split code which should be executed only for
init_net into one place.

ip_ct_attach and ip_ct_destroy assignments remain separate, because
they're separate stages in setup and teardown.

NOTE: NOTRACK code is in for-every-net part. It will be made per-netns
after we decidce how to do it correctly.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
d716a4dfbbdf0d4731d596a96e5f4b0d892ac168 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns conntrack accounting

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
c2a2c7e0cc39e7f9336cd67e8307a110bdba82f3 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns net.netfilter.nf_conntrack_log_invalid sysctl

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
0d55af8791bfb42e04cc456b348910582f230343 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns statistics

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
6058fa6bb96a5b6145cba10c5171f09c2783ca69 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns event cache

Heh, last minute proof-reading of this patch made me think,
that this is actually unneeded, simply because "ct" pointers will be
different for different conntracks in different netns, just like they
are different in one netns.

Not so sure anymore.

[Patrick: pointers will be different, flushing can only be done while
inactive though and thus it needs to be per netns]

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
a71996fccce4b2086a26036aa3c915365ca36926 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: pass conntrack to nf_conntrack_event_cache() not skb

This is cleaner, we already know conntrack to which event is relevant.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
74c51a1497033e6ff7b8096797daca233a4a30df 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: pass netns pointer to L4 protocol's ->error hook

Again, it's deducible from skb, but we're going to use it for
nf_conntrack_checksum and statistics, so just pass it from upper layer.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
a702a65fc1376fc1f6757ec2a6960348af3f1876 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: pass netns pointer to nf_conntrack_in()

It's deducible from skb->dev or skb->dst->dev, but we know netns at
the moment of call, so pass it down and use for finding and creating
conntracks.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
63c9a26264be108b52de087724673f8664570e34 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns unconfirmed list

What is confirmed connection in one netns can very well be unconfirmed
in another one.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
9b03f38d0487f3908696242286d934c9b38f9d2a 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns expectations

Make per-netns a) expectation hash and b) expectations count.

Expectations always belongs to netns to which it's master conntrack belong.
This is natural and doesn't bloat expectation.

Proc files and leaf users are stubbed to init_net, this is temporary.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
400dad39d1c33fe797e47326d87a3f54d0ac5181 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns conntrack hash

* make per-netns conntrack hash

Other solution is to add ->ct_net pointer to tuplehashes and still has one
hash, I tried that it's ugly and requires more code deep down in protocol
modules et al.

* propagate netns pointer to where needed, e. g. to conntrack iterators.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
49ac8713b6d064adf7474080fdccebd7cce76be0 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: per-netns conntrack count

Sysctls and proc files are stubbed to init_net's one. This is temporary.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
5a1fb391d881905e89623d78858d05b248cbc86a 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: add ->ct_net -- pointer from conntrack to netns

Conntrack (struct nf_conn) gets pointer to netns: ->ct_net -- netns in which
it was created. It comes from netdevice.

->ct_net is write-once field.

Every conntrack in system has ->ct_net initialized, no exceptions.

->ct_net doesn't pin netns: conntracks are recycled after timeouts and
pinning background traffic will prevent netns from even starting shutdown
sequence.

Right now every conntrack is created in init_net.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
dfdb8d791877052bbb527d9688d94a064721d8f7 08-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com> netfilter: netns nf_conntrack: add netns boilerplate

One comment: #ifdefs around #include is necessary to overcome amazing compile
breakages in NOTRACK-in-netns patch (see below).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
76108cea065cda58366d16a7eb6ca90d717a1396 08-Oct-2008 Jan Engelhardt <jengelh@medozas.de> netfilter: Use unsigned types for hooknum and pf vars

and (try to) consistently use u_int8_t for the L3 family.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
9714be7da8b32f36d2468fe08ff603b6402df8cf 06-Aug-2008 Krzysztof Piotr Oledzki <ole@ans.pl> netfilter: fix two recent sysctl problems

Starting with 9043476f726802f4b00c96d0c4f418dde48d1304 ("[PATCH]
sanitize proc_sysctl") we have two netfilter releated problems:

- WARNING: at kernel/sysctl.c:1966 unregister_sysctl_table+0xcc/0x103(),
caused by wrong order of ini/fini calls

- net.netfilter is duplicated and has truncated set of records

Thanks to very useful guidelines from Al Viro, this patch fixes both
of them.

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
584015727a3b88b46602b20077b46cd04f8b4ab3 21-Jul-2008 Krzysztof Piotr Oledzki <ole@ans.pl> netfilter: accounting rework: ct_extend + 64bit counters (v4)

Initially netfilter has had 64bit counters for conntrack-based accounting, but
it was changed in 2.6.14 to save memory. Unfortunately in-kernel 64bit counters are
still required, for example for "connbytes" extension. However, 64bit counters
waste a lot of memory and it was not possible to enable/disable it runtime.

This patch:
- reimplements accounting with respect to the extension infrastructure,
- makes one global version of seq_print_acct() instead of two seq_print_counters(),
- makes it possible to enable it at boot time (for CONFIG_SYSCTL/CONFIG_SYSFS=n),
- makes it possible to enable/disable it at runtime by sysctl or sysfs,
- extends counters from 32bit to 64bit,
- renames ip_conntrack_counter -> nf_conn_counter,
- enables accounting code unconditionally (no longer depends on CONFIG_NF_CT_ACCT),
- set initial accounting enable state based on CONFIG_NF_CT_ACCT
- removes buggy IPCT_COUNTER_FILLING event handling.

If accounting is enabled newly created connections get additional acct extend.
Old connections are not changed as it is not possible to add a ct_extend area
to confirmed conntrack. Accounting is performed for all connections with
acct extend regardless of a current state of "net.netfilter.nf_conntrack_acct".

Signed-off-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4c8894980010536915c4f5513ee180e3614aeca9 15-Jul-2008 David S. Miller <davem@davemloft.net> netfilter: Let nf_ct_kill() callers know if del_timer() returned true.

Signed-off-by: David S. Miller <davem@davemloft.net>
b891c5a831b13f74989dcbd7b39d04537b2a05d9 08-Jul-2008 Pablo Neira Ayuso <pablo@netfilter.org> netfilter: nf_conntrack: add allocation flag to nf_conntrack_alloc

ctnetlink does not need to allocate the conntrack entries with GFP_ATOMIC
as its code is executed in user context.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ceeff7541e5a4ba8e8d97ffbae32b3f283cb7a3f 12-Jun-2008 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: fix ctnetlink related crash in nf_nat_setup_info()

When creation of a new conntrack entry in ctnetlink fails after having
set up the NAT mappings, the conntrack has an extension area allocated
that is not getting properly destroyed when freeing the conntrack again.
This means the NAT extension is still in the bysource hash, causing a
crash when walking over the hash chain the next time:

BUG: unable to handle kernel paging request at 00120fbd
IP: [<c03d394b>] nf_nat_setup_info+0x221/0x58a
*pde = 00000000
Oops: 0000 [#1] PREEMPT SMP

Pid: 2795, comm: conntrackd Not tainted (2.6.26-rc5 #1)
EIP: 0060:[<c03d394b>] EFLAGS: 00010206 CPU: 1
EIP is at nf_nat_setup_info+0x221/0x58a
EAX: 00120fbd EBX: 00120fbd ECX: 00000001 EDX: 00000000
ESI: 0000019e EDI: e853bbb4 EBP: e853bbc8 ESP: e853bb78
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process conntrackd (pid: 2795, ti=e853a000 task=f7de10f0 task.ti=e853a000)
Stack: 00000000 e853bc2c e85672ec 00000008 c0561084 63c1db4a 00000000 00000000
00000000 0002e109 61d2b1c3 00000000 00000000 00000000 01114e22 61d2b1c3
00000000 00000000 f7444674 e853bc04 00000008 c038e728 0000000a f7444674
Call Trace:
[<c038e728>] nla_parse+0x5c/0xb0
[<c0397c1b>] ctnetlink_change_status+0x190/0x1c6
[<c0397eec>] ctnetlink_new_conntrack+0x189/0x61f
[<c0119aee>] update_curr+0x3d/0x52
[<c03902d1>] nfnetlink_rcv_msg+0xc1/0xd8
[<c0390228>] nfnetlink_rcv_msg+0x18/0xd8
[<c0390210>] nfnetlink_rcv_msg+0x0/0xd8
[<c038d2ce>] netlink_rcv_skb+0x2d/0x71
[<c0390205>] nfnetlink_rcv+0x19/0x24
[<c038d0f5>] netlink_unicast+0x1b3/0x216
...

Move invocation of the extension destructors to nf_conntrack_free()
to fix this problem.

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=10875

Reported-and-Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
718d4ad98e272daebc258e49dc02f52a6a8de9d3 10-Jun-2008 Fabian Hugelshofer <hugelshofer2006@gmx.ch> netfilter: nf_conntrack: properly account terminating packets

Currently the last packet of a connection isn't accounted when its causing
abnormal termination.

Introduces nf_ct_kill_acct() which increments the accounting counters on
conntrack kill. The new function was necessary, because there are calls
to nf_ct_kill() which don't need accounting:

nf_conntrack_proto_tcp.c line ~847:
Kills ct and returns NF_REPEAT. We don't want to count twice.

nf_conntrack_proto_tcp.c line ~880:
Kills ct and returns NF_DROP. I think we don't want to count dropped
packets.

nf_conntrack_netlink.c line ~824:
As far as I can see ctnetlink_del_conntrack() is used to destroy a
conntrack on behalf of the user. There is an sk_buff, but I don't think
this is an actual packet. Incrementing counters here is therefore not
desired.

Signed-off-by: Fabian Hugelshofer <hugelshofer2006@gmx.ch>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
51091764f26ec36c02e35166f083193a30f426fc 10-Jun-2008 Patrick McHardy <kaber@trash.net> netfilter: nf_conntrack: add nf_ct_kill()

Encapsulate the common

if (del_timer(&ct->timeout))
ct->timeout.function((unsigned long)ct)

sequence in a new function.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
443a70d50bdc212e1292778e264ce3d0a85b896f 29-Apr-2008 Philip Craig <philipc@snapgear.com> netfilter: nf_conntrack: padding breaks conntrack hash on ARM

commit 0794935e "[NETFILTER]: nf_conntrack: optimize hash_conntrack()"
results in ARM platforms hashing uninitialised padding. This padding
doesn't exist on other architectures.

Fix this by replacing NF_CT_TUPLE_U_BLANK() with memset() to ensure
everything is initialised. There were only 4 bytes that
NF_CT_TUPLE_U_BLANK() wasn't clearing anyway (or 12 bytes on ARM).

Signed-off-by: Philip Craig <philipc@snapgear.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ef1a5a50bbd509b8697dcd4d13017e9e0053867b 14-Apr-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: fix incorrect check for expectations

The expectation classes changed help->expectations to an array,
fix use as scalar value.

Signed-off-by: Patrick McHardy <kaber@trash.net>
3c9fba656a185cf56872a325e5594d9b4d4168ec 14-Apr-2008 Jan Engelhardt <jengelh@computergmbh.de> [NETFILTER]: nf_conntrack: replace NF_CT_DUMP_TUPLE macro indrection by function call

Directly call IPv4 and IPv6 variants where the address family is
easily known.

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
5f2b4c9006fc667c4614f0b079efab3721f68316 14-Apr-2008 Jan Engelhardt <jengelh@computergmbh.de> [NETFILTER]: nf_conntrack: use bool type in struct nf_conntrack_tuple.h

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
5e8fbe2ac8a3f1e34e7004c5750ef59bf9304f82 14-Apr-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: add tuplehash l3num/protonum accessors

Add accessors for l3num and protonum and get rid of some overly long
expressions.

Signed-off-by: Patrick McHardy <kaber@trash.net>
4e29e9ec7e0707d3925f5dcc29af0d3f04e49833 27-Feb-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: fix smp_processor_id() in preemptible code warning

Since we're using RCU for the conntrack hash now, we need to avoid
getting preempted or interrupted by BHs while changing the stats.

Fixes warning reported by Tilman Schmidt <tilman@imap.cc> when using
preemptible RCU:

[ 48.180297] BUG: using smp_processor_id() in preemptible [00000000] code: ntpdate/3562
[ 48.180297] caller is __nf_conntrack_find+0x9b/0xeb [nf_conntrack]
[ 48.180297] Pid: 3562, comm: ntpdate Not tainted 2.6.25-rc2-mm1-testing #1
[ 48.180297] [<c02015b9>] debug_smp_processor_id+0x99/0xb0
[ 48.180297] [<fac643a7>] __nf_conntrack_find+0x9b/0xeb [nf_conntrack]

Tested-by: Tilman Schmidt <tilman@imap.cc>
Tested-by: Christian Casteyde <casteyde.christian@free.fr> [Bugzilla #10097]

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
d33b7c06bd721e21534c120d1c4a5944dc3eb9ce 31-Jan-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> [NETFILTER]: nf_conntrack: kill unused static inline (do_iter)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c88130bcd546e73e66165f9c29113dae9facf1ec 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: naming unification

Rename all "conntrack" variables to "ct" for more consistency and
avoiding some overly long lines.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
76eb946040a7b4c797979a9c22464b9a07890ba5 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: don't inline early_drop()

early_drop() is only called *very* rarely, unfortunately gcc inlines it
into the hotpath because there is only a single caller. Explicitly mark
it noinline.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
0794935e21a18e7c171b604c31219b60ad9749a9 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: optimize hash_conntrack()

Avoid calling jhash three times and hash the entire tuple in one go.

__hash_conntrack | -485 # 760 -> 275, # inlines: 3 -> 1, size inlines: 717 -> 252
1 function changed, 485 bytes removed

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ba419aff2cda91680e5d4d3eeff95df49bd2edec 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: optimize __nf_conntrack_find()

Ignoring specific entries in __nf_conntrack_find() is only needed by NAT
for nf_conntrack_tuple_taken(). Remove it from __nf_conntrack_find()
and make nf_conntrack_tuple_taken() search the hash itself.

Saves 54 bytes of text in the hotpath on x86_64:

__nf_conntrack_find | -54 # 321 -> 267, # inlines: 3 -> 2, size inlines: 181 -> 127
nf_conntrack_tuple_taken | +305 # 15 -> 320, lexblocks: 0 -> 3, # inlines: 0 -> 3, size inlines: 0 -> 181
nf_conntrack_find_get | -2 # 90 -> 88
3 functions changed, 305 bytes added, 56 bytes removed, diff: +249

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
f8ba1affa18398610e765736153fff614309ccc8 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: switch rwlock to spinlock

With the RCU conversion only write_lock usages of nf_conntrack_lock are
left (except one read_lock that should actually use write_lock in the
H.323 helper). Switch to a spinlock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
76507f69c44ed199a1a68086145398459e55835d 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: use RCU for conntrack hash

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c52fbb410b2662a7bbc5cbe5969d73c733151498 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack_core: avoid taking nf_conntrack_lock in nf_conntrack_alter_reply

The conntrack is unconfirmed, so we have an exclusive reference, which
means that the write_lock is definitely unneeded. A read_lock used to
be needed for the helper lookup, but since we're using RCU for helpers
now rcu_read_lock is enough.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
47d9504543817b0aa908a37a335b90c30704a100 31-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: fix accounting with fixed timeouts

Don't skip accounting for conntracks with the FIXED_TIMEOUT bit.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
96eb24d770381b8a257b26183f6b6c131ad51ab9 31-Jan-2008 Stephen Hemminger <shemminger@vyatta.com> [NETFILTER]: nf_conntrack: sparse warnings

The hashtable size is really unsigned so sparse complains when you pass
a signed integer. Change all uses to make it consistent.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
b334aadc3c5cd4dae2a44f3dac09b3ef718ccde1 15-Jan-2008 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: clean up a few header files

- Remove declarations of non-existing variables and functions
- Move helper init/cleanup function declarations to nf_conntrack_helper.h
- Remove unneeded __nf_conntrack_attach declaration and make it static

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
34498825cb9062192b77fa02dae672a4fe6eec70 18-Dec-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: non-power-of-two jhash optimizations

Apply Eric Dumazet's jhash optimizations where applicable. Quoting Eric:

Thanks to jhash, hash value uses full 32 bits. Instead of returning
hash % size (implying a divide) we return the high 32 bits of the
(hash * size) that will give results between [0 and size-1] and same
hash distribution.

On most cpus, a multiply is less expensive than a divide, by an order
of magnitude.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
77236b6e33b06aaf756a86ed1965ca7d460b1b53 18-Dec-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: ctnetlink: use netlink attribute helpers

Use NLA_PUT_BE32, nla_get_be32() etc.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
fae718ddaf2b00e222dddec6717aca023376723c 25-Dec-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack_ipv4: fix module parameter compatibility

Some users do "modprobe ip_conntrack hashsize=...". Since we have the
module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the
hashsize parameter is unknown for nf_conntrack_ipv4 however and makes
it fail.

Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4.

Note: the nf_conntrack message in the ringbuffer will display an
incorrect hashsize since nf_conntrack is first pulled in as a
dependency and calculates the size itself, then it gets changed
through a call to nf_conntrack_set_hashsize().

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
29b67497f256399c4aa2adec27ab7ba24bba44e8 30-Oct-2007 Andrew Morton <akpm@linux-foundation.org> [NETFILTER]: nf_ct_alloc_hashtable(): use __GFP_NOWARN

This allocation is expected to fail and we handle it by fallback to vmalloc().

So don't scare people with nasty messages like
http://bugzilla.kernel.org/show_bug.cgi?id=9190

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3db05fea51cdb162cfa8f69e9cfb9e228919d2a9 15-Oct-2007 Herbert Xu <herbert@gondor.apana.org.au> [NETFILTER]: Replace sk_buff ** with sk_buff *

With all the users of the double pointers removed, this patch mops up by
finally replacing all occurances of sk_buff ** in the netfilter API by
sk_buff *.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
7f85f914721ffcef382a57995182916bd43d8a65 28-Sep-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: kill unique ID

Remove the per-conntrack ID, its not necessary anymore for dumping.
For compatiblity reasons we send the address of the conntrack to
userspace as ID.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
f73e924cdd166360e8cc9a1b193008fdc9b3e3e2 28-Sep-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: ctnetlink: use netlink policy

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
fdf708322d4658daa6eb795d1a835b97efdb335e 28-Sep-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nfnetlink: rename functions containing 'nfattr'

There is no struct nfattr anymore, rename functions to 'nlattr'.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
df6fb868d6118686805c2fa566e213a8f31c8e4f 28-Sep-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nfnetlink: convert to generic netlink attribute functions

Get rid of the duplicated rtnetlink macros and use the generic netlink
attribute functions. The old duplicated stuff is moved to a new header
file that exists just for userspace.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
a34c45896a723ee7b13128ac8bf564ea42fcd1eb 26-Jul-2007 Al Viro <viro@ftp.linux.org.uk> netfilter endian regressions

no real bugs, just misannotations cropping up

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
20c2df83d25c6a95affe6157a4c9cac4cf5ffaac 20-Jul-2007 Paul Mundt <lethal@linux-sh.org> mm: Remove slab destructors from kmem_cache_create().

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
e2a3123fbe58da9fd3f35cd242087896ace6049f 15-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: Introduces nf_ct_get_tuplepr and uses it

nf_ct_get_tuple() requires the offset to transport header and that bothers
callers such as icmp[v6] l4proto modules. This introduces new function
to simplify them.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ffc30690480bdd337e4914302b926d24870b56b2 15-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: make l3proto->prepare() generic and renames it

The icmp[v6] l4proto modules parse headers in ICMP[v6] error to get tuple.
But they have to find the offset to transport protocol header before that.
Their processings are almost same as prepare() of l3proto modules.
This makes prepare() more generic to simplify icmp[v6] l4proto module
later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
d87d8469e2dd19a3a134b99f78288d41854c614b 15-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: Increment error count on parsing IPv4 header

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
0d53778e81ac7af266dac8a20cc328328c327112 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: Convert DEBUGP to pr_debug

Convert DEBUGP to pr_debug and fix lots of non-compiling debug statements.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7ae7730fd6d98be1afe8ad9ea77813de607ec970 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: early_drop improvement

When the maximum number of conntrack entries is reached and a new
one needs to be allocated, conntrack tries to drop an unassured
connection from the same hash bucket the new conntrack would hash
to. Since with a properly sized hash the average number of entries
per bucket is 1, the chances of actually finding one are not very
good. This patch makes it walk the hash until a minimum number of
8 entries are checked.

Based on patch by Vasily Averin <vvs@sw.ru>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
b560580a13b180bc1e3cad7ffbc93388cc39be5d 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack_expect: maintain per conntrack expectation list

This patch brings back the per-conntrack expectation list that was
removed around 2.6.10 to avoid walking all expectations on expectation
eviction and conntrack destruction.

As these were the last users of the global expectation list, this patch
also kills that.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
e9c1b084e17ca225b6be731b819308ee0f9e04b8 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: move expectaton related init code to nf_conntrack_expect.c

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6823645d608541c2c69e8a99454936e058c294e0 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack_expect: function naming unification

Currently there is a wild mix of nf_conntrack_expect_, nf_ct_exp_,
expect_, exp_, ...

Consistently use nf_ct_ as prefix for exported functions.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ac565e5fc104fe1842a87f2206fcfb7b6dda903d 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: export hash allocation/destruction functions

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
330f7db5e578e1e298ba3a41748e5ea333a64a2b 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: remove 'ignore_conntrack' argument from nf_conntrack_find_get

All callers pass NULL, this also doesn't seem very useful for modules.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
f205c5e0c28aa7e0fb6eaaa66e97928f9d9e6994 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: use hlists for conntrack hash

Convert conntrack hash to hlists to reduce its size and cache
footprint. Since the default hashsize to max. entries ratio
sucks (1:16), this patch doesn't reduce the amount of memory
used for the hash by default, but instead uses a better ratio
of 1:8, which results in the same max. entries value.

One thing worth noting is early_drop. It really should use LRU,
so it now has to iterate over the entire chain to find the last
unconfirmed entry. Since chains shouldn't be very long and the
entire operation is very rare this shouldn't be a problem.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
8e5105a0c36a059dfd0f0bb9e73ee7c97d306247 08-Jul-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: round up hashsize to next multiple of PAGE_SIZE

Don't let the rest of the page go to waste.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
d8a0509a696de60296a66ba4fe4f9eaade497103 08-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_nat: kill global 'destroy' operation

This kills the global 'destroy' operation which was used by NAT.
Instead it uses the extension infrastructure so that multiple
extensions can register own operations.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
dacd2a1a5cf621288833aa3c6e815b86a1536538 08-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: remove old memory allocator of conntrack

Now memory space for help and NAT are allocated by extension
infrastructure.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ceceae1b1555a9afcb8dacf90df5fa1f20fd5466 08-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: use extension infrastructure for helper

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ecfab2c9fe5597221c2b30dec48634a2361a0d08 08-Jul-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: introduce extension infrastructure

Old space allocator of conntrack had problems about extensibility.
- It required slab cache per combination of extensions.
- It expected what extensions would be assigned, but it was impossible
to expect that completely, then we allocated bigger memory object than
really required.
- It needed to search helper twice due to lock issue.

Now basic informations of a connection are stored in 'struct nf_conn'.
And a storage for extension (helper, NAT) is allocated by kmalloc.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
3c158f7f57601bc27eab82f0dc4fd3fad314d845 05-Jun-2007 Patrick McHarrdy <kaber@trash.net> [NETFILTER]: nf_conntrack: fix helper module unload races

When a helper module is unloaded all conntracks refering to it have their
helper pointer NULLed out, leading to lots of races. In most places this
can be fixed by proper use of RCU (they do already check for != NULL,
but in a racy way), additionally nf_conntrack_expect_related needs to
bail out when no helper is present.

Also remove two paranoid BUG_ONs in nf_conntrack_proto_gre that are racy
and not worth fixing.

Signed-off-by: Patrick McHarrdy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5397e97d7533a03b28a7b8aeee648cbb36a8afc6 19-May-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: fix use-after-free in helper destroy callback invocation

When the helper module is removed for a master connection that has a
fulfilled expectation, but has already timed out and got removed from
the hash tables, nf_conntrack_helper_unregister can't find the master
connection to unset the helper, causing a use-after-free when the
expected connection is destroyed and releases the last reference to
the master.

The helper destroy callback was introduced for the PPtP helper to clean
up expectations and expected connections when the master connection
times out, but doing this from destroy_conntrack only works for
unfulfilled expectations since expected connections hold a reference
to the master, preventing its destruction. Move the destroy callback to
the timeout function, which fixes both problems.

Reported/tested by Gabor Burjan <buga@buvoshetes.hu>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5d78a84913abc1b2ef1ec0c14a78ec99517cc122 10-May-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_nat: Clears helper private area when NATing

Some helpers (eg. ftp) assume that private area in conntrack is
filled with zero. It should be cleared when helper is changed.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
fda61436835f6d46b6d85d4fe9206ffe682fe7f0 10-May-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: Removes unused destroy operation of l3proto

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
de6e05c49f8b4ed63224c5d38891f531ecc4eabb 23-Mar-2007 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: kill destroy() in struct nf_conntrack for diet

The destructor per conntrack is unnecessary, then this replaces it with
system wide destructor.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
e6f689db51a789807edede411b32eb7c9e457948 23-Mar-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: Use setup_timer

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1b53d9042c04b8eb875d02e65792e9884efc3784 23-Mar-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: Remove changelogs and CVS IDs

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9b88790972498d235a2a4d2b66640c3c5b70bb7c 15-Mar-2007 Sami Farin <safari-netfilter@safari.iki.fi> [NETFILTER]: nf_conntrack: use jhash2 in __hash_conntrack

Now it uses jhash, but using jhash2 would be around 3-4 times faster
(on P4).

Signed-off-by: Sami Farin <safari-netfilter@safari.iki.fi>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ac5357ebac43e191003c2cd0722377dccfa01a84 15-Mar-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: remove ugly hack in l4proto registration

Remove ugly special-casing of nf_conntrack_l4proto_generic, all it
wants is its sysctl tables registered, so do that explicitly in an
init function and move the remaining protocol initialization and
cleanup code to nf_conntrack_proto.c as well.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
bbe735e4247dba32568a305553b010081c8dea99 11-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Introduce skb_network_offset()

For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e281db5cdfc3ab077ab3e459d098cb4fde0bc57a 05-Mar-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack/nf_nat: fix incorrect config ifdefs

The nf_conntrack_netlink config option is named CONFIG_NF_CT_NETLINK,
but multiple files use CONFIG_IP_NF_CONNTRACK_NETLINK or
CONFIG_NF_CONNTRACK_NETLINK for ifdefs.

Fix this and reformat all CONFIG_NF_CT_NETLINK ifdefs to only use a line.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ec68e97dedacc1c7fb20a4b23b7fa76bee56b5ff 05-Mar-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: conntrack: fix {nf,ip}_ct_iterate_cleanup endless loops

Fix {nf,ip}_ct_iterate_cleanup unconfirmed list handling:

- unconfirmed entries can not be killed manually, they are removed on
confirmation or final destruction of the conntrack entry, which means
we might iterate forever without making forward progress.

This can happen in combination with the conntrack event cache, which
holds a reference to the conntrack entry, which is only released when
the packet makes it all the way through the stack or a different
packet is handled.

- taking references to an unconfirmed entry and using it outside the
locked section doesn't work, the list entries are not refcounted and
another CPU might already be waiting to destroy the entry

What the code really wants to do is make sure the references of the hash
table to the selected conntrack entries are released, so they will be
destroyed once all references from skbs and the event cache are dropped.

Since unconfirmed entries haven't even entered the hash yet, simply mark
them as dying and skip confirmation based on that.

Reported and tested by Chuck Ebbert <cebbert@redhat.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
601e68e100b6bf8ba13a32db8faf92d43acaa997 12-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NETFILTER]: Fix whitespace errors

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
982d9a9ce389c396bc83ce29d799937f379ddcb7 12-Feb-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: properly use RCU for nf_conntrack_destroyed callback

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c0e912d7ed8999f87fa7f084928aac1266e251f3 12-Feb-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: fix invalid conntrack statistics RCU assumption

NF_CT_STAT_INC assumes rcu_read_lock in nf_hook_slow disables
preemption as well, making it legal to use __get_cpu_var without
disabling preemption manually. The assumption is not correct anymore
with preemptable RCU, additionally we need to protect against softirqs
when not holding nf_conntrack_lock.

Add NF_CT_STAT_INC_ATOMIC macro, which disables local softirqs,
and use where necessary.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
923f4902fefdf4e89b0fb32c4e069d4f57d704f5 12-Feb-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: properly use RCU API for nf_ct_protos/nf_ct_l3protos arrays

Replace preempt_{enable,disable} based RCU by proper use of the
RCU API and add missing rcu_read_lock/rcu_read_unlock calls in
all paths not obviously only used within packet process context
(nfnetlink_conntrack).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c3a47ab3e5ad62601449e4e5401352271b777e28 12-Feb-2007 Patrick McHardy <kaber@trash.net> [NETFILTER]: Properly use RCU in nf_ct_attach

Use rcu_assign_pointer/rcu_dereference for ip_ct_attach pointer instead
of self-made RCU and use rcu_read_lock to make sure the conntrack module
doesn't disappear below us while calling it, since this function can be
called from outside the netfilter hooks.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
e18b890bb0881bbab6f4f1a6cd20d9c60d66b003 07-Dec-2006 Christoph Lameter <clameter@sgi.com> [PATCH] slab: remove kmem_cache_t

Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#

set -e

for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done

The script was run like this

sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
272491ef423b6976a230a998b10f46976aa91342 07-Dec-2006 Randy Dunlap <randy.dunlap@oracle.com> [NETFILTER]: Fix non-ANSI func. decl.

Fix non-ANSI function declaration:
net/netfilter/nf_conntrack_core.c:1096:25: warning: non-ANSI function declaration of function 'nf_conntrack_flush'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d7fe0f241dceade9c8d4af75498765c5ff7f27e6 04-Dec-2006 Al Viro <viro@zeniv.linux.org.uk> [PATCH] severing skbuff.h -> mm.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
13b1833910205289172cdc655cb9bc61188f77e9 03-Dec-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: EXPORT_SYMBOL cleanup

- move EXPORT_SYMBOL next to exported symbol
- use EXPORT_SYMBOL_GPL since this is what the original code used

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
f09943fefe6b702e40893d35b4f10fd1064037fe 03-Dec-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port

Add nf_conntrack port of the PPtP conntrack/NAT helper. Since there seems
to be no IPv6-capable PPtP implementation the helper only support IPv4.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5b1158e909ecbe1a052203e0d8df15633f829930 03-Dec-2006 Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> [NETFILTER]: Add NAT support for nf_conntrack

Add NAT support for nf_conntrack. Joint work of Jozsef Kadlecsik,
Yasuyuki Kozakai, Martin Josefsson and myself.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9457d851fc5df54522d733f72cbb1f02ab59272e 03-Dec-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: automatic helper assignment for expectations

Some helpers (namely H.323) manually assign further helpers to expected
connections. This is not possible with nf_conntrack anymore since we
need to know whether a helper is used at allocation time.

Handle the helper assignment centrally, which allows to perform the
correct allocation and as a nice side effect eliminates the need
for the H.323 helper to fiddle with nf_conntrack_lock.

Mid term the allocation scheme really needs to be redesigned since
we do both the helper and expectation lookup _twice_ for every new
connection.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
bff9a89bcac5b68ac0a1ea856b1726a35ae1eabb 03-Dec-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: endian annotations

Resync with Al Viro's ip_conntrack annotations and fix a missed
spot in ip_nat_proto_icmp.c.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
a999e6837603e4b5a164333c93918a1292f074c8 29-Nov-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: sysctl compatibility with old connection tracking

This patch adds an option to keep the connection tracking sysctls visible
under their old names.

Signed-off-by: Patrick McHardy <kaber@trash.net>
933a41e7e12b773d1dd026018f02b86b5d257a22 29-Nov-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: nf_conntrack: move conntrack protocol sysctls to individual modules

Signed-off-by: Patrick McHardy <kaber@trash.net>
be00c8e48993368663e2714bd1e7c886b7736406 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: reduce timer updates in __nf_ct_refresh_acct()

Only update the conntrack timer if there's been at least HZ jiffies since
the last update. Reduces the number of del_timer/add_timer cycles from one
per packet to one per connection per second (plus once for each state change
of a connection)

Should handle timer wraparounds and connection timeout changes.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
3ffd5eeb1a031ad226c80ae6e658970cd08569e2 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: minor __nf_ct_refresh_acct() whitespace cleanup

Minor whitespace cleanup.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
951d36cace3d3ad2ac6c222e126aed4113ad2bf7 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: remove ASSERT_{READ,WRITE}_LOCK

Remove the usage of ASSERT_READ_LOCK/ASSERT_WRITE_LOCK in nf_conntrack,
it didn't do anything, it was just an empty define and it uglified the code.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ae5718fb3dd0a11a4c9a061bf86417d52d58a6b3 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: more sanity checks in protocol registration/unregistration

Add some more sanity checks when registering/unregistering l3/l4 protocols.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
605dcad6c85226e6d43387917b329d65b95cef39 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: rename struct nf_conntrack_protocol

Rename 'struct nf_conntrack_protocol' to 'struct nf_conntrack_l4proto' in
order to help distinguish it from 'struct nf_conntrack_l3proto'. It gets
rather confusing with 'nf_conntrack_protocol'.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
e2b7606cdb602a4f69c02cfc8bebe9c63b595e24 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: More __read_mostly annotations

Place rarely written variables in the read-mostly section by using
__read_mostly

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
8f03dea52b1d0227656319e1b0822628b43791a8 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: split out protocol handling

This patch splits out L3/L4 protocol handling into its own file
nf_conntrack_proto.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
f61801218a58381f498ae5c38ae3eae0bc73e976 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: split out the event cache

This patch splits out the event cache into its own file
nf_conntrack_ecache.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
7e5d03bb9d2b96fdeab0cb0c98b93e6cf7130c96 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: split out helper handling

This patch splits out handling of helpers into its own file
nf_conntrack_helper.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
77ab9cff0f4112703df3ef7903c1a15adb967114 29-Nov-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: split out expectation handling

This patch splits out expectation handling into its own file
nf_conntrack_expect.c

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
2e47c264a2e6ea24c27b4987607222202818c1f4 27-Nov-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: conntrack: fix refcount leak when finding expectation

All users of __{ip,nf}_conntrack_expect_find() don't expect that
it increments the reference count of expectation.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
22e7410b760b9c1777839fdd10382c60df8cbda2 27-Nov-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: fix the race on assign helper to new conntrack

The found helper cannot be assigned to conntrack after unlocking
nf_conntrack_lock. This tries to find helper to assign again.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c073e3fa8b7f9841aa6451885f135656d455f511 31-Oct-2006 Martin Josefsson <gandalf@wlug.westbo.se> [NETFILTER]: nf_conntrack: add missing unlock in get_next_corpse()

Add missing unlock in get_next_corpse() in nf_conntrack. It was missed
during the removal of listhelp.h . Also remove an unneeded use of
nf_ct_tuplehash_to_ctrack() in the same function.

Should be applied before 2.6.19 is released.

Signed-off-by: Martin Josefsson <gandalf@wlug.westbo.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
1192e403e9ea2dc23bbbe2b4fe9bdbc47e8c6056 20-Sep-2006 Brian Haley <brian.haley@hp.com> [NETFILTER]: make some netfilter globals __read_mostly

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5251e2d2125407bbff0c39394a4011be9ed8b5d0 20-Sep-2006 Pablo Neira Ayuso <pablo@netfilter.org> [NETFILTER]: conntrack: fix race condition in early_drop

On SMP environments the maximum number of conntracks can be overpassed
under heavy stress situations due to an existing race condition.

CPU A CPU B
atomic_read() ...
early_drop() ...
... atomic_read()
allocate conntrack allocate conntrack
atomic_inc() atomic_inc()

This patch moves the counter incrementation before the early drop stage.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
df0933dcb027e156cb5253570ad694b81bd52b69 20-Sep-2006 Patrick McHardy <kaber@trash.net> [NETFILTER]: kill listhelp.h

Kill listhelp.h and use the list.h functions instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
94aec08ea426903a3fb3cafd4d8b900cd50df702 18-Sep-2006 Brian Haley <brian.haley@hp.com> [NETFILTER]: Change tunables to __read_mostly

Change some netfilter tunables to __read_mostly. Also fixed some
incorrect file reference comments while I was in there.

(this will be my last __read_mostly patch unless someone points out
something else that needs it)

Signed-off-by: Brian Haley <brian.haley@hp.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
7c9728c393dceb724d66d696cfabce82151a78e5 09-Jun-2006 James Morris <jmorris@namei.org> [SECMARK]: Add secmark support to conntrack

Add a secmark field to IP and NF conntracks, so that security markings
on packets can be copied to their associated connections, and also
copied back to packets as required. This is similar to the network
mark field currently used with conntrack, although it is intended for
enforcement of security policy rather than network policy.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
997ae831ade74bdaed4172b1c02060b9efd6e206 30-May-2006 Eric Leblond <eric@inl.fr> [NETFILTER]: conntrack: add fixed timeout flag in connection tracking

Add a flag in a connection status to have a non updated timeout.
This permits to have connection that automatically die at a given
time.

Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2c16b774c7a9b1684b0ff10121915903e9f0cf6c 25-Apr-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: kill unused callback init_conntrack

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
e1bbdebdba615ddd957de81103aa2f7fa0581952 25-Apr-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: Fix module refcount dropping too far

If nf_ct_l3proto_find_get() fails to get the refcount of
nf_ct_l3proto_generic, nf_ct_l3proto_put() will drop the refcount
too far.

This gets rid of '.me = THIS_MODULE' of nf_ct_l3proto_generic so that
nf_ct_l3proto_find_get() doesn't try to get refcount of it.
It's OK because its symbol is usable until nf_conntrack.ko is unloaded.

This also kills unnecessary NULL pointer check as well.
__nf_ct_proto_find() allways returns non-NULL pointer.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6f912042256c12b0927438122594f5379b364f5d 11-Apr-2006 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [PATCH] for_each_possible_cpu: network codes

for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.

We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.

This patch replaces for_each_cpu with for_each_possible_cpu under /net

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e041c683412d5bf44dc2b109053e3b837b71742d 27-Mar-2006 Alan Stern <stern@rowland.harvard.edu> [PATCH] Notifier chain update: API changes

The kernel's implementation of notifier chains is unsafe. There is no
protection against entries being added to or removed from a chain while the
chain is in use. The issues were discussed in this thread:

http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

We noticed that notifier chains in the kernel fall into two basic usage
classes:

"Blocking" chains are always called from a process context
and the callout routines are allowed to sleep;

"Atomic" chains can be called from an atomic context and
the callout routines are not allowed to sleep.

We decided to codify this distinction and make it part of the API. Therefore
this set of patches introduces three new, parallel APIs: one for blocking
notifiers, one for atomic notifiers, and one for "raw" notifiers (which is
really just the old API under a new name). New kinds of data structures are
used for the heads of the chains, and new routines are defined for
registration, unregistration, and calling a chain. The three APIs are
explained in include/linux/notifier.h and their implementation is in
kernel/sys.c.

With atomic and blocking chains, the implementation guarantees that the chain
links will not be corrupted and that chain callers will not get messed up by
entries being added or removed. For raw chains the implementation provides no
guarantees at all; users of this API must provide their own protections. (The
idea was that situations may come up where the assumptions of the atomic and
blocking APIs are not appropriate, so it should be possible for users to
handle these things in their own way.)

There are some limitations, which should not be too hard to live with. For
atomic/blocking chains, registration and unregistration must always be done in
a process context since the chain is protected by a mutex/rwsem. Also, a
callout routine for a non-raw chain must not try to register or unregister
entries on its own chain. (This did happen in a couple of places and the code
had to be changed to avoid it.)

Since atomic chains may be called from within an NMI handler, they cannot use
spinlocks for synchronization. Instead we use RCU. The overhead falls almost
entirely in the unregister routine, which is okay since unregistration is much
less frequent that calling a chain.

Here is the list of chains that we adjusted and their classifications. None
of them use the raw API, so for the moment it is only a placeholder.

ATOMIC CHAINS
-------------
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain

BLOCKING CHAINS
---------------
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
arch/x86_64/kernel/process.c idle_notifier
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c sleep_notifier_list
drivers/macintosh/via-pmu68k.c sleep_notifier_list
drivers/macintosh/windfarm_core.c wf_client_list
drivers/usb/core/notify.c usb_notifier_list
drivers/video/fbmem.c fb_notifier_list
kernel/cpu.c cpu_chain
kernel/module.c module_notify_list
kernel/profile.c munmap_notifier
kernel/profile.c task_exit_notifier
kernel/sys.c reboot_notifier_list
net/core/dev.c netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain

It's possible that some of these classifications are wrong. If they are,
please let us know or submit a patch to fix them. Note that any chain that
gets called very frequently should be atomic, because the rwsem read-locking
used for blocking chains is very likely to incur cache misses on SMP systems.
(However, if the chain's callout routines may sleep then the chain cannot be
atomic.)

The patch set was written by Alan Stern and Chandra Seetharaman, incorporating
material written by Keith Owens and suggestions from Paul McKenney and Andrew
Morton.

[jes@sgi.com: restructure the notifier chain initialization macros]
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Jes Sorensen <jes@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b9f78f9fca626875af8adc0f7366a38b8e625a0e 22-Mar-2006 Pablo Neira Ayuso <pablo@netfilter.org> [NETFILTER]: nf_conntrack: support for layer 3 protocol load on demand

x_tables matches and targets that require nf_conntrack_ipv[4|6] to work
don't have enough information to load on demand these modules. This
patch introduces the following changes to solve this issue:

o nf_ct_l3proto_try_module_get: try to load the layer 3 connection
tracker module and increases the refcount.
o nf_ct_l3proto_module put: drop the refcount of the module.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4e3882f77376e036a52b022909d7e910714bd27b 22-Mar-2006 Pablo Neira Ayuso <pablo@netfilter.org> [NETFILTER]: conntrack: cleanup the conntrack ID initialization

Currently the first conntrack ID assigned is 2, use 1 instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
57b47a53ec4a67691ba32cff5768e8d78fa6c67f 21-Mar-2006 Ingo Molnar <mingo@elte.hu> [NET]: sem2mutex part 2

Semaphore to mutex conversion.

The conversion was generated via scripts, and the result was validated
automatically via a script as well.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
dc808fe28db59fadf4ec32d53f62477fa28f3be8 21-Mar-2006 Harald Welte <laforge@netfilter.org> [NETFILTER] nf_conntrack: clean up to reduce size of 'struct nf_conn'

This patch moves all helper related data fields of 'struct nf_conn'
into a separate structure 'struct nf_conn_help'. This new structure
is only present in conntrack entries for which we actually have a
helper loaded.

Also, this patch cleans up the nf_conntrack 'features' mechanism to
resemble what the original idea was: Just glue the feature-specific
data structures at the end of 'struct nf_conn', and explicitly
re-calculate the pointer to it when needed rather than keeping
pointers around.

Saves 20 bytes per conntrack on my x86_64 box. A non-helped conntrack
is 276 bytes. We still need to save another 20 bytes in order to fit
into to target of 256bytes.

Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7d3cdc6b554137a7a0534ce38b155a63a3117f27 16-Feb-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: move registration of __nf_ct_attach

Move registration of __nf_ct_attach to nf_conntrack_core to make it usable
for IPv6 connection tracking as well.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ddc8d029ac6813827849801bce2d8c8813070db6 04-Feb-2006 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: nf_conntrack: check address family when finding protocol module

__nf_conntrack_{l3}proto_find() doesn't check the passed protocol family,
then it's possible to touch out of the array which has only AF_MAX items.

Spotted by Pablo Neira Ayuso.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c1d10adb4a521de5760112853f42aaeefcec96eb 05-Jan-2006 Pablo Neira Ayuso <pablo@netfilter.org> [NETFILTER]: Add ctnetlink port for nf_conntrack

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
d695aa8a1f133359485e15db06d53e15e7309e4d 05-Jan-2006 Jesper Juhl <jesper.juhl@gmail.com> [NETFILTER]: Decrease number of pointer derefs in nf_conntrack_core.c

Benefits of the patch:
- Fewer pointer dereferences should make the code slightly faster.
- Size of generated code is smaller
- improved readability

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6636568cf85ef5898a892e90fcc88b61cca9ca27 05-Dec-2005 Patrick McHardy <kaber@trash.net> [NETFILTER]: Wait for untracked references in nf_conntrack module unload

Noticed by Pablo Neira <pablo@eurodev.net>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4a59a810513d5f7aa76515908b8e3620fa1b9b69 17-Nov-2005 Patrick McHardy <kaber@trash.net> [NETFILTER]: Fix nf_conntrack compilation with CONFIG_NETFILTER_DEBUG

CC [M] net/netfilter/nf_conntrack_core.o
net/netfilter/nf_conntrack_core.c: In function 'nf_ct_unlink_expect':
net/netfilter/nf_conntrack_core.c:390: error: 'exp_timeout' undeclared (first use in this function)
net/netfilter/nf_conntrack_core.c:390: error: (Each undeclared identifier is reported only once
net/netfilter/nf_conntrack_core.c:390: error: for each function it appears in.)

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a6f294e43e432bd207a702fea49ebb303ef9b23 16-Nov-2005 KOVACS Krisztian <hidden@balabit.hu> [NETFILTER] Free layer-3 specific protocol tables at cleanup

Although the comment around the allocation code tells us that
the layer-3 specific protocol tables will be freed when cleaning up,
they aren't. And this makes nfsim complain loudly...

Signed-off-by: KOVACS Krisztian <hidden@balabit.hu>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9fb9cbb1082d6b31fb45aa1a14432449a0df6cf1 10-Nov-2005 Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp> [NETFILTER]: Add nf_conntrack subsystem.

The existing connection tracking subsystem in netfilter can only
handle ipv4. There were basically two choices present to add
connection tracking support for ipv6. We could either duplicate all
of the ipv4 connection tracking code into an ipv6 counterpart, or (the
choice taken by these patches) we could design a generic layer that
could handle both ipv4 and ipv6 and thus requiring only one sub-protocol
(TCP, UDP, etc.) connection tracking helper module to be written.

In fact nf_conntrack is capable of working with any layer 3
protocol.

The existing ipv4 specific conntrack code could also not deal
with the pecularities of doing connection tracking on ipv6,
which is also cured here. For example, these issues include:

1) ICMPv6 handling, which is used for neighbour discovery in
ipv6 thus some messages such as these should not participate
in connection tracking since effectively they are like ARP
messages

2) fragmentation must be handled differently in ipv6, because
the simplistic "defrag, connection track and NAT, refrag"
(which the existing ipv4 connection tracking does) approach simply
isn't feasible in ipv6

3) ipv6 extension header parsing must occur at the correct spots
before and after connection tracking decisions, and there were
no provisions for this in the existing connection tracking
design

4) ipv6 has no need for stateful NAT

The ipv4 specific conntrack layer is kept around, until all of
the ipv4 specific conntrack helpers are ported over to nf_conntrack
and it is feature complete. Once that occurs, the old conntrack
stuff will get placed into the feature-removal-schedule and we will
fully kill it off 6 months later.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Harald Welte <laforge@netfilter.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>