ed13998c319b050fc9abdb73915859dfdbe1fb38 |
|
05-Jun-2013 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
sock_diag: fix filter code sent to userspace Filters need to be translated to real BPF code for userland, like SO_GETFILTER. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3e5289d5e3f98b7b5b8cac32e9e5a7004c067436 |
|
19-Mar-2013 |
Daniel Borkmann <dborkman@redhat.com> |
filter: add ANC_PAY_OFFSET instruction for loading payload start offset It is very useful to do dynamic truncation of packets. In particular, we're interested to push the necessary header bytes to the user space and cut off user payload that should probably not be transferred for some reasons (e.g. privacy, speed, or others). With the ancillary extension PAY_OFFSET, we can load it into the accumulator, and return it. E.g. in bpfc syntax ... ld #poff ; { 0x20, 0, 0, 0xfffff034 }, ret a ; { 0x16, 0, 0, 0x00000000 }, ... as a filter will accomplish this without having to do a big hackery in a BPF filter itself. Follow-up JIT implementations are welcome. Thanks to Eric Dumazet for suggesting and discussing this during the Netfilter Workshop in Copenhagen. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d59577b6ffd313d0ab3be39cb1ab47e29bdc9182 |
|
16-Jan-2013 |
Vincent Bernat <bernat@luffy.cx> |
sk-filter: Add ability to lock a socket filter program While a privileged program can open a raw socket, attach some restrictive filter and drop its privileges (or send the socket to an unprivileged program through some Unix socket), the filter can still be removed or modified by the unprivileged program. This commit adds a socket option to lock the filter (SO_LOCK_FILTER) preventing any modification of a socket filter program. This is similar to OpenBSD BIOCLOCK ioctl on bpf sockets, except even root is not allowed change/drop the filter. The state of the lock can be read with getsockopt(). No error is triggered if the state is not changed. -EPERM is returned when a user tries to remove the lock or to change/remove the filter while the lock is active. The check is done directly in sk_attach_filter() and sk_detach_filter() and does not affect only setsockopt() syscall. Signed-off-by: Vincent Bernat <bernat@luffy.cx> Signed-off-by: David S. Miller <davem@davemloft.net>
|
aa1113d9f85da59dcbdd32aeb5d71da566e46def |
|
28-Dec-2012 |
Daniel Borkmann <dborkman@redhat.com> |
net: filter: return -EINVAL if BPF_S_ANC* operation is not supported Currently, we return -EINVAL for malformed or wrong BPF filters. However, this is not done for BPF_S_ANC* operations, which makes it more difficult to detect if it's actually supported or not by the BPF machine. Therefore, we should also return -EINVAL if K is within the SKF_AD_OFF universe and the ancillary operation did not match. Why exactly is it needed? If tools such as libpcap/tcpdump want to make use of new ancillary operations (like filtering VLAN in kernel space), there is currently no sane way to test if this feature / BPF_S_ANC* op is present or not, since no error is returned. This patch will make life easier for that and allow for a proper usage for user space applications. There was concern, if this patch will break userland. Short answer: Yes and no. Long answer: It will "break" only for code that calls ... { BPF_LD | BPF_(W|H|B) | BPF_ABS, 0, 0, <K> }, ... where <K> is in [0xfffff000, 0xffffffff] _and_ <K> is *not* an ancillary. And here comes the BUT: assuming some *old* code will have such an instruction where <K> is between [0xfffff000, 0xffffffff] and it doesn't know ancillary operations, then this will give a non-expected / unwanted behavior as well (since we do not return the BPF machine with 0 after a failed load_pointer(), which was the case before introducing ancillary operations, but load sth. into the accumulator instead, and continue with the next instruction, for instance). Thus, user space code would already have been broken by introducing ancillary operations into the BPF machine per se. Code that does such a direct load, e.g. "load word at packet offset 0xffffffff into accumulator" ("ld [0xffffffff]") is quite broken, isn't it? The whole assumption of ancillary operations is that no-one intentionally calls things like "ld [0xffffffff]" and expect this word to be loaded from such a packet offset. Hence, we can also safely make use of this feature testing patch and facilitate application development. Therefore, at least from this patch onwards, we have *for sure* a check whether current or in future implemented BPF_S_ANC* ops are supported in the kernel. Patch was tested on x86_64. (Thanks to Eric for the previous review.) Cc: Eric Dumazet <eric.dumazet@gmail.com> Reported-by: Ani Sinha <ani@aristanetworks.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a8fc92778080c845eaadc369a0ecf5699a03bef0 |
|
01-Nov-2012 |
Pavel Emelyanov <xemul@parallels.com> |
sk-filter: Add ability to get socket filter program (v2) The SO_ATTACH_FILTER option is set only. I propose to add the get ability by using SO_ATTACH_FILTER in getsockopt. To be less irritating to eyes the SO_GET_FILTER alias to it is declared. This ability is required by checkpoint-restore project to be able to save full state of a socket. There are two issues with getting filter back. First, kernel modifies the sock_filter->code on filter load, thus in order to return the filter element back to user we have to decode it into user-visible constants. Fortunately the modification in question is interconvertible. Second, the BPF_S_ALU_DIV_K code modifies the command argument k to speed up the run-time division by doing kernel_k = reciprocal(user_k). Bad news is that different user_k may result in same kernel_k, so we can't get the original user_k back. Good news is that we don't have to do it. What we need to is calculate a user2_k so, that reciprocal(user2_k) == reciprocal(user_k) == kernel_k i.e. if it's re-loaded back the compiled again value will be exactly the same as it was. That said, the user2_k can be calculated like this user2_k = reciprocal(kernel_k) with an exception, that if kernel_k == 0, then user2_k == 1. The optlen argument is treated like this -- when zero, kernel returns the amount of sock_fprog elements in filter, otherwise it should be large enough for the sock_fprog array. changes since v1: * Declared SO_GET_FILTER in all arch headers * Added decode of vlan-tag codes Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f3335031b9452baebfe49b8b5e55d3fe0c4677d1 |
|
27-Oct-2012 |
Eric Dumazet <edumazet@google.com> |
net: filter: add vlan tag access BPF filters lack ability to access skb->vlan_tci This patch adds two new ancillary accessors : SKF_AD_VLAN_TAG (44) mapped to vlan_tx_tag_get(skb) SKF_AD_VLAN_TAG_PRESENT (48) mapped to vlan_tx_tag_present(skb) This allows libpcap/tcpdump to use a kernel filter instead of having to fallback to accept all packets, then filter them in user space. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Ani Sinha <ani@aristanetworks.com> Suggested-by: Daniel Borkmann <danborkmann@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9e49e88958feb41ec701fa34b44723dabadbc28c |
|
24-Sep-2012 |
Daniel Borkmann <dxchgb@gmail.com> |
filter: add XOR instruction for use with X/K SKF_AD_ALU_XOR_X has been added a while ago, but as an 'ancillary' operation that is invoked through a negative offset in K within BPF load operations. Since BPF_MOD has recently been added, BPF_XOR should also be part of the common ALU operations. Removing SKF_AD_ALU_XOR_X might not be an option since this is exposed to user space. Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b6069a95706ca5738be3f5d90fd286cbd13ac695 |
|
08-Sep-2012 |
Eric Dumazet <edumazet@google.com> |
filter: add MOD operation Add a new ALU opcode, to compute a modulus. Commit ffe06c17afbbb used an ancillary to implement XOR_X, but here we reserve one of the available ALU opcode to implement both MOD_X and MOD_K Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: George Bakos <gbakos@alpinista.org> Cc: Jay Schulist <jschlst@samba.org> Cc: Jiri Pirko <jpirko@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c93bdd0e03e848555d144eb44a1f275b871a8dd5 |
|
01-Aug-2012 |
Mel Gorman <mgorman@suse.de> |
netvm: allow skb allocation to use PFMEMALLOC reserves Change the skb allocation API to indicate RX usage and use this to fall back to the PFMEMALLOC reserve when needed. SKBs allocated from the reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the reserve and the socket is later found to be unrelated to page reclaim, the packet is dropped so that the memory remains available for page reclaim. Network protocols are expected to recover from this packet loss. [a.p.zijlstra@chello.nl: Ideas taken from various patches] [davem@davemloft.net: Use static branches, coding style corrections] [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
c6c4b97c6b7003e8082dd43db224c1d1f7a24aa2 |
|
08-Jun-2012 |
Randy Dunlap <rdunlap@xenotime.net> |
net/core: fix kernel-doc warnings Fix kernel-doc warnings in net/core: Warning(net/core/skbuff.c:3368): No description found for parameter 'delta_truesize' Warning(net/core/filter.c:628): No description found for parameter 'pfp' Warning(net/core/filter.c:628): Excess function parameter 'sk' description in 'sk_unattached_filter_create' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
95c961747284a6b83a5e2d81240e214b0fa3464d |
|
15-Apr-2012 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: cleanup unsigned to unsigned int Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
46b325c7eb01482674406701825ff67f561ccdd4 |
|
12-Apr-2012 |
Will Drewry <wad@chromium.org> |
sk_run_filter: add BPF_S_ANC_SECCOMP_LD_W Introduces a new BPF ancillary instruction that all LD calls will be mapped through when skb_run_filter() is being used for seccomp BPF. The rewriting will be done using a secondary chk_filter function that is run after skb_chk_filter. The code change is guarded by CONFIG_SECCOMP_FILTER which is added, along with the seccomp_bpf_load() function later in this series. This is based on http://lkml.org/lkml/2012/3/2/141 Suggested-by: Indan Zupancic <indan@nul.nu> Signed-off-by: Will Drewry <wad@chromium.org> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Paris <eparis@redhat.com> v18: rebase ... v15: include seccomp.h explicitly for when seccomp_bpf_load exists. v14: First cut using a single additional instruction ... v13: made bpf functions generic. Signed-off-by: James Morris <james.l.morris@oracle.com>
|
ffe06c17afbbbd4d73cdc339419be232847d667a |
|
31-Mar-2012 |
Jiri Pirko <jpirko@redhat.com> |
filter: add XOR operation Add XOR instruction fo BPF machine. Needed for computing packet hashes. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
302d663740cfaf2c364df6bb61cd339014ed714c |
|
31-Mar-2012 |
Jiri Pirko <jpirko@redhat.com> |
filter: Allow to create sk-unattached filters Today, BPF filters are bind to sockets. Since BPF machine becomes handy for other purposes, this patch allows to create unattached filter. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f03fb3f455c6c3a3dfcef6c7f2dcab104c813f4b |
|
30-Mar-2012 |
Jan Seiffert <kaffeemonster@googlemail.com> |
bpf jit: Make the filter.c::__load_pointer helper non-static for the jits The function is renamed to make it a little more clear what it does. It is not added to any .h because it is not for general consumption, only for bpf internal use (and so by the jits). Signed-of-by: Jan Seiffert <kaffeemonster@googlemail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9ffc93f203c18a70623f21950f1dd473c9ec48cd |
|
28-Mar-2012 |
David Howells <dhowells@redhat.com> |
Remove all #inclusions of asm/system.h Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *` Signed-off-by: David Howells <dhowells@redhat.com>
|
4f25af27827080c3163e59c7af1ca84a05ce121c |
|
17-Oct-2011 |
Dan Carpenter <dan.carpenter@oracle.com> |
filter: use unsigned int to silence static checker warning This is just a cleanup. My testing version of Smatch warns about this: net/core/filter.c +380 check_load_and_stores(6) warn: check 'flen' for negative values flen comes from the user. We try to clamp the values here between 1 and BPF_MAXINSNS but the clamp doesn't work because it could be negative. This is a bug, but it's not exploitable. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
a9b3cd7f323b2e57593e7215362a7b02fc933e3a |
|
01-Aug-2011 |
Stephen Hemminger <shemminger@vyatta.com> |
rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER When assigning a NULL value to an RCU protected pointer, no barrier is needed. The rcu_assign_pointer, used to handle that but will soon change to not handle the special case. Convert all rcu_assign_pointer of NULL value. //smpl @@ expression P; @@ - rcu_assign_pointer(P, NULL) + RCU_INIT_POINTER(P, NULL) // </smpl> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
86e4ca66e81bba0f8640f1fa19b8b8f72cbd0561 |
|
26-May-2011 |
David S. Miller <davem@davemloft.net> |
bug.h: Move ratelimit warn interfaces to ratelimit.h As reported by Ingo Molnar, we still have configuration combinations where use of the WARN_RATELIMIT interfaces break the build because dependencies don't get met. Instead of going down the long road of trying to make it so that ratelimit.h can get included by kernel.h or asm-generic/bug.h, just move the interface into ratelimit.h and make users have to include that. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
|
6c4a5cb219520c7bc937ee186ca53f03733bd09f |
|
21-May-2011 |
Joe Perches <joe@perches.com> |
net: filter: Use WARN_RATELIMIT A mis-configured filter can spam the logs with lots of stack traces. Rate-limit the warnings and add printout of the bogus filter information. Original-patch-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0a14842f5a3c0e88a1e59fac5c3025db39721f74 |
|
20-Apr-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: filter: Just In Time compiler for x86-64 In order to speedup packet filtering, here is an implementation of a JIT compiler for x86_64 It is disabled by default, and must be enabled by the admin. echo 1 >/proc/sys/net/core/bpf_jit_enable It uses module_alloc() and module_free() to get memory in the 2GB text kernel range since we call helpers functions from the generated code. EAX : BPF A accumulator EBX : BPF X accumulator RDI : pointer to skb (first argument given to JIT function) RBP : frame pointer (even if CONFIG_FRAME_POINTER=n) r9d : skb->len - skb->data_len (headlen) r8 : skb->data To get a trace of generated code, use : echo 2 >/proc/sys/net/core/bpf_jit_enable Example of generated code : # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24 flen=18 proglen=147 pass=3 image=ffffffffa00b5000 JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60 JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00 JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00 JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0 JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00 JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00 JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24 JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31 JIT code: ffffffffa00b5090: c0 c9 c3 BPF program is 144 bytes long, so native program is almost same size ;) (000) ldh [12] (001) jeq #0x800 jt 2 jf 8 (002) ld [26] (003) and #0xffffff00 (004) jeq #0xc0a81400 jt 16 jf 5 (005) ld [30] (006) and #0xffffff00 (007) jeq #0xc0a81400 jt 16 jf 17 (008) jeq #0x806 jt 10 jf 9 (009) jeq #0x8035 jt 10 jf 17 (010) ld [28] (011) and #0xffffff00 (012) jeq #0xc0a81400 jt 16 jf 13 (013) ld [38] (014) and #0xffffff00 (015) jeq #0xc0a81400 jt 16 jf 17 (016) ret #65535 (017) ret #0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
25985edcedea6396277003854657b5f3cb31a628 |
|
31-Mar-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
|
80f8f1027b99660897bdeaeae73002185d829906 |
|
18-Jan-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: filter: dont block softirqs in sk_run_filter() Packet filter (BPF) doesnt need to disable softirqs, being fully re-entrant and lock-less. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
697d0e338c7fd392cf73bf120150ab6e5516a3a3 |
|
08-Jan-2011 |
Randy Dunlap <randy.dunlap@oracle.com> |
net: fix kernel-doc warning in core/filter.c Fix new kernel-doc notation warning in net/core/filter.c: Warning(net/core/filter.c:172): No description found for parameter 'fentry' Warning(net/core/filter.c:172): Excess function parameter 'filter' description in 'sk_run_filter' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
12b16dadbc2406144d408754f96d0f44aa016239 |
|
15-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: optimize accesses to ancillary data We can translate pseudo load instructions at filter check time to dedicated instructions to speed up filtering and avoid one switch(). libpcap currently uses SKF_AD_PROTOCOL, but custom filters probably use other ancillary accesses. Note : I made the assertion that ancillary data was always accessed with BPF_LD|BPF_?|BPF_ABS instructions, not with BPF_LD|BPF_?|BPF_IND ones (offset given by K constant, not by K + X register) On x86_64, this saves a few bytes of text : # size net/core/filter.o.* text data bss dec hex filename 4864 0 0 4864 1300 net/core/filter.o.new 4944 0 0 4944 1350 net/core/filter.o.old Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4bc65dd8d88671712d71592a83374cfb0b5fce7a |
|
07-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: use size of fetched data in __load_pointer() __load_pointer() checks data we fetch from skb is included in head portion, but assumes we fetch one byte, instead of up to four. This wont crash because we have extra bytes (struct skb_shared_info) after head, but this can read uninitialized bytes. Fix this using size of the data (1, 2, 4 bytes) in the test. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
62ab0812137ec4f9884dd7de346238841ac03283 |
|
06-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: constify sk_run_filter() sk_run_filter() doesnt write on skb, change its prototype to reflect this. Fix two af_packet comments. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2d5311e4e8272fd398fc1cf278f12fd6dee4074b |
|
01-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: add a security check at install time We added some security checks in commit 57fe93b374a6 (filter: make sure filters dont read uninitialized memory) to close a potential leak of kernel information to user. This added a potential extra cost at run time, while we can perform a check of the filter itself, to make sure a malicious user doesnt try to abuse us. This patch adds a check_loads() function, whole unique purpose is to make this check, allocating a temporary array of mask. We scan the filter and propagate a bitmask information, telling us if a load M(K) is allowed because a previous store M(K) is guaranteed. (So that sk_run_filter() can possibly not read unitialized memory) Note: this can uncover application bug, denying a filter attach, previously allowed. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dan Rosenberg <drosenberg@vsecurity.com> Cc: Changli Gao <xiaosuo@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
da2033c282264bfba4e339b7cb3df62adb5c5fc8 |
|
30-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: add SKF_AD_RXHASH and SKF_AD_CPU Add SKF_AD_RXHASH and SKF_AD_CPU to filter ancillary mechanism, to be able to build advanced filters. This can help spreading packets on several sockets with a fast selection, after RPS dispatch to N cpus for example, or to catch a percentage of flows in one queue. tcpdump -s 500 "cpu = 1" : [0] ld CPU [1] jeq #1 jt 2 jf 3 [2] ret #500 [3] ret #0 # take 12.5 % of flows (average) tcpdump -s 1000 "rxhash & 7 = 2" : [0] ld RXHASH [1] and #7 [2] jeq #2 jt 3 jf 4 [3] ret #1000 [4] ret #0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Rui <wirelesser@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
46bcf14f44d8f31ecfdc8b6708ec15a3b33316d9 |
|
06-Dec-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: fix sk_filter rcu handling Pavel Emelyanov tried to fix a race between sk_filter_(de|at)tach and sk_clone() in commit 47e958eac280c263397 Problem is we can have several clones sharing a common sk_filter, and these clones might want to sk_filter_attach() their own filters at the same time, and can overwrite old_filter->rcu, corrupting RCU queues. We can not use filter->rcu without being sure no other thread could do the same thing. Switch code to a more conventional ref-counting technique : Do the atomic decrement immediately and queue one rcu call back when last reference is released. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
c26aed40f4fd18f86bcc6aba557cab700b129b73 |
|
18-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: use reciprocal divide At compile time, we can replace the DIV_K instruction (divide by a constant value) by a reciprocal divide. At exec time, the expensive divide is replaced by a multiply, a less expensive operation on most processors. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8c1592d68bc89248bfd0ee287648f41c1370d826 |
|
18-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: cleanup codes[] init Starting the translated instruction to 1 instead of 0 allows us to remove one descrement at check time and makes codes[] array init cleaner. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
93aaae2e01e57483256b7da05c9a7ebd65ad4686 |
|
19-Nov-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: optimize sk_run_filter Remove pc variable to avoid arithmetic to compute fentry at each filter instruction. Jumps directly manipulate fentry pointer. As the last instruction of filter[] is guaranteed to be a RETURN, and all jumps are before the last instruction, we dont need to check filter bounds (number of instructions in filter array) at each iteration, so we remove it from sk_run_filter() params. On x86_32 remove f_k var introduced in commit 57fe93b374a6b871 (filter: make sure filters dont read uninitialized memory) Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to avoid too many ifdefs in this code. This helps compiler to use cpu registers to hold fentry and A accumulator. On x86_32, this saves 401 bytes, and more important, sk_run_filter() runs much faster because less register pressure (One less conditional branch per BPF instruction) # size net/core/filter.o net/core/filter_pre.o text data bss dec hex filename 2948 0 0 2948 b84 net/core/filter.o 3349 0 0 3349 d15 net/core/filter_pre.o on x86_64 : # size net/core/filter.o net/core/filter_pre.o text data bss dec hex filename 5173 0 0 5173 1435 net/core/filter.o 5224 0 0 5224 1468 net/core/filter_pre.o Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0302b8622ce696af1cda22fcf207d3793350e896 |
|
18-Nov-2010 |
Randy Dunlap <randy.dunlap@oracle.com> |
net: fix kernel-doc for sk_filter_rcu_release Fix kernel-doc warning for sk_filter_rcu_release(): Warning(net/core/filter.c:586): missing initial short description on line: * sk_filter_rcu_release: Release a socket filter by rcu_head Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
4c3710afbc333c33100739dec10662b4ee64e219 |
|
16-Nov-2010 |
Changli Gao <xiaosuo@gmail.com> |
net: move definitions of BPF_S_* to net/core/filter.c BPF_S_* are used internally, should not be exposed to the others. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cba328fc5ede9091616e7296483840869b615a46 |
|
16-Nov-2010 |
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> |
filter: Optimize instruction revalidation code. Since repeating u16 value to u8 value conversion using switch() clause's case statement is wasteful, this patch introduces u16 to u8 mapping table and removes most of case statements. As a result, the size of net/core/filter.o is reduced by about 29% on x86. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
57fe93b374a6b8711995c2d466c502af9f3a08bb |
|
10-Nov-2010 |
David S. Miller <davem@davemloft.net> |
filter: make sure filters dont read uninitialized memory There is a possibility malicious users can get limited information about uninitialized stack mem array. Even if sk_run_filter() result is bound to packet length (0 .. 65535), we could imagine this can be used by hostile user. Initializing mem[] array, like Dan Rosenberg suggested in his patch is expensive since most filters dont even use this array. Its hard to make the filter validation in sk_chk_filter(), because of the jumps. This might be done later. In this patch, I use a bitmap (a single long var) so that only filters using mem[] loads/stores pay the price of added security checks. For other filters, additional cost is a single instruction. [ Since we access fentry->k a lot now, cache it in a local variable and mark filter entry pointer as const. -DaveM ] Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0d7da9ddd9a4eb7808698d04b98bf9d62d02649b |
|
25-Oct-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add __rcu annotation to sk_filter Add __rcu annotation to : (struct sock)->sk_filter And use appropriate rcu primitives to reduce sparse warnings if CONFIG_SPARSE_RCU_POINTER=y Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
f91ff5b9ff529be8aac2039af63b2c8ea6cd6ebe |
|
27-Sep-2010 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: sk_{detach|attach}_filter() rcu fixes sk_attach_filter() and sk_detach_filter() are run with socket locked. Use the appropriate rcu_dereference_protected() instead of blocking BH, and rcu_dereference_bh(). There is no point adding BH prevention and memory barrier. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
01f2f3f6ef4d076c0c10a8a7b42624416d56b523 |
|
19-Jun-2010 |
Hagen Paul Pfeifer <hagen@jauu.net> |
net: optimize Berkeley Packet Filter (BPF) processing Gcc is currenlty not in the ability to optimize the switch statement in sk_run_filter() because of dense case labels. This patch replace the OR'd labels with ordered sequenced case labels. The sk_chk_filter() function is modified to patch/replace the original OPCODES in a ordered but equivalent form. gcc is now in the ability to transform the switch statement in sk_run_filter into a jump table of complexity O(1). Until this patch gcc generates a sequence of conditional branches (O(n) of 567 byte .text segment size (arch x86_64): 7ff: 8b 06 mov (%rsi),%eax 801: 66 83 f8 35 cmp $0x35,%ax 805: 0f 84 d0 02 00 00 je adb <sk_run_filter+0x31d> 80b: 0f 87 07 01 00 00 ja 918 <sk_run_filter+0x15a> 811: 66 83 f8 15 cmp $0x15,%ax 815: 0f 84 c5 02 00 00 je ae0 <sk_run_filter+0x322> 81b: 77 73 ja 890 <sk_run_filter+0xd2> 81d: 66 83 f8 04 cmp $0x4,%ax 821: 0f 84 17 02 00 00 je a3e <sk_run_filter+0x280> 827: 77 29 ja 852 <sk_run_filter+0x94> 829: 66 83 f8 01 cmp $0x1,%ax [...] With the modification the compiler translate the switch statement into the following jump table fragment: 7ff: 66 83 3e 2c cmpw $0x2c,(%rsi) 803: 0f 87 1f 02 00 00 ja a28 <sk_run_filter+0x26a> 809: 0f b7 06 movzwl (%rsi),%eax 80c: ff 24 c5 00 00 00 00 jmpq *0x0(,%rax,8) 813: 44 89 e3 mov %r12d,%ebx 816: e9 43 03 00 00 jmpq b5e <sk_run_filter+0x3a0> 81b: 41 89 dc mov %ebx,%r12d 81e: e9 3b 03 00 00 jmpq b5e <sk_run_filter+0x3a0> Furthermore, I reordered the instructions to reduce cache line misses by order the most common instruction to the start. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
40eaf96271526a9f71030dd1a199ce46c045752e |
|
22-Apr-2010 |
Paul LeoNerd Evans <leonerd@leonerd.org.uk> |
net: Socket filter ancilliary data access for skb->dev->type Add an SKF_AD_HATYPE field to the packet ancilliary data area, giving access to skb->dev->type, as reported in the sll_hatype field. When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all interfaces, there doesn't appear to be a way for the filter program to actually find out the underlying hardware type the packet was captured on. This patch adds such ability. This patch also handles the case where skb->dev can be NULL, such as on netlink sockets. Signed-off-by: Paul Evans <leonerd@leonerd.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
5a0e3ad6af8660be21ca98a971cd00f331318c05 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
a898def29e4119bc01ebe7ca97423181f4c0ea2d |
|
23-Feb-2010 |
Paul E. McKenney <paulmck@linux.vnet.ibm.com> |
net: Add checking to rcu_dereference() primitives Update rcu_dereference() primitives to use new lockdep-based checking. The rcu_dereference() in __in6_dev_get() may be protected either by rcu_read_lock() or RTNL, per Eric Dumazet. The rcu_dereference() in __sk_free() is protected by the fact that it is never reached if an update could change it. Check for this by using rcu_dereference_check() to verify that the struct sock's ->sk_wmem_alloc counter is zero. Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
5ff3f073670b544a9c0547cc6fef1f7eed5762ed |
|
14-Feb-2010 |
Michael S. Tsirkin <mst@redhat.com> |
net: export attach/detach filter routines Export sk_attach_filter/sk_detach_filter routines, so that tun module can use them. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d19742fb1c68e6db83b76e06dea5a374c99e104f |
|
20-Oct-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
filter: Add SKF_AD_QUEUE instruction It can help being able to filter packets on their queue_mapping. If filter performance is not good, we could add a "numqueue" field in struct packet_type, so that netif_nit_deliver() and other functions can directly ignore packets with not expected queue number. Lets experiment this simple filter extension first. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7e75f93eda027d9f9e0203ee6ffd210ea92e98f3 |
|
19-Oct-2009 |
jamal <hadi@cyberus.ca> |
pkt_sched: ingress socket filter by mark Allow bpf to set a filter to drop packets that dont match a specific mark Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d214c7537bbf2f247991fb65b3420b0b3d712c67 |
|
20-Nov-2008 |
Pablo Neira Ayuso <pablo@netfilter.org> |
filter: add SKF_AD_NLATTR_NEST to look for nested attributes SKF_AD_NLATTR allows us to find the first matching attribute in a stream of netlink attributes from one offset to the end of the netlink message. This is not suitable to look for a specific matching inside a set of nested attributes. For example, in ctnetlink messages, if we look for the CTA_V6_SRC attribute in a message that talks about an IPv4 connection, SKF_AD_NLATTR returns the offset of CTA_STATUS which has the same value of CTA_V6_SRC but outside the nest. To differenciate CTA_STATUS and CTA_V6_SRC, we would have to make assumptions on the size of the attribute and the usual offset, resulting in horrible BSF code. This patch adds SKF_AD_NLATTR_NEST, which is a variant of SKF_AD_NLATTR, that looks for an attribute inside the limits of a nested attributes, but not further. This patch validates that we have enough room to look for the nested attributes - based on a suggestion from Patrick McHardy. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
8fde8a076940969d32805b853efdce8b988d7dda |
|
02-Jul-2008 |
Wang Chen <wangchen@cn.fujitsu.com> |
net: Tyop of sk_filter() comment Parameter "needlock" no long exists. Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d3e2ce3bcdbf4319dea308c79b5f72a8ecc8015c |
|
03-May-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
net: use get/put_unaligned_* helpers Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4738c1db1593687713869fa69e733eebc7b0d6d8 |
|
10-Apr-2008 |
Patrick McHardy <kaber@trash.net> |
[SKFILTER]: Add SKF_ADF_NLATTR instruction SKF_ADF_NLATTR searches for a netlink attribute, which avoids manually parsing and walking attributes. It takes the offset at which to start searching in the 'A' register and the attribute type in the 'X' register and returns the offset in the 'A' register. When the attribute is not found it returns zero. A top-level attribute can be located using a filter like this (example for nfnetlink, using struct nfgenmsg): ... { /* A = offset of first attribute */ .code = BPF_LD | BPF_IMM, .k = sizeof(struct nlmsghdr) + sizeof(struct nfgenmsg) }, { /* X = CTA_PROTOINFO */ .code = BPF_LDX | BPF_IMM, .k = CTA_PROTOINFO, }, { /* A = netlink attribute offset */ .code = BPF_LD | BPF_B | BPF_ABS, .k = SKF_AD_OFF + SKF_AD_NLATTR }, { /* Exit if not found */ .code = BPF_JMP | BPF_JEQ | BPF_K, .k = 0, .jt = <error> }, ... A nested attribute below the CTA_PROTOINFO attribute would then be parsed like this: ... { /* A += sizeof(struct nlattr) */ .code = BPF_ALU | BPF_ADD | BPF_K, .k = sizeof(struct nlattr), }, { /* X = CTA_PROTOINFO_TCP */ .code = BPF_LDX | BPF_IMM, .k = CTA_PROTOINFO_TCP, }, { /* A = netlink attribute offset */ .code = BPF_LD | BPF_B | BPF_ABS, .k = SKF_AD_OFF + SKF_AD_NLATTR }, ... The data of an attribute can be loaded into 'A' like this: ... { /* X = A (attribute offset) */ .code = BPF_MISC | BPF_TAX, }, { /* A = skb->data[X + k] */ .code = BPF_LD | BPF_B | BPF_IND, .k = sizeof(struct nlattr), }, ... Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
43db6d65e0ef943a361cb91f8baa49132009227b |
|
10-Apr-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
socket: sk_filter deinline The sk_filter function is too big to be inlined. This saves 2296 bytes of text on allyesconfig. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
b715631fad3ed320b85d386a84a6fb0b3f86b0b9 |
|
10-Apr-2008 |
Stephen Hemminger <shemminger@vyatta.com> |
socket: sk_filter minor cleanups Some minor style cleanups: * Move __KERNEL__ definitions to one place in filter.h * Use const for sk_filter_len * Line wrapping * Put EXPORT_SYMBOL next to function definition Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9b013e05e0289c190a53d78ca029e2f21c0e4485 |
|
19-Oct-2007 |
Olof Johansson <olof@lixom.net> |
[NET]: Fix bug in sk_filter race cures. Looks like this might be causing problems, at least for me on ppc. This happened during a normal boot, right around first interface config/dhcp run.. cpu 0x0: Vector: 300 (Data Access) at [c00000000147b820] pc: c000000000435e5c: .sk_filter_delayed_uncharge+0x1c/0x60 lr: c0000000004360d0: .sk_attach_filter+0x170/0x180 sp: c00000000147baa0 msr: 9000000000009032 dar: 4 dsisr: 40000000 current = 0xc000000004780fa0 paca = 0xc000000000650480 pid = 1295, comm = dhclient3 0:mon> t [c00000000147bb20] c0000000004360d0 .sk_attach_filter+0x170/0x180 [c00000000147bbd0] c000000000418988 .sock_setsockopt+0x788/0x7f0 [c00000000147bcb0] c000000000438a74 .compat_sys_setsockopt+0x4e4/0x5a0 [c00000000147bd90] c00000000043955c .compat_sys_socketcall+0x25c/0x2b0 [c00000000147be30] c000000000007508 syscall_exit+0x0/0x40 --- Exception: c01 (System Call) at 000000000ff618d8 SP (fffdf040) is in userspace 0:mon> I.e. null pointer deref at sk_filter_delayed_uncharge+0x1c: 0:mon> di $.sk_filter_delayed_uncharge c000000000435e40 7c0802a6 mflr r0 c000000000435e44 fbc1fff0 std r30,-16(r1) c000000000435e48 7c8b2378 mr r11,r4 c000000000435e4c ebc2cdd0 ld r30,-12848(r2) c000000000435e50 f8010010 std r0,16(r1) c000000000435e54 f821ff81 stdu r1,-128(r1) c000000000435e58 380300a4 addi r0,r3,164 c000000000435e5c 81240004 lwz r9,4(r4) That's the deref of fp: static void sk_filter_delayed_uncharge(struct sock *sk, struct sk_filter *fp) { unsigned int size = sk_filter_len(fp); ... That is called from sk_attach_filter(): ... rcu_read_lock_bh(); old_fp = rcu_dereference(sk->sk_filter); rcu_assign_pointer(sk->sk_filter, fp); rcu_read_unlock_bh(); sk_filter_delayed_uncharge(sk, old_fp); return 0; ... So, looks like rcu_dereference() returned NULL. I don't know the filter code at all, but it seems like it might be a valid case? sk_detach_filter() seems to handle a NULL sk_filter, at least. So, this needs review by someone who knows the filter, but it fixes the problem for me: Signed-off-by: Olof Johansson <olof@lixom.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
47e958eac280c263397582d5581e868c3227a1bd |
|
18-Oct-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Fix the race between sk_filter_(de|at)tach and sk_clone() The proposed fix is to delay the reference counter decrement until the quiescent state pass. This will give sk_clone() a chance to get the reference on the cloned filter. Regular sk_filter_uncharge can happen from the sk_free() only and there's no need in delaying the put - the socket is dead anyway and is to be release itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d3904b739928edd83d117f1eb5bfa69f18d6f046 |
|
18-Oct-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Cleanup the error path in sk_attach_filter The sk_filter_uncharge is called for error handling and for releasing the former filter, but this will have to be done in a bit different manner, so cleanup the error path a bit. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
309dd5fc872448e35634d510049642312ebc170d |
|
18-Oct-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Move the filter releasing into a separate call This is done merely as a preparation for the fix. The sk_filter_uncharge() unaccounts the filter memory and calls the sk_filter_release(), which in turn decrements the refcount anf frees the filter. The latter function will be required separately. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
55b333253d5bcafbe187b50474e40789301c53c6 |
|
18-Oct-2007 |
Pavel Emelyanov <xemul@openvz.org> |
[NET]: Introduce the sk_detach_filter() call Filter is attached in a separate function, so do the same for filter detaching. This also removes one variable sock_setsockopt(). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
27a884dc3cb63b93c2b3b643f5b31eed5f8a4d26 |
|
20-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Convert skb->tail to sk_buff_data_t So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes on 64bit architectures, allowing us to combine the 4 bytes hole left by the layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4 64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN... :-) Many calculations that previously required that skb->{transport,network, mac}_header be first converted to a pointer now can be done directly, being meaningful as offsets or pointers. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
d56f90a7c96da5187f0cdf07ee7434fe6aa78bbc |
|
11-Apr-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce skb_network_header() For the places where we need a pointer to the network header, it is still legal to touch skb->nh.raw directly if just adding to, subtracting from or setting it to another layer header. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
98e399f82ab3a6d863d1d4a7ea48925cc91c830e |
|
19-Mar-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce skb_mac_header() For the places where we need a pointer to the mac header, it is still legal to touch skb->mac.raw directly if just adding to, subtracting from or setting it to another layer header. This one also converts some more cases to skb_reset_mac_header() that my regex missed as it had no spaces before nor after '=', ugh. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
cd354f1ae75e6466a7e31b727faede57a1f89ca5 |
|
14-Feb-2007 |
Tim Schmielau <tim@physik3.uni-rostock.de> |
[PATCH] remove many unneeded #includes of sched.h After Al Viro (finally) succeeded in removing the sched.h #include in module.h recently, it makes sense again to remove other superfluous sched.h includes. There are quite a lot of files which include it but don't actually need anything defined in there. Presumably these includes were once needed for macros that used to live in sched.h, but moved to other header files in the course of cleaning it up. To ease the pain, this time I did not fiddle with any header files and only removed #includes from .c-files, which tend to cause less trouble. Compile tested against 2.6.20-rc2 and 2.6.20-rc2-mm2 (with offsets) on alpha, arm, i386, ia64, mips, powerpc, and x86_64 with allnoconfig, defconfig, allmodconfig, and allyesconfig as well as a few randconfigs on x86_64 and all configs in arch/arm/configs on arm. I also checked that no new warnings were introduced by the patch (actually, some warnings are removed that were emitted by unnecessarily included header files). Signed-off-by: Tim Schmielau <tim@physik3.uni-rostock.de> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
4ec93edb14fe5fdee9fae6335f2cbba204627eac |
|
09-Feb-2007 |
YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> |
[NET] CORE: Fix whitespace errors. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
252e33467a3b016f20dd8df12269cef3b167f21e |
|
15-Nov-2006 |
Al Viro <viro@zeniv.linux.org.uk> |
[NET] net/core: Annotations. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
|
fda9ef5d679b07c9d9097aaf6ef7f069d794a8f9 |
|
01-Sep-2006 |
Dmitry Mishin <dim@openvz.org> |
[NET]: Fix sk->sk_filter field access Function sk_filter() is called from tcp_v{4,6}_rcv() functions with arg needlock = 0, while socket is not locked at that moment. In order to avoid this and similar issues in the future, use rcu for sk->sk_filter field read protection. Signed-off-by: Dmitry Mishin <dim@openvz.org> Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: Kirill Korotaev <dev@openvz.org>
|
40daafc80b0f6a950c9252f9f1a242ab5cb6a648 |
|
18-Apr-2006 |
Dmitry Mishin <dim@openvz.org> |
unaligned access in sk_run_filter() This patch fixes unaligned access warnings noticed on IA64 in sk_run_filter(). 'ptr' can be unaligned. Signed-off-By: Dmitry Mishin <dim@openvz.org> Signed-off-By: Kirill Korotaev <dev@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
|
2966b66c25f81ad2b3298b651614c6a3be1a977f |
|
24-Jan-2006 |
Kris Katterjohn <kjak@users.sourceforge.net> |
[NET]: more whitespace issues in net/core/filter.c This fixes some whitespace issues in net/core/filter.c Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
e35bedf369b17120dbd7d554bee45407a3825267 |
|
17-Jan-2006 |
Kris Katterjohn <kjak@users.sourceforge.net> |
[NET]: Fix whitespace issues in net/core/filter.c This fixes some whitespace issues in net/core/filter.c Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
7b11f69fb5c475f521db79f5fa22104e15842671 |
|
13-Jan-2006 |
Kris Katterjohn <kjak@users.sourceforge.net> |
[NET]: Clean up comments for sk_chk_filter() This removes redundant comments, and moves one comment to a better location. Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
4bad4dc919573dbe9a5b41dd9edff279e99822d7 |
|
06-Jan-2006 |
Kris Katterjohn <kjak@ispwest.com> |
[NET]: Change sk_run_filter()'s return type in net/core/filter.c It should return an unsigned value, and fix sk_filter() as well. Signed-off-by: Kris Katterjohn <kjak@ispwest.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
9369986306d4692f37b61302d4e1ce3054d8833e |
|
04-Jan-2006 |
Kris Katterjohn <kjak@users.sourceforge.net> |
[NET]: More instruction checks fornet/core/filter.c Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1b93ae64cabe5e28dd5a1f35f96f938ca4f6ae20 |
|
27-Dec-2005 |
David S. Miller <davem@sunset.davemloft.net> |
[NET]: Validate socket filters against BPF_MAXINSNS in one spot. Currently the checks are scattered all over and this leads to inconsistencies and even cases where the check is not made. Based upon a patch from Kris Katterjohn. Signed-off-by: David S. Miller <davem@davemloft.net>
|
fb0d366b0803571f06a5b838f02c6706fc287995 |
|
20-Nov-2005 |
Kris Katterjohn <kjak@users.sourceforge.net> |
[NET]: Reject socket filter if division by constant zero is attempted. This way we don't have to check it in sk_run_filter(). Signed-off-by: Kris Katterjohn <kjak@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1198ad002ad36291817c7bf0308ab9c50ee2571d |
|
06-Sep-2005 |
Herbert Xu <herbert@gondor.apana.org.au> |
[NET]: 2.6.13 breaks libpcap (and tcpdump) Patrick McHardy says: Never mind, I got it, we never fall through to the second switch statement anymore. I think we could simply break when load_pointer returns NULL. The switch statement will fall through to the default case and return 0 for all cases but 0 > k >= SKF_AD_OFF. Here's a patch to do just that. I left BPF_MSH alone because it's really a hack to calculate the IP header length, which makes no sense when applied to the special data. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
3154e540e374bbfd62693d95bc8ed51da95efe75 |
|
05-Jul-2005 |
Patrick McHardy <kaber@trash.net> |
[NET]: net/core/filter.c: make len cover the entire packet As suggested by Herbert Xu: Since we don't require anything to be in the linear packet range anymore make len cover the entire packet. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
0b05b2a49e430220876f8faa7e4778dc7497033c |
|
05-Jul-2005 |
Patrick McHardy <kaber@trash.net> |
[NET]: Consolidate common code in net/core/filter.c Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
|
6935d46c2da64aa032a557374c95336e265cd7ef |
|
05-Jul-2005 |
Patrick McHardy <kaber@trash.net> |
[NET]: Remove redundant code in net/core/filter.c skb_header_pointer handles linear and non-linear data, no need to handle linear data again. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
|
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 |
|
17-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|