History log of /net/ipv4/inet_diag.c
Revision Date Author Comments
70315d22d3c7383f9a508d0aab21e2eb35b2303a 10-Jan-2014 Neal Cardwell <ncardwell@google.com> inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets

Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT
and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock
(not just TIME_WAIT), and for such sockets the tw_substate field holds
the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2.

This brings the inet_diag state-matching code in line with the field
it uses to populate idiag_state. This is also analogous to the info
exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state
and get_timewait4_sock() exports tw->tw_substate.

Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state
fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi
state time-wait" would also return sockets in state TCP_FIN_WAIT2.

This is an old bug that predates 05dbc7b ("tcp/dccp: remove twchain").

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b1aac815c0891fe4a55a6b0b715910142227700f 17-Dec-2013 Daniel Borkmann <dborkman@redhat.com> net: inet_diag: zero out uninitialized idiag_{src,dst} fields

Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].

In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...

rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);

... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:

r->id.idiag_src[0] = inet->inet_rcv_saddr;
r->id.idiag_dst[0] = inet->inet_daddr;

struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.

So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).

Similarly, fix also other places where we fill that out.

Reported-by: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c1d607cc4a8ea1ef89d7f6f5728112bc5a52f2f6 11-Oct-2013 Eric Dumazet <edumazet@google.com> inet_diag: use sock_gen_put()

TCP listener refactoring, part 6 :

Use sock_gen_put() from inet_diag_dump_one_icsk() for future
SYN_RECV support.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
634fb979e8f3a70f04c1f2f519d0cd1142eb5c1a 10-Oct-2013 Eric Dumazet <edumazet@google.com> inet: includes a sock_common in request_sock

TCP listener refactoring, part 5 :

We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.

This patch includes the needed struct sock_common in front
of struct request_sock

This means there is no more inet6_request_sock IPv6 specific
structure.

Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.

loc_port -> ir_loc_port
loc_addr -> ir_loc_addr
rmt_addr -> ir_rmt_addr
rmt_port -> ir_rmt_port
iif -> ir_iif

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
efe4208f47f907b86f528788da711e8ab9dea44d 04-Oct-2013 Eric Dumazet <edumazet@google.com> ipv6: make lookups simpler and faster

TCP listener refactoring, part 4 :

To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common

Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.

Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).

inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6

This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.

inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr

And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
at the same offset.

We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
05dbc7b59481ca891bbcfe6799a562d48159fbf7 03-Oct-2013 Eric Dumazet <edumazet@google.com> tcp/dccp: remove twchain

TCP listener refactoring, part 3 :

Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.

Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.

As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.

If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.

[ INET_TW_MATCH() is no longer needed ]

I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()

This way, SYN_RECV pseudo sockets will be supported the same.

A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].

Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()

Before patch :

dmesg | grep "TCP established"

TCP established hash table entries: 524288 (order: 11, 8388608 bytes)

After patch :

TCP established hash table entries: 524288 (order: 10, 4194304 bytes)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
96f817fedec48b59c9e8b22141cec4e56ad47913 03-Oct-2013 Eric Dumazet <edumazet@google.com> tcp: shrink tcp6_timewait_sock by one cache line

While working on tcp listener refactoring, I found that it
would really make things easier if sock_common could include
the IPv6 addresses needed in the lookups, instead of doing
very complex games to get their values (depending on sock
being SYN_RECV, ESTABLISHED, TIME_WAIT)

For this to happen, I need to be sure that tcp6_timewait_sock
and tcp_timewait_sock consume same number of cache lines.

This is possible if we only use 32bits for tw_ttd, as we remove
one 32bit hole in inet_timewait_sock

inet_tw_time_stamp() is defined and used, even if its current
implementation looks like tcp_time_stamp : We might need finer
resolution for tcp_time_stamp in the future.

Before patch : sizeof(struct tcp6_timewait_sock) = 0xc8

After patch : sizeof(struct tcp6_timewait_sock) = 0xc0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e32123e59871b9389d5b3fe9318611c7f1d1307a 17-Apr-2013 Patrick McHardy <kaber@trash.net> netlink: rename ssk to sk in struct netlink_skb_params

Memory mapped netlink needs to store the receiving userspace socket
when sending from the kernel to userspace. Rename 'ssk' to 'sk' to
avoid confusion.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ba8a3b19e764b6a65e4030ab0999be50c291e6c 11-Mar-2013 Nandita Dukkipati <nanditad@google.com> tcp: Tail loss probe (TLP)

This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
-> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
-> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
-> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
-> Connection is in Open state.
-> Connection is either cwnd limited or no new data to send.
-> Number of probes per tail loss episode is limited to one.
-> Connection is SACK enabled.

When PTO fires:
new_segment_exists:
-> transmit new segment.
-> packets_out++. cwnd remains same.

no_new_packet:
-> retransmit the last segment.
Its ACK triggers FACK or early retransmit based recovery.

ACK path:
-> rearm RTO at start of ACK processing.
-> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
==1; enables RFC5827 ER.
==2; delayed ER.
==3; TLP and delayed ER. [DEFAULT]
==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5e1f54201cb481f40a04bc47e1bc8c093a189e23 09-Dec-2012 Neal Cardwell <ncardwell@google.com> inet_diag: validate port comparison byte code to prevent unsafe reads

Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.

Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f67caec9068cee426ec23cf9005a1dee2ecad187 08-Dec-2012 Neal Cardwell <ncardwell@google.com> inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()

Add logic to check the address family of the user-supplied conditional
and the address family of the connection entry. We now do not do
prefix matching of addresses from different address families (AF_INET
vs AF_INET6), except for the previously existing support for having an
IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
maintains as-is).

This change is needed for two reasons:

(1) The addresses are different lengths, so comparing a 128-bit IPv6
prefix match condition to a 32-bit IPv4 connection address can cause
us to unwittingly walk off the end of the IPv4 address and read
garbage or oops.

(2) The IPv4 and IPv6 address spaces are semantically distinct, so a
simple bit-wise comparison of the prefixes is not meaningful, and
would lead to bogus results (except for the IPv4-mapped IPv6 case,
which this commit maintains).

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
405c005949e47b6e91359159c24753519ded0c67 08-Dec-2012 Neal Cardwell <ncardwell@google.com> inet_diag: validate byte code to prevent oops in inet_diag_bc_run()

Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
operations.

Previously we did not validate the inet_diag_hostcond, address family,
address length, and prefix length. So a malicious user could make the
kernel read beyond the end of the bytecode array by claiming to have a
whole inet_diag_hostcond when the bytecode was not long enough to
contain a whole inet_diag_hostcond of the given address family. Or
they could make the kernel read up to about 27 bytes beyond the end of
a connection address by passing a prefix length that exceeded the
length of addresses of the given family.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1c95df85ca49640576de2f0a850925957b547b84 08-Dec-2012 Neal Cardwell <ncardwell@google.com> inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state

Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
instantiated for IPv4 traffic and in the SYN-RECV state were actually
created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
means that for such connections inet6_rsk(req) returns a pointer to a
random spot in memory up to roughly 64KB beyond the end of the
request_sock.

With this bug, for a server using AF_INET6 TCP sockets and serving
IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
inet_diag_fill_req() causing an oops or the export to user space of 16
bytes of kernel memory as a garbage IPv6 address, depending on where
the garbage inet6_rsk(req) pointed.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cacb6ba0f36ab14a507f4ee7697e8332899015d2 03-Nov-2012 Cyrill Gorcunov <gorcunov@openvz.org> net: inet_diag -- Return error code if protocol handler is missed

We've observed that in case if UDP diag module is not
supported in kernel the netlink returns NLMSG_DONE without
notifying a caller that handler is missed.

This patch makes __inet_diag_dump to return error code instead.

So as example it become possible to detect such situation
and handle it gracefully on userspace level.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: David Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e6c022a4fa2d2d9ca9d0a7ac3b05ad988f39fc30 28-Oct-2012 Eric Dumazet <edumazet@google.com> tcp: better retrans tracking for defer-accept

For passive TCP connections using TCP_DEFER_ACCEPT facility,
we incorrectly increment req->retrans each time timeout triggers
while no SYNACK is sent.

SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
which we received the ACK from client). Only the last SYNACK is sent
so that we can receive again an ACK from client, to move the req into
accept queue. We plan to change this later to avoid the useless
retransmit (and potential problem as this SYNACK could be lost)

TCP_INFO later gives wrong information to user, claiming imaginary
retransmits.

Decouple req->retrans field into two independent fields :

num_retrans : number of retransmit
num_timeout : number of timeouts

num_timeout is the counter that is incremented at each timeout,
regardless of actual SYNACK being sent or not, and used to
compute the exponential timeout.

Introduce inet_rtx_syn_ack() helper to increment num_retrans
only if ->rtx_syn_ack() succeeded.

Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
when we re-send a SYNACK in answer to a (retransmitted) SYN.
Prior to this patch, we were not counting these retransmits.

Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
only if a synack packet was successfully queued.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e4e541a84863b6a41f2427f59cc9156c644491a8 23-Oct-2012 Pavel Emelyanov <xemul@parallels.com> sock-diag: Report shutdown for inet and unix sockets (v2)

Make it simple -- just put new nlattr with just sk->sk_shutdown bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15e473046cb6e5d18a4d0057e61d76315230382b 07-Sep-2012 Eric W. Biederman <ebiederm@xmission.com> netlink: Rename pid to portid to avoid confusion

It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.

I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.

I have successfully built an allyesconfig kernel with this change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d06ca9564350184a19b5aae9ac150f1b1306de29 25-May-2012 Eric W. Biederman <ebiederm@xmission.com> userns: Teach inet_diag to work with user namespaces

Compute the user namespace of the socket that we are replying to
and translate the kuids of reported sockets into that user namespace.

Cc: Andrew Vagin <avagin@openvz.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
51d7cccf07238f5236c5b9269231a30dd5f8e714 16-Jul-2012 Andrey Vagin <avagin@openvz.org> net: make sock diag per-namespace

Before this patch sock_diag works for init_net only and dumps
information about sockets from all namespaces.

This patch expands sock_diag for all name-spaces.
It creates a netlink kernel socket for each netns and filters
data during dumping.

v2: filter accoding with netns in all places
remove an unused variable.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6e277ed59a45544786f4c4643a69527138c24fc1 27-Jun-2012 Thomas Graf <tgraf@suug.ch> inet_diag: Do not use RTA_PUT() macros

Also, no need to trim on nlmsg_put() failure, nothing has been added
yet. We also want to use nlmsg_end(), nlmsg_new() and nlmsg_free().

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
d106352d9f527fe336749ad89de7e07e5af91a68 27-Jun-2012 David S. Miller <davem@davemloft.net> inet_diag: Move away from NLMSG_PUT().

And use nlmsg_data() while we're here too, and remove useless
casts.

Signed-off-by: David S. Miller <davem@davemloft.net>
8dcf01fc009d12d01fd195ed95eaaee61178f21a 24-Apr-2012 Shan Wei <davidshan@tencent.com> net: sock_diag_handler structs can be const

read only, so change it to const.

Signed-off-by: Shan Wei <davidshan@tencent.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
62ad6fcd743792bf294f2a7ba26ab8f462065150 24-Apr-2012 Shan Wei <davidshan@tencent.com> udp_diag: implement idiag_get_info for udp/udplite to get queue information

When we use netlink to monitor queue information for udp socket,
idiag_rqueue and idiag_wqueue of inet_diag_msg are returned with 0.

Keep consistent with netstat, just return back allocated rmem/wmem size.

Signed-off-by: Shan Wei <davidshan@tencent.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
80d326fab534a5380e8f6e509a0b9076655a9670 24-Feb-2012 Pablo Neira Ayuso <pablo@netfilter.org> netlink: add netlink_dump_control structure for netlink_dump_start()

Davem considers that the argument list of this interface is getting
out of control. This patch tries to address this issue following
his proposal:

struct netlink_dump_control c = { .dump = dump, .done = done, ... };

netlink_dump_start(..., &c);

Suggested by David S. Miller.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b09c84cb622ffbcdb5d541986b1eaf7d5812602 10-Jan-2012 Pavel Emelyanov <xemul@parallels.com> inet_diag: Rename inet_diag_req_compat into inet_diag_req

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c8991362a0d3cf317dfbfb6cb946607870654e6d 10-Jan-2012 Pavel Emelyanov <xemul@parallels.com> inet_diag: Rename inet_diag_req into inet_diag_req_v2

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c0636faa539ec4205ec50e80844a5b0454b4bbaa 30-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Add the SKMEMINFO extension

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f65c1b534b99aef1809b893387b295963821549f 15-Dec-2011 Pavel Emelyanov <xemul@parallels.com> sock_diag: Generalize requests cookies managements

The sk address is used as a cookie between dump/get_exact calls.
It will be required for unix socket sdumping, so move it from
inet_diag to sock_diag.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aec8dc62f66199aef153d86e1f90d9c1d14696e3 15-Dec-2011 Pavel Emelyanov <xemul@parallels.com> sock_diag: Fix module netlink aliases

I've made a mistake when fixing the sock_/inet_diag aliases :(

1. The sock_diag layer should request the family-based alias,
not just the IPPROTO_IP one;
2. The inet_diag layer should request for AF_INET+protocol alias,
not just the protocol one.

Thus fix this.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dfd56b8b38fff3586f36232db58e1e9f7885a605 10-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com> net: use IS_ENABLED(CONFIG_IPV6)

Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1942c518ca017f376b267a7c5e78c15d37202442 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Generalize inet_diag dump and get_exact calls

Introduce two callbacks in inet_diag_handler -- one for dumping all
sockets (with filters) and the other one for dumping a single sk.

Replace direct calls to icsk handlers with indirect calls to callbacks
provided by handlers.

Make existing TCP and DCCP handlers use provided helpers for icsk-s.

The UDP diag module will provide its own.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3c4d05c8056724aff3abc20650807dd828fded54 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Introduce the inet socket dumping routine

The existing inet_csk_diag_fill dumps the inet connection sock info
into the netlink inet_diag_message. Prepare this routine to be able
to dump only the inet_sock part of a socket if the icsk part is missing.

This will be used by UDP diag module when dumping UDP sockets.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8d07d1518a074a08b90be02eee5ee15e60ac9779 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Introduce the byte-code run on an inet socket

The upcoming UDP module will require exactly this ability, so just
move the existing code to provide one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
efb3cb428dcde2356802b3f93e152efa7abc5a62 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Split inet_diag_get_exact into parts

Similar to previous patch: the 1st part locks the inet handler
and will get generalized and the 2nd one dumps icsk-s and will
be used by TCP and DCCP handlers.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
476f7dbff30a7c122899d753c9119d9233928380 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Split inet_diag_get_exact into parts

The 1st part locks the inet handler and the 2nd one dump the
inet connection sock.

In the next patches the 1st part will be generalized to call
the socket dumping routine indirectly (i.e. TCP/UDP/DCCP) and
the 2nd part will be used by TCP and DCCP handlers.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b005ab4ef8805dc4604848c9d2ccca9d71f8fc46 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Export inet diag cookie checking routine

The netlink diag susbsys stores sk address bits in the nl message
as a "cookie" and uses one when dumps details about particular
socket.

The same will be required for udp diag module, so introduce a heler
in inet_diag module

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
87c22ea52e1bb467f58b3e9a71509fccb70c7bd3 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Reduce the number of args for bytecode run routine

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7b35eadd7eee2e0b42421ce3efbc30f1c3c745e5 09-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Remove indirect sizeof from inet diag handlers

There's an info_size value stored on inet_diag_handler, but for existing
code this value is effectively constant, so just use sizeof(struct tcp_info)
where required.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8ef874bfc7296fa206eea2ad1e8a426f576bf6f6 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> sock_diag: Move the sock_ code to net/core/

This patch moves the sock_ code from inet_diag.c to generic sock_diag.c
file and provides necessary request_module-s calls and a pointer on
inet_diag_compat dumping routine.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a029fe26b67f815eddd432d9e702b9f2edbc2535 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Cleanup type2proto last user

Now all the code works with sock_diag_req-compatible structs, so it's
possible to stop using the inet_diag_type2proto in inet_csk_diag_fill.
Pass the inet_diag_req into it and use the sdiag_protocol field. At the
same time remove the explicit ext argument, since it's also on the req.

However, this conversion is still required in _compat code, so just move
this routine, not remove.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d23deaa07b9b788e781a2253672cdc8b65b85e53 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Introduce socket family checks

The new API will specify family to work with. Teach the existing
socket walking code to bypass not interesting ones.

To preserve compatibility with existing behavior the _compat code
sets interesting family to AF_UNSPEC to dump them all.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
25c4cd2b6dfd8e3d8efd8e85f167b66c032b80d9 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Switch the _dump to work with new header

Make inet_diag_dumo work with given header instead of calculating
one from the nl message.

The SOCK_DIAG_BY_FAMILY just passes skb's one through, the compat code
converts the old header to new one.

Also fix the bytecode calculation to find one at proper offset.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fe50ce284616c3131e353ff7158002aa47a41a81 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Switch the _get_exact to work with new header

Make inet_diag_get_exact work with given header instead of calculating
one from the nl message.

The SOCK_DIAG_BY_FAMILY just passes skb's one through, the compat code
converts the old header to new one.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
126fdc3249c9ced2a0d20f916858fec26a445f61 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Introduce new inet_diag_req header

This one coinsides with the sock_diag_req in the beginning and
contains only used fields from its previous analogue.

The existing code is patched to use the _compat version of it
for now.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d366477a52f1df29fa066ffb18e4e6101ee2ad04 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> sock_diag: Initial skeleton

When receiving the SOCK_DIAG_BY_FAMILY message we have to find the
handler for provided family and pass the nl message to it.

This patch describes an infrastructure to work with such nandlers
and implements stubs for AF_INET(6) ones.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f13c95f0e255e6d21762259875295cc212e6bc32 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Switch from _GETSOCK to IPPROTO_ numbers

Sorry, but the vger didn't let this message go to the list. Re-sending it with
less spam-filter-prone subject.

When dumping the AF_INET/AF_INET6 sockets user will also specify the protocol,
so prepare the protocol diag handlers to work with IPPROTO_ constants.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
37f352b5e3e89337f7a9a3a90250b5dde3c5f40d 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Move byte-code finding up the call-stack

Current code calculates it at fixed offset. This offset will change, so
move the BC calculation upper to make the further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8d34172dfdb762a306cdf58b547aa10d798622ec 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> sock_diag: Introduce new message type

This type will run the family+protocol based socket dumping.
Also prepare the stub function for it.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7f1fb60c4fc9fb29fbb406ac8c4cfb4e59e168d6 06-Dec-2011 Pavel Emelyanov <xemul@parallels.com> inet_diag: Partly rename inet_ to sock_

The ultimate goal is to get the sock_diag module, that works in
family+protocol terms. Currently this is suitable to do on the
inet_diag basis, so rename parts of the code. It will be moved
to sock_diag.c later.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4e3fd7a06dc20b2d8ec6892233ad2012968fe7b6 21-Nov-2011 Alexey Dobriyan <adobriyan@gmail.com> net: remove ipv6_addr_copy()

C assignment can handle struct in6_addr copying.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
717b6d83664646963c71d014c71babaa802333b9 22-Nov-2011 Maciej Żenczykowski <maze@google.com> net-netlink: fix diag to export IPv4 tos for dual-stack IPv6 sockets

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
06236ac3726f15124839cf16a9e2730a852dad9b 07-Nov-2011 Maciej Żenczykowski <maze@google.com> net-netlink: Add a new attribute to expose TCLASS values via netlink

commit 3ceca749668a52bd795585e0f71c6f0b04814f7b added a TOS attribute.

Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket,
furthermore they can have different values. As such one cannot in a
sane way expose both through a single attribute.

Signed-off-by: Maciej Żenczyowski <maze@google.com>
CC: Murali Raja <muralira@google.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
3ceca749668a52bd795585e0f71c6f0b04814f7b 12-Oct-2011 Murali Raja <muralira@google.com> net-netlink: Add a new attribute to expose TOS values via netlink

This patch exposes the tos value for the TCP sockets when the TOS flag
is requested in the ext_flags for the inet_diag request. This would mainly be
used to expose TOS values for both for TCP and UDP sockets. Currently it is
supported for TCP. When netlink support for UDP would be added the support
to expose the TOS values would alse be done. For IPV4 tos value is exposed
and for IPV6 tclass value is exposed.

Signed-off-by: Murali Raja <muralira@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eeb1497277d6b1a0a34ed36b97e18f2bd7d6de0d 17-Jun-2011 Eric Dumazet <eric.dumazet@gmail.com> inet_diag: fix inet_diag_bc_audit()

A malicious user or buggy application can inject code and trigger an
infinite loop in inet_diag_bc_audit()

Also make sure each instruction is aligned on 4 bytes boundary, to avoid
unaligned accesses.

Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c7ac8679bec9397afe8918f788cbcef88c38da54 10-Jun-2011 Greg Rose <gregory.v.rose@intel.com> rtnetlink: Compute and store minimum ifinfo dump size

The message size allocated for rtnl ifinfo dumps was limited to
a single page. This is not enough for additional interface info
available with devices that support SR-IOV and caused a bug in
which VF info would not be displayed if more than approximately
40 VFs were created per interface.

Implement a new function pointer for the rtnl_register service that will
calculate the amount of data required for the ifinfo dump and allocate
enough data to satisfy the request.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
b71d1d426d263b0b6cb5760322efebbfc89d4463 22-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com> inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b8f3ab4290f1e720166e888ea2a1d1d44c4d15dd 18-Jan-2011 David S. Miller <davem@davemloft.net> Revert "netlink: test for all flags of the NLM_F_DUMP composite"

This reverts commit 0ab03c2b1478f2438d2c80204f7fef65b1bca9cf.

It breaks several things including the avahi daemon.

Signed-off-by: David S. Miller <davem@davemloft.net>
0ab03c2b1478f2438d2c80204f7fef65b1bca9cf 07-Jan-2011 Jan Engelhardt <jengelh@medozas.de> netlink: test for all flags of the NLM_F_DUMP composite

Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT | NLM_F_MATCH,
when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits
being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL,
non-dump requests with NLM_F_EXCL set are mistaken as dump requests.

Substitute the condition to test for _all_ bits being set.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
22e76c849d505d87c5ecf3d3e6742a65f0ff4860 03-Nov-2010 Nelson Elhage <nelhage@ksplice.com> inet_diag: Make sure we actually run the same bytecode we audited.

We were using nlmsg_find_attr() to look up the bytecode by attribute when
auditing, but then just using the first attribute when actually running
bytecode. So, if we received a message with two attribute elements, where only
the second had type INET_DIAG_REQ_BYTECODE, we would validate and run different
bytecode strings.

Fix this by consistently using nlmsg_find_attr everywhere.

Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: Thomas Graf <tgraf@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a02cec2155fbea457eca8881870fd2de1a4c4c76 22-Sep-2010 Eric Dumazet <eric.dumazet@gmail.com> net: return operator cleanup

Change "return (EXPR);" to "return EXPR;"

return is not a function, parentheses are not required.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
b4ced2b768ab6c580148d1163c82a655fe147edc 19-Jan-2010 Roel Kluin <roel.kluin@gmail.com> netlink: With opcode INET_DIAG_BC_S_LE dport was compared in inet_diag_bc_run()

The s-port should be compared.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c720c7e8383aff1cb219bddf474ed89d850336e3 15-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> inet: rename some inet_sock fields

In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.

Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)

This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f373b53b5fe67aa4a6f28f921a529cc90f88e79b 09-Oct-2009 Eric Dumazet <eric.dumazet@gmail.com> tcp: replace ehash_size by ehash_mask

Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
31e6d363abcd0d05766c82f1a9c905a4c974a199 18-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: correct off-by-one write allocations reports

commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
(net: No more expensive sock_hold()/sock_put() on each tx)
changed initial sk_wmem_alloc value.

We need to take into account this offset when reporting
sk_wmem_alloc to user, in PROC_FS files or various
ioctls (SIOCOUTQ/TIOCOUTQ)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ac5978e7f83a6cbd9a70282d6e39bd936a31b2bc 28-Apr-2009 Arnaldo Carvalho de Melo <acme@redhat.com> inet_diag: Remove dup assignments

These are later assigned to other values without being used meanwhile.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c25eb3bfb97294d0543a81230fbc237046b4b84c 24-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: Convert TCP/DCCP listening hash tables to use RCU

This is the last step to be able to perform full RCU lookups
in __inet_lookup() : After established/timewait tables, we
add RCU lookups to listening hash table.

The only trick here is that a socket of a given type (TCP ipv4,
TCP ipv6, ...) can now flight between two different tables
(established and listening) during a RCU grace period, so we
must use different 'nulls' end-of-chain values for two tables.

We define a large value :

#define LISTENING_NULLS_BASE (1U << 29)

So that slots in listening table are guaranteed to have different
end-of-chain values than slots in established table. A reader can
still detect it finished its lookup in the right chain.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7e3aab4a9cd7d37f80eee75bebb6a71347f82476 22-Nov-2008 David S. Miller <davem@davemloft.net> inet_diag: Missed conversion after changing inet ehash lockl to spinlocks.

They are no longer a rwlocks.

Signed-off-by: David S. Miller <davem@davemloft.net>
5caea4ea7088e80ac5410d04660346094608b909 20-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: listening_hash get a spinlock per bucket

This patch prepares RCU migration of listening_hash table for
TCP/DCCP protocols.

listening_hash table being small (32 slots per protocol), we add
a spinlock for each slot, instead of a single rwlock for whole table.

This should reduce hold time of readers, and writers concurrency.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3ab5aee7fe840b5b1b35a8d1ac11c3de5281e611 17-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls

RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
95a5afca4a8d2e1cb77e1d4bc6ff9f718dc32f7a 17-Oct-2008 Johannes Berg <johannes@sipsolutions.net> net: Remove CONFIG_KMOD from net/ (towards removing CONFIG_KMOD entirely)

Some code here depends on CONFIG_KMOD to not try to load
protocol modules or similar, replace by CONFIG_MODULES
where more than just request_module depends on CONFIG_KMOD
and and also use try_then_request_module in ebtables.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
6be547a61d6220199826070cda792297c3d15994 28-Aug-2008 Andi Kleen <ak@linux.intel.com> inet_diag: Add empty bucket optimization to inet_diag too

Skip quickly over empty buckets in inet_diag.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0b040829952d84bf2a62526f0e24b624e0699447 11-Jun-2008 Adrian Bunk <bunk@kernel.org> net: remove CVS keywords

This patch removes CVS keywords that weren't updated for a long time
from comments.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
d86e0dac2ce412715181f792aa0749fe3effff11 31-Jan-2008 Pavel Emelyanov <xemul@openvz.org> [NETNS]: Tcp-v6 sockets per-net lookup.

Add a net argument to inet6_lookup and propagate it further.
Actually, this is tcp-v6 implementation of what was done for
tcp-v4 sockets in a previous patch.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
c67499c0e772064b37ad75eb69b28fc218752636 31-Jan-2008 Pavel Emelyanov <xemul@openvz.org> [NETNS]: Tcp-v4 sockets per-net lookup.

Add a net argument to inet_lookup and propagate it further
into lookup calls. Plus tune the __inet_check_established.

The dccp and inet_diag, which use that lookup functions
pass the init_net into them.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8cf8e5a67fb07f583aac94482ba51a7930dab493 29-Jan-2008 Arnaldo Carvalho de Melo <acme@redhat.com> [INET_DIAG]: Fix inet_diag_lock_handler error path.

Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=9825

The inet_diag_lock_handler function uses ERR_PTR to encode errors but
its callers were testing against NULL.

This only happens when the only inet_diag modular user, DCCP, is not
built into the kernel or available as a module.

Also there was a problem with not dropping the mutex lock when a handler
was not found, also fixed in this patch.

This caused an OOPS and ss would then hang on subsequent calls, as
&inet_diag_table_mutex was being left locked.

Thanks to spike at ml.yaroslavl.ru for report it after trying 'ss -d'
on a kernel that doesn't have DCCP available.

This bug was introduced in cset
d523a328fb0271e1a763e985a21f2488fd816e7e ("Fix inet_diag dead-lock
regression"), after 2.6.24-rc3, so just 2.6.24 seems to be affected.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
b7c6ba6eb1234e35a74fb8ba8123232a7b1ba9e4 28-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Consolidate kernel netlink socket destruction.

Create a specific helper for netlink kernel socket disposal. This just
let the code look better and provides a ground for proper disposal
inside a namespace.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Tested-by: Alexey Dobriyan <adobriyan@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
d523a328fb0271e1a763e985a21f2488fd816e7e 03-Dec-2007 Herbert Xu <herbert@gondor.apana.org.au> [INET]: Fix inet_diag dead-lock regression

The inet_diag register fix broke inet_diag module loading because the
loaded module had to take the same mutex that's already held by the
loader in order to register the new handler.

This patch fixes it by introducing a separate mutex to protect the
handling of handlers.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
076931989fe96823a577259cc6bc205d7ec31754 29-Nov-2007 Pavel Emelyanov <xemul@openvz.org> [INET]: Fix inet_diag register vs rcv race

The following race is possible when one cpu unregisters the handler
while other one is trying to receive a message and call this one:

CPU1: CPU2:
inet_diag_rcv() inet_diag_unregister()
mutex_lock(&inet_diag_mutex);
netlink_rcv_skb(skb, &inet_diag_rcv_msg);
if (inet_diag_table[nlh->nlmsg_type] ==
NULL) /* false handler is still registered */
...
netlink_dump_start(idiagnl, skb, nlh,
inet_diag_dump, NULL);
cb = kzalloc(sizeof(*cb), GFP_KERNEL);
/* sleep here freeing memory
* or preempt
* or sleep later on nlk->cb_mutex
*/
spin_lock(&inet_diag_register_lock);
inet_diag_table[type] = NULL;
... spin_unlock(&inet_diag_register_lock);
synchronize_rcu();
/* CPU1 is sleeping - RCU quiescent
* state is passed
*/
return;
/* inet_diag_dump is finally called: */
inet_diag_dump()
handler = inet_diag_table[cb->nlh->nlmsg_type];
BUG_ON(handler == NULL);
/* OOPS! While we slept the unregister has set
* handler to NULL :(
*/

Grep showed, that the register/unregister functions are called
from init/fini module callbacks for tcp_/dccp_diag, so it's OK
to use the inet_diag_mutex to synchronize manipulations with the
inet_diag_table and the access to it.

Besides, as Herbert pointed out, asynchronous dumps should hold
this mutex as well, and thus, we provide the mutex as cb_mutex one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
230140cffa7feae90ad50bf259db1fa07674f3a7 07-Nov-2007 Eric Dumazet <dada1@cosmosbay.com> [INET]: Remove per bucket rwlock in tcp/dccp ehash table.

As done two years ago on IP route cache table (commit
22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one
lock per hash bucket for the huge TCP/DCCP hash tables.

On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
litle performance differences. (we hit a different cache line for the
rwlock, but then the bucket cache line have a better sharing factor
among cpus, since we dirty it less often). For netstat or ss commands
that want a full scan of hash table, we perform fewer memory accesses.

Using a 'small' table of hashed rwlocks should be more than enough to
provide correct SMP concurrency between different buckets, without
using too much memory. Sizing of this table depends on
num_possible_cpus() and various CONFIG settings.

This patch provides some locking abstraction that may ease a future
work using a different model for TCP/DCCP table.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
305e1e96911417d8cda2699f6a20a6f434616a8c 22-Oct-2007 Jean Delvare <jdelvare@suse.de> [INET]: Let inet_diag and friends autoload

By adding module aliases to inet_diag, tcp_diag and dccp_diag, we let
them load automatically as needed. This makes tools like "ss" run
faster.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
cd40b7d3983c708aabe3d3008ec64ffce56d33b0 11-Oct-2007 Denis V. Lunev <den@openvz.org> [NET]: make netlink user -> kernel interface synchronious

This patch make processing netlink user -> kernel messages synchronious.
This change was inspired by the talk with Alexey Kuznetsov about current
netlink messages processing. He says that he was badly wrong when introduced
asynchronious user -> kernel communication.

The call netlink_unicast is the only path to send message to the kernel
netlink socket. But, unfortunately, it is also used to send data to the
user.

Before this change the user message has been attached to the socket queue
and sk->sk_data_ready was called. The process has been blocked until all
pending messages were processed. The bad thing is that this processing
may occur in the arbitrary process context.

This patch changes nlk->data_ready callback to get 1 skb and force packet
processing right in the netlink_unicast.

Kernel -> user path in netlink_unicast remains untouched.

EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
drop, but the process remains in the cycle until the message will be fully
processed. So, there is no need to use this kludges now.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
0cfad07555312468296ea3bbbcdf99038f58678b 17-Sep-2007 Herbert Xu <herbert@gondor.apana.org.au> [NETLINK]: Avoid pointer in netlink_run_queue

I was looking at Patrick's fix to inet_diag and it occured
to me that we're using a pointer argument to return values
unnecessarily in netlink_run_queue. Changing it to return
the value will allow the compiler to generate better code
since the value won't have to be memory-backed.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
b4b510290b056b86611757ce1175a230f1080f53 12-Sep-2007 Eric W. Biederman <ebiederm@xmission.com> [NET]: Support multiple network namespaces with netlink

Each netlink socket will live in exactly one network namespace,
this includes the controlling kernel sockets.

This patch updates all of the existing netlink protocols
to only support the initial network namespace. Request
by clients in other namespaces will get -ECONREFUSED.
As they would if the kernel did not have the support for
that netlink protocol compiled in.

As each netlink protocol is updated to be multiple network
namespace safe it can register multiple kernel sockets
to acquire a presence in the rest of the network namespaces.

The implementation in af_netlink is a simple filter implementation
at hash table insertion and hash table look up time.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
172589ccdde41b59861c92c4a971b95514ef24e3 29-Aug-2007 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> [NET]: DIV_ROUND_UP cleanup (part two)

Hopefully captured all single statement cases under net/. I'm
not too sure if there is some policy about #includes that are
"guaranteed" (ie., in the current tree) to be available through
some other #included header, so I just added linux/kernel.h to
each changed file that didn't #include it previously.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
0a9c73014415d2a84dac346c1e12169142a6ad37 11-Sep-2007 Patrick McHardy <kaber@trash.net> [INET_DIAG]: Fix oops in netlink_rcv_skb

netlink_run_queue() doesn't handle multiple processes processing the
queue concurrently. Serialize queue processing in inet_diag to fix
a oops in netlink_rcv_skb caused by netlink_run_queue passing a
NULL for the skb.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000054
[349587.500454] printing eip:
[349587.500457] c03318ae
[349587.500459] *pde = 00000000
[349587.500464] Oops: 0000 [#1]
[349587.500466] PREEMPT SMP
[349587.500474] Modules linked in: w83627hf hwmon_vid i2c_isa
[349587.500483] CPU: 0
[349587.500485] EIP: 0060:[<c03318ae>] Not tainted VLI
[349587.500487] EFLAGS: 00010246 (2.6.22.3 #1)
[349587.500499] EIP is at netlink_rcv_skb+0xa/0x7e
[349587.500506] eax: 00000000 ebx: 00000000 ecx: c148d2a0 edx: c0398819
[349587.500510] esi: 00000000 edi: c0398819 ebp: c7a21c8c esp: c7a21c80
[349587.500517] ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068
[349587.500521] Process oidentd (pid: 17943, ti=c7a20000 task=cee231c0 task.ti=c7a20000)
[349587.500527] Stack: 00000000 c7a21cac f7c8ba78 c7a21ca4 c0331962 c0398819 f7c8ba00 0000004c
[349587.500542] f736f000 c7a21cb4 c03988e3 00000001 f7c8ba00 c7a21cc4 c03312a5 0000004c
[349587.500558] f7c8ba00 c7a21cd4 c0330681 f7c8ba00 e4695280 c7a21d00 c03307c6 7fffffff
[349587.500578] Call Trace:
[349587.500581] [<c010361a>] show_trace_log_lvl+0x1c/0x33
[349587.500591] [<c01036d4>] show_stack_log_lvl+0x8d/0xaa
[349587.500595] [<c010390e>] show_registers+0x1cb/0x321
[349587.500604] [<c0103bff>] die+0x112/0x1e1
[349587.500607] [<c01132d2>] do_page_fault+0x229/0x565
[349587.500618] [<c03c8d3a>] error_code+0x72/0x78
[349587.500625] [<c0331962>] netlink_run_queue+0x40/0x76
[349587.500632] [<c03988e3>] inet_diag_rcv+0x1f/0x2c
[349587.500639] [<c03312a5>] netlink_data_ready+0x57/0x59
[349587.500643] [<c0330681>] netlink_sendskb+0x24/0x45
[349587.500651] [<c03307c6>] netlink_unicast+0x100/0x116
[349587.500656] [<c0330f83>] netlink_sendmsg+0x1c2/0x280
[349587.500664] [<c02fcce9>] sock_sendmsg+0xba/0xd5
[349587.500671] [<c02fe4d1>] sys_sendmsg+0x17b/0x1e8
[349587.500676] [<c02fe92d>] sys_socketcall+0x230/0x24d
[349587.500684] [<c01028d2>] syscall_call+0x7/0xb
[349587.500691] =======================
[349587.500693] Code: f0 ff 4e 18 0f 94 c0 84 c0 0f 84 66 ff ff ff 89 f0 e8 86 e2 fc ff e9 5a ff ff ff f0 ff 40 10 eb be 55 89 e5 57 89 d7 56 89 c6 53 <8b> 50 54 83 fa 10 72 55 8b 9e 9c 00 00 00 31 c9 8b 03 83 f8 0f

Reported by Athanasius <link@miggy.org>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
af65bdfce98d7965fbe93a48b8128444a2eea024 20-Apr-2007 Patrick McHardy <kaber@trash.net> [NETLINK]: Switch cb_lock spinlock to mutex and allow to override it

Switch cb_lock to mutex and allow netlink kernel users to override it
with a subsystem specific mutex for consistent locking in dump callbacks.
All netlink_dump_start users have been audited not to rely on any
side-effects of the previously used spinlock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
c702e8047fe74648f7852a9c1de781b0d5a98402 23-Mar-2007 Thomas Graf <tgraf@suug.ch> [NETLINK]: Directly return -EINTR from netlink_dump_start()

Now that all users of netlink_dump_start() use netlink_run_queue()
to process the receive queue, it is possible to return -EINTR from
netlink_dump_start() directly, therefore simplying the callers.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
ead592ba246dfcc643b3f0f0c8c03f7bc898a59f 23-Mar-2007 Thomas Graf <tgraf@suug.ch> [IPv4] diag: Use netlink_run_queue() to process the receive queue

Makes use of netlink_run_queue() to process the receive queue and
converts inet_diag_rcv_msg() to use the type safe netlink interface.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
dc5fc579b90ed0a9a4e55b0218cdbaf0a8cf2e67 26-Mar-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [NETLINK]: Use nlmsg_trim() where appropriate

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b529ccf2799c14346d1518e9bdf1f88f03643e99 26-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [NETLINK]: Introduce nlmsg_hdr() helper

For the common "(struct nlmsghdr *)skb->data" sequence, so that we reduce the
number of direct accesses to skb->data and for consistency with all the other
cast skb member helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
27a884dc3cb63b93c2b3b643f5b31eed5f8a4d26 20-Apr-2007 Arnaldo Carvalho de Melo <acme@redhat.com> [SK_BUFF]: Convert skb->tail to sk_buff_data_t

So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e905a9edab7f4f14f9213b52234e4a346c690911 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] IPV4: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
dbca9b2750e3b1ee6f56a616160ccfc12e8b161f 08-Feb-2007 Eric Dumazet <dada1@cosmosbay.com> [NET]: change layout of ehash table

ehash table layout is currently this one :

First half of this table is used by sockets not in TIME_WAIT state
Second half of it is used by sockets in TIME_WAIT state.

This is non optimal because of for a given hash or socket, the two chain heads
are located in separate cache lines.
Moreover the locks of the second half are never used.

If instead of this halving, we use two list heads in inet_ehash_bucket instead
of only one, we probably can avoid one cache miss, and reduce ram usage,
particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK,
CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep
together related chains to speedup lookups and socket state change.

In this patch I did not try to align struct inet_ehash_bucket, but a future
patch could try to make this structure have a convenient size (a power of two
or a multiple of L1_CACHE_SIZE).
I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so
maybe we dont need to scratch our heads to align the bucket...

Note : In case struct inet_ehash_bucket is not a power of two, we could
probably change alloc_large_system_hash() (in case it use __get_free_pages())
to free the unused space. It currently allocates a big zone, but the last
quarter of it could be freed. Again, this should be a temporary 'problem'.

Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
26932566a42d46aee7e5d526cb34fba9380cad10 01-Feb-2007 Patrick McHardy <kaber@trash.net> [NETLINK]: Don't BUG on undersized allocations

Currently netlink users BUG when the allocated skb for an event
notification is undersized. While this is certainly a kernel bug,
its not critical and crashing the kernel is too drastic, especially
when considering that these errors have appeared multiple times in
the past and it BUGs even if no listeners are present.

This patch replaces BUG by WARN_ON and changes the notification
functions to inform potential listeners of undersized allocations
using a unique error code (EMSGSIZE).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9f8552996d969f56039ec88128cf5ad35b12f141 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: inet_diag annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
0da974f4f303a6842516b764507e3c0a03f41e5a 21-Jul-2006 Panagiotis Issaris <takis@issaris.org> [NET]: Conversions from kmalloc+memset to k(z|c)alloc.

Signed-off-by: Panagiotis Issaris <takis@issaris.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
dff2c03534f525813342ab8dec90c5bb1ee07471 09-Jan-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_DIAG]: Introduce sk_diag_fill

To be called from inet_diag_get_exact, also rename inet_diag_fill to
inet_csk_diag_fill, for consistency with inet_twsk_diag_fill.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c7d58aabdc08aea92d6c5e01dcd7473a74b3642c 09-Jan-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_DIAG]: Introduce inet_twsk_diag_dump & inet_twsk_diag_fill

To properly dump TIME_WAIT sockets and to reduce complexity a bit by
having per socket class accessor routines.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4e852c027989c9fd5c51fb0080e28590c35c75ba 09-Jan-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_DIAG]: whitespace/simple cleanups

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7dbf0755249336f44f57368bdbf6f84103b3ba75 09-Jan-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_DIAG]: Use inet_twsk() with TIME_WAIT sockets

The fields being accessed in inet_diag_dump are outside sock_common, the
common part of struct sock and struct inet_timewait_sock.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0fa1a53e1f055a6c790f40e7728f42a825b29248 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [IPV6]: Introduce inet6_timewait_sock

Out of tcp6_timewait_sock, that now is just an aggregation of
inet_timewait_sock and inet6_timewait_sock, using tw_ipv6_offset in struct
inet_timewait_sock, that is common to the IPv6 transport protocols that use
timewait sockets, like DCCP and TCP.

tw_ipv6_offset plays the struct inet_sock pinfo6 role, i.e. for the generic
code to find the IPv6 area in a timewait sock.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ca304b6104ffdd120bb6687a88a0625e58bc71cd 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [IPV6]: Introduce inet6_rsk()

And inet6_rsk_offset in inet_request_sock, for the same reasons as
inet_sock's pinfo6 member.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a8f74b228826eef1cbe04a05647d61e896f5fd63 10-Nov-2005 Thomas Graf <tgraf@suug.ch> [NETLINK]: Make netlink_callback->done() optional

Most netlink families make no use of the done() callback, making
it optional gets rid of all unnecessary dummy implementations.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
066286071d3542243baa68166acb779187c848b3 15-Aug-2005 Patrick McHardy <kaber@trash.net> [NETLINK]: Add "groups" argument to netlink_kernel_create

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
17b085eacef81a6286bd478f2ec75e04abb091cb 12-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_DIAG]: Move the tcp_diag interface to the proper place

With this the previous setup is back, i.e. tcp_diag can be built as a module,
as dccp_diag and both share the infrastructure available in inet_diag.

If one selects CONFIG_INET_DIAG as module CONFIG_INET_TCP_DIAG will also be
built as a module, as will CONFIG_INET_DCCP_DIAG, if CONFIG_IP_DCCP was
selected static or as a module, if CONFIG_INET_DIAG is y, being statically
linked CONFIG_INET_TCP_DIAG will follow suit and CONFIG_INET_DCCP_DIAG will be
built in the same manner as CONFIG_IP_DCCP.

Now to aim at UDP, converting it to use inet_hashinfo, so that we can use
iproute2 for UDP sockets as well.

Ah, just to show an example of this new infrastructure working for DCCP :-)

[root@qemu ~]# ./ss -dane
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 0 *:5001 *:* ino:942 sk:cfd503a0
ESTAB 0 0 127.0.0.1:5001 127.0.0.1:32770 ino:943 sk:cfd50a60
ESTAB 0 0 127.0.0.1:32770 127.0.0.1:5001 ino:947 sk:cfd50700
TIME-WAIT 0 0 127.0.0.1:32769 127.0.0.1:5001 timer:(timewait,3.430ms,0) ino:0 sk:cf209620

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a8c2190ee7da1a1dc68ff1a6b5f03feb61e523a5 12-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [INET_DIAG]: Rename tcp_diag.[ch] to inet_diag.[ch]

Next changeset will introduce net/ipv4/tcp_diag.c, moving the code that was put
transitioanlly in inet_diag.c.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>