History log of /include/net/route.h
Revision Date Author Comments
83511cc43b561d957b0295c53063127bc5b185da 29-Sep-2015 Amit Pundir <amit.pundir@linaro.org> net: core: fix UID-based routing build

Use kuid_t GLOBAL_ROOT_UID instead of "0"
otherwise we run into following build failure:

include/net/route.h: In function 'ip_route_output_ports':
include/net/route.h:143:55: error: type mismatch in conditional expression
daddr, saddr, dport, sport, sk ? sock_i_uid(sk) : 0);
^

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2de7009bb038330dd9e9ea4567bbe19b14e14edf 08-Jul-2014 Sreeram Ramachandran <sreeram@google.com> Handle 'sk' being NULL in UID-based routing.

Bug: 15413527
Change-Id: Iab1fae9da6053b284591628ef1de878761b137b1
Signed-off-by: Sreeram Ramachandran <sreeram@google.com>
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
ba3d8d3f9f65807763b2e0e1ea7645d74a962248 31-Mar-2014 Lorenzo Colitti <lorenzo@google.com> net: core: Support UID-based routing.

This contains the following commits:

1. cc2f522 net: core: Add a UID range to fib rules.
2. d7ed2bd net: core: Use the socket UID in routing lookups.
3. 2f9306a net: core: Add a RTA_UID attribute to routes.
This is so that userspace can do per-UID route lookups.
4. 8e46efb net: ipv6: Use the UID in IPv6 PMTUD
IPv4 PMTUD already does this because ipv4_sk_update_pmtu
uses __build_flow_key, which includes the UID.

Bug: 15413527
Change-Id: I81bd31dae655de9cce7d7a1f9a905dc1c2feba7c
Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
0b8c7f6f2a26ed2dee24881299fc69f554596dbb 21-Mar-2014 Li RongQing <roy.qing.li@gmail.com> ipv4: remove ip_rt_dump from route.c

ip_rt_dump do nothing after IPv4 route caches removal, so we can remove it.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
f87c10a8aa1e82498c42d0335524d6ae7cf5a52b 09-Jan-2014 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing

While forwarding we should not use the protocol path mtu to calculate
the mtu for a forwarded packet but instead use the interface mtu.

We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
introduced for multicast forwarding. But as it does not conflict with
our usage in unicast code path it is perfect for reuse.

I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
dependencies because of IPSKB_FORWARDED.

Because someone might have written a software which does probe
destinations manually and expects the kernel to honour those path mtus
I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
can disable this new behaviour. We also still use mtus which are locked on a
route for forwarding.

The reason for this change is, that path mtus information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv4 forwarding path to
wrongfully emit fragmentation needed notifications or start to fragment
packets along a path.

Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
won't be set and further fragmentation logic will use the path mtu to
determine the fragmentation size. They also recheck packet size with
help of path mtu discovery and report appropriate errors.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0e0d44ab4275549998567cd4700b43f7496eb62b 28-Aug-2013 Steffen Klassert <steffen.klassert@secunet.com> net: Remove FLOWI_FLAG_CAN_SLEEP

FLOWI_FLAG_CAN_SLEEP was used to notify xfrm about the posibility
to sleep until the needed states are resolved. This code is gone,
so FLOWI_FLAG_CAN_SLEEP is not needed anymore.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
482fc6094afad572a4ea1fd722e7b11ca72022a0 05-Nov-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE

Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
their sockets won't accept and install new path mtu information and they
will always use the interface mtu for outgoing packets. It is guaranteed
that the packet is not fragmented locally. But we won't set the DF-Flag
on the outgoing frames.

Florian Weimer had the idea to use this flag to ensure DNS servers are
never generating outgoing fragments. They may well be fragmented on the
path, but the server never stores or usees path mtu values, which could
well be forged in an attack.

(The root of the problem with path MTU discovery is that there is
no reliable way to authenticate ICMP Fragmentation Needed But DF Set
messages because they are sent from intermediate routers with their
source addresses, and the IMCP payload will not always contain sufficient
information to identify a flow.)

Recent research in the DNS community showed that it is possible to
implement an attack where DNS cache poisoning is feasible by spoofing
fragments. This work was done by Amir Herzberg and Haya Shulman:
<https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>

This issue was previously discussed among the DNS community, e.g.
<http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
without leading to fixes.

This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
regarding local fragmentation with UFO/CORK" for the enforcement of the
non-fragmentable checks. If other users than ip_append_page/data should
use this semantic too, we have to add a new flag to IPCB(skb)->flags to
suppress local fragmentation and check for this in ip_finish_output.

Many thanks to Florian Weimer for the idea and feedback while implementing
this patch.

Cc: David S. Miller <davem@davemloft.net>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0baf2b35fc70ab16c385963d2502da26a55d2cb7 16-Oct-2013 Eric Dumazet <edumazet@google.com> ipv4: shrink rt_cache_stat

Half of the rt_cache_stat fields are no longer used after IP
route cache removal, lets shrink this per cpu area.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
aa6615814533c634190019ee3a5b10490026d545 24-Sep-2013 Francesco Fusco <ffusco@redhat.com> ipv4: processing ancillary IP_TOS or IP_TTL

If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.

Signed-off-by: Francesco Fusco <ffusco@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2702c4bb89b2a9bfe49f3db77a1bac88d5c6404c 22-Sep-2013 Joe Perches <joe@perches.com> route.h: Remove extern from function prototypes

There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0ea9d5e3e0e03a63b11392f5613378977dae7eca 13-Aug-2013 Hannes Frederic Sowa <hannes@stressinduktion.org> xfrm: introduce helper for safe determination of mtu

skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we
always have to make sure we a referring to the correct interpretation
of skb->sk.

We only depend on header defines to query the mtu, so we don't introduce
a new dependency to ipv6 by this change.

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
6da025fa23bb10c82f80de319c837ed2b02306e4 28-Oct-2012 Eric Dumazet <edumazet@google.com> ipv4: avoid a test in ip_rt_put()

We can save a test in ip_rt_put(), considering dst_release() accepts
a NULL parameter, and dst is first element in rtable.

Add a BUILD_BUG_ON() to catch any change that could break this
assertion.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <amwang@redhat.com>
Acked-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
155e8336c373d14d87a7f91e356d85ef4b93b8f9 08-Oct-2012 Julian Anastasov <ja@ssi.bg> ipv4: introduce rt_uses_gateway

Add new flag to remember when route is via gateway.
We will use it to allow rt_gateway to contain address of
directly connected host for the cases when DST_NOCACHE is
used or when the NH exception caches per-destination route
without DST_NOCACHE flag, i.e. when routes are not used for
other destinations. By this way we force the neighbour
resolving to work with the routed destination but we
can use different address in the packet, feature needed
for IPVS-DR where original packet for virtual IP is routed
via route to real IP.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
bafa6d9d89072c1a18853afe9ee5de05c491c13a 07-Sep-2012 Nicolas Dichtel <nicolas.dichtel@6wind.com> ipv4/route: arg delay is useless in rt_cache_flush()

Since route cache deletion (89aef8921bfbac22f), delay is no
more used. Remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4ccfe6d4109252dfadcd6885f33ed600ee03dbf8 07-Sep-2012 Nicolas Dichtel <nicolas.dichtel@6wind.com> ipv4/route: arg delay is useless in rt_cache_flush()

Since route cache deletion (89aef8921bfbac22f), delay is no
more used. Remove it.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
caacf05e5ad1abf0a2864863da4e33024bc68ec6 01-Aug-2012 David S. Miller <davem@davemloft.net> ipv4: Properly purge netdev references on uncached routes.

When a device is unregistered, we have to purge all of the
references to it that may exist in the entire system.

If a route is uncached, we currently have no way of accomplishing
this.

So create a global list that is scanned when a network device goes
down. This mirrors the logic in net/core/dst.c's dst_ifdown().

Signed-off-by: David S. Miller <davem@davemloft.net>
c6cffba4ffa26a8ffacd0bb9f3144e34f20da7de 26-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Fix input route performance regression.

With the routing cache removal we lost the "noref" code paths on
input, and this can kill some routing workloads.

Reinstate the noref path when we hit a cached route in the FIB
nexthops.

With help from Eric Dumazet.

Reported-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
13378cad02afc2adc6c0e07fca03903c7ada0b37 23-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Change rt->rt_iif encoding.

On input packet processing, rt->rt_iif will be zero if we should
use skb->dev->ifindex.

Since we access rt->rt_iif consistently via inet_iif(), that is
the only spot whose interpretation have to adjust.

Signed-off-by: David S. Miller <davem@davemloft.net>
2860583fe840d972573363dfa190b2149a604534 17-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Kill rt->fi

It's not really needed.

We only grabbed a reference to the fib_info for the sake of fib_info
local metrics.

However, fib_info objects are freed using RCU, as are therefore their
private metrics (if any).

We would have triggered a route cache flush if we eliminated a
reference to a fib_info object in the routing tables.

Therefore, any existing cached routes will first check and see that
they have been invalidated before an errant reference to these
metric values would occur.

Signed-off-by: David S. Miller <davem@davemloft.net>
9917e1e8762745191eba5a3bf2040278cbddbee1 17-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Turn rt->rt_route_iif into rt->rt_is_input.

That is this value's only use, as a boolean to indicate whether
a route is an input route or not.

So implement it that way, using a u16 gap present in the struct
already.

Signed-off-by: David S. Miller <davem@davemloft.net>
4fd551d7bed93af60af61c5a324b8f5dff37953a 17-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Kill rt->rt_oif

Never actually used.

It was being set on output routes to the original OIF specified in the
flow key used for the lookup.

Adjust the only user, ipmr_rt_fib_lookup(), for greater correctness of
the flowi4_oif and flowi4_iif values, thanks to feedback from Julian
Anastasov.

Signed-off-by: David S. Miller <davem@davemloft.net>
f8126f1d5136be1ca1a3536d43ad7a710b5620f8 13-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Adjust semantics of rt->rt_gateway.

In order to allow prefixed routes, we have to adjust how rt_gateway
is set and interpreted.

The new interpretation is:

1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr

2) rt_gateway != 0, destination requires a nexthop gateway

Abstract the fetching of the proper nexthop value using a new
inline helper, rt_nexthop(), as suggested by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
f1ce3062c53809d862d8a04e7a0566c3cc4e0bda 12-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Remove 'rt_dst' from 'struct rtable'

Signed-off-by: David S. Miller <davem@davemloft.net>
b48698895de86e07b685f8e4b8db0f1cd5a97e9a 01-Jul-2012 David Miller <davem@davemloft.net> ipv4: Remove 'rt_mark' from 'struct rtable'

Signed-off-by: David S. Miller <davem@davemloft.net>
d6c0a4f609847d6e65658913f9ccbcb1c137cff3 01-Jul-2012 David Miller <davem@davemloft.net> ipv4: Kill 'rt_src' from 'struct rtable'

Signed-off-by: David S. Miller <davem@davemloft.net>
1a00fee4ffb22312a0ac40045ecd6f222b55eb3d 01-Jul-2012 David Miller <davem@davemloft.net> ipv4: Remove rt_key_{src,dst,tos} from struct rtable.

They are always used in contexts where they can be reconstituted,
or where the finally resolved rt->rt_{src,dst} is semantically
equivalent.

Signed-off-by: David S. Miller <davem@davemloft.net>
38a424e4657462fe9f8b76f01a0e879abde99ab4 01-Jul-2012 David Miller <davem@davemloft.net> ipv4: Kill ip_route_input_noref().

The "noref" argument to ip_route_input_common() is now always ignored
because we do not cache routes, and in that case we must always grab
a reference to the resulting 'dst'.

Signed-off-by: David S. Miller <davem@davemloft.net>
89aef8921bfbac22f00e04f8450f6e447db13e42 17-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Delete routing cache.

The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.

The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.

What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables. The former of which is
controllable by external entitites.

Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.

Signed-off-by: David S. Miller <davem@davemloft.net>
1f42539d257af671d56d4bdbcf13aef31abff6ef 12-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Kill ip_rt_redirect().

No longer needed, as the protocol handlers now all properly
propagate the redirect back into the routing code.

Signed-off-by: David S. Miller <davem@davemloft.net>
b42597e2f36e2043756aa7462faddf630962f061 12-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions.

Signed-off-by: David S. Miller <davem@davemloft.net>
94206125c4aac32e43c25bfe1b827e7ab993b7dc 12-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Rearrange arguments to ip_rt_redirect()

Pass in the SKB rather than just the IP addresses, so that policy
and other aspects can reside in ip_rt_redirect() rather then
icmp_redirect().

Signed-off-by: David S. Miller <davem@davemloft.net>
f185071ddf799e194ba015d040d3d49cdbfa7e48 10-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Remove inetpeer from routes.

No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
5943634fc5592037db0693b261f7f4bea6bb9457 10-Jul-2012 David S. Miller <davem@davemloft.net> ipv4: Maintain redirect and PMTU info in struct rtable again.

Maintaining this in the inetpeer entries was not the right way to do
this at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
3e12939a2a67fbb4cbd962c3b9bc398c73319766 10-Jul-2012 David S. Miller <davem@davemloft.net> inet: Kill FLOWI_FLAG_PRECOW_METRICS.

No longer needed. TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics. Therefore there is no
longer any reason to pre-cow metrics in this way.

Signed-off-by: David S. Miller <davem@davemloft.net>
41347dcdd81988b8e60853257b2875285cc17a4e 28-Jun-2012 David S. Miller <davem@davemloft.net> ipv4: Kill rt->rt_spec_dst, no longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
c10237e077cef50e925f052e49f3b4fead9d71f9 28-Jun-2012 David S. Miller <davem@davemloft.net> Revert "ipv4: tcp: dont cache unconfirmed intput dst"

This reverts commit c074da2810c118b3812f32d6754bd9ead2f169e7.

This change has several unwanted side effects:

1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
thus never create a real cached route.

2) All TCP traffic will use DST_NOCACHE and never use the routing
cache at all.

Signed-off-by: David S. Miller <davem@davemloft.net>
c074da2810c118b3812f32d6754bd9ead2f169e7 27-Jun-2012 Eric Dumazet <edumazet@google.com> ipv4: tcp: dont cache unconfirmed intput dst

DDOS synflood attacks hit badly IP route cache.

On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.

rt_garbage_collect() triggers and tries to cleanup things.

Eventually route cache is disabled but machine is under fire and might
OOM and crash.

This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.

This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)

SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
36393395536064e483b73d173f6afc103eadfbc4 15-Jun-2012 David S. Miller <davem@davemloft.net> ipv4: Handle PMTU in all ICMP error handlers.

With ip_rt_frag_needed() removed, we have to explicitly update PMTU
information in every ICMP error handler.

Create two helper functions to facilitate this.

1) ipv4_sk_update_pmtu()

This updates the PMTU when we have a socket context to
work with.

2) ipv4_update_pmtu()

Raw version, used when no socket context is available. For this
interface, we essentially just pass in explicit arguments for
the flow identity information we would have extracted from the
socket.

And you'll notice that ipv4_sk_update_pmtu() is simply implemented
in terms of ipv4_update_pmtu()

Note that __ip_route_output_key() is used, rather than something like
ip_route_output_flow() or ip_route_output_key(). This is because we
absolutely do not want to end up with a route that does IPSEC
encapsulation and the like. Instead, we only want the route that
would get us to the node described by the outermost IP header.

Reported-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
55afabaa0df0dd139c8796a71beb43d1216fbe43 12-Jun-2012 David S. Miller <davem@davemloft.net> inet: Fix BUG triggered by __rt{,6}_get_peer().

If no peer actually gets attached (either because create is zero or
the peer allocation fails) we'll trigger a BUG because we
unconditionally do an rt{,6}_peer_ptr() afterwards.

Fix this by guarding it with the proper check.

Signed-off-by: David S. Miller <davem@davemloft.net>
46517008e1168dc926cf2c47d529efc07eca85c0 10-Jun-2012 David S. Miller <davem@davemloft.net> ipv4: Kill ip_rt_frag_needed().

There is zero point to this function.

It's only real substance is to perform an extremely outdated BSD4.2
ICMP check, which we can safely remove. If you really have a MTU
limited link being routed by a BSD4.2 derived system, here's a nickel
go buy yourself a real router.

The other actions of ip_rt_frag_needed(), checking and conditionally
updating the peer, are done by the per-protocol handlers of the ICMP
event.

TCP, UDP, et al. have a handler which will receive this event and
transmit it back into the associated route via dst_ops->update_pmtu().

This simplification is important, because it eliminates the one place
where we do not have a proper route context in which to make an
inetpeer lookup.

Signed-off-by: David S. Miller <davem@davemloft.net>
97bab73f987e2781129cd6f4b6379bf44d808cc6 10-Jun-2012 David S. Miller <davem@davemloft.net> inet: Hide route peer accesses behind helpers.

We encode the pointer(s) into an unsigned long with one state bit.

The state bit is used so we can store the inetpeer tree root to use
when resolving the peer later.

Later the peer roots will be per-FIB table, and this change works to
facilitate that.

Signed-off-by: David S. Miller <davem@davemloft.net>
c5d21c4b2a7765ab0600c8426374b50eb9f4a36f 10-Jun-2012 Roland Dreier <roland@purestorage.com> net: Reorder initialization in ip_route_output to fix gcc warning

If I build with W=1, for every file that includes <net/route.h>, I get the warning

include/net/route.h: In function 'ip_route_output':
include/net/route.h:135:3: warning: initialized field overwritten [-Woverride-init]
include/net/route.h:135:3: warning: (near initialization for 'fl4') [-Woverride-init]

(This is with "gcc (Debian 4.6.3-1) 4.6.3")

A fix seems pretty trivial: move the initialization of .flowi4_tos
earlier. As far as I can tell, this has no effect on code generation.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fbfe95a42e90b3dd079cc9019ba7d7700feee0f6 09-Jun-2012 David S. Miller <davem@davemloft.net> inet: Create and use rt{,6}_get_peer_create().

There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one. So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
95c961747284a6b83a5e2d81240e214b0fa3464d 15-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com> net: cleanup unsigned to unsigned int

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e6b45241c57a83197e5de9166b3b0d32ac562609 04-Feb-2012 Julian Anastasov <ja@ssi.bg> ipv4: reset flowi parameters on route connect

Eric Dumazet found that commit 813b3b5db83
(ipv4: Use caller's on-stack flowi as-is in output
route lookups.) that comes in 3.0 added a regression.
The problem appears to be that resulting flowi4_oif is
used incorrectly as input parameter to some routing lookups.
The result is that when connecting to local port without
listener if the IP address that is used is not on a loopback
interface we incorrectly assign RTN_UNICAST to the output
route because no route is matched by oif=lo. The RST packet
can not be sent immediately by tcp_v4_send_reset because
it expects RTN_LOCAL.

So, change ip_route_connect and ip_route_newports to
update the flowi4 fields that are input parameters because
we do not want unnecessary binding to oif.

To make it clear what are the input parameters that
can be modified during lookup and to show which fields of
floiw4 are reused add a new function to update the flowi4
structure: flowi4_update_output.

Thanks to Yurij M. Plotnikov for providing a bug report including a
program to reproduce the problem.

Thanks to Eric Dumazet for tracking the problem down to
tcp_v4_send_reset and providing initial fix.

Reported-by: Yurij M. Plotnikov <Yurij.Plotnikov@oktetlabs.ru>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b8400f3718a11c9b0ca400705cddf94f3132c1c3 23-Nov-2011 Steffen Klassert <steffen.klassert@secunet.com> route: struct rtable can be const in rt_is_input_route and rt_is_output_route

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a48eff128865aa20520fa6e0e0c5fbd2ac50d712 19-May-2011 David S. Miller <davem@davemloft.net> ipv4: Pass explicit destination address to rt_bind_peer().

Signed-off-by: David S. Miller <davem@davemloft.net>
ed2361e66eec60645f8e4715fe39a42235ef43ae 19-May-2011 David S. Miller <davem@davemloft.net> ipv4: Pass explicit destination address to rt_get_peer().

This will next trickle down to rt_bind_peer().

Signed-off-by: David S. Miller <davem@davemloft.net>
8e36360ae876995e92d3a7538dda70548e64e685 13-May-2011 David S. Miller <davem@davemloft.net> ipv4: Remove route key identity dependencies in ip_rt_get_source().

Pass in the sk_buff so that we can fetch the necessary keys from
the packet header when working with input routes.

Signed-off-by: David S. Miller <davem@davemloft.net>
cbb1e85f9cfd2bd9b7edfd21d167e89a4279faf0 04-May-2011 David S. Miller <davem@davemloft.net> ipv4: Kill rt->rt_{src, dst} usage in IP GRE tunnels.

First, make callers pass on-stack flowi4 to ip_route_output_gre()
so they can get at the fully resolved flow key.

Next, use that in ipgre_tunnel_xmit() to avoid the need to use
rt->rt_{dst,src}.

Signed-off-by: David S. Miller <davem@davemloft.net>
31e4543db29fb85496a122b965d6482c8d1a2bfe 04-May-2011 David S. Miller <davem@davemloft.net> ipv4: Make caller provide on-stack flow key to ip_route_output_ports().

Signed-off-by: David S. Miller <davem@davemloft.net>
475949d8e86bbde5ea3ffa4d95e022ca69233b14 04-May-2011 David S. Miller <davem@davemloft.net> ipv4: Renamt struct rtable's rt_tos to rt_key_tos.

To more accurately reflect that it is purely a routing
cache lookup key and is used in no other context.

Signed-off-by: David S. Miller <davem@davemloft.net>
6706b6ebab85dfca4e2886e35ec9c3c4ee13e27e 29-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Remove now superfluous code in ip_route_connect().

Now that output route lookups update the flow with
source address et al. selections, the fl4->{saddr,daddr}
assignments here are no longer necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
813b3b5db831ddbd92b5ce0fdeb74e3368f1323c 28-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Use caller's on-stack flowi as-is in output route lookups.

Signed-off-by: David S. Miller <davem@davemloft.net>
b678027cb77b079bc8e5b94172995d173bdb494b 26-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Kill RTO_CONN.

It's not used by anything in the kernel, and defined in net/route.h so
never exported to userspace.

Therefore we can safely remove it.

Signed-off-by: David S. Miller <davem@davemloft.net>
2d7192d6cbab20e153c47fa1559ffd41ceef0e79 26-Apr-2011 David S. Miller <davem@davemloft.net> ipv4: Sanitize and simplify ip_route_{connect,newports}()

These functions are used together as a unit for route resolution
during connect(). They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object. That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.

Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
2a9e9507011440a57d6356ded630ba0c0f5d4b77 24-Apr-2011 David S. Miller <davem@davemloft.net> net: Remove __KERNEL__ cpp checks from include/net

These header files are never installed to user consumption, so any
__KERNEL__ cpp checks are superfluous.

Projects should also not copy these files into their userland utility
sources and try to use them there. If they insist on doing so, the
onus is on them to sanitize the headers as needed.

Signed-off-by: David S. Miller <davem@davemloft.net>
b71d1d426d263b0b6cb5760322efebbfc89d4463 22-Apr-2011 Eric Dumazet <eric.dumazet@gmail.com> inet: constify ip headers and in6_addr

Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1b86a58f9d7ce4fe2377687f378fbfb53bdc9b6c 07-Apr-2011 OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> ipv4: Fix "Set rt->rt_iif more sanely on output routes."

Commit 1018b5c01636c7c6bda31a719bda34fc631db29a ("Set rt->rt_iif more
sanely on output routes.") breaks rt_is_{output,input}_route.

This became the cause to return "IP_PKTINFO's ->ipi_ifindex == 0".

To fix it, this does:

1) Add "int rt_route_iif;" to struct rtable

2) For input routes, always set rt_route_iif to same value as rt_iif

3) For output routes, always set rt_route_iif to zero. Set rt_iif
as it is done currently.

4) Change rt_is_{output,input}_route() to test rt_route_iif

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
94b92b88344641dbeadd2ef371549b7663a48fb1 31-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use flowi4_init_output() in net/route.h

Signed-off-by: David S. Miller <davem@davemloft.net>
6df59a84eccd4cad7fcefda3e0c5e55239a3b2dd 25-Mar-2011 Steffen Klassert <steffen.klassert@secunet.com> route: Take the right src and dst addresses in ip_route_newports

When we set up the flow informations in ip_route_newports(), we take
the address informations from the the rt_key_src and rt_key_dst fields
of the rtable. They appear to be empty. So take the address
informations from rt_src and rt_dst instead. This issue was introduced
by commit 5e2b61f78411be25f0b84f97d5b5d312f184dfd1 ("ipv4: Remove
flowi from struct rtable.")

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e6abbaa2725a43cf5d26c4c2a5dc6c0f6029ea19 19-Mar-2011 Julian Anastasov <ja@ssi.bg> ipv4: fix route deletion for IPs on many subnets

Alex Sidorenko reported for problems with local
routes left after IP addresses are deleted. It happens
when same IPs are used in more than one subnet for the
device.

Fix fib_del_ifaddr to restrict the checks for duplicate
local and broadcast addresses only to the IFAs that use
our primary IFA or another primary IFA with same address.
And we expect the prefsrc to be matched when the routes
are deleted because it is possible they to differ only by
prefsrc. This patch prevents local and broadcast routes
to be leaked until their primary IP is deleted finally
from the box.

As the secondary address promotion needs to delete
the routes for all secondaries that used the old primary IFA,
add option to ignore these secondaries from the checks and
to assume they are already deleted, so that we can safely
delete the route while these IFAs are still on the device list.

Reported-by: Alex Sidorenko <alexandre.sidorenko@hp.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
9cce96df5b76691712dba22e83ff5efe900361e1 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Put fl4_* macros to struct flowi4 and use them again.

Signed-off-by: David S. Miller <davem@davemloft.net>
9d6ec938019c6b16cb9ec96598ebe8f20de435fe 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use flowi4 in public route lookup interfaces.

Signed-off-by: David S. Miller <davem@davemloft.net>
6281dcc94a96bd73017b2baa8fa83925405109ef 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Make flowi ports AF dependent.

Create two sets of port member accessors, one set prefixed by fl4_*
and the other prefixed by fl6_*

This will let us to create AF optimal flow instances.

It will work because every context in which we access the ports,
we have to be fully aware of which AF the flowi is anyways.

Signed-off-by: David S. Miller <davem@davemloft.net>
1d28f42c1bd4bb2363d88df74d0128b4da135b4a 12-Mar-2011 David S. Miller <davem@davemloft.net> net: Put flowi_* prefix on AF independent members of struct flowi

I intend to turn struct flowi into a union of AF specific flowi
structs. There will be a common structure that each variant includes
first, much like struct sock_common.

This is the first step to move in that direction.

Signed-off-by: David S. Miller <davem@davemloft.net>
78fbfd8a653ca972afe479517a40661bfff6d8c3 12-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Create and use route lookup helpers.

The idea here is this minimizes the number of places one has to edit
in order to make changes to how flows are defined and used.

Signed-off-by: David S. Miller <davem@davemloft.net>
5e2b61f78411be25f0b84f97d5b5d312f184dfd1 05-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Remove flowi from struct rtable.

The only necessary parts are the src/dst addresses, the
interface indexes, the TOS, and the mark.

The rest is unnecessary bloat, which amounts to nearly
50 bytes on 64-bit.

Signed-off-by: David S. Miller <davem@davemloft.net>
4157434c23f8f5126a2ffd3cc7b2c3bd928be075 05-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Use passed-in protocol in ip_route_newports().

Signed-off-by: David S. Miller <davem@davemloft.net>
5bfa787fb2c29cce0722500f90df29e049ff07fc 02-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: ip_route_output_key() is better as an inline.

This avoid a stack frame at zero cost.

Signed-off-by: David S. Miller <davem@davemloft.net>
b23dd4fe42b455af5c6e20966b7d6959fa8352ea 02-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make output route lookup return rtable directly.

Instead of on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2774c131b1d19920b4587db1cfbd6f0750ad1f15 01-Mar-2011 David S. Miller <davem@davemloft.net> xfrm: Handle blackhole route creation via afinfo.

That way we don't have to potentially do this in every xfrm_lookup()
caller.

Signed-off-by: David S. Miller <davem@davemloft.net>
273447b352e69c327efdecfd6e1d6fe3edbdcd14 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Kill can_sleep arg to ip_route_output_flow()

This boolean state is now available in the flow flags.

Signed-off-by: David S. Miller <davem@davemloft.net>
5df65e5567a497a28067019b8ff08f98fb026629 01-Mar-2011 David S. Miller <davem@davemloft.net> net: Add FLOWI_FLAG_CAN_SLEEP.

And set is in contexts where the route resolution can sleep.

Signed-off-by: David S. Miller <davem@davemloft.net>
420d44daa7aa1cc847e9e527f0a27a9ce61768ca 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"

Since that is what the current vague "flags" argument means.

Signed-off-by: David S. Miller <davem@davemloft.net>
abdf7e7239da270e68262728f125ea94b9b7d42d 01-Mar-2011 David S. Miller <davem@davemloft.net> ipv4: Can final ip_route_connect() arg to boolean "can_sleep".

Since that's what the current vague "flags" thing means.

Signed-off-by: David S. Miller <davem@davemloft.net>
dca8b089c95d94afa1d715df257de0286350e99d 24-Feb-2011 David S. Miller <davem@davemloft.net> ipv4: Rearrange how ip_route_newports() gets port keys.

ip_route_newports() is the only place in the entire kernel that
cares about the port members in the routing cache entry's lookup
flow key.

Therefore the only reason we store an entire flow inside of the
struct rtentry is for this one special case.

Rewrite ip_route_newports() such that:

1) The caller passes in the original port values, so we don't need
to use the rth->fl.fl_ip_{s,d}port values to remember them.

2) The lookup flow is constructed by hand instead of being copied
from the routing cache entry's flow.

Signed-off-by: David S. Miller <davem@davemloft.net>
6431cbc25fa21635ee04eb0516ba6c51389fbfac 08-Feb-2011 David S. Miller <davem@davemloft.net> inet: Create a mechanism for upward inetpeer propagation into routes.

If we didn't have a routing cache, we would not be able to properly
propagate certain kinds of dynamic path attributes, for example
PMTU information and redirects.

The reason is that if we didn't have a routing cache, then there would
be no way to lookup all of the active cached routes hanging off of
sockets, tunnels, IPSEC bundles, etc.

Consider the case where we created a cached route, but no inetpeer
entry existed and also we were not asked to pre-COW the route metrics
and therefore did not force the creation a new inetpeer entry.

If we later get a PMTU message, or a redirect, and store this
information in a new inetpeer entry, there is no way to teach that
cached route about the newly existing inetpeer entry.

The facilities implemented here handle this problem.

First we create a generation ID. When we create a cached route of any
kind, we remember the generation ID at the time of attachment. Any
time we force-create an inetpeer entry in response to new path
information, we bump that generation ID.

The dst_ops->check() callback is where the knowledge of this event
is propagated. If the global generation ID does not equal the one
stored in the cached route, and the cached route has not attached
to an inetpeer yet, we look it up and attach if one is found. Now
that we've updated the cached route's information, we update the
route's generation ID too.

This clears the way for implementing PMTU and redirects directly in
the inetpeer cache. There is absolutely no need to consult cached
route information in order to maintain this information.

At this point nothing bumps the inetpeer genids, that comes in the
later changes which handle PMTUs and redirects using inetpeers.

Signed-off-by: David S. Miller <davem@davemloft.net>
a4daad6b0923030fbd3b00a01f570e4c3eef446b 28-Jan-2011 David S. Miller <davem@davemloft.net> net: Pre-COW metrics for TCP.

TCP is going to record metrics for the connection,
so pre-COW the route metrics at route cache entry
creation time.

This avoids several atomic operations that have to
occur if we COW the metrics after the entry reaches
global visibility.

Signed-off-by: David S. Miller <davem@davemloft.net>
62fa8a846d7de4b299232e330c74b7783539df76 27-Jan-2011 David S. Miller <davem@davemloft.net> net: Implement read-only protection and COW'ing of metrics.

Routing metrics are now copy-on-write.

Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there. Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.

The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.

For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing. Very likely
this "somewhere else" will be the inetpeer cache.

Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.

But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads. In those
cases the read-only metric copies stay in place and never get written
to.

TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit. But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.

Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.

Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.

The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline. This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.

Signed-off-by: David S. Miller <davem@davemloft.net>
6561a3b12d62ed5317e6ac32182d87a03f62c8dc 20-Dec-2010 David S. Miller <davem@davemloft.net> ipv4: Flush per-ns routing cache more sanely.

Flush the routing cache only of entries that match the
network namespace in which the purge event occurred.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
323e126f0c5995f779d7df7fd035f6e8fed8764d 13-Dec-2010 David S. Miller <davem@davemloft.net> ipv4: Don't pre-seed hoplimit metric.

Always go through a new ip4_dst_hoplimit() helper, just like ipv6.

This allowed several simplifications:

1) The interim dst_metric_hoplimit() can go as it's no longer
userd.

2) The sysctl_ip_default_ttl entry no longer needs to use
ipv4_doint_and_flush, since the sysctl is not cached in
routing cache metrics any longer.

3) ipv4_doint_and_flush no longer needs to be exported and
therefore can be marked static.

When ipv4_doint_and_flush_strategy was removed some time ago,
the external declaration in ip.h was mistakenly left around
so kill that off too.

We have to move the sysctl_ip_default_ttl declaration into
ipv4's route cache definition header net/route.h, because
currently net/ip.h (where the declaration lives now) has
a back dependency on net/route.h

Signed-off-by: David S. Miller <davem@davemloft.net>
5811662b15db018c740c57d037523683fd3e6123 12-Nov-2010 Changli Gao <xiaosuo@gmail.com> net: use the macros defined for the members of flowi

Use the macros defined for the members of flowi to clean the code up.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c753796769e4fb0cd813b6e5801b3c01f4681d4f 12-Nov-2010 David S. Miller <davem@davemloft.net> ipv4: Make rt->fl.iif tests lest obscure.

When we test rt->fl.iif against zero, we're seeing if it's
an output or an input route.

Make that explicit with some helper functions.

Signed-off-by: David S. Miller <davem@davemloft.net>
72cdd1d971c0deb1619c5c339270570c43647a78 11-Nov-2010 Eric Dumazet <eric.dumazet@gmail.com> net: get rid of rtable->idev

It seems idev field in struct rtable has no special purpose, but adding
extra atomic ops.

We hold refcounts on the device itself (using percpu data, so pretty
cheap in current kernel).

infiniband case is solved using dst.dev instead of idev->dev

Removal of this field means routing without route cache is now using
shared data, percpu data, and only potential contention is a pair of
atomic ops on struct neighbour per forwarded packet.

About 5% speedup on routing test.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fb0c5f0bc8b69b40549449ee7fc65f3706f12062 27-Sep-2010 Ulrich Weber <uweber@astaro.com> tproxy: check for transparent flag in ip_route_newports

as done in ip_route_connect()

Signed-off-by: Ulrich Weber <uweber@astaro.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d8d1f30b95a635dbd610dcc5eb641aca8f4768cf 11-Jun-2010 Changli Gao <xiaosuo@gmail.com> net-next: remove useless union keyword

remove useless union keyword in rtable, rt6_info and dn_route.

Since there is only one member in a union, the union keyword isn't useful.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
407eadd996dc62a827db85f1d0c286a98fd5d336 10-May-2010 Eric Dumazet <eric.dumazet@gmail.com> net: implements ip_route_input_noref()

ip_route_input() is the version returning a refcounted dst, while
ip_route_input_noref() returns a non refcounted one.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7d720c3e4f0c4fc152a6bf17e24244a3c85412d2 16-Feb-2010 Tejun Heo <tj@kernel.org> percpu: add __percpu sparse annotations to net

Add __percpu sparse annotations to net.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.

The macro and type tricks around snmp stats make things a bit
interesting. DEFINE/DECLARE_SNMP_STAT() macros mark the target field
as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly. All
snmp_mib_*() users which used to cast the argument to (void **) are
updated to cast it to (void __percpu **).

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
a5ee155136b4a8f4ab0e4c9c064b661da475e298 29-Nov-2009 Eric W. Biederman <ebiederm@xmission.com> net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH

The motivation for an additional notifier in batched netdevice
notification (rt_do_flush) only needs to be called once per batch not
once per namespace.

For further batching improvements I need a guarantee that the
netdevices are unregistered in order allowing me to unregister an all
of the network devices in a network namespace at the same time with
the guarantee that the loopback device is really and truly
unregistered last.

Additionally it appears that we moved the route cache flush after
the final synchronize_net, which seems wrong and there was no
explanation. So I have restored the original location of the final
synchronize_net.

Cc: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fd2c3ef761fbc5e6c27fa7d40b30cda06bfcd7d8 03-Nov-2009 Eric Dumazet <eric.dumazet@gmail.com> net: cleanup include/net

This cleanup patch puts struct/union/enum opening braces,
in first line to ease grep games.

struct something
{

becomes :

struct something {

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
511c3f92ad5b6d9f8f6464be1b4f85f0422be91a 02-Jun-2009 Eric Dumazet <eric.dumazet@gmail.com> net: skb->rtable accessor

Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
79876874ce20d37ecdc7f481ebf142466999152f 01-Oct-2008 KOVACS Krisztian <hidden@sch.bme.hu> ipv4: Conditionally enable transparent flow flag when connecting

Set FLOWI_FLAG_ANYSRC in flowi->flags if the socket has the
transparent socket option set. This way we selectively enable certain
connections with non-local source addresses to be routed.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
1668e010cbe1a7567c81d4c02d31dde9859e9da1 01-Oct-2008 KOVACS Krisztian <hidden@sch.bme.hu> ipv4: Make inet_sock.h independent of route.h

inet_iif() in inet_sock.h requires route.h. Since users of inet_iif()
usually require other route.h functionality anyway this patch moves
inet_iif() to route.h.

Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
eeb61f719c00c626115852bbc91189dc3011a844 27-Jul-2008 Al Viro <viro@ZenIV.linux.org.uk> missing bits of net-namespace / sysctl

Piss-poor sysctl registration API strikes again, film at 11...

What we really need is _pathname_ required to be present in already
registered table, so that kernel could warn about bad order. That's the
next target for sysctl stuff (and generally saner and more explicit
order of initialization of ipv[46] internals wouldn't hurt either).

For the time being, here are full fixups required by ..._rotable()
stuff; we make per-net sysctl sets descendents of "ro" one and make sure
that sufficient skeleton is there before we start registering per-net
sysctls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6f9f489a4eeaa3c8a8618e078a5270d2c4872b67 27-Jul-2008 Al Viro <viro@zeniv.linux.org.uk> net: missing bits of net-namespace / sysctl

Piss-poor sysctl registration API strikes again, film at 11...
What we really need is _pathname_ required to be present in
already registered table, so that kernel could warn about bad
order. That's the next target for sysctl stuff (and generally
saner and more explicit order of initialization of ipv[46]
internals wouldn't hurt either).

For the time being, here are full fixups required by ..._rotable()
stuff; we make per-net sysctl sets descendents of "ro" one and
make sure that sufficient skeleton is there before we start registering
per-net sysctls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
76e6ebfb40a2455c18234dcb0f9df37533215461 06-Jul-2008 Denis V. Lunev <den@openvz.org> netns: add namespace parameter to rt_cache_flush

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0010e46577a27c1d915034637f6c2fa57a9a091c 29-Apr-2008 Timo Teras <timo.teras@iki.fi> ipv4: Update MTU to all related cache entries in ip_rt_frag_needed()

Add struct net_device parameter to ip_rt_frag_needed() and update MTU to
cache entries where ifindex is specified. This is similar to what is
already done in ip_rt_redirect().

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
3b1e0a655f8eba44ab1ee2a1068d169ccfb853b9 25-Mar-2008 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.

Introduce per-sock inlines: sock_net(), sock_net_set()
and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set().
Without CONFIG_NET_NS, no namespace other than &init_net exists.
Let's explicitly define them to help compiler optimizations.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
7d164be8aa4392fe55474f4608547f2097e07c41 24-Mar-2008 Joe Perches <joe@perches.com> [NET]: include/net/route.h - remove duplicate include

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
29e75252da20f3ab9e132c68c9aed156b87beae6 01-Feb-2008 Eric Dumazet <dada1@cosmosbay.com> [IPV4] route cache: Introduce rt_genid for smooth cache invalidation

Current ip route cache implementation is not suited to large caches.

We can consume a lot of CPU when cache must be invalidated, since we
currently need to evict all cache entries, and this eviction is
sometimes asynchronous. min_delay & max_delay can somewhat control this
asynchronism behavior, but whole thing is a kludge, regularly triggering
infamous soft lockup messages. When entries are still in use, this also
consumes a lot of ram, filling dst_garbage.list.

A better scheme is to use a generation identifier on each entry,
so that cache invalidation can be performed by changing the table
identifier, without having to scan all entries.
No more delayed flushing, no more stalling when secret_interval expires.

Invalidated entries will then be freed at GC time (controled by
ip_rt_gc_timeout or stress), or when an invalidated entry is found
in a chain when an insert is done.
Thus we keep a normal equilibrium.

This patch :
- renames rt_hash_rnd to rt_genid (and makes it an atomic_t)
- Adds a new rt_genid field to 'struct rtable' (filling a hole on 64bit)
- Checks entry->rt_genid at appropriate places :
4a19ec5800fc3bb64e2d87c4d9fdd9e636086fe0 31-Jan-2008 Laszlo Attila Toth <panther@balabit.hu> [NET]: Introducing socket mark socket option.

A userspace program may wish to set the mark for each packets its send
without using the netfilter MARK target. Changing the mark can be used
for mark based routing without netfilter or for packet filtering.

It requires CAP_NET_ADMIN capability.

Signed-off-by: Laszlo Attila Toth <panther@balabit.hu>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
b5921910a1de4ba82add59154976c3dc7352c8c2 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Routing cache virtualization.

Basically, this piece looks relatively easy. Namespace is already
available on the dst entry via device and the device is safe to
dereferrence. Compare it with one of a searcher and skip entry if
appropriate.

The only exception is ip_rt_frag_needed. So, add namespace parameter to it.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
eee80592c3c1f7381c04913d9d3eb6e3c3c87628 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Correct namespace for connect-time routing.

ip_route_connect and ip_route_newports are a part of routing API
presented to the socket layer. The namespace is available inside them
through a socket.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f206351a50ea86250fabea96b9af8d8f8fc02603 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to ip_route_output_key.

Needed to propagate it down to the ip_route_output_flow.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
f1b050bf7a88910f9f00c9c8989c1bf5a67dd140 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to ip_route_output_flow.

Needed to propagate it down to the __ip_route_output_key.

Signed_off_by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
611c183ebcb5af384df3a4ddb391034a1b6ac255 23-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Add namespace parameter to __ip_route_output_key.

This is only required to propagate it down to the
ip_route_output_slow.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
1bad118a330d494b23663fce94d4e9d9d5065fa7 10-Jan-2008 Denis V. Lunev <den@openvz.org> [NETNS]: Pass namespace through ip_rt_ioctl.

... up to rtentry_to_fib_config

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6b175b26c1048d331508940ad3516ead1998084f 10-Jan-2008 Eric W. Biederman <ebiederm@xmission.com> [NETNS]: Add netns parameter to inet_(dev_)add_type.

The patch extends the inet_addr_type and inet_dev_addr_type with the
network namespace pointer. That allows to access the different tables
relatively to the network namespace.

The modification of the signature function is reported in all the
callers of the inet_addr_type using the pointer to the well known
init_net.

Acked-by: Benjamin Thery <benjamin.thery@bull.net>
Acked-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0553811612a6178365f3b062c30234913b218a96 05-Dec-2007 Laszlo Attila Toth <panther@balabit.hu> [IPV4]: Add inet_dev_addr_type()

Address type search can be limited to an interface by
inet_dev_addr_type function.

Signed-off-by: Laszlo Attila Toth <panther@balabit.hu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
56c99d0415e8b778c200f115b198c126243ec351 06-Dec-2007 Denis V. Lunev <den@openvz.org> [IPV4]: Remove prototype of ip_rt_advice

ip_rt_advice has been gone, so no need to keep prototype and debug message.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
4839c52b01ca91be1c62761e08fb3deb3881e857 10-Jul-2007 Philippe De Muyter <phdm@macqel.be> [IPV4]: Make ip_tos2prio const.

Signed-off-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
e06e7c615877026544ad7f8b309d1a3706410383 11-Jun-2007 David S. Miller <davem@sunset.davemloft.net> [IPV4]: The scheduled removal of multipath cached routing support.

With help from Chris Wedgwood.

Signed-off-by: David S. Miller <davem@davemloft.net>
093c2ca4167cf66f69020329d14138da0da8599b 10-Feb-2007 Eric Dumazet <dada1@cosmosbay.com> [IPV4]: Convert ipv4 route to use the new dst_entry 'next' pointer

This patch removes the rt_next pointer from 'struct rtable.u' union,
and renames u.rt_next to u.dst_rt_next.

It also moves 'struct flowi' right after 'struct dst_entry' to prepare
the gain on lookups.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8eb9086f21c73b38b5ca27558db4c91d62d0e70b 08-Feb-2007 David S. Miller <davem@sunset.davemloft.net> [IPV4/IPV6]: Always wait for IPSEC SA resolution in socket contexts.

Do this even for non-blocking sockets. This avoids the silly -EAGAIN
that applications can see now, even for non-blocking sockets in some
cases (f.e. connect()).

With help from Venkat Tekkirala.

Signed-off-by: David S. Miller <davem@davemloft.net>
39dccd9d922b595301e5d43ca7a30823d81393b6 28-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: route.h annotations

ip_route_connect(), ip_route_newports() get port numbers net-endian

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
fd6832220974809141b3981e380b78690bba8911 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: inet_addr_type() annotations

argument and inferred net-endian variables in callers annotated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
bada8adc4e6622764205921e6ba3f717aa03c882 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: ip_route_connect() ipv4 address arguments annotated

annotated address arguments (port number left alone for now); ditto
for inferred net-endian variables in callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
f2c3fe24119ee4f8faca08699f0488f500014a27 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: annotate ipv4 addresses in struct rtable and struct flowi

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
f7655229c06d041323b40bd6eb9f95ca0ce95506 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: ip_rt_redirect() annotations

The first 4 arguments of ip_rt_redirect() are net-endian. Annotated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
9e12bb22e32389b41222c9d9fb55724fed83a038 27-Sep-2006 Al Viro <viro@zeniv.linux.org.uk> [IPV4]: ip_route_input() annotations

ip_route_input() takes net-endian source and destination address.
* Annotated as such.
* arguments of its invocations annotated where needed.
* local helpers getting the same values passed to by it (ip_route_input_mc(),
ip_route_input_slow(), ip_handle_martian_source(), ip_mkroute_input(),
ip_mkroute_input_def(), __mkroute_input()) annotated

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
beb8d13bed80f8388f1a9a107d07ddd342e627e8 05-Aug-2006 Venkat Yekkirala <vyekkirala@TrustedCS.com> [MLSXFRM]: Add flow labeling

This labels the flows that could utilize IPSec xfrms at the points the
flows are defined so that IPSec policy and SAs at the right label can
be used.

The following protos are currently not handled, but they should
continue to be able to use single-labeled IPSec like they currently
do.

ipmr
ip_gre
ipip
igmp
sit
sctp
ip6_tunnel (IPv6 over IPv6 tunnel device)
decnet

Signed-off-by: Venkat Yekkirala <vyekkirala@TrustedCS.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
62c4f0a2d5a188f73a94f2cb8ea0dba3e7cf0a7f 26-Apr-2006 David Woodhouse <dwmw2@infradead.org> Don't include linux/config.h from anywhere else in include/

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
cef2685e0053945ea0f3c02297386b040f486ea7 25-Mar-2006 Ilia Sotnikov <hostcc@gmail.com> [IPV4]: Aggregate route entries with different TOS values

When we get an ICMP need-to-frag message, the original TOS value in the
ICMP payload cannot be used as a key to look up the routes to update.
This is because the TOS field may have been modified by routers on the
way. Similarly, ip_rt_redirect should also ignore the TOS as the router
that gave us the message may have modified the TOS value.

The patch achieves this objective by aggregating entries with different
TOS values (but are otherwise identical) into the same bucket. This
makes it easy to update them at the same time when an ICMP message is
received.

In future we should use a twin-hashing scheme where teh aggregation
occurs at the entry level. That is, the TOS goes back into the hash
for normal lookups while ICMP lookups will end up with a node that
gives us a list that contains all other route entries that differ
only by TOS.

Signed-off-by: Ilia Sotnikov <hostcc@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
5d39a795bfa217b5f7637028c83ab5cb291f37bf 01-Feb-2006 Patrick McHardy <kaber@trash.net> [IPV4]: Always set fl.proto in ip_route_newports

ip_route_newports uses the struct flowi from the struct rtable returned
by ip_route_connect for the new route lookup and just replaces the port
numbers if they have changed. If an IPsec policy exists which doesn't match
port 0 the struct flowi won't have the proto field set and no xfrm lookup
is done for the changed ports.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
0ff60a45678e67b2547256a636fd00c1667ce4fa 22-Nov-2005 Jamal Hadi Salim <hadi@cyberus.ca> [IPV4]: Fix secondary IP addresses after promotion

This patch fixes the problem with promoting aliases when:
a) a single primary and > 1 secondary addresses
b) multiple primary addresses each with at least one secondary address

Based on earlier efforts from Brian Pomerantz <bapper@piratehaven.org>,
Patrick McHardy <kaber@trash.net> and Thomas Graf <tgraf@suug.ch>

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
20380731bc2897f2952ae055420972ded4cd786e 16-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [NET]: Fix sparse warnings

Of this type, mostly:

CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0742fd53a3774781255bd1e471e7aa2e4a82d5f7 10-Aug-2005 Adrian Bunk <bunk@stusta.de> [IPV4]: possible cleanups

This patch contains the following possible cleanups:
- make needlessly global code static
- #if 0 the following unused global function:
- xfrm4_state.c: xfrm4_state_fini
- remove the following unneeded EXPORT_SYMBOL's:
- ip_output.c: ip_finish_output
- ip_output.c: sysctl_ip_default_ttl
- fib_frontend.c: ip_dev_find
- inetpeer.c: inet_peer_idlock
- ip_options.c: ip_options_compile
- ip_options.c: ip_options_undo
- net/core/request_sock.c: sysctl_max_syn_backlog

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
39c715b71740c4a78ba4769fb54826929bac03cb 22-Jun-2005 Ingo Molnar <mingo@elte.hu> [PATCH] smp_processor_id() cleanup

This patch implements a number of smp_processor_id() cleanup ideas that
Arjan van de Ven and I came up with.

The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
spaghetti was hard to follow both on the implementational and on the
usage side.

Some of the complexity arose from picking wrong names, some of the
complexity comes from the fact that not all architectures defined
__smp_processor_id.

In the new code, there are two externally visible symbols:

- smp_processor_id(): debug variant.

- raw_smp_processor_id(): nondebug variant. Replaces all existing
uses of _smp_processor_id() and __smp_processor_id(). Defined
by every SMP architecture in include/asm-*/smp.h.

There is one new internal symbol, dependent on DEBUG_PREEMPT:

- debug_smp_processor_id(): internal debug variant, mapped to
smp_processor_id().

Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
lib/smp_processor_id.c file. All related comments got updated and/or
clarified.

I have build/boot tested the following 8 .config combinations on x86:

{SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}

I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT. (Other
architectures are untested, but should work just fine.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
37e20a66db02eff9adbeee043af990cca85d0034 30-May-2005 Pravin B. Shelar <pravins@calsoftinc.com> [IPV4]: Kill MULTIPATHHOLDROUTE flag.

It cannot work properly, so just ignore it in drr
and rr multipath algorithms just like the random
multipath algorithm does.

Suggested by Herbert Xu.

Signed-off by: Pravin B. Shelar <pravins@calsoftinc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
02c30a84e6298b6b20a56f0896ac80b47839e134 06-May-2005 Jesper Juhl <juhl-lkml@dif.dk> [PATCH] update Ross Biro bouncing email address

Ross moved. Remove the bad email address so people will find the correct
one in ./CREDITS.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!