History log of /net/dccp/dccp.h
Revision Date Author Comments
fd34d627aee8f32d6e54fdc0347be7a18e2d65ee 04-Jan-2014 stephen hemminger <stephen@networkplumber.org> dccp: remove obsolete code

This function is defined but not used.
Remove it now, can be resurrected if ever needed.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
a402a5aa9b4cbb42cc41bf573d2e5c4713541af0 18-Oct-2013 Joe Perches <joe@perches.com> net: dccp: Remove extern from function prototypes

There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.

Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2c53040f018b6c36a46eec75b9b937aaa5f78e6d 10-Jul-2012 Ben Hutchings <bhutchings@solarflare.com> net: Fix (nearly-)kernel-doc comments for various functions

Fix incorrect start markers, wrapped summary lines, missing section
breaks, incorrect separators, and some name mismatches.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
95c961747284a6b83a5e2d81240e214b0fa3464d 15-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com> net: cleanup unsigned to unsigned int

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
eb93992207dadb946a3b5cf4544957dc924a6f58 19-Dec-2011 Rusty Russell <rusty@rustcorp.com.au> module_param: make bool parameters really bool (net & drivers/net)

module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.

(Thanks to Joe Perches for suggesting coccinelle for 0/1 -> true/false).

Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
dfd56b8b38fff3586f36232db58e1e9f7885a605 10-Dec-2011 Eric Dumazet <eric.dumazet@gmail.com> net: use IS_ENABLED(CONFIG_IPV6)

Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d6916f87ca5e566786f1a935a7cabba54774bbda 25-Jul-2011 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: support for the exchange of NN options in established state 1/2

In contrast to static feature negotiation at the begin of a connection, this
patch introduces support for exchange of dynamically changing options.

Such an update/exchange is necessary in at least two cases:
* CCID-2's Ack Ratio (RFC 4341, 6.1.2) which changes during the connection;
* Sequence Window values that, as per RFC 4340, 7.5.2, should be sent "as
the connection progresses".

Both are non-negotiable (NN) features, which means that no new capabilities
are negotiated, but rather that changes in known parameters are brought
up-to-date at either end.

Thse characteristics are reflected by the implementation:
* only NN options can be exchanged after connection setup;
* an ack is scheduled directly after activation to speed up the update;
* CCIDs may request changes to an NN feature even if a negotiation for that
feature is already underway: this is required by CCID-2, where changes in
cwnd necessitate Ack Ratio changes, such that the previous Ack Ratio (which
is still being negotiated) would cause irrecoverable RTO timeouts (thanks
to work by Samuel Jero).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.uk>
763dadd47c884853a22f2f19ea27e58431303ff3 30-Dec-2010 Samuel Jero <sj323707@ohio.edu> dccp: fix bug in updating the GSR

Currently dccp_check_seqno allows any valid packet to update the Greatest
Sequence Number Received, even if that packet's sequence number is less than
the current GSR. This patch adds a check to make sure that the new packet's
sequence number is greater than GSR.

Signed-off-by: Samuel Jero <sj323707@ohio.edu>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
b7ec19af63b467e30189984fb24e6157603608e3 10-Dec-2010 Shan Wei <shanwei@cn.fujitsu.com> dccp: remove unused macros

Remove macros which have been unused since the initial implementation
(commit 7c657876b63cb1d8a2ec06f8fc6c37bb8412e66c, [DCCP]: Initial
implementation from Tue Aug 9 20:14:34 2005 -0700).

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
04910265078f08a73208beab70ed2a3cce4a919f 04-Dec-2010 Tomasz Grobelny <tomasz@grobelny.oswiecenia.net> dccp qpolicy: Parameter checking of cmsg qpolicy parameters

Ensure that cmsg->cmsg_type value is valid for qpolicy
that is currently in use.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
871a2c16c21b988688b4ab1a78eadd969765c0a3 04-Dec-2010 Tomasz Grobelny <tomasz@grobelny.oswiecenia.net> dccp: Policy-based packet dequeueing infrastructure

This patch adds a generic infrastructure for policy-based dequeueing of
TX packets and provides two policies:
* a simple FIFO policy (which is the default) and
* a priority based policy (set via socket options).
Both policies honour the tx_qlen sysctl for the maximum size of the write
queue (can be overridden via socket options).

The priority policy uses skb->priority internally to assign an u32 priority
identifier, using the same ranking as SO_PRIORITY. The skb->priority field
is set to 0 when the packet leaves DCCP. The priority is supplied as ancillary
data using cmsg(3), the patch also provides the requisite parsing routines.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
b3d14bff12a38ad13a174eb0cc83d2ac7169eee4 10-Nov-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Implementation of circular Ack Vector buffer with overflow handling

This completes the implementation of a circular buffer for Ack Vectors, by
extending the current (linear array-based) implementation. The changes are:

(a) An `overflow' flag to deal with the case of overflow. As before, dynamic
growth of the buffer will not be supported; but code will be added to deal
robustly with overflowing Ack Vector buffers.

(b) A `tail_seqno' field. When naively implementing the algorithm of Appendix A
in RFC 4340, problems arise whenever subsequent Ack Vector records overlap,
which can bring the entire run length calculation completely out of synch.
(This is documented on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/\
ack_vectors/tracking_tail_ackno/ .)
(c) The buffer length is now computed dynamically (i.e. current fill level),
as the span between head to tail.

As a result, dccp_ackvec_pending() is now simpler - the #ifdef is no longer
necessary since buf_empty is always true when IP_DCCP_ACKVEC is not configured.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
b1fcf55eea541af9efa5d39f5a0d1aec8ceca55d 27-Oct-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Refine the wait-for-ccid mechanism

This extends the existing wait-for-ccid routine so that it may be used with
different types of CCID, addressing the following problems:

1) The queue-drain mechanism only works with rate-based CCIDs. If CCID-2 for
example has a full TX queue and becomes network-limited just as the
application wants to close, then waiting for CCID-2 to become unblocked
could lead to an indefinite delay (i.e., application "hangs").
2) Since each TX CCID in turn uses a feedback mechanism, there may be changes
in its sending policy while the queue is being drained. This can lead to
further delays during which the application will not be able to terminate.
3) The minimum wait time for CCID-3/4 can be expected to be the queue length
times the current inter-packet delay. For example if tx_qlen=100 and a delay
of 15 ms is used for each packet, then the application would have to wait
for a minimum of 1.5 seconds before being allowed to exit.
4) There is no way for the user/application to control this behaviour. It would
be good to use the timeout argument of dccp_close() as an upper bound. Then
the maximum time that an application is willing to wait for its CCIDs to can
be set via the SO_LINGER option.

These problems are addressed by giving the CCID a grace period of up to the
`timeout' value.

The wait-for-ccid function is, as before, used when the application
(a) has read all the data in its receive buffer and
(b) if SO_LINGER was set with a non-zero linger time, or
(c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ
state (client application closes after receiving CloseReq).

In addition, there is a catch-all case of __skb_queue_purge() after waiting for
the CCID. This is necessary since the write queue may still have data when
(a) the host has been passively-closed,
(b) abnormal termination (unread data, zero linger time),
(c) wait-for-ccid could not finish within the given time limit.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
ecdfbdabbe4e0cf0443cbbea2df1bf51bf67f3f3 11-Oct-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: schedule an Ack when receiving timestamps

This schedules an Ack when receiving a timestamp, exploiting the
existing inet_csk_schedule_ack() function, saving one case in the
`dccp_ack_pending()' function.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
d196c9a5d4e150cdff675662214c80c69b906958 11-Oct-2010 Ivo Calado <ivocalado@embedded.ufcg.edu.br> dccp: generalise data-loss condition

This patch generalises the task of determining data loss from RFC 4340, 7.7.1.

Let S_A, S_B be sequence numbers such that S_B is "after" S_A, and let
N_B be the NDP count of packet S_B. Then, using modulo-2^48 arithmetic,
D = S_B - S_A - 1 is an upper bound of the number of lost data packets,
D - N_B is an approximation of the number of lost data packets
(there are cases where this is not exact).

The patch implements this as
dccp_loss_count(S_A, S_B, N_B) := max(S_B - S_A - 1 - N_B, 0)

Signed-off-by: Ivo Calado <ivocalado@embedded.ufcg.edu.br>
Signed-off-by: Erivaldo Xavier <desadoc@gmail.com>
Signed-off-by: Leandro Sales <leandroal@gmail.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
0b53d4604ac2b4f2faa9a62a04ea9b383ad2efe0 11-Oct-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: fix the adjustments to AWL and SWL

This fixes a problem and a potential loophole with regard to seqno/ackno
validity: currently the initial adjustments to AWL/SWL are only performed
once at the begin of the connection, during the handshake.

Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.

Therefore it is better to perform that safety check each time SWL/AWL are
updated, as implemented by the patch.

A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.

During the initial handshake we have more stringent sequence number protection;
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
1f4f0f645cc1d7f1187fcdb0ac22c2e69bd68050 05-Oct-2010 stephen hemminger <shemminger@vyatta.com> dccp: Kill dead code and add static markers.

Remove dead code and make some functions static.
Compile tested only.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a7d13fbf85375698879d16f118af77fbfcc2de44 22-Jun-2010 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: remove unused function argument

This removes an unused 'sk' argument from several option-inserting functions.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
bb29624614c2afe2873ee8ee97cf09df42701694 11-Apr-2010 Herbert Xu <herbert@gondor.apana.org.au> inet: Remove unused send_check length argument

inet: Remove unused send_check length argument

This patch removes the unused length argument from the send_check
function in struct inet_connection_sock_af_ops.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ec733b15a3ef0b5759141a177f8044a2f40c41e7 18-Mar-2010 Eric Dumazet <eric.dumazet@gmail.com> net: snmp mib cleanup

There is no point to align or pad mibs to cache lines, they are per cpu
allocated with a 8 bytes alignment anyway.
This wastes space for no gain. This patch removes __SNMP_MIB_ALIGN__

Since SNMP mibs contain "unsigned long" fields only, we can relax the
allocation alignment from "unsigned long long" to "unsigned long"

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b7058842c940ad2c08dd829b21e5c92ebe3b8758 01-Oct-2009 David S. Miller <davem@davemloft.net> net: Make setsockopt() optlen be unsigned.

This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.

Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.

Signed-off-by: David S. Miller <davem@davemloft.net>
86739fb96e8c8269fc5b3d300c959bede272a6f6 27-Feb-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Do not let initial option overhead shrink the MPS

This fixes a problem caused by the overlap of the connection-setup and
established-state phases of DCCP connections.

During connection setup, the client retransmits Confirm Feature-Negotiation
options until a response from the server signals that it can move from the
half-established PARTOPEN into the OPEN state, whereupon the connection is
fully established on both ends (RFC 4340, 8.1.5).

However, since the client may already send data while it is in the PARTOPEN
state, consequences arise for the Maximum Packet Size: the problem is that the
initial option overhead is much higher than for the subsequent established
phase, as it involves potentially many variable-length list-type options
(server-priority options, RFC 4340, 6.4).

Applying the standard MPS is insufficient here: especially with larger
payloads this can lead to annoying, counter-intuitive EMSGSIZE errors.

On the other hand, reducing the MPS available for the established phase by
the added initial overhead is highly wasteful and inefficient.

The solution chosen therefore is a two-phase strategy:

If the payload length of the DataAck in PARTOPEN is too large, an Ack is sent
to carry the options, and the feature-negotiation list is then flushed.

This means that the server gets two Acks for one Response. If both Acks get
lost, it is probably better to restart the connection anyway and devising yet
another special-case does not seem worth the extra complexity.

The result is a higher utilisation of the available packet space for the data
transmission phase (established state) of a connection.

The patch (over-)estimates the initial overhead to be 32*4 bytes -- commonly
seen values were around 90 bytes for initial feature-negotiation options.

It uses sizeof(u32) to mean "aligned units of 4 bytes".
For consistency, another use of 4-byte alignment is adapted.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
f3f3abb62ccb1a1c77bcce855c04e12356e6ac95 17-Jan-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Debugging functions for feature negotiation

Since all feature-negotiation processing now takes place in feat.c,
functions for producing verbose debugging output are concentrated
there.

New functions to print out values, entry records, and options are
provided, and also a macro is defined to not always have the function
name in the output line.

Thanks a lot to Wei Yongjun and Giuseppe Galeota for help and
discussion with an earlier revision of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
883ca833e5fb814fb03426c9d35e5489ce43e8da 17-Jan-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Initialisation and type-checking of feature sysctls

This patch takes care of initialising and type-checking sysctls
related to feature negotiation. Type checking is important since some
of the sysctls now directly impact the feature-negotiation process.

The sysctls are initialised with the known default values for each
feature. For the type-checking the value constraints from RFC 4340
are used:

* Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes),
tested and confirmed that it works up to 4294967295 - for Gbps speed;
* Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer);
* CCIDs are between 0 .. 255;
* request_retries, retries1, retries2 also between 0..255 for good measure;
* tx_qlen is checked to be non-negative;
* sync_ratelimit remains as before.

Notes:
------
1. Die s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c.
2. As pointed out by Arnaldo, the pattern of type-checking repeats itself in
other places, sometimes with exactly the same kind of definitions (e.g.
"static int zero;"). It may be a good idea (kernel janitors?) to consolidate
type checking. For the sake of keeping the changeset small and in order not
to affect other subsystems, I have not strived to generalise here.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
792b48780e8b6435d017cef4b5c304876a48653e 17-Jan-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Implement both feature-local and feature-remote Sequence Window feature

This adds full support for local/remote Sequence Window feature, from which the
* sequence-number-validity (W) and
* acknowledgment-number-validity (W') windows
derive as specified in RFC 4340, 7.5.3.

Specifically, the following is contained in this patch:
* integrated new socket fields into dccp_sk;
* updated the update_gsr/gss routines with regard to these fields;
* updated handler code: the Sequence Window feature is located at the TX side,
so the local feature is meant if the handler-rx flag is false;
* the initialisation of `rcv_wnd' in reqsk is removed, since
- rcv_wnd is not used by the code anywhere;
- sequence number checks are not done in the LISTEN state (cf. 7.5.3);
- dccp_check_req checks the Ack number validity more rigorously;
* the `struct dccp_minisock' became empty and is now removed.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
ddebc973c56b51b4e5d84d606f0430d81b895d67 05-Jan-2009 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Lockless integration of CCID congestion-control plugins

Based on Arnaldo's earlier patch, this patch integrates the standardised
CCID congestion control plugins (CCID-2 and CCID-3) of DCCP with dccp.ko:

* enables a faster connection path by eliminating the need to always go
through the CCID registration lock;

* updates the implementation to use only a single array whose size equals
the number of configured CCIDs instead of the maximum (256);

* since the CCIDs are now fixed array elements, synchronization is no
longer needed, simplifying use and implementation.

CCID-2 is suggested as minimum for a basic DCCP implementation (RFC 4340, 10);
CCID-3 is a standards-track CCID supported by RFC 4342 and RFC 5348.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
6fdd34d43bff8be9bb925b49d87a0ee144d2ab07 08-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Phase out the use of boolean Ack Vector sysctl

This removes the use of the sysctl and the minisock variable for the Send Ack
Vector feature, as it now is handled fully dynamically via feature negotiation
(i.e. when CCID-2 is enabled, Ack Vectors are automatically enabled as per
RFC 4341, 4.).

Using a sysctl in parallel to this implementation would open the door to
crashes, since much of the code relies on tests of the boolean minisock /
sysctl variable. Thus, this patch replaces all tests of type

if (dccp_msk(sk)->dccpms_send_ack_vector)
/* ... */
with
if (dp->dccps_hc_rx_ackvec != NULL)
/* ... */

The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
negotiation concluded that Ack Vectors are to be used on the half-connection.
Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
so that the test is a valid one.

The activation handler for Ack Vectors is called as soon as the feature
negotiation has concluded at the
* server when the Ack marking the transition RESPOND => OPEN arrives;
* client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

Adding the sequence number of the Response packet to the Ack Vector has been
removed, since
(a) connection establishment implies that the Response has been received;
(b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
this entry will always be ignored;
(c) it can not be used for anything useful - to detect loss for instance, only
packets received after the loss can serve as pseudo-dupacks.

There was a FIXME to change the error code when dccp_ackvec_add() fails.
I removed this after finding out that:
* the check whether ackno < ISN is already made earlier,
* this Response is likely the 1st packet with an Ackno that the client gets,
* so when dccp_ackvec_add() fails, the reason is likely not a packet error.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
4098dce5be537a157eed4a326efd464109825b8b 08-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Remove manual influence on NDP Count feature

Updating the NDP count feature is handled automatically now:
* for CCID-2 it is disabled, since the code does not use NDP counts;
* for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.

Allowing the user to change NDP values leads to unpredictable and failing
behaviour, since it is then possible to disable NDP counts even when they
are needed (e.g. in CCID-3).

This means that only those user settings are sensible that agree with the
values for Send NDP Count implied by the choice of CCID. But those settings
are already activated by the feature negotiation (CCID dependency tracking),
hence this form of support is redundant.

At startup the initialisation of the NDP count feature uses the default
value of 0, which is done implicitly by the zeroing-out of the socket when
it is allocated. If the choice of CCID or feature negotiation enables NDP
count, this will then be updated via the NDP activation handler.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
422d9cdcb85b3622d08a590fed66021af7aea333 02-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Feature activation handlers

This patch provides the post-processing of feature negotiation state, after
the negotiation has completed.

To this purpose, handlers are used and added to the dccp_feat_table. Each
handler is passed a boolean flag whether the RX or TX side of the feature
is meant.

Several handlers are provided already, new handlers can easily be added.

The initialisation is now fully dynamic, i.e. CCIDs are activated only
after the feature negotiation. The integration of this dynamic activation
is done in the subsequent patches.

Thanks to Wei Yongjun for pointing out the necessity of skipping over empty
Confirm options while copying the negotiated feature values.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
0971d17ca3d80f61863f4750091a64448bf91600 02-Dec-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Insert feature-negotiation options into skb

This patch replaces the earlier insertion routine from options.c, so that
code specific to feature negotiation can remain in feat.c. This is possible
by calling a function already existing in options.c.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
dd24c00191d5e4a1ae896aafe33c6b8095ab4bd1 26-Nov-2008 Eric Dumazet <dada1@cosmosbay.com> net: Use a percpu_counter for orphan_count

Instead of using one atomic_t per protocol, use a percpu_counter
for "orphan_count", to reduce cache line contention on
heavy duty network servers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dd9c0e363cef32b7d6f23d4c87e8dfe4f91fd1c5 17-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Deprecate Ack Ratio sysctl

This patch deprecates the Ack Ratio sysctl, since
* Ack Ratio is entirely ignored by CCID-3 and CCID-4,
* Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
* even if it would work in CCID-2, there is no point for a user to change it:
- Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
- if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
(since waiting for Acks which will never arrive in this window),
- cwnd is not a user-configurable value.

The only reasonable place for Ack Ratio is to print it for debugging. It is
planned to do this later on, as part of e.g. dccp_probe.

With this patch Ack Ratio is now under full control of feature negotiation:
* Ack Ratio is resolved as a dependency of the selected CCID;
* if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
Ratio 2 for both endpoints";
* what happens then is part of another patch set, since it concerns the
dynamic update of Ack Ratio while the connection is in full flight.

Thanks to Tomasz Grobelny for discussion leading up to this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0c1168398ecbfacbb27203b281bde20ec9f78017 17-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Mechanism to resolve CCID dependencies

This adds a hook to resolve features whose value depends on the choice of
CCID. It is done at the server since it can only be done after the CCID
values have been negotiated; i.e. the client will add its CCID preference
list on the Change options sent in the Request, which will be reconciled
with the local preference list of the server.

The concept is documented on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
implementation_notes.html#ccid_dependencies

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
9eca0a47dee201a73967026985b5f0a79a46bd36 12-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Resolve dependencies of features on choice of CCID

This provides a missing link in the code chain, as several features implicitly
depend and/or rely on the choice of CCID. Most notably, this is the Send Ack Vector
feature, but also Ack Ratio and Send Loss Event Rate (also taken care of).

For Send Ack Vector, the situation is as follows:
* since CCID2 mandates the use of Ack Vectors, there is no point in allowing
endpoints which use CCID2 to disable Ack Vector features such a connection;

* a peer with a TX CCID of CCID2 will always expect Ack Vectors, and a peer
with a RX CCID of CCID2 must always send Ack Vectors (RFC 4341, sec. 4);

* for all other CCIDs, the use of (Send) Ack Vector is optional and thus
negotiable. However, this implies that the code negotiating the use of Ack
Vectors also supports it (i.e. is able to supply and to either parse or
ignore received Ack Vectors). Since this is not the case (CCID-3 has no Ack
Vector support), the use of Ack Vectors is here disabled, with a comment
in the source code.

An analogous consideration arises for the Send Loss Event Rate feature,
since the CCID-3 implementation does not support the loss interval options
of RFC 4342. To make such use explicit, corresponding feature-negotiation
options are inserted which signal the use of the loss event rate option,
as it is used by the CCID3 code.

Lastly, the values of the Ack Ratio feature are matched to the choice of CCID.

The patch implements this as a function which is called after the user has
made all other registrations for changing default values of features.

The table is variable-length, the reserved (and hence for feature-negotiation
invalid, confirmed by considering section 19.4 of RFC 4340) feature number `0'
is used to mark the end of the table.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
ac75773c2742d82cbcb078708df406e9017224b7 05-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Per-socket initialisation of feature negotiation

This provides feature-negotiation initialisation for both DCCP sockets
and DCCP request_sockets, to support feature negotiation during
connection setup.

It also resolves a FIXME regarding the congestion control
initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
61e6473efbd6087e1db3aaa93a5266c5bfd8aa99 05-Nov-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: List management for new feature negotiation

This adds list initial fields and list management functions for the
new feature negotiation implementation.

Thanks to Arnaldo for suggestions and improvements.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
410e27a49bb98bc7fa3ff5fc05cc313817b9f253 09-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> This reverts "Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/dccp_exp"
as it accentally contained the wrong set of patches. These will be
submitted separately.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
49ffc29a0223adbe0ea7005eea3ab2a03abbeb06 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Clamping RTT values

This extracts the clamping part of dccp_sample_rtt() and makes it available
to other parts of the code (as e.g. used in the next patch).

Note: The function dccp_sample_rtt() now reduces to subtracting the elapsed
time. This could be eliminated but would require shorter prefixes and thus
is not done by this patch - maybe an idea for later.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
f76fd327a8b32d3ad5b51639faf6f54d18be0981 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-3: Runtime verification of timer resolution

The DCCP base time resolution is 10 microseconds (RFC 4340, 13.1 ... 13.3).

Using a timer with a lower resolution was found to trigger the following
bug warnings/problems on high-speed networks (e.g. local loopback):
* RTT samples are rounded down to 0 if below resolution;
* in some cases, negative RTT samples were observed;
* the CCID-3 feedback timer complains that the feedback interval is 0,
since the feedback interval is in the order of 1 RTT or less and RTT
measurement rounded this down to 0;
On an Intel computer this will for instance happen when using a
boot-time parameter of "clocksource=jiffies".

The following system log messages were observed:
11:24:00 kernel: BUG: delta (0) <= 0 at ccid3_hc_rx_send_feedback()
11:26:12 kernel: BUG: delta (0) <= 0 at ccid3_hc_rx_send_feedback()
11:26:30 kernel: dccp_sample_rtt: unusable RTT sample 0, using min
11:26:30 last message repeated 5 times

This patch defines a global constant for the time resolution, adds this in
timer.c, and checks the available clock resolution at CCID-3 module load time.

When the resolution is worse than 10 microseconds, module loading exits with
a message "socket type not supported".

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
7d1af6a8d935678248d057564e75e1452409a53c 04-Sep-2008 Tomasz Grobelny <tomasz@grobelny.oswiecenia.net> dccp qpolicy: Parameter checking of cmsg qpolicy parameters

Ensure that cmsg->cmsg_type value is valid for qpolicy
that is currently in use.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
d6da3511d6b558d0b017777b61dc08b8fbc06ea4 04-Sep-2008 Tomasz Grobelny <tomasz@grobelny.oswiecenia.net> dccp: Policy-based packet dequeueing infrastructure

This patch adds a generic infrastructure for policy-based dequeueing of
TX packets and provides two policies:
* a simple FIFO policy (which is the default) and
* a priority based policy (set via socket options).
Both policies honour the tx_qlen sysctl for the maximum size of the write
queue (can be overridden via socket options).

The priority policy uses skb->priority internally to assign an u32 priority
identifier, using the same ranking as SO_PRIORITY. The skb->priority field
is set to 0 when the packet leaves DCCP. The priority is supplied as ancillary
data using cmsg(3), the patch also provides the requisite parsing routines.

Signed-off-by: Tomasz Grobelny <tomasz@grobelny.oswiecenia.net>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
146993cf5174472644ed11bd5fb539f0af8bfa49 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Refine the wait-for-ccid mechanism

This extends the existing wait-for-ccid routine so that it may be used with
different types of CCID. It further addresses the problems listed below.

The code looks if the write queue is non-empty and grants the TX CCID up to
`timeout' jiffies to drain the queue. It will instead purge that queue if
* the delay suggested by the CCID exceeds the time budget;
* a socket error occurred while waiting for the CCID;
* there is a signal pending (eg. annoyed user pressed Control-C);
* the CCID does not support delays (we don't know how long it will take).


D e t a i l s [can be removed]
-------------------------------
DCCP's sending mechanism functions a bit like non-blocking I/O: dccp_sendmsg()
will enqueue up to net.dccp.default.tx_qlen packets (default=5), without waiting
for them to be released to the network.

Rate-based CCIDs, such as CCID3/4, can impose sending delays of up to maximally
64 seconds (t_mbi in RFC 3448). Hence the write queue may still contain packets
when the application closes. Since the write queue is congestion-controlled by
the CCID, draining the queue is also under control of the CCID.

There are several problems that needed to be addressed:
1) The queue-drain mechanism only works with rate-based CCIDs. If CCID2 for
example has a full TX queue and becomes network-limited just as the
application wants to close, then waiting for CCID2 to become unblocked could
lead to an indefinite delay (i.e., application "hangs").
2) Since each TX CCID in turn uses a feedback mechanism, there may be changes
in its sending policy while the queue is being drained. This can lead to
further delays during which the application will not be able to terminate.
3) The minimum wait time for CCID3/4 can be expected to be the queue length
times the current inter-packet delay. For example if tx_qlen=100 and a delay
of 15 ms is used for each packet, then the application would have to wait
for a minimum of 1.5 seconds before being allowed to exit.
4) There is no way for the user/application to control this behaviour. It would
be good to use the timeout argument of dccp_close() as an upper bound. Then
the maximum time that an application is willing to wait for its CCIDs to can
be set via the SO_LINGER option.

These problems are addressed by giving the CCID a grace period of up to the
`timeout' value.

The wait-for-ccid function is, as before, used when the application
(a) has read all the data in its receive buffer and
(b) if SO_LINGER was set with a non-zero linger time, or
(c) the socket is either in the OPEN (active close) or in the PASSIVE_CLOSEREQ
state (client application closes after receiving CloseReq).

In addition, there is a catch-all case by calling __skb_queue_purge() after
waiting for the CCID. This is necessary since the write queue may still have
data when
(a) the host has been passively-closed,
(b) abnormal termination (unread data, zero linger time),
(c) wait-for-ccid could not finish within the given time limit.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
d7dc7e5f49299739e610ea8febf9ea91a4dc1ae9 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Implementation of circular Ack Vector buffer with overflow handling

This completes the implementation of a circular buffer for Ack Vectors, by
extending the current (linear array-based) implementation. The changes are:

(a) An `overflow' flag to deal with the case of overflow. As before, dynamic
growth of the buffer will not be supported; but code will be added to deal
robustly with overflowing Ack Vector buffers.

(b) A `tail_seqno' field. When naively implementing the algorithm of Appendix A
in RFC 4340, problems arise whenever subsequent Ack Vector records overlap,
which can bring the entire run length calculation completely out of synch.
(This is documented on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/\
ack_vectors/tracking_tail_ackno/ .)
(c) The buffer lengthi is now computed dynamically (i.e. current fill level),
as the span between head to tail.

As a result, dccp_ackvec_pending() is now simpler - the #ifdef is no longer
necessary since buf_empty is always true when IP_DCCP_ACKVEC is not configured.

Note on overflow handling:
-------------------------
The Ack Vector code previously simply started to drop packets when the
Ack Vector buffer overflowed. This means that the userspace application
will not be able to receive, only because of an Ack Vector storage problem.

Furthermore, overflow may be transient, so that applications may later
recover from the overflow. Recovering from dropped packets is more difficult
(e.g. video key frames).

Hence the patch uses a different policy: when the buffer overflows, the oldest
entries are subsequently overwritten. This has a higher chance of recovery.
Details are on http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/ack_vectors/

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
bfbddd085a5bced6efb9e1bc4d029438f9639784 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Fix the adjustments to AWL and SWL

This fixes a problem and a potential loophole with regard to seqno/ackno
validity: the problem is that the initial adjustments to AWL/SWL were
only performed at the begin of the connection, during the handshake.

Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.

Therefore the consequence is to perform this safety check each time SWL/AWL
are updated.

A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.

During the initial handshake we have more stringent sequence number protection,
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).

A detailed rationale is below -- can be removed from the commit message.


1. Server sequence number checks during initial handshake
---------------------------------------------------------
The server can not use the fields of the listening socket for seqno/ackno checks
and thus needs to store all relevant information on a per-connection basis on
the dccp_request socket. This is a size-constrained structure and has currently
only ISS (dreq_iss) and ISR (dreq_isr) defined.
Adding further fields (SW{L,H}, AW{L,H}) would increase the size of the struct
and it is questionable whether this will have any practical gain. The currently
implemented solution is as follows.
* receiving first Request: dccp_v{4,6}_conn_request sets
ISR := P.seqno, ISS := dccp_v{4,6}_init_sequence()

* sending first Response: dccp_v{4,6}_send_response via dccp_make_response()
sets P.seqno := ISS, sets P.ackno := ISR

* receiving retransmitted Request: dccp_check_req() overrides ISR := P.seqno

* answering retransmitted Request: dccp_make_response() sets ISS += 1,
otherwise as per first Response

* completing the handshake: succeeds in dccp_check_req() for the first Ack
where P.ackno == ISS (P.seqno is not tested)

* creating child socket: ISS, ISR are copied from the request_sock

This solution will succeed whenever the server can receive the Request and the
subsequent Ack in succession, without retransmissions. If there is packet loss,
the client needs to retransmit until this condition succeeds; it will otherwise
eventually give up. Adding further fields to the request_sock could increase
the robustness a bit, in that it would make possible to let a reordered Ack
(from a retransmitted Response) pass. The argument against such a solution is
that if the packet loss is not persistent and an Ack gets through, why not
wait for the one answering the original response: if the loss is persistent, it
is probably better to not start the connection in the first place.

Long story short: the present design (by Arnaldo) is simple and will likely work
just as well as a more complicated solution. As a consequence, {A,S}W{L,H} are
not needed until the moment the request_sock is cloned into the accept queue.

At that stage feature negotiation has completed, so that the values for the local
and remote Sequence Window feature (7.5.2) are known, i.e. we are now in a better
position to compute {A,S}W{L,H}.


2. Client sequence number checks during initial handshake
---------------------------------------------------------
Until entering PARTOPEN the client does not need the adjustments, since it
constrains the Ack window to the packet it sent.

* sending first Request: dccp_v{4,6}_connect() choose ISS,
dccp_connect() then sets GAR := ISS (as per 8.5),
dccp_transmit_skb() (with the previous bug fix) sets
GSS := ISS, AWL := ISS, AWH := GSS
* n-th retransmitted Request (with previous patch):
dccp_retransmit_skb() via timer calls
dccp_transmit_skb(), which sets GSS := ISS+n
and then AWL := ISS, AWH := ISS+n

* receiving any Response: dccp_rcv_request_sent_state_process()
-- accepts packet if AWL <= P.ackno <= AWH;
-- sets GSR = ISR = P.seqno

* sending the Ack completing the handshake: dccp_send_ack() calls
dccp_transmit_skb(), which sets GSS += 1
and AWL := ISS, AWH := GSS


Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
2975abd251d795810932b20354729ba236d95bf9 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Schedule an Ack when receiving timestamps

This schedules an Ack when receiving a timestamp, exploiting the
existing inet_csk_schedule_ack() function, saving one case in the
`dccp_ack_pending()' function.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
88ddac513a4e7e04234214b600401ec22abfbb46 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Special case of the MPS for client-PARTOPEN with DataAcks

To increase robustness, it is necessary to resend Confirm feature-negotiation
options, even though the RFC does not mandate it. But feature negotiation
options can take (much) more room than the options on common DataAck packets.

Instead of reducing the MPS always for a case which only applies to the three
messages send during initial handshake, this patch devises a special case:

if the payload length of the DataAck in PARTOPEN is too large, an Ack is sent
to carry the options, and the feature-negotiation list is then flushed.

This means that the server gets two Acks for one Response. If both Acks get
lost, it is probably better to restart the connection anyway and devising yet
another special-case does not seem worth the extra complexity.

The patch (over-)estimates the expected overhead to be 32*4 bytes -- commonly
seen values were 20-90 bytes for initial feature-negotiation options.

It uses sizeof(u32) to mean "aligned units of 4 bytes". For consistency,
another use of sizeof is modified.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
624a965a93610152b10c73d050ed44812efa8abe 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Support for the exchange of NN options in established state

In contrast to static feature negotiation at the begin of a connection, which
establishes the capabilities of both endpoints, this patch introduces support
for dynamic exchange of feature negotiation options.

Such a dynamic exchange is necessary in at least two cases:
* CCID-2's Ack Ratio (RFC 4341, 6.1.2) which changes during the connection;
* Sequence Window values that, as per RFC 4340, 7.5.2, should be sent "as
as the connection progresses".

Both are NN (non-negotiable) features. Hence dynamic feature "negotiation" is
distinguished from static/pre-connection negotiation by the following:
* no new capabilities are negotiated (those that matter for the connection
are negotiated prior to setting up the connection, comparable to SIP);
* features must be understood by each endpoint: as per RFC 4340, 6.4,
Sequence Window is "Req'd" and Ack Ratio must be understood when CCID-2
is used as per the note underneath Table 4.

These characteristics are reflected in the implementation:
* only NN options can be exchanged after connection setup;
* NN options are activated directly after validating them. The rationale is
that a peer must accept every valid NN value (RFC 4340, 6.3.2), hence it
will either accept the value and send a "Confirm R", or it will send an
empty Confirm (which will reset the connection according to FN rules).
* An Ack is scheduled directly after activation to accelerate communicating
the update to the peer.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
76f738a7950b559a23ab3c692c99a02f35a54f7f 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Debugging functions for feature negotiation

Since all feature-negotiation processing now takes place in feat.c, functions
for producing verbose debugging output are concentrated there.

New functions to print out values, entry records, and options are provided,
and also a macro is defined to not always have the function name in the
output line.

Thanks a lot to Wei Yongjun and Giuseppe Galeota for help with errors in an
earlier revision of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
0a4822679d94e2b0117aeead06a19fad59533905 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Initialisation and type-checking of feature sysctls

This patch takes care of initialising and type-checking sysctls related to
feature negotiation. Type checking is important since some of the sysctls
now directly act on the feature-negotiation process.

The sysctls are initialised with the known default values for each feature.
For the type-checking the value constraints from RFC 4340 are used:

* Sequence Window uses the specified Wmin=32, the maximum is ulong (4 bytes),
tested and confirmed that it works up to 4294967295 - for Gbps speed;
* Ack Ratio is between 0 .. 0xffff (2-byte unsigned integer);
* CCIDs are between 0 .. 255;
* request_retries, retries1, retries2 also between 0..255 for good measure;
* tx_qlen is checked to be non-negative;
* sync_ratelimit remains as before.

Further changes:
----------------
Performed s@sysctl_dccp_feat@sysctl_dccp@g since the sysctls are now in feat.c.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
51c7d4fa2675c106a980ddcdbe308b54b5151945 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Implement both feature-local and feature-remote Sequence Window feature

This adds full support for local/remote Sequence Window feature, from which the
* sequence-number-validity (W) and
* acknowledgment-number-validity (W') windows
derive as specified in RFC 4340, 7.5.3.

Specifically, the following changes are introduced:
* integrated new socket fields into dccp_sk;
* updated the update_gsr/gss routines with regard to these fields;
* updated handler code: the Sequence Window feature is located at the TX side,
so the local feature is meant if the handler-rx flag is false;
* the initialisation of `rcv_wnd' in reqsk is removed, since
- rcv_wnd is not used by the code anywhere;
- sequence number checks are not done in the LISTEN state (cf. 7.5.3);
- dccp_check_req checks the Ack number validity more rigorously;
* the `struct dccp_minisock' became empty and is now removed.

Until the handshake completes with activating negotiated values, the local/remote
Sequence-Window values are undefined and thus can not reliably be estimated.
This issue is addressed in a separate patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
b235dc4abbc1356284bd0dc730efa711f394e0e2 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-2: Phase out the use of boolean Ack Vector sysctl

This removes the use of the sysctl and the minisock variable for the Send Ack
Vector feature, which is now handled fully dynamically via feature negotiation;
i.e. when CCID2 is enabled, Ack Vectors are automatically enabled (as per
RFC 4341, 4.).

Using a sysctl in parallel to this implementation would open the door to
crashes, since much of the code relies on tests of the boolean minisock /
sysctl variable. Thus, this patch replaces all tests of type

if (dccp_msk(sk)->dccpms_send_ack_vector)
/* ... */
with
if (dp->dccps_hc_rx_ackvec != NULL)
/* ... */

The dccps_hc_rx_ackvec is allocated by the dccp_hdlr_ackvec() when feature
negotiation concluded that Ack Vectors are to be used on the half-connection.
Otherwise, it is NULL (due to dccp_init_sock/dccp_create_openreq_child),
so that the test is a valid one.

The activation handler for Ack Vectors is called as soon as the feature
negotiation has concluded at the
* server when the Ack marking the transition RESPOND => OPEN arrives;
* client after it has sent its ACK, marking the transition REQUEST => PARTOPEN.

Adding the sequence number of the Response packet to the Ack Vector has been
removed, since
(a) connection establishment implies that the Response has been received;
(b) the CCIDs only look at packets received in the (PART)OPEN state, i.e.
this entry will always be ignored;
(c) it can not be used for anything useful - to detect loss for instance, only
packets received after the loss can serve as pseudo-dupacks.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
68e074bfcef269bc61006c2740d7f89ccbbd93d7 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Remove manual influence on NDP Count feature

Updating the NDP count feature is handled automatically now:
* for CCID-2 it is disabled, since the code does not use NDP counts;
* for CCID-3 it is enabled, as NDP counts are used to determine loss lengths.

Allowing the user to change NDP values leads to unpredictable and failing
behaviour, since it is then possible to disable NDP counts even when they
are needed (e.g. in CCID-3).

This means that only those user settings are sensible that agree with the
values for Send NDP Count implied by the choice of CCID. But those settings
are already activated by the feature negotiation (CCID dependency tracking),
hence this form of support is redundant.

At startup the initialisation of the NDP count feature is with the default
value of 0, which is done implicitly by the zeroing-out of the socket when
it is allocated. If the choice of CCID or feature negotiation enables NDP
count, this will then be updated via the NDP activation handler.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
c926c6aed3e444e8c88a768f063b2de8fd6ae760 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Feature activation handlers

This patch provides the post-processing of feature negotiation state, after
the negotiation has completed.

To this purpose, handlers are used and added to the dccp_feat_table. Each
handler is passed a boolean flag whether the RX or TX side of the feature
is meant.

Several handlers are provided already, new handlers can easily be added.

The initialisation is now fully dynamic, i.e. CCIDs are activated only
after the feature negotiation. The integration of this dynamic activation
is done in the subsequent patches.

Thanks to Wei Yongjun for pointing out the necessity of skipping over empty
Confirm options while copying the negotiated feature values.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
0ef118a017919cd661cf294811d1889ac556ee80 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Insert feature-negotiation options into skb

This patch replaces the earlier insertion routine from options.c, so that
code specific to feature negotiation can remain in feat.c. This is possible
by calling a function already existing in options.c.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
17c30b40ed79e9f3955e884632c8f01e577b204a 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Deprecate Ack Ratio sysctl

This patch deprecates the Ack Ratio sysctl, since
* Ack Ratio is entirely ignored by CCID-3 and CCID-4,
* Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
* even if it would work in CCID-2, there is no point for a user to change it:
- Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
- if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
(since waiting for Acks which will never arrive in this window),
- cwnd is not a user-configurable value.

The only reasonable place for Ack Ratio is to print it for debugging. It is
planned to do this later on, as part of e.g. dccp_probe.

With this patch Ack Ratio is now under full control of feature negotiation:
* Ack Ratio is resolved as a dependency of the selected CCID;
* if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
Ratio 2 for both endpoints";
* what happens then is part of another patch set, since it concerns the
dynamic update of Ack Ratio while the connection is in full flight.

Thanks to Tomasz Grobelny for discussion leading up to this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
d4c8741c431e07cfc66eb2b4c3a17b8d4975d9c0 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Mechanism to resolve CCID dependencies

This adds a hook to resolve features whose value depends on the choice of
CCID. It is done at the server since it can only be done after the CCID
values have been negotiated; i.e. the client will add its CCID preference
list on the Change options sent in the Request, which will be reconciled
with the local preference list of the server.

The concept is documented on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
implementation_notes.html#ccid_dependencies

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
093e1f46cf162913d05e1d4eeb01baa3e297b683 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Resolve dependencies of features on choice of CCID

This provides a missing link in the code chain, as several features implicitly
depend and/or rely on the choice of CCID. Most notably, this is the Send Ack Vector
feature, but also Ack Ratio and Send Loss Event Rate (also taken care of).

For Send Ack Vector, the situation is as follows:
* since CCID2 mandates the use of Ack Vectors, there is no point in allowing
endpoints which use CCID2 to disable Ack Vector features such a connection;

* a peer with a TX CCID of CCID2 will always expect Ack Vectors, and a peer
with a RX CCID of CCID2 must always send Ack Vectors (RFC 4341, sec. 4);

* for all other CCIDs, the use of (Send) Ack Vector is optional and thus
negotiable. However, this implies that the code negotiating the use of Ack
Vectors also supports it (i.e. is able to supply and to either parse or
ignore received Ack Vectors). Since this is not the case (CCID-3 has no Ack
Vector support), the use of Ack Vectors is here disabled, with a comment
in the source code.

An analogous consideration arises for the Send Loss Event Rate feature,
since the CCID-3 implementation does not support the loss interval options
of RFC 4342. To make such use explicit, corresponding feature-negotiation
options are inserted which signal the use of the loss event rate option,
as it is used by the CCID3 code.

Lastly, the values of the Ack Ratio feature are matched to the choice of CCID.

The patch implements this as a function which is called after the user has
made all other registrations for changing default values of features.

The table is variable-length, the reserved (and hence for feature-negotiation
invalid, confirmed by considering section 19.4 of RFC 4340) feature number `0'
is used to mark the end of the table.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
702083839b607f390dbed5d2304eb8fc5f4c85ac 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Cleanup routines for feature negotiation

This inserts the required de-allocation routines for memory allocated by
feature negotiation in the socket destructors, replacing dccp_feat_clean()
in one instance.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
828755cee087e4a34f45d6c9db661ccd0631cc6d 04-Sep-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Per-socket initialisation of feature negotiation

This provides feature-negotiation initialisation for both DCCP sockets and
DCCP request_sockets, to support feature negotiation during connection setup.

It also resolves a FIXME regarding the congestion control initialisation.

Thanks to Wei Yongjun for help with the IPv6 side of this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
6edafaaf6f5e70ef1e620ff01bd6bacebe1e0718 07-Aug-2008 Gui Jianfeng <guijianfeng@cn.fujitsu.com> tcp: Fix kernel panic when calling tcp_v(4/6)_md5_do_lookup

If the following packet flow happen, kernel will panic.
MathineA MathineB
SYN
---------------------->
SYN+ACK
<----------------------
ACK(bad seq)
---------------------->
When a bad seq ACK is received, tcp_v4_md5_do_lookup(skb->sk, ip_hdr(skb)->daddr))
is finally called by tcp_v4_reqsk_send_ack(), but the first parameter(skb->sk) is
NULL at that moment, so kernel panic happens.
This patch fixes this bug.

OOPS output is as following:
[ 302.812793] IP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42
[ 302.817075] Oops: 0000 [#1] SMP
[ 302.819815] Modules linked in: ipv6 loop dm_multipath rtc_cmos rtc_core rtc_lib pcspkr pcnet32 mii i2c_piix4 parport_pc i2c_core parport ac button ata_piix libata dm_mod mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: scsi_wait_scan]
[ 302.849946]
[ 302.851198] Pid: 0, comm: swapper Not tainted (2.6.27-rc1-guijf #5)
[ 302.855184] EIP: 0060:[<c05cfaa6>] EFLAGS: 00010296 CPU: 0
[ 302.858296] EIP is at tcp_v4_md5_do_lookup+0x12/0x42
[ 302.861027] EAX: 0000001e EBX: 00000000 ECX: 00000046 EDX: 00000046
[ 302.864867] ESI: ceb69e00 EDI: 1467a8c0 EBP: cf75f180 ESP: c0792e54
[ 302.868333] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 302.871287] Process swapper (pid: 0, ti=c0792000 task=c0712340 task.ti=c0746000)
[ 302.875592] Stack: c06f413a 00000000 cf75f180 ceb69e00 00000000 c05d0d86 000016d0 ceac5400
[ 302.883275] c05d28f8 000016d0 ceb69e00 ceb69e20 681bf6e3 00001000 00000000 0a67a8c0
[ 302.890971] ceac5400 c04250a3 c06f413a c0792eb0 c0792edc cf59a620 cf59a620 cf59a634
[ 302.900140] Call Trace:
[ 302.902392] [<c05d0d86>] tcp_v4_reqsk_send_ack+0x17/0x35
[ 302.907060] [<c05d28f8>] tcp_check_req+0x156/0x372
[ 302.910082] [<c04250a3>] printk+0x14/0x18
[ 302.912868] [<c05d0aa1>] tcp_v4_do_rcv+0x1d3/0x2bf
[ 302.917423] [<c05d26be>] tcp_v4_rcv+0x563/0x5b9
[ 302.920453] [<c05bb20f>] ip_local_deliver_finish+0xe8/0x183
[ 302.923865] [<c05bb10a>] ip_rcv_finish+0x286/0x2a3
[ 302.928569] [<c059e438>] dev_alloc_skb+0x11/0x25
[ 302.931563] [<c05a211f>] netif_receive_skb+0x2d6/0x33a
[ 302.934914] [<d0917941>] pcnet32_poll+0x333/0x680 [pcnet32]
[ 302.938735] [<c05a3b48>] net_rx_action+0x5c/0xfe
[ 302.941792] [<c042856b>] __do_softirq+0x5d/0xc1
[ 302.944788] [<c042850e>] __do_softirq+0x0/0xc1
[ 302.948999] [<c040564b>] do_softirq+0x55/0x88
[ 302.951870] [<c04501b1>] handle_fasteoi_irq+0x0/0xa4
[ 302.954986] [<c04284da>] irq_exit+0x35/0x69
[ 302.959081] [<c0405717>] do_IRQ+0x99/0xae
[ 302.961896] [<c040422b>] common_interrupt+0x23/0x28
[ 302.966279] [<c040819d>] default_idle+0x2a/0x3d
[ 302.969212] [<c0402552>] cpu_idle+0xb2/0xd2
[ 302.972169] =======================
[ 302.974274] Code: fc ff 84 d2 0f 84 df fd ff ff e9 34 fe ff ff 83 c4 0c 5b 5e 5f 5d c3 90 90 57 89 d7 56 53 89 c3 50 68 3a 41 6f c0 e8 e9 55 e5 ff <8b> 93 9c 04 00 00 58 85 d2 59 74 1e 8b 72 10 31 db 31 c9 85 f6
[ 303.011610] EIP: [<c05cfaa6>] tcp_v4_md5_do_lookup+0x12/0x42 SS:ESP 0068:c0792e54
[ 303.018360] Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
59435444a13ed52d3444c5df26b73d3086bcd57b 26-Jul-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp: Allow to distinguish original and retransmitted packets

This patch allows the sender to distinguish original and retransmitted packets,
which is in particular needed for the retransmission of DCCP-Requests:
* the first Request uses ISS (generated in net/dccp/ip*.c), and sets GSS = ISS;
* all retransmitted Requests use GSS' = GSS + 1, so that the n-th retransmitted
Request has sequence number ISS + n (mod 48).

To add generic support, the patch reorganises existing code so that:
* icsk_retransmits == 0 for the original packet and
* icsk_retransmits = n > 0 for the n-th retransmitted packet
at the time dccp_transmit_skb() is called, via dccp_retransmit_skb().

Thanks to Wei Yongjun for pointing this problem out.

Further changes:
----------------
* removed the `skb' argument from dccp_retransmit_skb(), since sk_send_head
is used for all retransmissions (the exception is client-Acks in PARTOPEN
state, but these do not use sk_send_head);
* since sk_send_head always contains the original skb (via dccp_entail()),
skb_cloned() never evaluated to true and thus pskb_copy() was never used.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
547b792cac0a038b9dbf958d3c120df3740b5572 26-Jul-2008 Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> net: convert BUG_TRAP to generic WARN_ON

Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013c7e35aeba39777f9b3eef8a70207b3931152 13-Jul-2008 Gerrit Renker <gerrit@erg.abdn.ac.uk> dccp ccid-3: Fix error in loss detection

The TFRC loss detection code used the wrong loss condition (RFC 4340, 7.7.1):
* the difference between sequence numbers s1 and s2 instead of
* the number of packets missing between s1 and s2 (one less than the distance).

Since this condition appears in many places of the code, it has been put into a
separate function, dccp_loss_free().

Further changes:
----------------
* tidied up incorrect typing (it was using `int' for u64/s64 types);
* optimised conditional statements for common case of non-reordered packets;
* rewrote comments/documentation to match the changes.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
7d06b2e053d2d536348e3a0f6bb02982a41bea37 15-Jun-2008 Brian Haley <brian.haley@hp.com> net: change proto destroy method to return void

Change struct proto destroy function pointer to return void. Noticed
by Al Viro.

Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
028b027524b162eef90839a92ba4b8bddf23e06c 13-Apr-2008 Patrick McHardy <kaber@trash.net> [DCCP]: Fix skb->cb conflicts with IP

dev_queue_xmit() and the other IP output functions expect to get a skb
with clear or properly initialized skb->cb. Unlike TCP and UDP, the
dccp_skb_cb doesn't contain a struct inet_skb_parm at the beginning,
so the DCCP-specific data is interpreted by the IP output functions.
This can cause false negatives for the conditional POST_ROUTING hook
invocation, making the packet bypass the hook.

Add a inet_skb_parm/inet6_skb_parm union to the beginning of
dccp_skb_cb to avoid clashes. Also add a BUILD_BUG_ON to make
sure it fits in the cb.

[ Combined with patch from Gerrit Renker to remove two now unnecessary
memsets of IPCB(skb)->opt ]

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7630f026810a63464e47391ab1e03674c33eb1b8 03-Apr-2008 Denis V. Lunev <den@openvz.org> [DCCP]: Replace socket with sock for reset sending.

Replace dccp_v(4|6)_ctl_socket with sock to unify a code with TCP/ICMP.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
0dc47877a3de00ceadea0005189656ae8dc52669 06-Mar-2008 Harvey Harrison <harvey.harrison@gmail.com> net: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ab1e0a13d70299e792fd0527cefd070c1405fa5b 03-Feb-2008 Arnaldo Carvalho de Melo <acme@redhat.com> [SOCK] proto: Add hashinfo member to struct proto

This way we can remove TCP and DCCP specific versions of

sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port
sk->sk_prot->hash: inet_hash is directly used, only v6 need
a specific version to deal with mapped sockets
sk->sk_prot->unhash: both v4 and v6 use inet_hash directly

struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so
that inet_csk_get_port can find the per family routine.

Now only the lookup routines receive as a parameter a struct inet_hashtable.

With this we further reuse code, reducing the difference among INET transport
protocols.

Eventually work has to be done on UDP and SCTP to make them share this
infrastructure and get as a bonus inet_diag interfaces so that iproute can be
used with these protocols.

net-2.6/net/ipv4/inet_hashtables.c:
struct proto | +8
struct inet_connection_sock_af_ops | +8
2 structs changed
__inet_hash_nolisten | +18
__inet_hash | -210
inet_put_port | +8
inet_bind_bucket_create | +1
__inet_hash_connect | -8
5 functions changed, 27 bytes added, 218 bytes removed, diff: -191

net-2.6/net/core/sock.c:
proto_seq_show | +3
1 function changed, 3 bytes added, diff: +3

net-2.6/net/ipv4/inet_connection_sock.c:
inet_csk_get_port | +15
1 function changed, 15 bytes added, diff: +15

net-2.6/net/ipv4/tcp.c:
tcp_set_state | -7
1 function changed, 7 bytes removed, diff: -7

net-2.6/net/ipv4/tcp_ipv4.c:
tcp_v4_get_port | -31
tcp_v4_hash | -48
tcp_v4_destroy_sock | -7
tcp_v4_syn_recv_sock | -2
tcp_unhash | -179
5 functions changed, 267 bytes removed, diff: -267

net-2.6/net/ipv6/inet6_hashtables.c:
__inet6_hash | +8
1 function changed, 8 bytes added, diff: +8

net-2.6/net/ipv4/inet_hashtables.c:
inet_unhash | +190
inet_hash | +242
2 functions changed, 432 bytes added, diff: +432

vmlinux:
16 functions changed, 485 bytes added, 492 bytes removed, diff: -7

/home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c:
tcp_v6_get_port | -31
tcp_v6_hash | -7
tcp_v6_syn_recv_sock | -9
3 functions changed, 47 bytes removed, diff: -47

/home/acme/git/net-2.6/net/dccp/proto.c:
dccp_destroy_sock | -7
dccp_unhash | -179
dccp_hash | -49
dccp_set_state | -7
dccp_done | +1
5 functions changed, 1 bytes added, 242 bytes removed, diff: -241

/home/acme/git/net-2.6/net/dccp/ipv4.c:
dccp_v4_get_port | -31
dccp_v4_request_recv_sock | -2
2 functions changed, 33 bytes removed, diff: -33

/home/acme/git/net-2.6/net/dccp/ipv6.c:
dccp_v6_get_port | -31
dccp_v6_hash | -7
dccp_v6_request_recv_sock | +5
3 functions changed, 5 bytes added, 38 bytes removed, diff: -33

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a07a5a86d091699fd5e791765b8a79e6b1ef96cb 17-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Remove unused inline function

The function follows48(), which is a special-case of dccp_delta_seqno(),
is nowhere used in the DCCP code, thus removed by this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
af3b867e2f6b72422bc7aacb1f1e26f47a9649bc 13-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Support inserting options during the 3-way handshake

This provides a separate routine to insert options during the initial handshake.
The main purpose is to conduct feature negotiation, for the moment the only user
is the timestamp echo needed for the (CCID3) handshake RTT sample.

Padding of options has been put into a small separate routine, to be shared among
the two functions. This could also be used as a generic routine to finish inserting
options.

Also removed an `XXX' comment since its content was obvious.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
28be5440044d5b19b0331f79fb3e81845ad6d77e 13-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Use maximum-RTO backoff from DCCP spec

This removes another Fixme, using the TCP maximum RTO rather than the value
specified by the DCCP specification. Across the sections in RFC 4340, 64
seconds is consistently suggested as maximum RTO backoff value; and this is
the value which is now used.

I have checked both termination cases for retransmissions of Close/CloseReq:
with the default value 15 of `retries2', and an initial icsk_retransmit = 0,
it takes about 614 seconds to declare a non-responding peer as dead, after
which the final terminating Reset is sent. With the TCP maximum RTO value of
120 seconds it takes (as might be expected) almost twice as long, about 23
minutes.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
954c2db868ce896325dced91d5fba5e2226897a4 12-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [CCID3]: Interface CCID3 code with newer Loss Intervals Database

This hooks up the TFRC Loss Interval database with CCID 3 packet reception.
In addition, it makes the CCID-specific computation of the first loss
interval (which requires access to all the guts of CCID3) local to ccid3.c.

The patch also fixes an omission in the DCCP code, that of a default /
fallback RTT value (defined in section 3.4 of RFC 4340 as 0.2 sec); while
at it, the upper bound of 4 seconds for an RTT sample has been reduced to
match the initial TCP RTO value of 3 seconds from[RFC 1122, 4.2.3.1].

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2180c41ca5c1a36c67f4140e80154699333109d2 06-Dec-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Introduce generic function to test for `data packets'

as per RFC 4340, sec. 7.7.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e356d37a096a990ea1a74c44c15640122e56110b 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Factor out common code for generating Resets

This factors code common to dccp_v{4,6}_ctl_send_reset into a separate function,
and adds support for filling in the Data 1 ... Data 3 fields from RFC 4340, 5.6.

It is useful to have this separate, since the following Reset codes will always
be generated from the control socket rather than via dccp_send_reset:
* Code 3, "No Connection", cf. 8.3.1;
* Code 4, "Packet Error" (identification for Data 1 added);
* Code 5, "Option Error" (identification for Data 1..3 added, will be used later);
* Code 6, "Mandatory Error" (same as Option Error);
* Code 7, "Connection Refused" (what on Earth is the difference to "No Connection"?);
* Code 8, "Bad Service Code";
* Code 9, "Too Busy";
* Code 10, "Bad Init Cookie" (not used).

Code 0 is not recommended by the RFC, the following codes would be used in
dccp_send_reset() instead, since they all relate to an established DCCP connection:
* Code 1, "Closed";
* Code 2, "Aborted";
* Code 11, "Aggression Penalty" (12.3).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
a94f0f970549e63e54c80c4509db299c514d8c11 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Rate-limit DCCP-Syncs

This implements a SHOULD from RFC 4340, 7.5.4:
"To protect against denial-of-service attacks, DCCP implementations SHOULD
impose a rate limit on DCCP-Syncs sent in response to sequence-invalid packets,
such as not more than eight DCCP-Syncs per second."

The rate-limit is maintained on a per-socket basis. This is a more stringent
policy than enforcing the rate-limit on a per-source-address basis and
protects against attacks with forged source addresses.

Moreover, the mechanism is deliberately kept simple. In contrast to
xrlim_allow(), bursts of Sync packets in reply to sequence-invalid packets
are not supported. This foils such attacks where the receipt of a Sync
triggers further sequence-invalid packets. (I have tested this mechanism against
xrlim_allow algorithm for Syncs, permitting bursts just increases the problems.)

In order to keep flexibility, the timeout parameter can be set via sysctl; and
the whole mechanism can even be disabled (which is however not recommended).

The algorithm in this patch has been improved with regard to wrapping issues
thanks to a suggestion by Arnaldo.

Commiter note: Rate limited the step 6 DCCP_WARN too, as it says we're
sending a sync.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
0430ee3451f4589b68f522552b1896825f2043b3 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Add Support for Data 1 .. 3 fields of Reset packets

This adds fields to support the informational Data 1..3 fields of the
DCCP-Reset packets (RFC 4340, 5.6), and makes minor cosmetic changes
to documentation.
Code which fills in these fields follows in subsequent patches, it is
primarily used for reporting option-processing and feature-negotiation
errors.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
727ecc5faaf6e976fc841649821c865ebd1e822d 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Add FIXME for send_delayed_ack

This adds a FIXME to signal that the function dccp_send_delayed_ack is nowhere
used in the entire DCCP/CCID code.

Using a delayed Ack timer is suggested in 11.3 of RFC 4340, but it has also
rather subtle implications for the Ack-Ratio-accounting.

CCID2 does not use this (maybe it should).

I think leaving the function in is good, in case someone wants to implement
this.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
3393da8241ae3a53e183ba15f8bd822995ec97cd 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Simplify interface of dccp_sample_rtt

The third parameter of dccp_sample_rtt now becomes useless and is removed.

Also combined the subtraction of the timestamp echo and the elapsed time.
This is safe, since (a) presence of timestamp echo is tested first and (b)
elapsed time is either present and non-zero or it is not set and equals 0
due to the memset in dccp_parse_options.

To avoid measuring option-processing time, the timestamp for measuring the
initial Request/Response RTT sample is taken directly when the function is
called (the Linux implementation always adds a timestamp on the Request,
so there is no loss in doing this).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4c70f383e0c0273c4092c4efdb414be0966978b7 26-Sep-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Provide 10s of microsecond timesource

This provides a timesource, conveniently used for DCCP timestamps, which
returns the elapsed time in 10s of microseconds since initialisation.
This makes for a wrap-around time of about 11.9 hours, which should be
sufficient for most applications.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6168b96c07d8d40f83622cfb488ca27e4178a603 20-Aug-2007 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [DCCP]: Nuke the timeval helpers now that we fully converted to ktime_t

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
8fb8354af9b92ce3bd41083995f1fe26024d0959 20-Aug-2007 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [DCCP]: Nuke dccp_timestamp and dccps_epoch, not used anymore

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9823b7b5542858afe5b6a1e2df83b3847c28f3d6 20-Aug-2007 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [DCCP]: Convert dccp_sample_rtt to ktime_t

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
e961811fcde4202ae5c3c9ce81dcfc244e8959bb 28-May-2007 Ian McDonald <ian.mcdonald@jandi.co.nz> Fix dccp_sum_coverage

When compiling with EXTRA_CFLAGS=-W notice that we have signed/unsigned issue
in dccp.h.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
4712a792ee661921374c163eb6a4d06e33fd305f 20-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Provide function for RTT sampling

A recurring problem, in particular in the CCID code, is that RTT samples
from packets with timestamp echo and elapsed time options need to be taken.

This service is provided via a new function dccp_sample_rtt in this patch.
Furthermore, to protect against `insane' RTT samples, the sampled value
is bounded between 100 microseconds and 4 seconds - for which u32 is sufficient.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ac12b0c49571fe4c3a2f4957ed494da316d558be 20-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Always use debug-toggle parameters

Currently debugging output (when configured) is automatically enabled when
DCCP modules are compiled into the kernel rather than built as loadable modules.
This is not necessary, since the module parameters in this case become kernel
commandline parameters, e.g. DCCP or CCID3 debug output can be enabled for a
static build by appending the following at the boot prompt:

dccp.dccp_debug=1 dccp_ccid3.ccid3_debug=1

This patch therefore does away with the more complicated way of always enabling
debug output for static builds

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b16be51b5e5d75cec71b18ebc75f15a4734c62ad 20-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Fix for follows48

The follows48 relation identifies whether 48-bit sequence number
x is the direct successor of y. Currently, it does not handle cases
of the following type correctly:

follows48(0x(prefix)10000LL, 0x(prefix)0FFFFLL)

where prefix is an arbitrary hex sequence of up to 7 digits.

This is fixed by reusing the new dccp_delta_seqno function.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d52de17b8cf36d43a9d6977e7861a9f415541c6b 20-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Make `before' relation unambiguous

Problem:
0aec51c86986f61de26dd04913667af544a8b8eb 20-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Make dccp_delta_seqno return signed numbers

Problem:
6b811d43f6cc9eccdfc011a99f8571df2abc46d1 20-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: 48-bit sequence number arithmetic

This patch
* organizes the sequence arithmetic functions into one corner of dccp.h
* performs a small modification of dccp_set_seqno to make it more widely reusable
(now it is safe to use any number, since it performs modulo-2^48 assignment)
* adds functions and generic macros for 48-bit sequence arithmetic:
--48 bit complement
--modulo-48 addition and modulo-48 subtraction
--dccp_inc_seqno now a special case of add48
Constants renamed following a suggestion by Arnaldo.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c93a882ebe673b5e6da0a70fd433f7517e032d23 25-Mar-2007 Adrian Bunk <bunk@stusta.de> [DCCP]: make dccp_write_xmit_timer() static again

dccp_write_xmit_timer() needlessly became global.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
aabb601b0f08b909b650f1a7bfa1e8d9b5a8d999 09-Mar-2007 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Initialise write_xmit_timer also on passive sockets

The TX CCID needs the write_xmit_timer for delaying packet sends. Previously
this timer was only activated on active (connecting) sockets.

This patch initialises the write_xmit_timer in sync with the other timers, i.e.
the timer will be ready on any socket. This is used by applications with a
listening socket which start to stream after receiving an initiation by the
client. The write_xmit_timer is stopped when the application closes, as before.

Was tested to work and to remove the timer bug reported on dccp@vger.

Also moved timer initialisation into timer.c (static).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
c9eaf17341834de00351bf79f16b2d879c8aea96 09-Feb-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [NET] DCCP: Fix whitespace errors.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
0f9e5b573f7249b0e04a03457b55081d1f60f2bf 10-Dec-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Debug timeval operations

Problem:

Most target types in the CCID3 code are u32, so subtle conversion errors
can occur if signed time calculations yield negative results: the original
values are lost in the conversion to unsigned, calculation errors go undetected.

This patch therefore
* sets all critical time types from unsigned to suseconds_t
* avoids comparison between signed/unsigned via type-casting
* provides ample warning messages in case time calculations are negative

These warning messages can be removed at a later stage when the code
has undergone more testing.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
5cc3741d6cc9f07d8ddd9c45cb5088460ce3364f 10-Dec-2006 Ian McDonald <ian.mcdonald@jandi.co.nz> [DCCP]: Remove timeo from output.c

It simplifies waiting for the CCID module to signal that a packet
is ready to be sent. Other simplifications flow on from this such as
removing constants.

As a result of this EAGAIN is not returned any more by dccp_wait_for_ccid
(which would otherwise lead to unnecessarily discarding the packet in
dccp_write_xmit).

Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
59348b19efebfd6a8d0791ff81d207b16594c94b 20-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Simplified conditions due to use of enum:8 states

This reaps the benefit of the earlier patch, which changed the type of
CCID 3 states to use enums, in that many conditions are now simplified
and the number of possible (unexpected) values is greatly reduced.

In a few instances, this also allowed to simplify pre-conditions; where
care has been taken to retain logical equivalence.

[DCCP]: Introduce a consistent BUG/WARN message scheme

This refines the existing set of DCCP messages so that
* BUG(), BUG_ON(), WARN_ON() have meaningful DCCP-specific counterparts
* DCCP_CRIT (for severe warnings) is not rate-limited
* DCCP_WARN() is introduced as rate-limited wrapper

Using these allows a faster and cleaner transition to their original
counterparts once the code has matured into a full DCCP implementation.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
b1308dc015eb09cf094ca169296738a13ae049ad 20-Nov-2006 Ian McDonald <ian.mcdonald@jandi.co.nz> [DCCP]: Set TX Queue Length Bounds via Sysctl

Previously the transmit queue was unbounded.

This patch:
* puts a limit on transmit queue length
and sends back EAGAIN if the buffer is full
* sets the TX queue length to a sensible default
* implements tx buffer sysctls for DCCP

Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
84116716cc9404356f775443b460f76766f08f65 20-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: enable debug messages also for static builds

This patch
* makes debugging (when configured) work both for static / module build
* provides generic debugging macros for use in other DCCP / CCID modules
* adds missing information about debug parameters to Kconfig
* performs some code tidy-up

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
3c6952624a8f600f9a0fbc1f5db5560a7ef9b13e 16-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Introduce DCCP_{BUG{_ON},CRIT} macros, use enum:8 for the ccid3 states

This patch tackles the following problem:
* the ccid3_hc_{t,r}x_sock define ccid3hc{t,r}x_state as `u8', but
in reality there can only be a few, pre-defined enum names
* this necessitates addiditional checking for unexpected values
which would otherwise be caught by the compiler

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
afb0a34dd3e20b3f534de19993271b8664cf10bb 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Introduce a consistent naming scheme for sysctls

In order to make their function clearer and obtain a consistent naming
scheme to identify sysctls, all existing DCCP sysctls have been prefixed
with `sysctl_dccp', following the same convention as used by TCP.

Feature-specific sysctls retain the `feat' in the middle, although the
`default' has been dropped, since it is obvious from use.

Also removed a duplicate `dccp_feat_default_sequence_window' in ipv4.c.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2e2e9e92bd723244ea20fa488b1780111f2b05e1 13-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Add sysctls to control retransmission behaviour

This adds 3 sysctls which govern the retransmission behaviour of DCCP control
packets (3way handshake, feature negotiation).

It removes 4 FIXMEs from the code.

The close resemblance of sysctl variables to their TCP analogues is emphasised
not only by their name, but also by giving them the same initial values.
This is useful since there is not much practical experience with DCCP yet.

Furthermore, with regard to the previous patch, it is now possible to limit
the number of keepalive-Responses by setting net.dccp.default.request_retries
(also a bit like in TCP).

Lastly, added documentation of all existing DCCP sysctls.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
6f4e5fff1e4d46714ea554fd83e44eab534e8b11 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Support for partial checksums (RFC 4340, sec. 9.2)

This patch does the following:
a) introduces variable-length checksums as specified in [RFC 4340, sec. 9.2]
b) provides necessary socket options and documentation as to how to use them
c) basic support and infrastructure for the Minimum Checksum Coverage feature
[RFC 4340, sec. 9.2.1]: acceptability tests, user notification and user
interface

In addition, it

(1) fixes two bugs in the DCCPv4 checksum computation:
* pseudo-header used checksum_len instead of skb->len
* incorrect checksum coverage calculation based on dccph_x
(2) removes dccp_v4_verify_checksum() since it reduplicates code of the
checksum computation; code calling this function is updated accordingly.
(3) now uses skb_checksum(), which is safer than checksum_partial() if the
sk_buff has is a non-linear buffer (has pages attached to it).
(4) fixes an outstanding TODO item:
* If P.CsCov is too large for the packet size, drop packet and return.

The code has been tested with applications, the latest version of tcpdump now
comes with support for partial DCCP checksums.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
cf557926f6955b4c3fa55e81fdb3675e752e8eed 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: tidy up dccp_v{4,6}_conn_request

This is a code simplification to remove reduplicated code
by concentrating and abstracting shared code.

Detailed Changes:
f45b3ec481581f24719d8ab0bc812c02fcedc2bc 10-Nov-2006 Ian McDonald <ian.mcdonald@jandi.co.nz> [DCCP]: Fix logfile overflow

This patch fixes data being spewed into the logs continually. As the
code stood if there was a large queue and long delays timeo would go
down to zero and never get reset.

This fixes it by resetting timeo. Put constant into header as well.

Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
8a73cd09d96aa01743316657fc4e6864fe79b703 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: calling dccp_v{4,6}_reqsk_send_ack is a BUG

This patch removes two functions, the send_ack functions of request_sock,
which are not called/used by the DCCP code. It is correct that these
functions are not called, below is a justification why calling these
functions (on a passive socket in the LISTEN/RESPOND state) would mean
a DCCP protocol violation.

A) Background: using request_sock in TCP:
f6484f7c7ad22e4bb018875c386d6a7aaa441426 10-Nov-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP] timewait: Remove leftover extern declarations

Gerrit Renker noticed dccp_tw_deschedule and submitted a patch with a FIXME,
but as he suggests in the same patch the best thing is to just ditch this
declaration, while doing that also noticed that tcp_tw_count is as well not
defined anywhere, so ditch it too.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
60361be1be7854cbffb6dc268d1bc094da33431c 10-Nov-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: set safe upper bound for option length

This is a re-send from
http://www.mail-archive.com/dccp@vger.kernel.org/msg00553.html

It is the same patch as before, but I have built in Arnaldo's suggestions
pointed out in that posting.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
0e64e94e477f8ed04e9295b11a5898d443c28a47 25-Oct-2006 Gerrit Renker <gerrit@erg.abdn.ac.uk> [DCCP]: Update documentation references.

Updates the references to spec documents throughout the code, taking into
account that

* the DCCP, CCID 2, and CCID 3 drafts all became RFCs in March this year

* RFC 1063 was obsoleted by RFC 1191

* draft-ietf-tcpimpl-pmtud-0x.txt was published as an Informational
RFC, RFC 2923 on 2000-09-22.

All references verified.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
97e5848dd39e7e76bd6077735ebb5473763ab9c5 27-Aug-2006 Ian McDonald <ian.mcdonald@jandi.co.nz> [DCCP]: Introduce tx buffering

This adds transmit buffering to DCCP.

I have tested with CCID2/3 and with loss and rate limiting.

Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
837d107cd101fbf734e9ea2bbb5c7336a329e432 27-Aug-2006 Ian McDonald <ian.mcdonald@jandi.co.nz> [DCCP]: Introduces follows48 function

This adds a new function to see if two sequence numbers follow each
other.

Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
e6bccd357343e98db9e1fd0d487f4f924e1a7921 27-Aug-2006 Ian McDonald <ian.mcdonald@jandi.co.nz> [DCCP]: Update contact details and copyright

Just updating copyright and contacts

Signed off by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
a4bf3902427a128455b8de299ff0918072b2e974 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP] minisock: Rename struct dccp_options to struct dccp_minisock

This will later be included in struct dccp_request_sock so that we can
have per connection feature negotiation state while in the 3way
handshake, when we clone the DCCP_ROLE_LISTEN socket (in
dccp_create_openreq_child) we'll just copy this state from
dreq_minisock to dccps_minisock.

Also the feature negotiation and option parsing code will mostly touch
dccps_minisock, which will simplify some stuff.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3fdadf7d27e3fbcf72930941884387d1f4936f04 21-Mar-2006 Dmitry Mishin <dim@openvz.org> [NET]: {get|set}sockopt compatibility layer

This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.

Signed-off-by: Dmitry Mishin <dim@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2d0817d11eaec57435feb61493331a763f732a2b 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP] options: Make dccp_insert_options & friends yell on error

And not the silly LIMIT_NETDEBUG and silently return without inserting
the option requested.

Also drop some old debugging messages associated to option insertion.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
110bae4efb5ed5565257a0fb9f6d26e6125a1c4b 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Remove leftover dccp_send_response prototype

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
72478873571d869906f7a250b09e12fa5b65e321 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP] ipv6: Add missing ipv6 control socket

I guess I forgot to add it, nah, now it just works:

18:04:33.274066 IP6 ::1.1476 > ::1.5001: request (service=0)
18:04:33.334482 IP6 ::1.5001 > ::1.1476: reset (code=bad_service_code)

Ditched IP_DCCP_UNLOAD_HACK, as now we would have to do it for both
IPv6 and IPv4, so I'll come up with another way for freeing the
control sockets in upcoming changesets.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c25a18ba347f091d1ce620ba33e6772b60a528e1 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Uninline some functions

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b61fafc4ef3faf54236d57e3b230ca19167663bf 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Move the IPv4 specific bits from proto.c to ipv4.c

With this patch in place we can break down the complexity by better
compartmentalizing the code that is common to ipv6 and ipv4.

Now we have these modules:
Module Size Used by
dccp_diag 1344 0
inet_diag 9448 1 dccp_diag
dccp_ccid3 15856 0
dccp_tfrc_lib 12320 1 dccp_ccid3
dccp_ccid2 5764 0
dccp_ipv4 16996 2
dccp 48208 4 dccp_diag,dccp_ccid3,dccp_ccid2,dccp_ipv4

dccp_ipv6 still requires dccp_ipv4 due to dccp_ipv6_mapped, that is
the next target to work on the "hey, ipv4 is legacy, I only want ipv6
dude!" direction.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
c985ed705ffc682ce40d46a5f7bf98db86b27899 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Move dccp_[un]hash from ipv4.c to the core

As this is used by both ipv4 and ipv6 and is not ipv4 specific.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
3e0fadc51f2fde01e0e22f481370a9b5f073bfc3 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Move dccp_v4_{init,destroy}_sock to the core

Removing one more ipv6 uses ipv4 stuff case in dccp land.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
017487d7d1e905a5bb529f6a2bc8cf8ea14e2307 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Generalize dccp_v4_send_reset

Renaming it to dccp_send_reset and moving it from the ipv4 specific
code to the core dccp code.

This fixes some bugs in IPV6 where timers would send v4 resets, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e55d912f5b75723159348a7fc7692f869a86636a 21-Mar-2006 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP] feat: Introduce sysctls for the default features

[root@qemu ~]# for a in /proc/sys/net/dccp/default/* ; do echo $a ; cat $a ; done
/proc/sys/net/dccp/default/ack_ratio
2
/proc/sys/net/dccp/default/rx_ccid
3
/proc/sys/net/dccp/default/send_ackvec
1
/proc/sys/net/dccp/default/send_ndp
1
/proc/sys/net/dccp/default/seq_window
100
/proc/sys/net/dccp/default/tx_ccid
3
[root@qemu ~]#

So if wanting to test ccid3 as the tx CCID one can just do:

[root@qemu ~]# echo 3 > /proc/sys/net/dccp/default/tx_ccid
[root@qemu ~]# echo 2 > /proc/sys/net/dccp/default/rx_ccid
[root@qemu ~]# cat /proc/sys/net/dccp/default/[tr]x_ccid
2
3
[root@qemu ~]#

Of course we also need the setsockopt for each app to tell its preferences, but
for testing or defining something other than CCID2 as the default for apps that
don't explicitely set their preference the sysctl interface is handy.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
60fe62e789076ae7c13f7ffb35fec4b24802530d 21-Mar-2006 Andrea Bittau <a.bittau@cs.ucl.ac.uk> [DCCP]: sparse endianness annotations

This also fixes the layout of dccp_hdr short sequence numbers, problem
was not fatal now as we only support long (48 bits) sequence numbers.

Signed-off-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
f21e68caa0ddffddf98a1e729e734a470957b6ec 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Prepare the AF agnostic core for the introduction of DCCPv6

Basically exports a similar set of functions as the one exported by
the non-AF specific TCP code.

In the process moved some non-AF specific code from dccp_v4_connect to
dccp_connect_init and moved the checksum verification from
dccp_invalid_packet to dccp_v4_rcv, so as to use it in dccp_v6_rcv
too.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
34ca6860810342441f801226b19ae6c9e0ecb34f 14-Dec-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Just rename dccp_v4_prot to dccp_prot

To match TCP equivalent.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
48918a4dbd6c599d6af30bd64cb355fadca708eb 30-Oct-2005 Herbert Xu <herbert@gondor.apana.org.au> [DCCP]: Simplify skb_set_owner_w semantics

While we're at it let's reorganise the set_owner_w calls a little so that:

1) dccp_transmit_skb sets the owner for all packets except data packets.
2) Add dccp_skb_entail to set owner for packets queued for retransmission.
3) Make dccp_transmit_skb static.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
ae31c3399d17b1f7bc1742724f70476b5417744f 18-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Move the ack vector code to net/dccp/ackvec.[ch]

Isolating it, that will be used when we introduce a CCID2 (TCP-Like)
implementation.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
67e6b629212fa9ffb7420e8a88a41806af637e28 17-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Introduce DCCP_SOCKOPT_SERVICE

As discussed in the dccp@vger mailing list:

Now applications have to use setsockopt(DCCP_SOCKOPT_SERVICE, service[s]),
prior to calling listen() and connect().

An array of unsigned ints can be passed meaning that the listening sock accepts
connection requests for several services.

With this we can ditch struct sockaddr_dccp and use only sockaddr_in (and
sockaddr_in6 in the future).

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b0e567806d16586629468c824dfb2e71155df7da 09-Sep-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP] Introduce dccp_timestamp

To start the timestamps with 0.0ms, easing the integer maths in the CCIDs, this
probably will be reworked to use the to be introduced struct timeval_offset
infrastructure out of skb_get_timestamp, etc.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
c530cfb1ce1e8f230744c3f3bd86771f50725053 29-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [CCID3]: Call sk->sk_write_space(sk) when receiving a feedback packet

This makes the send rate calculations behave way more closely to what
is specified, with the jitter previously seen on x and x_recv
disappearing completely on non lossy setups.

This resembles the tcp_data_snd_check code, that possibly we'll end up
using in DCCP as well, perhaps moving this code to
inet_connection_sock.

For now I'm doing the simplest implementation tho.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
b6ee3d4ada4e85d9b9b9164c1327ef0850c79d5e 27-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [CCID3]: Reorganise timeval handling

Introducing functions to add to or subtract from a timeval variable
and renaming now_delta to timeval_new_delta that calls do_gettimeofday
and then timeval_delta, that should be used when there are several
deltas made relative to the current time or setting variables to it,
so as to avoid calling do_gettimeofday excessively.

I'm leaving these "timeval_" prefixed funcions internal to DCCP for a
while till we're sure there are no subtle bugs in it.

It also is more correct as it checks if the number of usecs added to
or subtracted from a tv_usec field is more than 2 seconds.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d6809c12b3334a929c39bf08ea63bd819e0500f7 27-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Introduce dccp_wait_for_ccid and use it in dccp_write_xmit

This is not quite what I think we should have long term but improves
performance for now, so lets use it till we get CCID3 working well,
then we can think about using sk_write_queue, perhaps using some ideas
from Juwen Lai's old stack for 2.4.20.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d4b81ff70547b40c9b0742b163e8354560003cc0 24-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Export dccp_insert_option_timestamp to CCIDs

And don't insert a TIMESTAMP option in all packets, leave the decision
to the CCIDs.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7ad07e7cf343181002c10c39d3f57a88e4903d4f 24-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Implement the CLOSING timer

So that we retransmit CLOSE/CLOSEREQ packets till they elicit an
answer or we hit a timeout.

Most of the machinery uses TCP approaches, this code has to be
polished & audited, but this is better than we had before.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
03ace394ac9bcad38043a381ae5f4860b9c9fa1c 21-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Fix the ACK and SEQ window variables settings

This is from a first audit, more eyeballs are more than welcome.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1bc0986957b63a2fbbc46ab95d3d1d72830bda83 20-Aug-2005 Ian McDonald <iam4@cs.waikato.ac.nz> [DCCP]: Fix the timestamp options

This changes timestamp, timestamp echo, and elapsed time to use units of 10
usecs as per DCCP spec. This has been tested to verify that times are correct.
Also fixed up length and used hton/ntoh more.

Still to add in later patches:
- actually use elapsed time to adjust RTT
(commented out as was prior to this patch)
- send options at times more closely following the spec
(content is now correct)

Signed-off-by: Ian McDonald <iam4@cs.waikato.ac.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e92ae93a8aa66aea12935420cb22d4df1c18d023 17-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Send SYNCACK packets in response to SYNC packets

Also fix step 6 when receiving SYNC or SYNCACK packets, i.e. we were not using
the updated swl.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a10cedd4b905236603c6c4fd77cf338ebbfb1a60 15-Aug-2005 Patrick McHardy <kaber@trash.net> [DCCP]: Fix compiler warnings

may be a false warning if there always is something on ccid3hcrx_hist:

net/dccp/ccids/ccid3.c: In function 'ccid3_hc_rx_packet_recv':
net/dccp/ccids/ccid3.c:1634: warning: 'tstamp.tv_usec' may be used uninitialized in this function
net/dccp/ccids/ccid3.c:1634: warning: 'tstamp.tv_sec' may be used uninitialized in this function

const on inline functions doesn't have any effect:

net/dccp/dccp.h:64: warning: type qualifiers ignored on function return type
net/dccp/dccp.h:70: warning: type qualifiers ignored on function return type
net/dccp/dccp.h:76: warning: type qualifiers ignored on function return type

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a1d3a35518779df0579dd9de0121354b49c68ddc 14-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Fix sparse warnings

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
725ba8eee3881e619c8e5a0116f1bdb6480ac2d9 14-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Introduce the DCCP Kernel hacking menu

Only available if CONFIG_DEBUG_KERNEL is enabled in the "Kernel
Hacking" Menu.

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7690af3fff7633e40b1b9950eb8489129251d074 14-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Just reflow the source code to fit in 80 columns

Andrew Morton should be happy now 8)

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
27258ee54f8cd4a43d09319aa5448145afc2cb8d 10-Aug-2005 Arnaldo Carvalho de Melo <acme@mandriva.com> [DCCP]: Introduce dccp_write_xmit from code in dccp_sendmsg

This way it gets closer to the TCP flow, where congestion window
checks are done, it seems we can map ccid_hc_tx_send_packet in
dccp_write_xmit to tcp_snd_wnd_test in tcp_write_xmit, a CCID2
decision should just fit in here as well...

Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
95b81ef794278c835b321f6376b0522cd5df59b7 10-Aug-2005 Yoshifumi Nishida <nishida@csl.sony.co.jp> [DCCP]: Fix checksum routines

Signed-off-by: Yoshifumi Nishida <nishida@csl.sony.co.jp>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7c657876b63cb1d8a2ec06f8fc6c37bb8412e66c 10-Aug-2005 Arnaldo Carvalho de Melo <acme@ghostprotocols.net> [DCCP]: Initial implementation

Development to this point was done on a subversion repository at:

http://oops.ghostprotocols.net:81/cgi-bin/viewcvs.cgi/dccp-2.6/

This repository will be kept at this site for the foreseable future,
so that interested parties can see the history of this code,
attributions, etc.

If I ever decide to take this offline I'll provide the full history at
some other suitable place.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>