History log of /net/sunrpc/xprtrdma/rpc_rdma.c
Revision Date Author Comments
539431a437d2e5d6d94016184dfc0aab263c01e1 29-Jul-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Don't invalidate FRMRs if registration fails

If FRMR registration fails, it's likely to transition the QP to the
error state. Or, registration may have failed because the QP is
_already_ in ERROR.

Thus calling rpcrdma_deregister_external() in
rpcrdma_create_chunks() is useless in FRMR mode: the LOCAL_INVs just
get flushed.

It is safe to leave existing registrations: when FRMR registration
is tried again, rpcrdma_register_frmr_external() checks if each FRMR
is already/still VALID, and knocks it down first if it is.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
6ab59945f292a5c6cbc4a6c2011f1a732a116af2 29-Jul-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Update rkeys after transport reconnect

Various reports of:

rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0
ep ffff8800bfd3e848

Ensure that rkeys in already-marshalled RPC/RDMA headers are
refreshed after the QP has been replaced by a reconnect.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=249
Suggested-by: Selvin Xavier <Selvin.Xavier@Emulex.Com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
c93c62231cf55df4a26bd08937efeea97e6fc5e8 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Disconnect on registration failure

If rpcrdma_register_external() fails during request marshaling, the
current RPC request is killed. Instead, this RPC should be retried
after reconnecting the transport instance.

The most likely reason for registration failure with FRMR is a
failed post_send, which would be due to a remote transport
disconnect or memory exhaustion. These issues can be recovered
by a retry.

Problems encountered in the marshaling logic itself will not be
corrected by trying again, so these should still kill a request.

Now that we've added a clean exit for marshaling errors, take the
opportunity to defang some BUG_ON's.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
e7ce710a8802351bd4118c5d6136c1d850f67cf9 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Avoid deadlock when credit window is reset

Update the cwnd while processing the server's reply. Otherwise the
next task on the xprt_sending queue is still subject to the old
credit window. Currently, no task is awoken if the old congestion
window is still exceeded, even if the new window is larger, and a
deadlock results.

This is an issue during a transport reconnect. Servers don't
normally shrink the credit window, but the client does reset it to
1 when reconnecting so the server can safely grow it again.

As a minor optimization, remove the hack of grabbing the initial
cwnd size (which happens to be RPC_CWNDSCALE) and using that value
as the congestion scaling factor. The scaling value is invariant,
and we are better off without the multiplication operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
18906972aa1103c07869c9b43860a52e0e27e8e5 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Reset connection timeout after successful reconnect

If the new connection is able to make forward progress, reset the
re-establish timeout. Otherwise it keeps growing even if disconnect
events are rare.

The same behavior as TCP is adopted: reconnect immediately if the
transport instance has been able to make some forward progress.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
196c69989d84ab902bbe545f7bd8ce78ee74dac4 28-May-2014 Shirley Ma <shirley.ma@oracle.com> xprtrdma: Allocate missing pagelist

GETACL relies on transport layer to alloc memory for reply buffer.
However xprtrdma assumes that the reply buffer (pagelist) has been
pre-allocated in upper layer. This problem was reported by IOL OFA lab
test on PPC.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Edward Mossman <emossman@iol.unh.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
13c9ff8f673862b69e795ea99a237b461c557eb3 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Simplify rpcrdma_deregister_external() synopsis

Clean up: All remaining callers of rpcrdma_deregister_external()
pass NULL as the last argument, so remove that argument.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
0ac531c1832318efa3dc3d723e356a7e09330e80 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Remove REGISTER memory registration mode

All kernel RDMA providers except amso1100 support either MTHCAFMR
or FRMR, both of which are faster than REGISTER. amso1100 can
continue to use ALLPHYSICAL.

The only other ULP consumer in the kernel that uses the reg_phys_mr
verb is Lustre.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
b45ccfd25d506e83d9ecf93d0ac7edf031d35d2f 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Remove MEMWINDOWS registration modes

The MEMWINDOWS and MEMWINDOWS_ASYNC memory registration modes were
intended as stop-gap modes before the introduction of FRMR. They
are now considered obsolete.

MEMWINDOWS_ASYNC is also considered unsafe because it can leave
client memory registered and exposed for an indeterminant time after
each I/O.

At this point, the MEMWINDOWS modes add needless complexity, so
remove them.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
03ff8821eb5ed168792667cfc3ddff903e97af99 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: Remove BOUNCEBUFFERS memory registration mode

Clean up: This memory registration mode is slow and was never
meant for use in production environments. Remove it to reduce
implementation complexity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
254f91e2fa1f4cc18fd2eb9d5481888ffe126d5b 28-May-2014 Chuck Lever <chuck.lever@oracle.com> xprtrdma: RPC/RDMA must invoke xprt_wake_pending_tasks() in process context

An IB provider can invoke rpcrdma_conn_func() in an IRQ context,
thus rpcrdma_conn_func() cannot be allowed to directly invoke
generic RPC functions like xprt_wake_pending_tasks().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
0fc6c4e7bb287148eb5e949efd89327929d4841d 28-May-2014 Steve Wise <swise@opengridcomputing.com> xprtrdma: mind the device's max fast register page list depth

Some rdma devices don't support a fast register page list depth of
at least RPCRDMA_MAX_DATA_SEGS. So xprtrdma needs to chunk its fast
register regions according to the minimum of the device max supported
depth or RPCRDMA_MAX_DATA_SEGS.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2b7bbc963da8d076f263574af4138b5df2e1581f 12-Mar-2014 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Fix large reads on NFS/RDMA

After commit a11a2bf4, "SUNRPC: Optimise away unnecessary data moves
in xdr_align_pages", Thu Aug 2 13:21:43 2012, READs larger than a
few hundred bytes via NFS/RDMA no longer work. This commit exposed
a long-standing bug in rpcrdma_inline_fixup().

I reproduce this with an rsize=4096 mount using the cthon04 basic
tests. Test 5 fails with an EIO error.

For my reproducer, kernel log shows:

NFS: server cheating in read reply: count 4096 > recvd 0

rpcrdma_inline_fixup() is zeroing the xdr_stream::page_len field,
and xdr_align_pages() is now returning that value to the READ XDR
decoder function.

That field is set up by xdr_inline_pages() by the READ XDR encoder
function. As far as I can tell, it is supposed to be left alone
after that, as it describes the dimensions of the reply xdr_stream,
not the contents of that stream.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=68391
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
a4f0835c604f80f945ab3e72ffd00547145c4b2b 08-Jan-2013 Trond Myklebust <Trond.Myklebust@netapp.com> SUNRPC: Eliminate task->tk_xprt accesses that bypass rcu_dereference()

tk_xprt is just a shortcut for tk_client->cl_xprt, however cl_xprt is
defined as an __rcu variable. Replace dereferences of tk_xprt with
non-rcu dereferences where it is safe to do so.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
4a6862b3649d705bf41a36e3c7943d0322a9ee27 20-Feb-2012 Tom Tucker <tom@ogc.us> xprtrdma: The transport should not bug-check when a dup reply is received

The client side RDMA transport will bug check if it receives a duplicate
reply, instead we should simply drop the duplicate reply.

Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
b85417860172ff693dc115d7999805fc240cec1c 25-Nov-2011 Cong Wang <amwang@redhat.com> sunrpc: remove the second argument of k[un]map_atomic()

Signed-off-by: Cong Wang <amwang@redhat.com>
bd7ea31b9e8a342be76e0fe8d638343886c2d8c5 09-Feb-2011 Tom Tucker <tom@ogc.us> RPCRDMA: Fix to XDR page base interpretation in marshalling logic.

The RPCRDMA marshalling logic assumed that xdr->page_base was an
offset into the first page of xdr->page_list. It is in fact an
offset into the xdr->page_list itself, that is, it selects the
first page in the page_list and the offset into that page.

The symptom depended in part on the rpc_memreg_strategy, if it was
FRMR, or some other one-shot mapping mode, the connection would get
torn down on a base and bounds error. When the badly marshalled RPC
was retransmitted it would reconnect, get the error, and tear down the
connection again in a loop forever. This resulted in a hung-mount. For
the other modes, it would result in silent data corruption. This bug is
most easily reproduced by writing more data than the filesystem
has space for.

This fix corrects the page_base assumption and otherwise simplifies
the iov mapping logic.

Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
15cdc644b268a9a9ce73ce0b153129222c254b7b 11-Aug-2010 Tom Tucker <tom@ogc.us> rpcrdma: Fix SQ size calculation when memreg is FRMR

This patch updates the computation to include the worst case situation
where three FRMR are required to map a single RPC REQ.

Signed-off-by: Tom Tucker <tom@ogc.us>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
b38ab40ad58c1fc43ea590d6342f6a6763ac8fb6 11-Mar-2009 Tom Talpey <tmtalpey@gmail.com> XPRTRDMA: correct an rpc/rdma inline send marshaling error

Certain client rpc's which contain both lengthy page-contained
metadata and a non-empty xdr_tail buffer require careful handling
to avoid overlapped memory copying. Rearranging of existing rpcrdma
marshaling code avoids it; this fixes an NFSv4 symlink creation error
detected with connectathon basic/test8 to multiple servers.

Signed-off-by: Tom Talpey <tmtalpey@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
5f37d561e0f0cd98017c389cbc22080290f11c3c 09-Oct-2008 Tom Talpey <talpey@netapp.com> RPC/RDMA: reformat a debug printk to keep lines together.

The send marshaling code split a particular dprintk across two
lines, which makes it hard to extract from logfiles.

Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
926449ba66ce2a45c619bbe755b00d6bdbf0d83e 09-Oct-2008 Tom Talpey <talpey@netapp.com> RPC/RDMA: return a consistent error, when connect fails.

The xprt_connect call path does not expect such errors as ECONNREFUSED
to be returned from failed transport connection attempts, otherwise it
translates them to EIO and signals fatal errors. For example, mount.nfs
prints simply "internal error". Translate all such errors to ENOTCONN
from RPC/RDMA to match sockets behavior.

Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
9191ca3b381b15b9a88785a8ae2fa4db8e553b0c 09-Oct-2008 Tom Talpey <talpey@netapp.com> RPC/RDMA: adhere to protocol for unpadded client trailing write chunks.

The RPC/RDMA protocol allows clients and servers to avoid RDMA
operations for data which is purely the result of XDR padding.
On the client, automatically insert the necessary padding for
such server replies, and optionally don't marshal such chunks.

Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
575448bd36208f99fe0dd554a43518d798966740 09-Oct-2008 Tom Talpey <talpey@netapp.com> RPC/RDMA: suppress retransmit on RPC/RDMA clients.

An RPC/RDMA client cannot retransmit on an unbroken connection,
doing so violates its flow control with the server.

Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
6067804047b64dde89f4f133fc7eba48ee44107d 21-Sep-2008 Arnaldo Carvalho de Melo <acme@redhat.com> net: Use hton[sl]() instead of __constant_hton[sl]() where applicable

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
d4b37ff73540ab90bee57b882a10b21e2f97939f 26-Oct-2007 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Fix an unnecessary implicit type cast in rpcrdma_count_chunks()

Nit: rl_nchunks is an unsigned integer, so pass it into
rpcrdma_count_chunks() via an unsigned integer argument. This eliminates
a harmless mixed sign comparison in rpcrdma_count_chunks()

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Thomas Talpey <Thomas.Talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2a428b2b8fe2c270a5889086ebe3ab914e3ea7d8 26-Oct-2007 Chuck Lever <chuck.lever@oracle.com> SUNRPC: Prevent mixed sign comparisons in rpcrdma_convert_iovs()

Keep the type of the buffer position the same during iovec conversion to
reduce the likelihood of unexpected results from comparisons and length
computations.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Thomas Talpey <Thomas.Talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
8d614434ab77b440b69e66a9bd44e46e7194c34a 11-Dec-2007 YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> [SUNRPC]: Use htonl() where appropriate.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
50e1092b3a119bb4660bb6bd2e1749dc2d8ac62e 10-Dec-2007 James Lentini <jlentini@netapp.com> SUNRPC xprtrdma: fix XDR tail buf marshalling for all ops

rpcrdma_convert_iovs is passed an xdr_buf representing either an RPC
request or an RPC reply. In the case of a request, several
calculations and tests involving pos are unnecessary. In the case of a
reply, several calculations and tests involving pos are incorrect (the
code tests pos against the reply xdr buf's len field, which is always
0 at the time rpcrdma_convert_iovs is executed). This change removes
the incorrect/unnecessary calculations and tests involving pos.

This fixes an observed problem when reading certain file sizes over
NFS/RDMA.

Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: James Lentini <jlentini@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
e08a132b0ef3cf89dfbf1dea2c6248ea624bdcd7 30-Oct-2007 Stephen Rothwell <sfr@canb.auug.org.au> [SUNRPC] rpc_rdma: we need to cast u64 to unsigned long long for printing

as some architectures have unsigned long for u64.

net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_create_chunks':
net/sunrpc/xprtrdma/rpc_rdma.c:222: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64'
net/sunrpc/xprtrdma/rpc_rdma.c:234: warning: format '%llx' expects type 'long long unsigned int', but argument 5 has type 'u64'
net/sunrpc/xprtrdma/rpc_rdma.c: In function 'rpcrdma_count_chunks':
net/sunrpc/xprtrdma/rpc_rdma.c:577: warning: format '%llx' expects type 'long long unsigned int', but argument 4 has type 'u64

Noticed on PowerPC pseries_defconfig build.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2d8a972661832719931b0dd5b80e97215cb93d94 29-Oct-2007 Al Viro <viro@ftp.linux.org.uk> SUNRPC endianness annotations

rpcrdma stuff lacks endianness annotations for on-the-wire data.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e96018280cb36210f4c69663561825114a57e7e1 10-Sep-2007 \"Talpey, Thomas\ <Thomas.Talpey@netapp.com> RPCRDMA: rpc rdma protocol implementation

This implements the marshaling and unmarshaling of the rpcrdma transport
headers. Connection management is also addressed.

Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
f58851e6b0f148fb4b2a1c6f70beb2f125863c0f 10-Sep-2007 \"Talpey, Thomas\ <Thomas.Talpey@netapp.com> RPCRDMA: rpc rdma transport switch

This implements the configuration and building of the core transport
switch implementation of the rpcrdma transport. Stubs are provided for
the rpcrdma protocol handling, and the infiniband/iwarp verbs interface.
These are provided in following patches.

Signed-off-by: Tom Talpey <talpey@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>