History log of /net/ceph/messenger.c
Revision Date Author Comments
89baaa570ab0b476db09408d209578cfed700e9f 16-Oct-2014 Mike Christie <michaelc@cs.wisc.edu> libceph: use memalloc flags for net IO

This patch has ceph's lib code use the memalloc flags.

If the VM layer needs to write data out to free up memory to handle new
allocation requests, the block layer must be able to make forward progress.
To handle that requirement we use structs like mempools to reserve memory for
objects like bios and requests.

The problem is when we send/receive block layer requests over the network
layer, net skb allocations can fail and the system can lock up.
To solve this, the memalloc related flags were added. NBD, iSCSI
and NFS uses these flags to tell the network/vm layer that it should
use memory reserves to fullfill allcation requests for structs like
skbs.

I am running ceph in a bunch of VMs in my laptop, so this patch was
not tested very harshly.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
f9865f06f7f18c6661c88d0511f05c48612319cc 10-Oct-2014 Ilya Dryomov <idryomov@redhat.com> libceph: ceph-msgr workqueue needs a resque worker

Commit f363e45fd118 ("net/ceph: make ceph_msgr_wq non-reentrant")
effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq. This is
wrong - libceph is very much a memory reclaim path, so restore it.

Cc: stable@vger.kernel.org # needs backporting for < 3.12
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>
e4339d28f640a7c0d92903bcf389a2dfa281270d 16-Sep-2014 Yan, Zheng <zyan@redhat.com> libceph: reference counting pagelist

this allow pagelist to present data that may be sent multiple times.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
b9a678994b4a64b1106ab2cf7cfe7cbc10bb6f40 10-Sep-2014 Joe Perches <joe@perches.com> libceph: Convert pr_warning to pr_warn

Use the more common pr_warn.

Other miscellanea:

o Coalesce formats
o Realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
5f740d7e1531099b888410e6bab13f68da9b1a4d 07-Aug-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly

Determining ->last_piece based on the value of ->page_offset + length
is incorrect because length here is the length of the entire message.
->last_piece set to false even if page array data item length is <=
PAGE_SIZE, which results in invalid length passed to
ceph_tcp_{send,recv}page() and causes various asserts to fire.

# cat pages-cursor-init.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
# rbd_resize calls librbd rbd_resize(), size is in bytes
./rbd_resize bar $(((4 << 20) + 512))
rbd resize --size 10 bar
BAR_DEV=$(rbd map bar)
# trigger a 512-byte copyup -- 512-byte page array data item
dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5

The problem exists only in ceph_msg_data_pages_cursor_init(),
ceph_msg_data_pages_advance() does the right thing. The size_t cast is
unnecessary.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
37ab77ac29b5bdec029a66f6d6eb4756679c7e12 24-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: drop osd ref when canceling con work

queue_con() bumps osd ref count. We should do the reverse when
canceling con work.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
0215e44bb390a968d01404aa2f35af56f9b55fc8 20-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: move and add dout()s to ceph_msg_{get,put}()

Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn
kref release callback into a static function.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
178eda29ca721842f2146378e73d43e0044c4166 22-Apr-2014 Chunwei Chen <tuxoko@gmail.com> libceph: fix corruption when using page_count 0 page in rbd

It has been reported that using ZFSonLinux on rbd will result in memory
corruption. The bug report can be found here:

https://github.com/zfsonlinux/spl/issues/241
http://tracker.ceph.com/issues/7790

The reason is that ZFS will send pages with page_count 0 into rbd, which in
turns send them to tcp_sendpage. However, tcp_sendpage cannot deal with
page_count 0, as it will do get_page and put_page, and erroneously free the
page.

This type of issue has been noted before, and handled in iscsi, drbd,
etc. So, rbd should also handle this. This fix address this issue by fall back
to slower sendmsg when page_count 0 detected.

Cc: Sage Weil <sage@inktank.com>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
676d23690fb62b5d51ba5d659935e9f7d9da9f8e 11-Apr-2014 David S. Miller <davem@davemloft.net> net: Fix use after free by removing length arg from sk_data_ready callbacks.

Several spots in the kernel perform a sequence like:

skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.

Signed-off-by: David S. Miller <davem@davemloft.net>
d90deda69cb82411ba7d990e97218e0f8b2d07bb 22-Mar-2014 Yan, Zheng <zheng.z.yan@intel.com> libceph: fix oops in ceph_msg_data_{pages,pagelist}_advance()

When there is no more data, ceph_msg_data_{pages,pagelist}_advance()
should not move on to the next page.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
0ec1d15ec6ed513ab2cc86c67d94761d71228a32 05-Feb-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: do not dereference a NULL bio pointer

Commit f38a5181d9f3 ("ceph: Convert to immutable biovecs") introduced
a NULL pointer dereference, which broke rbd in -rc1. Fix it.

Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
eeb0bed5572b1282009dfc2635604df5a35d1a02 09-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: add ceph_kv{malloc,free}() and switch to them

Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into
two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them.

ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is
vmalloc()'ed with __GFP_HIGHMEM set. This changes the existing
behaviour:

- for buffers (ceph_buffer_new()), from trying to kmalloc() everything
and using vmalloc() just as a fallback

- for messages (ceph_msg_new()), from going to vmalloc() for anything
bigger than a page

- for messages (ceph_msg_new()), from disallowing vmalloc() to use high
memory

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
f2be82b0058e90b5d9ac2cb896b4914276fb50ef 09-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: fix preallocation check in get_reply()

The check that makes sure that we have enough memory allocated to read
in the entire header of the message in question is currently busted.
It compares front_len of the incoming message with iov_len field of
ceph_msg::front structure, which is used primarily to indicate the
amount of data already read in, and not the size of the allocated
buffer. Under certain conditions (e.g. a short read from a socket
followed by that socket's shutdown and owning ceph_connection reset)
this results in a warning similar to

[85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0)

and, through another bug, leads to forever hung tasks and forced
reboots. Fix this by comparing front_len with front_alloc_len field of
struct ceph_msg, which stores the actual size of the buffer.

Fixes: http://tracker.ceph.com/issues/5425

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
3cea4c3071d4e55e9d7356efe9d0ebf92f0c2204 09-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: rename ceph_msg::front_max to front_alloc_len

Rename front_max field of struct ceph_msg to front_alloc_len to make
its purpose more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
f48db1e9ac6f1578ab7efef9f66c70279e2f0cb5 30-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: use CEPH_MON_PORT when the specified port is 0

Similar to userspace, don't bail with "parse_ips bad ip ..." if the
specified port is port 0, instead use port CEPH_MON_PORT (6789, the
default monitor port).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2b3e0c905af43cfe402a2ef3f800be5dc1684005 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: update ceph_features.h

This updates ceph_features.h so that it has all feature bits defined in
ceph.git. In the interim since the last update, ceph.git crossed the
"32 feature bits" point, and, the addition of the 33rd bit wasn't
handled correctly. The work-around is squashed into this commit and
reflects ceph.git commit 053659d05e0349053ef703b414f44965f368b9f0.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12b4629a9fb80fecaebadc217b13b8776ed8dbef 24-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: all features fields must be u64

In preparation for ceph_features.h update, change all features fields
from unsigned int/u32 to u64. (ceph.git has ~40 feature bits at this
point.)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
f38a5181d9f3e004b1f50f9d7e1f2a8492ce240a 07-Aug-2013 Kent Overstreet <kmo@daterainc.com> ceph: Convert to immutable biovecs

Now that we've got a mechanism for immutable biovecs -
bi_iter.bi_bvec_done - we need to convert drivers to use primitives that
respect it instead of using the bvec array directly.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
4d1829a59de402fc95daf4576c51aa0a7439aee8 30-Jul-2013 Tejun Heo <tj@kernel.org> ceph: WQ_NON_REENTRANT is meaningless and going away

dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Cc: ceph-devel@vger.kernel.org
64dc61306ce7da370833289739e2f52dfc6b37ba 23-Jul-2013 Eric Dumazet <edumazet@google.com> net: add sk_stream_is_writeable() helper

Several call sites use the hardcoded following condition :

sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)

Lets use a helper because TCP_NOTSENT_LOWAT support will change this
condition for TCP sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
81b36be4c56299ac4c4c786908cb117ad232b62e 01-May-2013 Alex Elder <elder@inktank.com> libceph: allocate ceph message data with a slab allocator

Create a slab cache to manage ceph_msg_data structure allocation.

This is part of:
http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e3d5d6380482b4a5e2e9d0d662f2ef6d56504aef 01-May-2013 Alex Elder <elder@inktank.com> libceph: allocate ceph messages with a slab allocator

Create a slab cache to manage ceph_msg structure allocation.

This is part of:
http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a51b272e9e99f912e8e07d4c9f58c1d433afea7c 19-Apr-2013 Alex Elder <elder@inktank.com> libceph: fix two messenger bugs

This patch makes four small changes in the ceph messenger.

While getting copyup functionality working I found two bugs in the
messenger. Existing paths through the code did not trigger these
problems, but they're fixed here:
- In ceph_msg_data_pagelist_cursor_init(), the cursor's
last_piece field was being checked against the length
supplied. This was OK until this commit: ccba6d98 libceph:
implement multiple data items in a message That commit changed
the cursor init routines to allow lengths to be supplied that
exceeded the size of the current data item. Because of this,
we have to use the assigned cursor resid field rather than the
provided length in determining whether the cursor points to
the last piece of a data item.
- In ceph_msg_data_add_pages(), a BUG_ON() was erroneously
catching attempts to add page data to a message if the message
already had data assigned to it. That was OK until that same
commit, at which point it was fine for messages to have
multiple data items. It slipped through because that BUG_ON()
call was present twice in that function. (You can never be too
careful.)

In addition two other minor things are changed:
- In ceph_msg_data_cursor_init(), the local variable "data" was
getting assigned twice.
- In ceph_msg_data_advance(), it was assumed that the
type-specific advance routine would set new_piece to true
after it advanced past the last piece. That may have been
fine, but since we check for that case we might as well set it
explicitly in ceph_msg_data_advance().

This resolves:
http://tracker.ceph.com/issues/4762

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
90af36022aecdeeb1b9c0755461187de717c86dd 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: add, don't set data for a message

Change the names of the functions that put data on a pagelist to
reflect that we're adding to whatever's already there rather than
just setting it to the one thing. Currently only one data item is
ever added to a message, but that's about to change.

This resolves:
http://tracker.ceph.com/issues/2770

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ca8b3a69174b04376722672d7dd6b666a7f17c50 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: implement multiple data items in a message

This patch adds support to the messenger for more than one data item
in its data list.

A message data cursor has two more fields to support this:
- a count of the number of bytes left to be consumed across
all data items in the list, "total_resid"
- a pointer to the head of the list (for validation only)

The cursor initialization routine has been split into two parts: the
outer one, which initializes the cursor for traversing the entire
list of data items; and the inner one, which initializes the cursor
to start processing a single data item.

When a message cursor is first initialized, the outer initialization
routine sets total_resid to the length provided. The data pointer
is initialized to the first data item on the list. From there, the
inner initialization routine finishes by setting up to process the
data item the cursor points to.

Advancing the cursor consumes bytes in total_resid. If the resid
field reaches zero, it means the current data item is fully
consumed. If total_resid indicates there is more data, the cursor
is advanced to point to the next data item, and then the inner
initialization routine prepares for using that. (A check is made at
this point to make sure we don't wrap around the front of the list.)

The type-specific init routines are modified so they can be given a
length that's larger than what the data item can support. The resid
field is initialized to the smaller of the provided length and the
length of the entire data item.

When total_resid reaches zero, we're done.

This resolves:
http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5240d9f95dfe0f0701b35fbff1cb5b70825ad23f 14-Mar-2013 Alex Elder <elder@inktank.com> libceph: replace message data pointer with list

In place of the message data pointer, use a list head which links
through message data items. For now we only support a single entry
on that list.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8ae4f4f5c056150d5480710ab356801e84d01a3d 14-Mar-2013 Alex Elder <elder@inktank.com> libceph: have cursor point to data

Rather than having a ceph message data item point to the cursor it's
associated with, have the cursor point to a data item. This will
allow a message cursor to be used for more than one data item.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
36153ec9dd6287d7cedf6afb51453c445d946cee 14-Mar-2013 Alex Elder <elder@inktank.com> libceph: move cursor into message

A message will only be processing a single data item at a time, so
there's no need for each data item to have its own cursor.

Move the cursor embedded in the message data structure into the
message itself. To minimize the impact, keep the data->cursor
field, but make it be a pointer to the cursor in the message.

Move the definition of ceph_msg_data above ceph_msg_data_cursor so
the cursor can point to the data without a forward definition rather
than vice-versa.

This and the upcoming patches are part of:
http://tracker.ceph.com/issues/3761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c851c49591ebf000c610711e39eea7da5ff05b21 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: record bio length

The bio is the only data item type that doesn't record its full
length. Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f759ebb968dbf185fc079dd2e824b1aa3a3d71aa 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: skip message if too big to receive

We know the length of our message buffers. If we get a message
that's too long, just dump it and ignore it. If skip was set
then con->in_msg won't be valid, so be careful not to dereference
a null pointer in the process.

This resolves:
http://tracker.ceph.com/issues/4664

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ea96571f7b865edaf1acd472e6f2cddc9fb67892 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: fix possible CONFIG_BLOCK build problem

This patch:
15a0d7b libceph: record message data length
did not enclose some bio-specific code inside CONFIG_BLOCK as
it should have. Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
98fa5dd883aadbb0020b68d0f9367ba152dfe511 02-Apr-2013 Alex Elder <elder@inktank.com> libceph: provide data length when preparing message

In prepare_message_data(), the length used to initialize the cursor
is taken from the header of the message provided. I'm working
toward not using the header data length field to determine length in
outbound messages, and this is a step in that direction. For
inbound messages this will be set to be the actual number of bytes
that are arriving (which may be less than the total size of the data
buffer available).

This resolves:
http://tracker.ceph.com/issues/4589

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a19308048182d5f9e16b03b1d1c038d9346c7589 14-Mar-2013 Alex Elder <elder@inktank.com> libceph: record message data length

Keep track of the length of the data portion for a message in a
separate field in the ceph_msg structure. This information has
been maintained in wire byte order in the message header, but
that's going to change soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
56fc5659162965ce3018a34c6bb8a022f3a3b33c 31-Mar-2013 Alex Elder <elder@inktank.com> libceph: account for alignment in pages cursor

When a cursor for a page array data message is initialized it needs
to determine the initial value for cursor->last_piece. Currently it
just checks if length is less than a page, but that's not correct.
The data in the first page in the array will be offset by a page
offset based on the alignment recorded for the data. (All pages
thereafter will be aligned at the base of the page, so there's
no need to account for this except for the first page.)

Because this was wrong, there was a case where the length of a piece
would be calculated as all of the residual bytes in the message and
that plus the page offset could exceed the length of a page.

So fix this case. Make sure the sum won't wrap.

This resolves a third issue described in:
http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
5df521b1eecf276c4bae8ffb7945acef45530449 30-Mar-2013 Alex Elder <elder@inktank.com> libceph: page offset must be less than page size

Currently ceph_msg_data_pages_advance() allows the page offset value
to be PAGE_SIZE, apparently assuming ceph_msg_data_pages_next() will
treat it as 0. But that doesn't happen, and the result led to a
helpful assertion failure.

Change ceph_msg_data_pages_advance() to truncate the offset to 0
before returning if it reaches PAGE_SIZE.

Make a few other minor adjustments in this area (comments and a
better assertion) while modifying it.

This resolves a second issue described in:
http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
1190bf06a6b033384a65b5acdb1193d41cd257a6 30-Mar-2013 Alex Elder <elder@inktank.com> libceph: fix broken data length assertions

It's OK for the result of a read to come back with fewer bytes than
were requested. So don't trigger a BUG() in that case when
initializing the data cursor.

This resolves the first problem described in:
http://tracker.ceph.com/issues/4598

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
6644ed7b7e04f8e588aebdaa58cededb9416ab95 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: make message data be a pointer

Begin the transition from a single message data item to a list of
them by replacing the "data" structure in a message with a pointer
to a ceph_msg_data structure.

A null pointer will indicate the message has no data; replace the
use of ceph_msg_has_data() with a simple check for a null pointer.

Create functions ceph_msg_data_create() and ceph_msg_data_destroy()
to dynamically allocate and free a data item structure of a given type.

When a message has its data item "set," allocate one of these to
hold the data description, and free it when the last reference to
the message is dropped.

This partially resolves:
http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8ea299bcbc85aeaf5348d99614b35433287bec6b 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: use only ceph_msg_data_advance()

The *_msg_pos_next() functions do little more than call
ceph_msg_data_advance(). Replace those wrapper functions with
a simple call to ceph_msg_data_advance().

This cleanup is related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
143334ff446d634fcd3145919b5cddcc9148a74a 29-Mar-2013 Alex Elder <elder@inktank.com> libceph: don't add to crc unless data sent

In write_partial_message_data() we aggregate the crc for the data
portion of the message as each new piece of the data item is
encountered. Because it was computed *before* sending the data, if
an attempt to send a new piece resulted in 0 bytes being sent, the
crc crc across that piece would erroneously get computed again and
added to the aggregate result. This would occasionally happen in
the evnet of a connection failure.

The crc value isn't really needed until the complete value is known
after sending all data, so there's no need to compute it before
sending.

So don't calculate the crc for a piece until *after* we know at
least one byte of it has been sent. That will avoid this problem.

This resolves:
http://tracker.ceph.com/issues/4450

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
f5db90bcf2c69d099f9d828a8104796f41de6bc5 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: kill last of ceph_msg_pos

The only remaining field in the ceph_msg_pos structure is
did_page_crc. In the new cursor model of things that flag (or
something like it) belongs in the cursor.

Define a new field "need_crc" in the cursor (which applies to all
types of data) and initialize it to true whenever a cursor is
initialized.

In write_partial_message_data(), the data CRC still will be computed
as before, but it will check the cursor->need_crc field to determine
whether it's needed. Any time the cursor is advanced to a new piece
of a data item, need_crc will be set, and this will cause the crc
for that entire piece to be accumulated into the data crc.

In write_partial_message_data() the intermediate crc value is now
held in a local variable so it doesn't have to be byte-swapped so
many times. In read_partial_msg_data() we do something similar
(but mainly for consistency there).

With that, the ceph_msg_pos structure can go away, and it no longer
needs to be passed as an argument to prepare_message_data().

This cleanup is related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
859a35d5523e8e6a5c3568c12febe2e1270bc3a1 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: kill most of ceph_msg_pos

All but one of the fields in the ceph_msg_pos structure are now
never used (only assigned), so get rid of them. This allows
several small blocks of code to go away.

This is cleanup of old code related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
643c68a4a990612720479078f3450d5b766da9f2 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: use cursor resid for loop condition

Use the "resid" field of a cursor rather than finding when the
message data position has moved up to meet the data length to
determine when all data has been sent or received in
write_partial_message_data() and read_partial_msg_data().

This is cleanup of old code related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4c59b4a278f9b7a418ad8af933fd7b341df64393 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: collapse all data items into one

It turns out that only one of the data item types is ever used at
any one time in a single message (currently).
- A page array is used by the osd client (on behalf of the file
system) and by rbd. Only one osd op (and therefore at most
one data item) is ever used at a time by rbd. And the only
time the file system sends two, the second op contains no
data.
- A bio is only used by the rbd client (and again, only one
data item per message)
- A page list is used by the file system and by rbd for outgoing
data, but only one op (and one data item) at a time.

We can therefore collapse all three of our data item fields into a
single field "data", and depend on the messenger code to properly
handle it based on its type.

This allows us to eliminate quite a bit of duplicated code.

This is related to:
http://tracker.ceph.com/issues/4429

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
686be20875db63c6103573565c63db20153ee6e1 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: get rid of read helpers

Now that read_partial_message_pages() and read_partial_message_bio()
are literally identical functions we can factor them out. They're
pretty simple as well, so just move their relevant content into
read_partial_msg_data().

This is and previous patches together resolve:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
61fcdc97c06bce7b6d16dd2a6b478f24cd121d96 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: no outbound zero data

There is handling in write_partial_message_data() for the case where
only the length of--and no other information about--the data to be
sent has been specified. It uses the zero page as the source of
data to send in this case.

This case doesn't occur. All message senders set up a page array,
pagelist, or bio describing the data to be sent. So eliminate the
block of code that handles this (but check and issue a warning for
now, just in case it happens for some reason).

This resolves:
http://tracker.ceph.com/issues/4426

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
878efabd3236abaedd0a4539bbb248ac69fed115 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: use cursor for inbound data pages

The cursor code for a page array selects the right page, page
offset, and length to use for a ceph_tcp_recvpage() call, so
we can use it to replace a block in read_partial_message_pages().

This partially resolves:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6518be47f910f62a98cb6044dbb457af55241f95 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: kill ceph message bio_iter, bio_seg

The bio_iter and bio_seg fields in a message are no longer used, we
use the cursor instead. So get rid of them and the functions that
operate on them them.

This is related to:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
463207aa40cf2cadcae84866b3f85ccaa7022ee8 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: use cursor for bio reads

Replace the use of the information in con->in_msg_pos for incoming
bio data. The old in_msg_pos and the new cursor mechanism do
basically the same thing, just slightly differently.

The main functional difference is that in_msg_pos keeps track of the
length of the complete bio list, and assumed it was fully consumed
when that many bytes had been transferred. The cursor does not assume
a length, it simply consumes all bytes in the bio list. Because the
only user of bio data is the rbd client, and because the length of a
bio list provided by rbd client always matches the number of bytes
in the list, both ways of tracking length are equivalent.

In addition, for in_msg_pos the initial bio vector is selected as
the initial value of the bio->bi_idx, while the cursor assumes this
is zero. Again, the rbd client always passes 0 as the initial index
so the effect is the same.

Other than that, they basically match:
in_msg_pos cursor
---------- ------
bio_iter bio
bio_seg vec_index
page_pos page_offset

The in_msg_pos field is initialized by a call to init_bio_iter().
The bio cursor is initialized by ceph_msg_data_cursor_init().
Both now happen in the same spot, in prepare_message_data().

The in_msg_pos field is advanced by a call to in_msg_pos_next(),
which updates page_pos and calls iter_bio_next() to move to the next
bio vector, or to the next bio in the list. The cursor is advanced
by ceph_msg_data_advance(). That isn't currently happening so
add a call to that in in_msg_pos_next().

Finally, the next piece of data to use for a read is determined
by a bunch of lines in read_partial_message_bio(). Those can be
replaced by an equivalent ceph_msg_data_bio_next() call.

This partially resolves:
http://tracker.ceph.com/issues/4428

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
25aff7c559c8b54a810bc094d59fe037cfed6b18 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: record residual bytes for all message data types

All of the data types can use this, not just the page array. Until
now, only the bio type doesn't have it available, and only the
initiator of the request (the rbd client) is able to supply the
length of the full request without re-scanning the bio list. Change
the cursor init routines so the length is supplied based on the
message header "data_len" field, and use that length to intiialize
the "resid" field of the cursor.

In addition, change the way "last_piece" is defined so it is based
on the residual number of bytes in the original request. This is
necessary (at least for bio messages) because it is possible for
a read request to succeed without consuming all of the space
available in the data buffer.

This resolves:
http://tracker.ceph.com/issues/4427

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
28a89ddece39890c255a0c41baf622731a08c288 12-Mar-2013 Alex Elder <elder@inktank.com> libceph: drop pages parameter

The value passed for "pages" in read_partial_message_pages() is
always the pages pointer from the incoming message, which can be
derived inside that function. So just get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
888334f966fab232fe9158c2c2f0a935e356b583 25-Mar-2013 Alex Elder <elder@inktank.com> libceph: initialize data fields on last msg put

When the last reference to a ceph message is dropped,
ceph_msg_last_put() is called to clean things up.

For "normal" messages (allocated via ceph_msg_new() rather than
being allocated from a memory pool) it's sufficient to just release
resources. But for a mempool-allocated message we actually have to
re-initialize the data fields in the message back to initial state
so they're ready to go in the event the message gets reused.

Some of this was already done; this fleshes it out so it's done
more completely.

This resolves:
http://tracker.ceph.com/issues/4540

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
20e55c4cc758e4dccdfd92ae8e9588dd624b2cd7 25-Mar-2013 Sage Weil <sage@inktank.com> libceph: clear messenger auth_retry flag when we authenticate

We maintain a counter of failed auth attempts to allow us to retry once
before failing. However, if the second attempt succeeds, the flag isn't
cleared, which makes us think auth failed again later when the connection
resets for other reasons (like a socket error).

This is one part of the sorry sequence of events in bug

http://tracker.ceph.com/issues/4282

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
3a23083bda56850a1dc0e1c6d270b1f5dc789f07 25-Mar-2013 Sage Weil <sage@inktank.com> libceph: implement RECONNECT_SEQ feature

This is an old protocol extension that allows the client and server to
avoid resending old messages after a reconnect (following a socket error).
Instead, the exchange their sequence numbers during the handshake. This
avoids sending a bunch of useless data over the socket.

It has been supported in the server code since v0.22 (Sep 2010).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
8a166d05369f6a0369bb194a795e6e3928ac6e34 08-Mar-2013 Alex Elder <elder@inktank.com> libceph: more cleanup of write_partial_msg_pages()

Basically all cases in write_partial_msg_pages() use the cursor, and
as a result we can simplify that function quite a bit.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9d2a06c2750177dca5f8d0e89884c1d409d64bbc 08-Mar-2013 Alex Elder <elder@inktank.com> libceph: kill message trail

The wart that is the ceph message trail can now be removed, because
its only user was the osd client, and the previous patch made that
no longer the case.

The result allows write_partial_msg_pages() to be simplified
considerably.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e766d7b55e10f93c7bab298135a4e90dcc46620d 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: implement pages array cursor

Implement and use cursor routines for page array message data items
for outbound message data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6aaa4511deb4b0fd776d1153dc63a89cdc024fb8 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: implement bio message data item cursor

Implement and use cursor routines for bio message data items for
outbound message data.

(See the previous commit for reasoning in support of the changes
in out_msg_pos_next().)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7fe1e5e57b84eab98ff352519aa66e86dac5bf61 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: use data cursor for message pagelist

Switch to using the message cursor for the (non-trail) outgoing
pagelist data item in a message if present.

Notes on the logic changes in out_msg_pos_next():
- only the mds client uses a ceph pagelist for message data;
- if the mds client ever uses a pagelist, it never uses a page
array (or anything else, for that matter) for data in the same
message;
- only the osd client uses the trail portion of a message data,
and when it does, it never uses any other data fields for
outgoing data in the same message; and finally
- only the rbd client uses bio message data (never pagelist).

Therefore out_msg_pos_next() can assume:
- if we're in the trail portion of a message, the message data
pagelist, data, and bio can be ignored; and
- if there is a page list, there will never be any a bio or page
array data, and vice-versa.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
dd236fcb65d7b6b80c408cb5f66aab55f4594284 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: prepare for other message data item types

This just inserts some infrastructure in preparation for handling
other types of ceph message data items. No functional changes,
just trying to simplify review by separating out some noise.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
fe38a2b67bc6b3a60da82a23e9082256a30e39d9 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: start defining message data cursor

This patch lays out the foundation for using generic routines to
manage processing items of message data.

For simplicity, we'll start with just the trail portion of a
message, because it stands alone and is only present for outgoing
data.

First some basic concepts. We'll use the term "data item" to
represent one of the ceph_msg_data structures associated with a
message. There are currently four of those, with single-letter
field names p, l, b, and t. A data item is further broken into
"pieces" which always lie in a single page. A data item will
include a "cursor" that will track state as the memory defined by
the item is consumed by sending data from or receiving data into it.

We define three routines to manipulate a data item's cursor: the
"init" routine; the "next" routine; and the "advance" routine. The
"init" routine initializes the cursor so it points at the beginning
of the first piece in the item. The "next" routine returns the
page, page offset, and length (limited by both the page and item
size) of the next unconsumed piece in the item. It also indicates
to the caller whether the piece being returned is the last one in
the data item.

The "advance" routine consumes the requested number of bytes in the
item (advancing the cursor). This is used to record the number of
bytes from the current piece that were actually sent or received by
the network code. It returns an indication of whether the result
means the current piece has been fully consumed. This is used by
the message send code to determine whether it should calculate the
CRC for the next piece processed.

The trail of a message is implemented as a ceph pagelist. The
routines defined for it will be usable for non-trail pagelist data
as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
437945094fed0deb1810e8da95465c8f26bc6f80 02-Mar-2013 Alex Elder <elder@inktank.com> libceph: abstract message data

Group the types of message data into an abstract structure with a
type indicator and a union containing fields appropriate to the
type of data it represents. Use this to represent the pages,
pagelist, bio, and trail in a ceph message.

Verify message data is of type NONE in ceph_msg_data_set_*()
routines. Since information about message data of type NONE really
should not be interpreted, get rid of the other assertions in those
functions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f9e15777afd87585f2222dfd446c2e52deb65eba 02-Mar-2013 Alex Elder <elder@inktank.com> libceph: be explicit about message data representation

A ceph message has a data payload portion. The memory for that data
(either the source of data to send or the location to place data
that is received) is specified in several ways. The ceph_msg
structure includes fields for all of those ways, but this
mispresents the fact that not all of them are used at a time.

Specifically, the data in a message can be in:
- an array of pages
- a list of pages
- a list of Linux bios
- a second list of pages (the "trail")
(The two page lists are currently only ever used for outgoing data.)

Impose more structure on the ceph message, making the grouping of
some of these fields explicit. Shorten the name of the
"page_alignment" field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
97fb1c7f6637ee61c90b8bc186d464cfd426b063 02-Mar-2013 Alex Elder <elder@inktank.com> libceph: define ceph_msg_has_*() data macros

Define and use macros ceph_msg_has_*() to determine whether to
operate on the pages, pagelist, bio, and trail fields of a message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
35b6280899424a0faf5410ce1ee86f9682528e6c 09-Mar-2013 Alex Elder <elder@inktank.com> libceph: define and use ceph_crc32c_page()

Factor out a common block of code that updates a CRC calculation
over a range of data in a page.

This and the preceding patches are related to:
http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
afb3d90e205140415477d501ff9e2a33ff0b197f 09-Mar-2013 Alex Elder <elder@inktank.com> libceph: define and use ceph_tcp_recvpage()

Define a new function ceph_tcp_recvpage() that behaves in a way
comparable to ceph_tcp_sendpage().

Rearrange the code in both read_partial_message_pages() and
read_partial_message_bio() so they have matching structure,
(similar to what's in write_partial_msg_pages()), and use
this new function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
34d2d2006cc82fd21f716e10568b8c8b4ef61c0e 09-Mar-2013 Alex Elder <elder@inktank.com> libceph: encapsulate reading message data

Pull the code that reads the data portion into a message into
a separate function read_partial_msg_data().

Rename write_partial_msg_pages() to be write_partial_message_data()
to match its read counterpart, and to reflect its more generic
purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e387d525b0ceeecf07b074781eab77414dc9697e 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: small write_partial_msg_pages() refactor

Define local variables page_offset and length to represent the range
of bytes within a page that will be sent by ceph_tcp_sendpage() in
write_partial_msg_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
78625051b524e104332e69a9079d0ee9a2100cf2 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: consolidate message prep code

In prepare_write_message_data(), various fields are initialized in
preparation for writing message data out. Meanwhile, in
read_partial_message(), there is essentially the same block of code,
operating on message variables associated with an incoming message.

Generalize prepare_write_message_data() so it works for both
incoming and outcoming messages, and use it in both spots. The
did_page_crc is not used for input (so it's harmless to initialize
it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
bae6acd9c65cbfeffc66a9f48ae91dca6e3aec85 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: use local variables for message positions

There are several places where a message's out_msg_pos or in_msg_pos
field is used repeatedly within a function. Use a local pointer
variable for this purpose to unclutter the code.

This and the upcoming cleanup patches are related to:
http://tracker.ceph.com/issues/4403

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
98a0370898799895aa8f55109f54c33fcd8196b0 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: don't clear bio_iter in prepare_write_message()

At one time it was necessary to clear a message's bio_iter field to
avoid a bad pointer dereference in write_partial_msg_pages().

That no longer seems to be the case. Here's why.

The message's bio fields represent (in this case) outgoing data.
Between where the bio_iter is made NULL in prepare_write_message()
and the call in that function to prepare_message_data(), the
bio fields are never used.

In prepare_message_data(), init-bio_iter() is called, and the result
of that overwrites the value in the message's bio_iter field.

Because it gets overwritten anyway, there is no need to set it to
NULL. So don't do it.

This resolves:
http://tracker.ceph.com/issues/4402

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
07aa155878499f599a709eeecfaa0ca9ea764a88 05-Mar-2013 Alex Elder <elder@inktank.com> libceph: activate message data assignment checks

The mds client no longer tries to assign zero-length message data,
and the osd client no longer sets its data info more than once.
This allows us to activate assertions in the messenger to verify
these things never happen.

This resolves both of these:
http://tracker.ceph.com/issues/4263
http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
4a73ef27ad04f1b8ea23eb55e50b20fcc0530a6f 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: record message data byte length

Record the number of bytes of data in a page array rather than the
number of pages in the array. It can be assumed that the page array
is of sufficient size to hold the number of bytes indicated (and
offset by the indicated alignment).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ebf18f47093e968105767eed4a0aa155e86b224e 05-Mar-2013 Alex Elder <elder@inktank.com> ceph: only set message data pointers if non-empty

Change it so we only assign outgoing data information for messages
if there is outgoing data to send.

This then allows us to add a few more (currently commented-out)
assertions.

This is related to:
http://tracker.ceph.com/issues/4284

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
27fa83852ba275361eaa1a1283cf6704fa8191a6 14-Feb-2013 Alex Elder <elder@inktank.com> libceph: isolate other message data fields

Define ceph_msg_data_set_pagelist(), ceph_msg_data_set_bio(), and
ceph_msg_data_set_trail() to clearly abstract the assignment of the
remaining data-related fields in a ceph message structure. Use the
new functions in the osd client and mds client.

This partially resolves:
http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f1baeb2b9fc1c2c87ec02f1bf8cb88e108d4fbce 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: set page info with byte length

When setting page array information for message data, provide the
byte length rather than the page count ceph_msg_data_set_pages().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
02afca6ca00b7972887c5cc77068356f33bdfc18 14-Feb-2013 Alex Elder <elder@inktank.com> libceph: isolate message page field manipulation

Define a function ceph_msg_data_set_pages(), which more clearly
abstracts the assignment page-related fields for data in a ceph
message structure. Use this new function in the osd client and mds
client.

Ideally, these fields would never be set more than once (with
BUG_ON() calls to guarantee that). At the moment though the osd
client sets these every time it receives a message, and in the event
of a communication problem this can happen more than once. (This
will be resolved shortly, but setting up these helpers first makes
it all a bit easier to work with.)

Rearrange the field order in a ceph_msg structure to group those
that are used to define the possible data payloads.

This partially resolves:
http://tracker.ceph.com/issues/4263

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9516e45b25d9967c35d2e798496ec5e590aaa24f 02-Mar-2013 Alex Elder <elder@inktank.com> libceph: simplify new message initialization

Rather than explicitly initializing many fields to 0, NULL, or false
in a newly-allocated message, just use kzalloc() for allocating new
messages. This will become a much more convenient way of doing
things anyway for upcoming patches that abstract the data field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
35c7bfbcd4fabded090e5ab316a1cbf053a0a980 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: advance pagelist with list_rotate_left()

While processing an outgoing pagelist (either the data pagelist or
trail) in a ceph message, the messenger cycles through each of the
pages on the list. This is accomplished in out_msg_pos_next(), if
the end of the first page on the list is reached, the first page is
moved to the end of the list.

There is a list operation, list_rotate_left(), which performs
exactly this operation, and by using it, what's really going on
becomes more obvious.

So replace these two list_move_tail() calls with list_rotate_left().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e788182fa6c1a400076278a75d0efa0a8a08e4ec 09-Mar-2013 Alex Elder <elder@inktank.com> libceph: define and use in_msg_pos_next()

Define a new function in_msg_pos_next() to match out_msg_pos_next(),
and use it in place of code at the end of read_partial_message_pages()
and read_partial_message_bio().

Note that the page number is incremented and offset reset under
slightly different conditions from before. The result is
equivalent, however, as explained below.

Each time an incoming message is going to arrive, we find out how
much room is left--not surpassing the current page--and provide that
as the number of bytes to receive. So the amount we'll use is the
lesser of: all that's left of the entire request; and all that's
left in the current page.

If we received exactly how many were requested, we either reached
the end of the request or the end of the page. In the first case,
we're done, in the second, we move onto the next page in the array.

In all cases but (possibly) on the last page, after adding the
number of bytes received, page_pos == PAGE_SIZE. On the last page,
it doesn't really matter whether we increment the page number and
reset the page position, because we're done and we won't come back
here again. The code previously skipped over that last case,
basically. The new code handles that case the same as the others,
incrementing and resetting.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b3d56fab333bbb3ac7300843d69e52d7bd8a016b 09-Mar-2013 Alex Elder <elder@inktank.com> libceph: kill args in read_partial_message_bio()

There is only one caller for read_partial_message_bio(), and it
always passes &msg->bio_iter and &bio_seg as the second and third
arguments. Furthermore, the message in question is always the
connection's in_msg, and we can get that inside the called function.

So drop those two parameters and use their derived equivalents.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e1dcb128f88958e7212fdd7ceebba4f84d6bc47a 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: change type of ceph_tcp_sendpage() "more"

Change the type of the "more" parameter from int to bool.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6ebc8b32b327463f552d9d4499aba2ef1e02a600 09-Mar-2013 Alex Elder <elder@inktank.com> libceph: minor byte order problems in read_partial_message()

Some values printed are not (necessarily) in CPU order. We already
have a copy of the converted versions, so use them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7b11ba37585595034a91df8869414f732466b800 09-Mar-2013 Alex Elder <elder@inktank.com> libceph: define CEPH_MSG_MAX_MIDDLE_LEN

This is probably unnecessary but the code read as if it were wrong
in read_partial_message().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4137577ae398837b0d5e47d4d9365320584efdad 05-Mar-2013 Alex Elder <elder@inktank.com> libceph: clean up skipped message logic

In ceph_con_in_msg_alloc() it is possible for a connection's
alloc_msg method to indicate an incoming message should be skipped.
By default, read_partial_message() initializes the skip variable
to 0 before it gets provided to ceph_con_in_msg_alloc().

The osd client, mon client, and mds client each supply an alloc_msg
method. The mds client always assigns skip to be 0.

The other two leave the skip value of as-is, or assigns it to zero,
except:
- if no (osd or mon) request having the given tid is found, in
which case skip is set to 1 and NULL is returned; or
- in the osd client, if the data of the reply message is not
adequate to hold the message to be read, it assigns skip
value 1 and returns NULL.
So the returned message pointer will always be NULL if skip is ever
non-zero.

Clean up the logic a bit in ceph_con_in_msg_alloc() to make this
state of affairs more obvious. Add a comment explaining how a null
message pointer can mean either a message that should be skipped or
a problem allocating a message.

This resolves:
http://tracker.ceph.com/issues/4324

Reported-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
53ded495c6ac9f79d9a7f91bac92ba977944306c 02-Mar-2013 Alex Elder <elder@inktank.com> libceph: define mds_alloc_msg() method

The only user of the ceph messenger that doesn't define an alloc_msg
method is the mds client. Define one, such that it works just like
it did before, and simplify ceph_con_in_msg_alloc() by assuming the
alloc_msg method is always present.

This and the next patch resolve:
http://tracker.ceph.com/issues/4322

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
1d866d1c31110db177cbd0636b95c4cb32ca2c6e 02-Mar-2013 Alex Elder <elder@inktank.com> libceph: drop mutex while allocating a message

In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a
connection a new message is allocated with ceph_msg_new().

Drop the mutex before making this call, and make sure we're still
connected when we get it back again.

This is preparing for the next patch, which ensures all connections
define an alloc_msg method, and then handles them all the same way.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
ec02a2f2ffae13e038453ae89592a8c6210f7f4d 02-Mar-2013 Alex Elder <elder@inktank.com> libceph: kill ceph_msg->pagelist_count

The pagelist_count field is never actually used, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d4b515fa10dd52a2aef88df7299e9f3a8ab0957a 26-Feb-2013 Alex Elder <elder@inktank.com> libceph: distinguish page array and pagelist count

Use distinct fields for tracking the number of pages in a message's
page array and in a message's page list. Currently only one or the
other is used at a time, but that will be changing soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
07c09b725543ff2958c11522d583f90f7fdba735 16-Feb-2013 Alex Elder <elder@inktank.com> libceph: make ceph_msg->bio_seg be unsigned

The bio_seg field is used by the ceph messenger in iterating through
a bio. It should never have a negative value, so make it an
unsigned. (I contemplated making it unsigned short to match the
struct bio definition, but it offered no benefit.)

Change variables used to hold bio_seg values to all be unsigned as
well. Change two variable names in init_bio_iter() to match the
convention used everywhere else.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
49659416ba4fa8308bd29e453f54c3bcf8a0fbf1 19-Feb-2013 Alex Elder <elder@inktank.com> libceph: use a do..while loop in con_work()

This just converts a manually-implemented loop into a do..while loop
in con_work(). It also moves handling of EAGAIN inside the blocks
where it's already been determined an error code was returned.

Also update a few dout() calls near the affected code for
consistency.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b6e7b6a11923bda6102b4e3e196693567944869c 19-Feb-2013 Alex Elder <elder@inktank.com> libceph: use a flag to indicate a fault has occurred

This just rearranges the logic in con_work() a little bit so that a
flag is used to indicate a fault has occurred. This allows both the
fault and non-fault case to be handled the same way and avoids a
couple of nearly consecutive gotos.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
93209264203987cdd2c69d34df6eaa2cd184e283 19-Feb-2013 Alex Elder <elder@inktank.com> libceph: separate non-locked fault handling

An error occurring on a ceph connection is treated as a fault,
causing the connection to be reset. The initial part of this fault
handling has to be done while holding the connection mutex, but
it must then be dropped for the last part.

Separate the part of this fault handling that executes without the
lock into its own function, con_fault_finish(). Move the call to
this new function, as well as call that drops the connection mutex,
into ceph_fault(). Rename that function con_fault() to reflect that
it's only handling the connection part of the fault handling.

The motivation for this was a warning from sparse about the locking
being done here. Rearranging things this way keeps all the mutex
manipulation within ceph_fault(), and this stops sparse from
complaining.

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f20a39fd6e6356b4cf3c1650c4dc6c66c99d8bae 19-Feb-2013 Alex Elder <elder@inktank.com> libceph: encapsulate connection backoff

Collect the code that tests for and implements a backoff delay for a
ceph connection into a new function, ceph_backoff().

Make the debug output messages in that part of the code report
things consistently by reporting a message in the socket closed
case, and by making the one for PREOPEN state report the connection
pointer like the rest.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
154171678989950f6c392e126fa8006a145ed1cc 19-Feb-2013 Alex Elder <elder@inktank.com> libceph: eliminate sparse warnings

Eliminate most of the problems in the libceph code that cause sparse
to issue warnings.
- Convert functions that are never referenced externally to have
static scope.
- Pass NULL rather than 0 for a pointer argument in one spot in
ceph_monc_delete_snapid()

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c9ffc77adebf9dfe3026ede6c8b3c61586b485b7 20-Feb-2013 Alex Elder <elder@inktank.com> libceph: define connection flag helpers

Define and use functions that encapsulate operations performed on
a connection's flags.

This resolves:
http://tracker.ceph.com/issues/4234

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3ebc21f7bc2f9c0145bbbf0f12430b766a200f9f 31-Jan-2013 Alex Elder <elder@inktank.com> libceph: fix messenger CONFIG_BLOCK dependencies

The ceph messenger has a few spots that are only used when
bio messages are supported, and that's only when CONFIG_BLOCK
is defined. This surrounds a couple of spots with #ifdef's
that would cause a problem if CONFIG_BLOCK were not present
in the kernel configuration.

This resolves:
http://tracker.ceph.com/issues/3976

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0fa6ebc600bc8e830551aee47a0e929e818a1868 28-Dec-2012 Sage Weil <sage@inktank.com> libceph: fix protocol feature mismatch failure path

We should not set con->state to CLOSED here; that happens in
ceph_fault() in the caller, where it first asserts that the state
is not yet CLOSED. Avoids a BUG when the features don't match.

Since the fail_protocol() has become a trivial wrapper, replace
calls to it with direct calls to reset_connection().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
122070a2ffc91f87fe8e8493eb0ac61986c5557c 26-Dec-2012 Alex Elder <elder@inktank.com> libceph: WARN, don't BUG on unexpected connection states

A number of assertions in the ceph messenger are implemented with
BUG_ON(), killing the system if connection's state doesn't match
what's expected. At this point our state model is (evidently) not
well understood enough for these assertions to trigger a BUG().
Convert all BUG_ON(con->state...) calls to be WARN_ON(con->state...)
so we learn about these issues without killing the machine.

We now recognize that a connection fault can occur due to a socket
closure at any time, regardless of the state of the connection. So
there is really nothing we can assert about the state of the
connection at that point so eliminate that assertion.

Reported-by: Ugis <ugis22@gmail.com>
Tested-by: Ugis <ugis22@gmail.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
28362986f8743124b3a0fda20a8ed3e80309cce1 14-Dec-2012 Alex Elder <elder@inktank.com> libceph: report connection fault with warning

When a connection's socket disconnects, or if there's a protocol
error of some kind on the connection, a fault is signaled and
the connection is reset (closed and reopened, basically). We
currently get an error message on the log whenever this occurs.

A ceph connection will attempt to reestablish a socket connection
repeatedly if a fault occurs. This means that these error messages
will get repeatedly added to the log, which is undesirable.

Change the error message to be a warning, so they don't get
logged by default.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
7bb21d68c535ad8be38e14a715632ae398b37ac1 08-Dec-2012 Alex Elder <elder@inktank.com> libceph: socket can close in any connection state

A connection's socket can close for any reason, independent of the
state of the connection (and without irrespective of the connection
mutex). As a result, the connectino can be in pretty much any state
at the time its socket is closed.

Handle those other cases at the top of con_work(). Pull this whole
block of code into a separate function to reduce the clutter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
7246240c7c186542f73af4fadc744d66440f616f 25-Oct-2012 Sage Weil <sage@inktank.com> libceph: avoid NULL kref_put from NULL alloc_msg return

The ceph_on_in_msg_alloc() method calls the ->alloc_msg() helper which
may return NULL. It also drops con->mutex while it allocates a message,
which means that the connection state may change (e.g., get closed). If
that happens, we clean up and bail out. Avoid calling ceph_msg_put() on
a NULL return value and triggering a crash.

This was observed when an ->alloc_msg() call races with a timeout that
resends a zillion messages and resets the connection, and ->alloc_msg()
returns NULL (because the request was resent to another target).

Fixes http://tracker.newdream.net/issues/3342

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
9bd952615a42d7e2ce3fa2c632e808e804637a1a 25-Oct-2012 Sage Weil <sage@inktank.com> libceph: avoid NULL kref_put when osd reset races with alloc_msg

The ceph_on_in_msg_alloc() method drops con->mutex while it allocates a
message. If that races with a timeout that resends a zillion messages and
resets the connection, and the ->alloc_msg() method returns a NULL message,
it will call ceph_msg_put(NULL) and BUG.

Fix by only calling put if msg is non-NULL.

Fixes http://tracker.newdream.net/issues/3142

Signed-off-by: Sage Weil <sage@inktank.com>
802c6d967fbdcd2cbc91b917425661bb8bbfaade 09-Oct-2012 Alex Elder <elder@inktank.com> rbd: define common queue_con_delay()

This patch defines a single function, queue_con_delay() to call
queue_delayed_work() for a connection. It basically generalizes
what was previously queue_con() by adding the delay argument.
queue_con() is now a simple helper that passes 0 for its delay.
queue_con_delay() returns 0 if it queued work or an errno if it
did not for some reason.

If con_work() finds the BACKOFF flag set for a connection, it now
calls queue_con_delay() to handle arranging to start again after a
delay.

Note about connection reference counts: con_work() only ever gets
called as a work item function. At the time that work is scheduled,
a reference to the connection is acquired, and the corresponding
con_work() call is then responsible for dropping that reference
before it returns.

Previously, the backoff handling inside con_work() silently handed
off its reference to delayed work it scheduled. Now that
queue_con_delay() is used, a new reference is acquired for the
newly-scheduled work, and the original reference is dropped by the
con->ops->put() call at the end of the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
8618e30bc14b06bfafa0f164cca7b0e06451f88a 09-Oct-2012 Alex Elder <elder@inktank.com> rbd: let con_work() handle backoff

Both ceph_fault() and con_work() include handling for imposing a
delay before doing further processing on a faulted connection.
The latter is used only if ceph_fault() is unable to.

Instead, just let con_work() always be responsible for implementing
the delay. After setting up the delay value, set the BACKOFF flag
on the connection unconditionally and call queue_con() to ensure
con_work() will get called to handle it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
588377d6199034c36d335e7df5818b731fea072c 09-Oct-2012 Alex Elder <elder@inktank.com> rbd: reset BACKOFF if unable to re-queue

If ceph_fault() is unable to queue work after a delay, it sets the
BACKOFF connection flag so con_work() will attempt to do so.

In con_work(), when BACKOFF is set, if queue_delayed_work() doesn't
result in newly-queued work, it simply ignores this condition and
proceeds as if no backoff delay were desired. There are two
problems with this--one of which is a bug.

The first problem is simply that the intended behavior is to back
off, and if we aren't able queue the work item to run after a delay
we're not doing that.

The only reason queue_delayed_work() won't queue work is if the
provided work item is already queued. In the messenger, this
means that con_work() is already scheduled to be run again. So
if we simply set the BACKOFF flag again when this occurs, we know
the next con_work() call will again attempt to hold off activity
on the connection until after the delay.

The second problem--the bug--is a leak of a reference count. If
queue_delayed_work() returns 0 in con_work(), con->ops->put() drops
the connection reference held on entry to con_work(). However,
processing is (was) allowed to continue, and at the end of the
function a second con->ops->put() is called.

This patch fixes both problems.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
5ce765a540f34d1e2005e1210f49f67fdf11e997 22-Sep-2012 Alex Elder <elder@inktank.com> libceph: only kunmap kmapped pages

In write_partial_msg_pages(), pages need to be kmapped in order to
perform a CRC-32c calculation on them. As an artifact of the way
this code used to be structured, the kunmap() call was separated
from the kmap() call and both were done conditionally. But the
conditions under which the kmap() and kunmap() calls were made
differed, so there was a chance a kunmap() call would be done on a
page that had not been mapped.

The symptom of this was tripping a BUG() in kunmap_high() when
pkmap_count[nr] became 0.

Reported-by: Bryan K. Wright <bryan@virginia.edu>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
6d4221b53707486dfad3f5bfe568d2ce7f4c9863 10-Aug-2012 Jim Schutt <jaschut@sandia.gov> libceph: avoid truncation due to racing banners

Because the Ceph client messenger uses a non-blocking connect, it is
possible for the sending of the client banner to race with the
arrival of the banner sent by the peer.

When ceph_sock_state_change() notices the connect has completed, it
schedules work to process the socket via con_work(). During this
time the peer is writing its banner, and arrival of the peer banner
races with con_work().

If con_work() calls try_read() before the peer banner arrives, there
is nothing for it to do, after which con_work() calls try_write() to
send the client's banner. In this case Ceph's protocol negotiation
can complete succesfully.

The server-side messenger immediately sends its banner and addresses
after accepting a connect request, *before* actually attempting to
read or verify the banner from the client. As a result, it is
possible for the banner from the server to arrive before con_work()
calls try_read(). If that happens, try_read() will read the banner
and prepare protocol negotiation info via prepare_write_connect().
prepare_write_connect() calls con_out_kvec_reset(), which discards
the as-yet-unsent client banner. Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than
the banner that the peer is expecting.

The result is that the peer sees an invalid banner, and the client
reports "negotiation failed".

Fix this by moving con_out_kvec_reset() out of
prepare_write_connect() to its callers at all locations except the
one where the banner might still need to be sent.

[elder@inktak.com: added note about server-side behavior]

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@inktank.com>
6139919133377652992a5fe134e22abce3e9c25e 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: recheck con state after allocating incoming message

We drop the lock when calling the ->alloc_msg() con op, which means
we need to (a) not clobber con->in_msg without the mutex held, and (b)
we need to verify that we are still in the OPEN state when we retake
it to avoid causing any mayhem. If the state does change, -EAGAIN
will get us back to con_work() and loop.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
4740a623d20c51d167da7f752b63e2b8714b2543 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: change ceph_con_in_msg_alloc convention to be less weird

This function's calling convention is very limiting. In particular,
we can't return any error other than ENOMEM (and only implicitly),
which is a problem (see next patch).

Instead, return an normal 0 or error code, and make the skip a pointer
output parameter. Drop the useless in_hdr argument (we have the con
pointer).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
8636ea672f0c5ab7478c42c5b6705ebd1db7eb6a 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: avoid dropping con mutex before fault

The ceph_fault() function takes the con mutex, so we should avoid
dropping it before calling it. This fixes a potential race with
another thread calling ceph_con_close(), or _open(), or similar (we
don't reverify con->state after retaking the lock).

Add annotation so that lockdep realizes we will drop the mutex before
returning.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
7b862e07b1a4d5c963d19027f10ea78085f27f9b 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: verify state after retaking con lock after dispatch

We drop the con mutex when delivering a message. When we retake the
lock, we need to verify we are still in the OPEN state before
preparing to read the next tag, or else we risk stepping on a
connection that has been closed.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
8007b8d626b49c34fb146ec16dc639d8b10c862f 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: fix handling of immediate socket connect failure

If the connect() call immediately fails such that sock == NULL, we
still need con_close_socket() to reset our socket state to CLOSED.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
43c7427d100769451601b8a36988ac0528ce0124 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: clear all flags on con_close

Signed-off-by: Sage Weil <sage@inktank.com>
4a8616920860920abaa51193146fe36b38ef09aa 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: clean up con flags

Rename flags with CON_FLAG prefix, move the definitions into the c file,
and (better) document their meaning.

Signed-off-by: Sage Weil <sage@inktank.com>
8dacc7da69a491c515851e68de6036f21b5663ce 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: replace connection state bits with states

Use a simple set of 6 enumerated values for the socket states (CON_STATE_*)
and use those instead of the state bits. All of the con->state checks are
now under the protection of the con mutex, so this is safe. It also
simplifies many of the state checks because we can check for anything other
than the expected state instead of various bits for races we can think of.

This appears to hold up well to stress testing both with and without socket
failure injection on the server side.

Signed-off-by: Sage Weil <sage@inktank.com>
d7353dd5aaf22ed611fbcd0d4a4a12fb30659290 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: drop unnecessary CLOSED check in socket state change callback

If we are CLOSED, the socket is closed and we won't get these.

Signed-off-by: Sage Weil <sage@inktank.com>
ee76e0736db8455e3b11827d6899bd2a4e1d0584 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: close socket directly from ceph_con_close()

It is simpler to do this immediately, since we already hold the con mutex.
It also avoids the need to deal with a not-quite-CLOSED socket in con_work.

Signed-off-by: Sage Weil <sage@inktank.com>
2e8cb10063820af7ed7638e3fd9013eee21266e7 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: drop gratuitous socket close calls in con_work

If the state is CLOSED or OPENING, we shouldn't have a socket.

Signed-off-by: Sage Weil <sage@inktank.com>
a59b55a602b6c741052d79c1e3643f8440cddd27 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: move ceph_con_send() closed check under the con mutex

Take the con mutex before checking whether the connection is closed to
avoid racing with someone else closing it.

Signed-off-by: Sage Weil <sage@inktank.com>
00650931e52e97fe64096bec167f5a6780dfd94a 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: move msgr clear_standby under con mutex protection

Avoid dropping and retaking con->mutex in the ceph_con_send() case by
leaving locking up to the caller.

Signed-off-by: Sage Weil <sage@inktank.com>
3b5ede07b55b52c3be27749d183d87257d032065 21-Jul-2012 Sage Weil <sage@inktank.com> libceph: fix fault locking; close socket on lossy fault

If we fault on a lossy connection, we should still close the socket
immediately, and do so under the con mutex.

We should also take the con mutex before printing out the state bits in
the debug output.

Signed-off-by: Sage Weil <sage@inktank.com>
85effe183dd45854d1ad1a370b88cddb403c4c91 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: reset connection retry on successfully negotiation

We exponentially back off when we encounter connection errors. If several
errors accumulate, we will eventually wait ages before even trying to
reconnect.

Fix this by resetting the backoff counter after a successful negotiation/
connection with the remote node. Fixes ceph issue #2802.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
5469155f2bc83bb2c88b0a0370c3d54d87eed06e 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: protect ceph_con_open() with mutex

Take the con mutex while we are initiating a ceph open. This is necessary
because the may have previously been in use and then closed, which could
result in a racing workqueue running con_work().

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
a4107026976f06c9a6ce8cc84a763564ee39d901 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: (re)initialize bio_iter on start of message receive

Previously, we were opportunistically initializing the bio_iter if it
appeared to be uninitialized in the middle of the read path. The problem
is that a sequence like:

- start reading message
- initialize bio_iter
- read half a message
- messenger fault, reconnect
- restart reading message
- ** bio_iter now non-NULL, not reinitialized **
- read past end of bio, crash

Instead, initialize the bio_iter unconditionally when we allocate/claim
the message for read.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
8c50c817566dfa4581f82373aac39f3e608a7dc8 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: fix mutex coverage for ceph_con_close

Hold the mutex while twiddling all of the state bits to avoid possible
races. While we're here, make not of why we cannot close the socket
directly.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
3a140a0d5c4b9e35373b016e41dfc85f1e526bdb 31-Jul-2012 Sage Weil <sage@inktank.com> libceph: report socket read/write error message

We need to set error_msg to something useful before calling ceph_fault();
do so here for try_{read,write}(). This is more informative than

libceph: osd0 192.168.106.220:6801 (null)

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
a2a3258417eb6a1799cf893350771428875a8287 09-Jul-2012 Guanjun He <gjhe@suse.com> libceph: prevent the race of incoming work during teardown

Add an atomic variable 'stopping' as flag in struct ceph_messenger,
set this flag to 1 in function ceph_destroy_client(), and add the condition code
in function ceph_data_ready() to test the flag value, if true(1), just return.

Signed-off-by: Guanjun He <gjhe@suse.com>
Reviewed-by: Sage Weil <sage@inktank.com>
a16cb1f70799c851410d9dca0a24122e258df06c 10-Jul-2012 Sage Weil <sage@inktank.com> libceph: fix messenger retry

In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
5bdca4e0768d3e0f4efa43d9a2cc8210aeb91ab9 10-Jul-2012 Sage Weil <sage@inktank.com> libceph: fix messenger retry

In ancient times, the messenger could both initiate and accept connections.
An artifact if that was data structures to store/process an incoming
ceph_msg_connect request and send an outgoing ceph_msg_connect_reply.
Sadly, the negotiation code was referencing those structures and ignoring
important information (like the peer's connect_seq) from the correct ones.

Among other things, this fixes tight reconnect loops where the server sends
RETRY_SESSION and we (the client) retries with the same connect_seq as last
time. This bug pretty easily triggered by injecting socket failures on the
MDS and running some fs workload like workunits/direct_io/test_sync_io.

Signed-off-by: Sage Weil <sage@inktank.com>
fbb85a478f6d4cce6942f1c25c6a68ec5b1e7e7f 27-Jun-2012 Sage Weil <sage@inktank.com> libceph: allow sock transition from CONNECTING to CLOSED

It is possible to close a socket that is in the OPENING state. For
example, it can happen if ceph_con_close() is called on the con before
the TCP connection is established. con_work() will come around and shut
down the socket.

Signed-off-by: Sage Weil <sage@inktank.com>
b7a9e5dd40f17a48a72f249b8bbc989b63bae5fd 27-Jun-2012 Sage Weil <sage@inktank.com> libceph: set peer name on con_open, not init

The peer name may change on each open attempt, even when the connection is
reused.

Signed-off-by: Sage Weil <sage@inktank.com>
bc18f4b1c850ab355e38373fbb60fd28568d84b5 21-Jun-2012 Alex Elder <elder@inktank.com> libceph: add some fine ASCII art

Sage liked the state diagram I put in my commit description so
I'm putting it in with the code.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
5821bd8ccdf5d17ab2c391c773756538603838c3 11-Jun-2012 Alex Elder <elder@inktank.com> libceph: small changes to messenger.c

This patch gathers a few small changes in "net/ceph/messenger.c":
out_msg_pos_next()
- small logic change that mostly affects indentation
write_partial_msg_pages().
- use a local variable trail_off to represent the offset into
a message of the trail portion of the data (if present)
- once we are in the trail portion we will always be there, so we
don't always need to check against our data position
- avoid computing len twice after we've reached the trail
- get rid of the variable tmpcrc, which is not needed
- trail_off and trail_len never change so mark them const
- update some comments
read_partial_message_bio()
- bio_iovec_idx() will never return an error, so don't bother
checking for it

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
7593af920baac37752190a0db703d2732bed4a3b 24-May-2012 Alex Elder <elder@inktank.com> libceph: distinguish two phases of connect sequence

Currently a ceph connection enters a "CONNECTING" state when it
begins the process of (re-)connecting with its peer. Once the two
ends have successfully exchanged their banner and addresses, an
additional NEGOTIATING bit is set in the ceph connection's state to
indicate the connection information exhange has begun. The
CONNECTING bit/state continues to be set during this phase.

Rather than have the CONNECTING state continue while the NEGOTIATING
bit is set, interpret these two phases as distinct states. In other
words, when NEGOTIATING is set, clear CONNECTING. That way only
one of them will be active at a time.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
ab166d5aa3bc036fba7efaca6e4e43a7e9510acf 31-May-2012 Alex Elder <elder@inktank.com> libceph: separate banner and connect writes

There are two phases in the process of linking together the two ends
of a ceph connection. The first involves exchanging a banner and
IP addresses, and if that is successful a second phase exchanges
some detail about each side's connection capabilities.

When initiating a connection, the client side now queues to send
its information for both phases of this process at the same time.
This is probably a bit more efficient, but it is slightly messier
from a layering perspective in the code.

So rearrange things so that the client doesn't send the connection
information until it has received and processed the response in the
initial banner phase (in process_banner()).

Move the code (in the (con->sock == NULL) case in try_write()) that
prepares for writing the connection information, delaying doing that
until the banner exchange has completed. Move the code that begins
the transition to this second "NEGOTIATING" phase out of
process_banner() and into its caller, so preparing to write the
connection information and preparing to read the response are
adjacent to each other.

Finally, preparing to write the connection information now requires
the output kvec to be reset in all cases, so move that into the
prepare_write_connect() and delete it from all callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e27947c767f5bed15048f4e4dad3e2eb69133697 23-May-2012 Alex Elder <elder@inktank.com> libceph: define and use an explicit CONNECTED state

There is no state explicitly defined when a ceph connection is fully
operational. So define one.

It's set when the connection sequence completes successfully, and is
cleared when the connection gets closed.

Be a little more careful when examining the old state when a socket
disconnect event is reported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
3ec50d1868a9e0493046400bb1fdd054c7f64ebd 23-May-2012 Alex Elder <elder@inktank.com> libceph: clear NEGOTIATING when done

A connection state's NEGOTIATING bit gets set while in CONNECTING
state after we have successfully exchanged a ceph banner and IP
addresses with the connection's peer (the server). But that bit
is not cleared again--at least not until another connection attempt
is initiated.

Instead, clear it as soon as the connection is fully established.
Also, clear it when a socket connection gets prematurely closed
in the midst of establishing a ceph connection (in case we had
reached the point where it was set).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
bb9e6bba5d8b85b631390f8dbe8a24ae1ff5b48a 21-Jun-2012 Alex Elder <elder@inktank.com> libceph: clear CONNECTING in ceph_con_close()

A connection that is closed will no longer be connecting. So
clear the CONNECTING state bit in ceph_con_close(). Similarly,
if the socket has been closed we no longer are in connecting
state (a new connect sequence will need to be initiated).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
456ea46865787283088b23a8a7f69244513b95f0 21-Jun-2012 Alex Elder <elder@inktank.com> libceph: don't touch con state in con_close_socket()

In con_close_socket(), a connection's SOCK_CLOSED flag gets set and
then cleared while its shutdown method is called and its reference
gets dropped.

Previously, that flag got set only if it had not already been set,
so setting it in con_close_socket() might have prevented additional
processing being done on a socket being shut down. We no longer set
SOCK_CLOSED in the socket event routine conditionally, so setting
that bit here no longer provides whatever benefit it might have
provided before.

A race condition could still leave the SOCK_CLOSED bit set even
after we've issued the call to con_close_socket(), so we still clear
that bit after shutting the socket down. Add a comment explaining
the reason for this.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
d65c9e0b9eb43d14ece9dd843506ccba06162ee7 21-Jun-2012 Alex Elder <elder@inktank.com> libceph: just set SOCK_CLOSED when state changes

When a TCP_CLOSE or TCP_CLOSE_WAIT event occurs, the SOCK_CLOSED
connection flag bit is set, and if it had not been previously set
queue_con() is called to ensure con_work() will get a chance to
handle the changed state.

con_work() atomically checks--and if set, clears--the SOCK_CLOSED
bit if it was set. This means that even if the bit were set
repeatedly, the related processing in con_work() only gets called
once per transition of the bit from 0 to 1.

What's important then is that we ensure con_work() gets called *at
least* once when a socket close event occurs, not that it gets
called *exactly* once.

The work queue mechanism already takes care of queueing work
only if it is not already queued, so there's no need for us
to call queue_con() conditionally.

So this patch just makes it so the SOCK_CLOSED flag gets set
unconditionally in ceph_sock_state_change().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
188048bce311ee41e5178bc3255415d0eae28423 21-Jun-2012 Alex Elder <elder@inktank.com> libceph: don't change socket state on sock event

Currently the socket state change event handler records an error
message on a connection to distinguish a close while connecting from
a close while a connection was already established.

Changing connection information during handling of a socket event is
not very clean, so instead move this assignment inside con_work(),
where it can be done during normal connection-level processing (and
under protection of the connection mutex as well).

Move the handling of a socket closed event up to the top of the
processing loop in con_work(); there's no point in handling backoff
etc. if we have a newly-closed socket to take care of.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
a8d00e3cdef4c1c4f194414b72b24cd995439a05 21-Jun-2012 Alex Elder <elder@inktank.com> libceph: SOCK_CLOSED is a flag, not a state

The following commit changed it so SOCK_CLOSED bit was stored in
a connection's new "flags" field rather than its "state" field.

libceph: start separating connection flags from state
commit 928443cd

That bit is used in con_close_socket() to protect against setting an
error message more than once in the socket event handler function.

Unfortunately, the field being operated on in that function was not
updated to be "flags" as it should have been. This fixes that
error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
abdaa6a849af1d63153682c11f5bbb22dacb1f6b 11-Jun-2012 Alex Elder <elder@inktank.com> libceph: don't use bio_iter as a flag

Recently a bug was fixed in which the bio_iter field in a ceph
message was not being properly re-initialized when a message got
re-transmitted:
commit 43643528cce60ca184fe8197efa8e8da7c89a037
Author: Yan, Zheng <zheng.z.yan@intel.com>
rbd: Clear ceph_msg->bio_iter for retransmitted message

We are now only initializing the bio_iter field when we are about to
start to write message data (in prepare_write_message_data()),
rather than every time we are attempting to write any portion of the
message data (in write_partial_msg_pages()). This means we no
longer need to use the msg->bio_iter field as a flag.

So just don't do that any more. Trust prepare_write_message_data()
to ensure msg->bio_iter is properly initialized, every time we are
about to begin writing (or re-writing) a message's bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
572c588edadaa3da3992bd8a0fed830bbcc861f8 11-Jun-2012 Alex Elder <elder@inktank.com> libceph: move init of bio_iter

If a message has a non-null bio pointer, its bio_iter field is
initialized in write_partial_msg_pages() if this has not been done
already. This is really a one-time setup operation for sending a
message's (bio) data, so move that initialization code into
prepare_write_message_data() which serves that purpose.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
df6ad1f97342ebc4270128222e896541405eecdb 11-Jun-2012 Alex Elder <elder@inktank.com> libceph: move init_bio_*() functions up

Move init_bio_iter() and iter_bio_next() up in their source file so
the'll be defined before they're needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
fd154f3c75465abd83b7a395033e3755908a1e6e 11-Jun-2012 Alex Elder <elder@inktank.com> libceph: don't mark footer complete before it is

This is a nit, but prepare_write_message() sets the FOOTER_COMPLETE
flag before the CRC for the data portion (recorded in the footer)
has been completely computed. Hold off setting the complete flag
until we've decided it's ready to send.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
84ca8fc87fcf4ab97bb8acdb59bf97bb4820cb14 11-Jun-2012 Alex Elder <elder@inktank.com> libceph: encapsulate advancing msg page

In write_partial_msg_pages(), once all the data from a page has been
sent we advance to the next one. Put the code that takes care of
this into its own function.

While modifying write_partial_msg_pages(), make its local variable
"in_trail" be Boolean, and use the local variable "msg" (which is
just the connection's current out_msg pointer) consistently.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
739c905baa018c99003564ebc367d93aa44d4861 11-Jun-2012 Alex Elder <elder@inktank.com> libceph: encapsulate out message data setup

Move the code that prepares to write the data portion of a message
into its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
d59315ca8c0de00df9b363f94a2641a30961ca1c 21-Jun-2012 Sage Weil <sage@inktank.com> libceph: drop ceph_con_get/put helpers and nref member

These are no longer used. Every ceph_connection instance is embedded in
another structure, and refcounts manipulated via the get/put ops.

Signed-off-by: Sage Weil <sage@inktank.com>
36eb71aa57e6a33d61fd90a2fd87f00c6844bc86 21-Jun-2012 Sage Weil <sage@inktank.com> libceph: use con get/put methods

The ceph_con_get/put() helpers manipulate the embedded con ref
count, which isn't used now that ceph_connections are embedded in
other structures.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
b132cf4c733f91bb4dd2277ea049243cf16e8b66 07-Jun-2012 Yan, Zheng <zheng.z.yan@intel.com> rbd: Clear ceph_msg->bio_iter for retransmitted message

The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 43643528cce60ca184fe8197efa8e8da7c89a037)
26ce171915f348abd1f41da1ed139d93750d987f 19-Jun-2012 Dan Carpenter <dan.carpenter@oracle.com> libceph: fix NULL dereference in reset_connection()

We dereference "con->in_msg" on the line after it was set to NULL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
89a86be0ce20022f6ede8bccec078dbb3d63caaa 09-Jun-2012 Sage Weil <sage@inktank.com> libceph: transition socket state prior to actual connect

Once we call ->connect(), we are racing against the actual
connection, and a subsequent transition from CONNECTING ->
CONNECTED. Set the state to CONNECTING before that, under the
protection of the mutex, to avoid the race.

This was introduced in 928443cd9644e7cfd46f687dbeffda2d1a357ff9,
with the original socket state code.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
43643528cce60ca184fe8197efa8e8da7c89a037 07-Jun-2012 Yan, Zheng <zheng.z.yan@intel.com> rbd: Clear ceph_msg->bio_iter for retransmitted message

The bug can cause NULL pointer dereference in write_partial_msg_pages

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
8921d114f5574c6da2cdd00749d185633ecf88f3 01-Jun-2012 Alex Elder <elder@inktank.com> libceph: make ceph_con_revoke_message() a msg op

ceph_con_revoke_message() is passed both a message and a ceph
connection. A ceph_msg allocated for incoming messages on a
connection always has a pointer to that connection, so there's no
need to provide the connection when revoking such a message.

Note that the existing logic does not preclude the message supplied
being a null/bogus message pointer. The only user of this interface
is the OSD client, and the only value an osd client passes is a
request's r_reply field. That is always non-null (except briefly in
an error path in ceph_osdc_alloc_request(), and that drops the
only reference so the request won't ever have a reply to revoke).
So we can safely assume the passed-in message is non-null, but add a
BUG_ON() to make it very obvious we are imposing this restriction.

Rename the function ceph_msg_revoke_incoming() to reflect that it is
really an operation on an incoming message.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
6740a845b2543cc46e1902ba21bac743fbadd0dc 01-Jun-2012 Alex Elder <elder@inktank.com> libceph: make ceph_con_revoke() a msg operation

ceph_con_revoke() is passed both a message and a ceph connection.
Now that any message associated with a connection holds a pointer
to that connection, there's no need to provide the connection when
revoking a message.

This has the added benefit of precluding the possibility of the
providing the wrong connection pointer. If the message's connection
pointer is null, it is not being tracked by any connection, so
revoking it is a no-op. This is supported as a convenience for
upper layers, so they can revoke a message that is not actually
"in flight."

Rename the function ceph_msg_revoke() to reflect that it is really
an operation on a message, not a connection.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
92ce034b5a740046cc643a21ea21eaad589e0043 04-Jun-2012 Alex Elder <elder@inktank.com> libceph: have messages take a connection reference

There are essentially two types of ceph messages: incoming and
outgoing. Outgoing messages are always allocated via ceph_msg_new(),
and at the time of their allocation they are not associated with any
particular connection. Incoming messages are always allocated via
ceph_con_in_msg_alloc(), and they are initially associated with the
connection from which incoming data will be placed into the message.

When an outgoing message gets sent, it becomes associated with a
connection and remains that way until the message is successfully
sent. The association of an incoming message goes away at the point
it is sent to an upper layer via a con->ops->dispatch method.

This patch implements reference counting for all ceph messages, such
that every message holds a reference (and a pointer) to a connection
if and only if it is associated with that connection (as described
above).


For background, here is an explanation of the ceph message
lifecycle, emphasizing when an association exists between a message
and a connection.

Outgoing Messages
An outgoing message is "owned" by its allocator, from the time it is
allocated in ceph_msg_new() up to the point it gets queued for
sending in ceph_con_send(). Prior to that point the message's
msg->con pointer is null; at the point it is queued for sending its
message pointer is assigned to refer to the connection. At that
time the message is inserted into a connection's out_queue list.

When a message on the out_queue list has been sent to the socket
layer to be put on the wire, it is transferred out of that list and
into the connection's out_sent list. At that point it is still owned
by the connection, and will remain so until an acknowledgement is
received from the recipient that indicates the message was
successfully transferred. When such an acknowledgement is received
(in process_ack()), the message is removed from its list (in
ceph_msg_remove()), at which point it is no longer associated with
the connection.

So basically, any time a message is on one of a connection's lists,
it is associated with that connection. Reference counting outgoing
messages can thus be done at the points a message is added to the
out_queue (in ceph_con_send()) and the point it is removed from
either its two lists (in ceph_msg_remove())--at which point its
connection pointer becomes null.

Incoming Messages
When an incoming message on a connection is getting read (in
read_partial_message()) and there is no message in con->in_msg,
a new one is allocated using ceph_con_in_msg_alloc(). At that
point the message is associated with the connection. Once that
message has been completely and successfully read, it is passed to
upper layer code using the connection's con->ops->dispatch method.
At that point the association between the message and the connection
no longer exists.

Reference counting of connections for incoming messages can be done
by taking a reference to the connection when the message gets
allocated, and releasing that reference when it gets handed off
using the dispatch method.

We should never fail to get a connection reference for a
message--the since the caller should already hold one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
38941f8031bf042dba3ced6394ba3a3b16c244ea 01-Jun-2012 Alex Elder <elder@inktank.com> libceph: have messages point to their connection

When a ceph message is queued for sending it is placed on a list of
pending messages (ceph_connection->out_queue). When they are
actually sent over the wire, they are moved from that list to
another (ceph_connection->out_sent). When acknowledgement for the
message is received, it is removed from the sent messages list.

During that entire time the message is "in the possession" of a
single ceph connection. Keep track of that connection in the
message. This will be used in the next patch (and is a helpful
bit of information for debugging anyway).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
1c20f2d26795803fc4f5155fe4fca5717a5944b6 04-Jun-2012 Alex Elder <elder@inktank.com> libceph: tweak ceph_alloc_msg()

The function ceph_alloc_msg() is only used to allocate a message
that will be assigned to a connection's in_msg pointer. Rename the
function so this implied usage is more clear.

In addition, make that assignment inside the function (again, since
that's precisely what it's intended to be used for). This allows us
to return what is now provided via the passed-in address of a "skip"
variable. The return type is now Boolean to be explicit that there
are only two possible outcomes.

Make sure the result of an ->alloc_msg method call always sets the
value of *skip properly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
1bfd89f4e6e1adc6a782d94aa5d4c53be1e404d7 27-May-2012 Alex Elder <elder@inktank.com> libceph: fully initialize connection in con_init()

Move the initialization of a ceph connection's private pointer,
operations vector pointer, and peer name information into
ceph_con_init(). Rearrange the arguments so the connection pointer
is first. Hide the byte-swapping of the peer entity number inside
ceph_con_init()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
a5988c490ef66cb04ea2f610681949b25c773b3c 29-May-2012 Alex Elder <elder@inktank.com> libceph: set CLOSED state bit in con_init

Once a connection is fully initialized, it is really in a CLOSED
state, so make that explicit by setting the bit in its state field.

It is possible for a connection in NEGOTIATING state to get a
failure, leading to ceph_fault() and ultimately ceph_con_close().
Clear that bits if it is set in that case, to reflect that the
connection truly is closed and is no longer participating in a
connect sequence.

Issue a warning if ceph_con_open() is called on a connection that
is not in CLOSED state.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
ce2c8903e76e690846a00a0284e4bd9ee954d680 23-May-2012 Alex Elder <elder@inktank.com> libceph: start tracking connection socket state

Start explicitly keeping track of the state of a ceph connection's
socket, separate from the state of the connection itself. Create
placeholder functions to encapsulate the state transitions.

--------
| NEW* | transient initial state
--------
| con_sock_state_init()
v
----------
| CLOSED | initialized, but no socket (and no
---------- TCP connection)
^ \
| \ con_sock_state_connecting()
| ----------------------
| \
+ con_sock_state_closed() \
|\ \
| \ \
| ----------- \
| | CLOSING | socket event; \
| ----------- await close \
| ^ |
| | |
| + con_sock_state_closing() |
| / \ |
| / --------------- |
| / \ v
| / --------------
| / -----------------| CONNECTING | socket created, TCP
| | / -------------- connect initiated
| | | con_sock_state_connected()
| | v
-------------
| CONNECTED | TCP connection established
-------------

Make the socket state an atomic variable, reinforcing that it's a
distinct transtion with no possible "intermediate/both" states.
This is almost certainly overkill at this point, though the
transitions into CONNECTED and CLOSING state do get called via
socket callback (the rest of the transitions occur with the
connection mutex held). We can back out the atomicity later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
928443cd9644e7cfd46f687dbeffda2d1a357ff9 22-May-2012 Alex Elder <elder@inktank.com> libceph: start separating connection flags from state

A ceph_connection holds a mixture of connection state (as in "state
machine" state) and connection flags in a single "state" field. To
make the distinction more clear, define a new "flags" field and use
it rather than the "state" field to hold Boolean flag values.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil<sage@inktank.com>
15d9882c336db2db73ccf9871ae2398e452f694c 27-May-2012 Alex Elder <elder@inktank.com> libceph: embed ceph messenger structure in ceph_client

A ceph client has a pointer to a ceph messenger structure in it.
There is always exactly one ceph messenger for a ceph client, so
there is no need to allocate it separate from the ceph client
structure.

Switch the ceph_client structure to embed its ceph_messenger
structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e22004235a900213625acd6583ac913d5a30c155 23-May-2012 Alex Elder <elder@inktank.com> libceph: rename kvec_reset and kvec_add functions

The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
are entirely private functions, so drop the "ceph_" prefix in their
name to make them slightly more wieldy.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
327800bdc2cb9b71f4b458ca07aa9d522668dde0 22-May-2012 Alex Elder <elder@inktank.com> libceph: rename socket callbacks

Change the names of the three socket callback functions to make it
more obvious they're specifically associated with a connection's
socket (not the ceph connection that uses it).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
6384bb8b8e88a9c6bf2ae0d9517c2c0199177c34 30-May-2012 Alex Elder <elder@inktank.com> libceph: kill bad_proto ceph connection op

No code sets a bad_proto method in its ceph connection operations
vector, so just get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
e5e372da9a469dfe3ece40277090a7056c566838 22-May-2012 Alex Elder <elder@inktank.com> libceph: eliminate connection state "DEAD"

The ceph connection state "DEAD" is never set and is therefore not
needed. Eliminate it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
3da54776e2c0385c32d143fd497a7f40a88e29dd 16-May-2012 Alex Elder <elder@inktank.com> ceph: add auth buf in prepare_write_connect()

Move the addition of the authorizer buffer to a connection's
out_kvec out of get_connect_authorizer() and into its caller. This
way, the caller--prepare_write_connect()--can avoid adding the
connect header to out_kvec before it has been fully initialized.

Prior to this patch, it was possible for a connect header to be
sent over the wire before the authorizer protocol or buffer length
fields were initialized. An authorizer buffer associated with that
header could also be queued to send only after the connection header
that describes it was on the wire.

Fixes http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
dac1e716c60161867a47745bca592987ca3a9cb2 16-May-2012 Alex Elder <elder@inktank.com> ceph: rename prepare_connect_authorizer()

Change the name of prepare_connect_authorizer(). The next
patch is going to make this function no longer add anything to the
connection's out_kvec, so it will no longer fit the pattern of
the rest of the prepare_connect_*() functions.

In addition, pass the address of a variable that will hold the
authorization protocol to use. Move the assignment of that to the
connection's out_connect structure into prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
729796be9190f57ca40ccca315e8ad34a1eb8fef 16-May-2012 Alex Elder <elder@inktank.com> ceph: return pointer from prepare_connect_authorizer()

Change prepare_connect_authorizer() so it returns a pointer (or
pointer-coded error).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
8f43fb53894079bf0caab6e348ceaffe7adc651a 16-May-2012 Alex Elder <elder@inktank.com> ceph: use info returned by get_authorizer

Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
a3530df33eb91d787d08c7383a0a9982690e42d0 16-May-2012 Alex Elder <elder@inktank.com> ceph: have get_authorizer methods return pointers

Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value. This is to pave the way for making use of the
returned value in an upcoming patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
ed96af646011412c2bf1ffe860db170db355fae5 16-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: check return from get_authorizer

In prepare_connect_authorizer(), a connection's get_authorizer
method is called but ignores its return value. This function can
return an error, so check for it and return it if that ever occurs.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
b1c6b9803f5491e94041e6da96bc9dec3870e792 16-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: rework prepare_connect_authorizer()

Change prepare_connect_authorizer() so it returns without dropping
the connection mutex if the connection has no get_authorizer method.

Use the symbolic CEPH_AUTH_UNKNOWN instead of 0 when assigning
authorization protocols.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
5a0f8fdd8a0ebe320952a388331dc043d7e14ced 17-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: check prepare_write_connect() result

prepare_write_connect() can return an error, but only one of its
callers checks for it. All the rest are in functions that already
return errors, so it should be fine to return the error if one
gets returned.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e10c758e4031a801ea4d2f8fb39bf14c2658d74b 16-May-2012 Alex Elder <elder@inktank.com> ceph: don't set WRITE_PENDING too early

prepare_write_connect() prepares a connect message, then sets
WRITE_PENDING on the connection. Then *after* this, it calls
prepare_connect_authorizer(), which updates the content of the
connection buffer already queued for sending. It's also possible it
will result in prepare_write_connect() returning -EAGAIN despite the
WRITE_PENDING big getting set.

Fix this by preparing the connect authorizer first, setting the
WRITE_PENDING bit only after that is done.

Partially addresses http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e825a66df97776d30a48a187e3a986736af43945 16-May-2012 Alex Elder <elder@inktank.com> ceph: drop msgr argument from prepare_write_connect()

In all cases, the value passed as the msgr argument to
prepare_write_connect() is just con->msgr. Just get the msgr
value from the ceph connection and drop the unneeded argument.

The only msgr passed to prepare_write_banner() is also therefore
just the one from con->msgr, so change that function to drop the
msgr argument as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
41b90c00858129f52d08e6a05c9cfdb0f2bd074d 16-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: send banner in process_connect()

prepare_write_connect() has an argument indicating whether a banner
should be sent out before sending out a connection message. It's
only ever set in one of its callers, so move the code that arranges
to send the banner into that caller and drop the "include_banner"
argument from prepare_write_connect().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
84fb3adf6413862cff51d8af3fce5f0b655586a2 16-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: reset connection kvec caller

Reset a connection's kvec fields in the caller rather than in
prepare_write_connect(). This ends up repeating a few lines of
code but it's improving the separation between distinct operations
on the connection, which we can take advantage of later.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
d329156f16306449c273002486c28de3ddddfd89 16-May-2012 Alex Elder <elder@inktank.com> libceph: don't reset kvec in prepare_write_banner()

Move the kvec reset for a connection out of prepare_write_banner and
into its only caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
fd51653f78cf40a0516e521b6de22f329c5bad8d 10-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: change read_partial() to take "end" arg

Make the second argument to read_partial() be the ending input byte
position rather than the beginning offset it now represents. This
amounts to moving the addition "to + size" into the caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e6cee71fac27c946a0bbad754dd076e66c4e9dbd 10-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: update "to" in read_partial() caller

read_partial() always increases whatever "to" value is supplied by
adding the requested size to it, and that's the only thing it does
with that pointed-to value.

Do that pointer advance in the caller (and then only when the
updated value will be subsequently used), and change the "to"
parameter to be an in-only and non-pointer value.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
57dac9d1620942608306d8c17c98a9d1568ffdf4 10-May-2012 Alex Elder <elder@inktank.com> ceph: messenger: use read_partial() in read_partial_message()

There are two blocks of code in read_partial_message()--those that
read the header and footer of the message--that can be replaced by a
call to read_partial(). Do that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
95c961747284a6b83a5e2d81240e214b0fa3464d 15-Apr-2012 Eric Dumazet <eric.dumazet@gmail.com> net: cleanup unsigned to unsigned int

Use of "unsigned int" is preferred to bare "unsigned" in net tree.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8d63e318c4eb1bea6f7e3cb4b77849eaa167bfec 07-Mar-2012 Alex Elder <elder@dreamhost.com> libceph: isolate kmap() call in write_partial_msg_pages()

In write_partial_msg_pages(), every case now does an identical call
to kmap(page). Instead, just call it once inside the CRC-computing
block where it's needed. Move the definition of kaddr inside that
block, and make it a (char *) to ensure portable pointer arithmetic.

We still don't kunmap() it until after the sendpage() call, in case
that also ends up needing to use the mapping.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
9bd1966344bf975b5ce65e80fd6bacc41b4325a8 07-Mar-2012 Alex Elder <elder@dreamhost.com> libceph: rename "page_shift" variable to something sensible

In write_partial_msg_pages() there is a local variable used to
track the starting offset within a bio segment to use. Its name,
"page_shift" defies the Linux convention of using that name for
log-base-2(page size).

Since it's only used in the bio case rename it "bio_offset". Use it
along with the page_pos field to compute the memory offset when
computing CRC's in that function. This makes the bio case match the
others more closely.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
0cdf9e60189a87356a865a96dbafc2240af5c91d 07-Mar-2012 Alex Elder <elder@dreamhost.com> libceph: get rid of zero_page_address

There's not a lot of benefit to zero_page_address, which basically
holds a mapping of the zero page through the life of the messenger
module. Even with our own mapping, the sendpage interface where
it's used may need to kmap() it again. It's almost certain to
be in low memory anyway.

So stop treating the zero page specially in write_partial_msg_pages()
and just get rid of zero_page_address entirely.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
e36b13cceb46136d849aeee06b4907ad3570ba78 07-Mar-2012 Alex Elder <elder@dreamhost.com> libceph: only call kernel_sendpage() via helper

Make ceph_tcp_sendpage() be the only place kernel_sendpage() is
used, by using this helper in write_partial_msg_pages().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
31739139f3ed7be802dd9019ec8d8cc910e3d241 07-Mar-2012 Alex Elder <elder@dreamhost.com> libceph: use kernel_sendpage() for sending zeroes

If a message queued for send gets revoked, zeroes are sent over the
wire instead of any unsent data. This is done by constructing a
message and passing it to kernel_sendmsg() via ceph_tcp_sendmsg().

Since we are already working with a page in this case we can use
the sendpage interface instead. Create a new ceph_tcp_sendpage()
helper that sets up flags to match the way ceph_tcp_sendmsg()
does now.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
37675b0f42a8f7699c3602350d1c3b2a1698a3d3 07-Mar-2012 Alex Elder <elder@dreamhost.com> libceph: fix inverted crc option logic

CRC's are computed for all messages between ceph entities. The CRC
computation for the data portion of message can optionally be
disabled using the "nocrc" (common) ceph option. The default is
for CRC computation for the data portion to be enabled.

Unfortunately, the code that implements this feature interprets the
feature flag wrong, meaning that by default the CRC's have *not*
been computed (or checked) for the data portion of messages unless
the "nocrc" option was supplied.

Fix this, in write_partial_msg_pages() and read_partial_message().
Also change the flag variable in write_partial_msg_pages() to be
"no_datacrc" to match the usage elsewhere in the file.

This fixes http://tracker.newdream.net/issues/2064

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
84495f496170a73ed79667b7fbf91947b7f47c87 15-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: some simple changes

Nothing too big here.
- define the size of the buffer used for consuming ignored
incoming data using a symbolic constant
- simplify the condition determining whether to unmap the page
in write_partial_msg_pages(): do it for crc but not if the
page is the zero page

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
f42299e6c3883c69c14079b8c88fe33815b2dcc3 15-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: small refactor in write_partial_kvec()

Make a small change in the code that counts down kvecs consumed by
a ceph_tcp_sendmsg() call. Same functionality, just blocked out
a little differently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
fe3ad593e2c34457ffa6233014ab19f4d36f85f2 15-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: do crc calculations outside loop

Move blocks of code out of loops in read_partial_message_section()
and read_partial_message(). They were only was getting called at
the end of the last iteration of the loop anyway.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
a9a0c51af4e7c825c014b40694571456a75ebbc4 15-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: separate CRC calculation from byte swapping

Calculate CRC in a separate step from rearranging the byte order
of the result, to improve clarity and readability.

Use offsetof() to determine the number of bytes to include in the
CRC calculation.

In read_partial_message(), switch which value gets byte-swapped,
since the just-computed CRC is already likely to be in a register.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
bca064d236a2e3162a07c758855221bcbe3c475b 15-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: use "do" in CRC-related Boolean variables

Change the name (and type) of a few CRC-related Boolean local
variables so they contain the word "do", to distingish their purpose
from variables used for holding an actual CRC value.

Note that in the process of doing this I identified a fairly serious
logic error in write_partial_msg_pages(): the value of "do_crc"
assigned appears to be the opposite of what it should be. No
attempt to fix this is made here; this change preserves the
erroneous behavior. The problem I found is documented here:
http://tracker.newdream.net/issues/2064

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
d3002b974cefbb7c1e325cc296966f768ff76b06 14-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: a few small changes

This gathers a number of very minor changes:
- use %hu when formatting the a socket address's address family
- null out the ceph_msgr_wq pointer after the queue has been
destroyed
- drop a needless cast in ceph_write_space()
- add a WARN() call in ceph_state_change() in the event an
unrecognized socket state is encountered
- rearrange the logic in ceph_con_get() and ceph_con_put() so
that:
- the reference counts are only atomically read once
- the values displayed via dout() calls are known to
be meaningful at the time they are formatted

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
41617d0c9c9832e030667277ddf6b4ffb4ecdc90 14-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: make ceph_tcp_connect() return int

There is no real need for ceph_tcp_connect() to return the socket
pointer it creates, since it already assigns it to con->sock, which
is visible to the caller. Instead, have it return an error code,
which tidies things up a bit.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
6173d1f02fb19c0fba02857ae4e1109b5ec95034 14-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: encapsulate some messenger cleanup code

Define a helper function to perform various cleanup operations. Use
it both in the exit routine and in the init routine in the event of
an error.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
e0f43c9419c1900e5b50de4261e9686a45a0a2b8 14-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: make ceph_msgr_wq private

The messenger workqueue has no need to be public. So give it static
scope.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
859eb7994876f26fd9f52d9589fbcab8e2fe8069 14-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: encapsulate connection kvec operations

Encapsulate the operation of adding a new chunk of data to the next
open slot in a ceph_connection's out_kvec array. Also add a "reset"
operation to make subsequent add operations start at the beginning
of the array again.

Use these routines throughout, avoiding duplicate code and ensuring
all calls are handled consistently.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
963be4d7709f84d865f76d12d5b0ec7edad1c498 14-Feb-2012 Alex Elder <elder@dreamhost.com> libceph: move prepare_write_banner()

One of the arguments to prepare_write_connect() indicates whether it
is being called immediately after a call to prepare_write_banner().
Move the prepare_write_banner() call inside prepare_write_connect(),
and reinterpret (and rename) the "after_banner" argument so it
indicates that prepare_write_connect() should *make* the call
rather than should know it has already been made.

This was split out from the next patch to highlight this change in
logic.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
99f0f3b2c4be15784bb4ede33b5f2c3f7861dba7 23-Jan-2012 Alex Elder <elder@dreamhost.com> ceph: eliminate some abusive casts

This fixes some spots where a type cast to (void *) was used as
as a universal type hiding mechanism. Instead, properly cast the
type to the intended target type.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
bd406145129e8724cc71b65ff2a788dbd4d60c50 23-Jan-2012 Alex Elder <elder@dreamhost.com> ceph: eliminate some needless casts

This eliminates type casts in some places where they are not
required.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
f64a93172b97dcfcfa68f595652220653562f605 23-Jan-2012 Alex Elder <elder@dreamhost.com> ceph: kill addr_str_lock spinlock; use atomic instead

A spinlock is used to protect a value used for selecting an array
index for a string used for formatting a socket address for human
consumption. The index is reset to 0 if it ever reaches the maximum
index value.

Instead, use an ever-increasing atomic variable as a sequence
number, and compute the array index by masking off all but the
sequence number's lowest bits. Make the number of entries in the
array a power of two to allow the use of such a mask (to avoid jumps
in the index value when the sequence number wraps).

The length of these strings is somewhat arbitrarily set at 60 bytes.
The worst-case length of a string produced is 54 bytes, for an IPv6
address that can't be shortened, e.g.:
[1234:5678:9abc:def0:1111:2222:123.234.210.100]:32767
Change it so we arbitrarily use 64 bytes instead; if nothing else
it will make the array of these line up better in hex dumps.

Rename a few things to reinforce the distinction between the number
of strings in the array and the length of individual strings.

Signed-off-by: Alex Elder <elder@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
a5bc3129a296fd4663c3ef0be5575e82453739dd 23-Jan-2012 Alex Elder <elder@dreamhost.com> ceph: make use of "else" where appropriate

Rearrange ceph_tcp_connect() a bit, making use of "else" rather than
re-testing a value with consecutive "if" statements. Don't record a
connection's socket pointer unless the connect operation is
successful.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
5766651971e81298732466c9aa462ff47898ba37 23-Jan-2012 Alex Elder <elder@dreamhost.com> ceph: use a shared zero page rather than one per messenger

Each messenger allocates a page to be used when writing zeroes
out in the event of error or other abnormal condition. Instead,
use the kernel ZERO_PAGE() for that purpose.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
182fac2689b769a96e7fc9defcd560c5cca92b1e 29-Feb-2012 Jim Schutt <jaschut@sandia.gov> net/ceph: Only clear SOCK_NOSPACE when there is sufficient space in the socket buffer

The Ceph messenger would sometimes queue multiple work items to write
data to a socket when the socket buffer was full.

Fix this problem by making ceph_write_space() use SOCK_NOSPACE in the
same way that net/core/stream.c:sk_stream_write_space() does, i.e.,
clearing it only when sufficient space is available in the socket buffer.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Reviewed-by: Alex Elder <elder@dreamhost.com>
bc3b2d7fb9b014d75ebb79ba371a763dbab5e8cf 15-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com> net: Add export.h for EXPORT_SYMBOL/THIS_MODULE to non-modules

These files are non modular, but need to export symbols using
the macros now living in export.h -- call out the include so
that things won't break when we remove the implicit presence
of module.h from everywhere.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
ee3b56f265cafb28c9a1b58faed4f1dbdf51af86 23-Sep-2011 Noah Watkins <noahwatkins@gmail.com> ceph: use kernel DNS resolver

Change ceph_parse_ips to take either names given as
IP addresses or standard hostnames (e.g. localhost).
The DNS lookup is done using the dns_resolver facility
similar to its use in AFS, NFS, and CIFS.

This patch defines CONFIG_CEPH_LIB_USE_DNS_RESOLVER
that controls if this feature is on or off.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
f0ed1b7cef1e801ef470efc501f9c663fe10cebd 10-Aug-2011 Sage Weil <sage@newdream.net> libceph: warn on msg allocation failures

Any non-masked msg allocation failure should generate a warning and stack
trace to the console. All of these need to eventually be replaced by
safe preallocation or msgpools.

Signed-off-by: Sage Weil <sage@newdream.net>
b61c27636fffbaf1980e675282777b9467254a40 10-Aug-2011 Sage Weil <sage@newdream.net> libceph: don't complain on msgpool alloc failures

The pool allocation failures are masked by the pool; there is no need to
spam the console about them. (That's the whole point of having the pool
in the first place.)

Mark msg allocations whose failure is safely handled as such.

Signed-off-by: Sage Weil <sage@newdream.net>
c0d5f9db1c7d1b8a9e2f217706e8ea233bac2754 16-Sep-2011 Jim Schutt <jaschut@sandia.gov> libceph: initialize ack_stamp to avoid unnecessary connection reset

Commit 4cf9d544631c recorded when an outgoing ceph message was ACKed,
in order to avoid unnecessary connection resets when an OSD is busy.

However, ack_stamp is uninitialized, so there is a window between
when the message is sent and when it is ACKed in which handle_timeout()
interprets the unitialized value as an expired timeout, and resets
the connection unnecessarily.

Close the window by initializing ack_stamp.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
4cf9d544631c92809cb94ea680c71df56e9437aa 26-Jul-2011 Sage Weil <sage@newdream.net> libceph: don't time out osd requests that haven't been received

Keep track of when an outgoing message is ACKed (i.e., the server fully
received it and, presumably, queued it for processing). Time out OSD
requests only if it's been too long since they've been received.

This prevents timeouts and connection thrashing when the OSDs are simply
busy and are throttling the requests they read off the network.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
a2a79609c044d3ddb540671d5029a41c90c57251 13-May-2011 Sage Weil <sage@newdream.net> libceph: add missing breaks in addr_set_port

Signed-off-by: Sage Weil <sage@newdream.net>
04177882265bc5014300a631e7384f8fe6b6aa0f 13-May-2011 Sage Weil <sage@newdream.net> libceph: fix TAG_WAIT case

If we get a WAIT as a client something went wrong; error out. And don't
fall through to an unrelated case.

Signed-off-by: Sage Weil <sage@newdream.net>
12a2f643b0e6e791ba61485430d0003eeb3e373c 12-May-2011 Sage Weil <sage@newdream.net> libceph: use snprintf for unknown addrs

Signed-off-by: Sage Weil <sage@newdream.net>
e8f54ce169125a2e59330fac25ad3c9ac0ce22a5 12-May-2011 Sage Weil <sage@newdream.net> libceph: fix uninitialized value when no get_authorizer method is set

If there is no get_authorizer method we set the out_kvec to a bogus
pointer. The length is also zero in that case, so it doesn't much matter,
but it's better not to add the empty item in the first place.

Signed-off-by: Sage Weil <sage@newdream.net>
0da5d70369e87f80adf794080cfff1ca15a34198 19-May-2011 Sage Weil <sage@newdream.net> libceph: handle connection reopen race with callbacks

If a connection is closed and/or reopened (ceph_con_close, ceph_con_open)
it can race with a callback. con_work does various state checks for
closed or reopened sockets at the beginning, but drops con->mutex before
making callbacks. We need to check for state bit changes after retaking
the lock to ensure we restart con_work and execute those CLOSED/OPENING
tests or else we may end up operating under stale assumptions.

In Jim's case, this was causing 'bad tag' errors.

There are four cases where we re-take the con->mutex inside con_work: catch
them all and return EAGAIN from try_{read,write} so that we can restart
con_work.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
ca20892db7567c40e8ed0668f46cf0d085d7db6d 03-May-2011 Henry C Chang <henry.cy.chang@gmail.com> libceph: fix ceph_msg_new error path

If memory allocation failed, calling ceph_msg_put() will cause GPF
since some of ceph_msg variables are not initialized first.

Fix Bug #970.

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
e00de341fdb76c955703b4438100f9933c452b7f 04-Mar-2011 Sage Weil <sage@newdream.net> libceph: fix msgr standby handling

The standby logic used to be pretty dependent on the work requeueing
behavior that changed when we switched to WQ_NON_REENTRANT. It was also
very fragile.

Restructure things so that:
- We clear WRITE_PENDING when we set STANDBY. This ensures we will
requeue work when we wake up later.
- con_work backs off if STANDBY is set. There is nothing to do if we are
in standby.
- clear_standby() helper is called by both con_send() and con_keepalive(),
the two actions that can wake us up again. Move the connect_seq++
logic here.

Signed-off-by: Sage Weil <sage@newdream.net>
e76661d0a59e53e5cc4dccbe4b755d1dc8a968ec 03-Mar-2011 Sage Weil <sage@newdream.net> libceph: fix msgr keepalive flag

There was some broken keepalive code using a dead variable. Shift to using
the proper bit flag.

Signed-off-by: Sage Weil <sage@newdream.net>
60bf8bf8815e6adea4c1d0423578c3b8000e2ec8 04-Mar-2011 Sage Weil <sage@newdream.net> libceph: fix msgr backoff

With commit f363e45f we replaced a bunch of hacky workqueue mutual
exclusion logic with the WQ_NON_REENTRANT flag. One pieces of fallout is
that the exponential backoff breaks in certain cases:

* con_work attempts to connect.
* we get an immediate failure, and the socket state change handler queues
immediate work.
* con_work calls con_fault, we decide to back off, but can't queue delayed
work.

In this case, we add a BACKOFF bit to make con_work reschedule delayed work
next time it runs (which should be immediately).

Signed-off-by: Sage Weil <sage@newdream.net>
692d20f576fb26f62c83f80dbf3ea899998391b7 03-Mar-2011 Sage Weil <sage@newdream.net> libceph: retry after authorization failure

If we mark the connection CLOSED we will give up trying to reconnect to
this server instance. That is appropriate for things like a protocol
version mismatch that won't change until the server is restarted, at which
point we'll get a new addr and reconnect. An authorization failure like
this is probably due to the server not properly rotating it's secret keys,
however, and should be treated as transient so that the normal backoff and
retry behavior kicks in.

Signed-off-by: Sage Weil <sage@newdream.net>
42961d2333a1855c649fa3790e258ab4f0fa66a4 25-Jan-2011 Sage Weil <sage@newdream.net> libceph: fix socket write error handling

Pass errors from writing to the socket up the stack. If we get -EAGAIN,
return 0 from the helper to simplify the callers' checks.

Signed-off-by: Sage Weil <sage@newdream.net>
98bdb0aa007ff7e8e0061936d8d0e210faf2e655 25-Jan-2011 Sage Weil <sage@newdream.net> libceph: fix socket read error handling

If we get EAGAIN when trying to read from the socket, it is not an error.
Return 0 from the helper in this case to simplify the error handling cases
in the caller (indirectly, try_read).

Fix try_read to pass any error to it's caller (con_work) instead of almost
always returning 0. This let's us respond to things like socket
disconnects.

Signed-off-by: Sage Weil <sage@newdream.net>
f363e45fd1184219b472ea549cb7e192e24ef4d2 03-Jan-2011 Tejun Heo <tj@kernel.org> net/ceph: make ceph_msgr_wq non-reentrant

ceph messenger code does a rather complex dancing around multithread
workqueue to make sure the same work item isn't executed concurrently
on different CPUs. This restriction can be provided by workqueue with
WQ_NON_REENTRANT.

Make ceph_msgr_wq non-reentrant workqueue with the default concurrency
level and remove the QUEUED/BUSY logic.

* This removes backoff handling in con_work() but it couldn't reliably
block execution of con_work() to begin with - queue_con() can be
called after the work started but before BUSY is set. It seems that
it was an optimization for a rather cold path and can be safely
removed.

* The number of concurrent work items is bound by the number of
connections and connetions are independent from each other. With
the default concurrency level, different connections will be
executed independently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Sage Weil <sage@newdream.net>
Cc: ceph-devel@vger.kernel.org
Signed-off-by: Sage Weil <sage@newdream.net>
d96c9043d1588f04c7f467167f653c07d83232d5 14-Dec-2010 Sage Weil <sage@newdream.net> ceph: fix msgr_init error path

create_workqueue() returns NULL on failure.

Signed-off-by: Sage Weil <sage@newdream.net>
c5c6b19d4b8f5431fca05f28ae9e141045022149 09-Nov-2010 Sage Weil <sage@newdream.net> ceph: explicitly specify page alignment in network messages

The alignment used for reading data into or out of pages used to be taken
from the data_off field in the message header. This only worked as long
as the page alignment matched the object offset, breaking direct io to
non-page aligned offsets.

Instead, explicitly specify the page alignment next to the page vector
in the ceph_msg struct, and use that instead of the message header (which
probably shouldn't be trusted). The alloc_msg callback is responsible for
filling in this field properly when it sets up the page vector.

Signed-off-by: Sage Weil <sage@newdream.net>
df9f86faf3ee610527ed02031fe7dd3c8b752e44 01-Nov-2010 Sage Weil <sage@newdream.net> ceph: fix small seq message skipping

If the client gets out of sync with the server message sequence number, we
normally skip low seq messages (ones we already received). The skip code
was also incrementing the expected seq, such that all subsequent messages
also appeared old and got skipped, and an eventual timeout on the osd
connection. This resulted in some lagging requests and console messages
like

[233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017
[233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018
[233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019
[233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020
[233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021
[233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022
[233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023
[233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024
[233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025
[233500.534614] ceph: tid 6023602 timed out on osd22, will reset osd

Reported-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sage Weil <sage@newdream.net>
3d14c5d2b6e15c21d8e5467dc62d33127c23a644 07-Apr-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: factor out libceph from Ceph file system

This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:

- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>