History log of /drivers/block/rbd.c
Revision Date Author Comments
a8d4205623ae965e36c68629db306ca0695a2771 22-Oct-2014 Jan Kara <jack@suse.cz> rbd: Fix error recovery in rbd_obj_read_sync()

When we fail to allocate page vector in rbd_obj_read_sync() we just
basically ignore the problem and continue which will result in an oops
later. Fix the problem by returning proper error.

CC: Yehuda Sadeh <yehuda@inktank.com>
CC: Sage Weil <sage@inktank.com>
CC: ceph-devel@vger.kernel.org
CC: stable@vger.kernel.org
Coverity-id: 1226882
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
f5ee37bd31678d6cb2313631f203794b5c25e862 09-Oct-2014 Ilya Dryomov <idryomov@redhat.com> rbd: use a single workqueue for all devices

Using one queue per device doesn't make much sense given that our
workfn processes "devices" and not "requests". Switch to a single
workqueue for all devices.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
792c3a914910bd34302c5345578f85cfcb5e2c01 10-Oct-2014 Ilya Dryomov <idryomov@redhat.com> rbd: rbd workqueues need a resque worker

Need to use WQ_MEM_RECLAIM for our workqueues to prevent I/O lockups
under memory pressure - we sit on the memory reclaim path.

Cc: stable@vger.kernel.org # 3.17, needs backporting for 3.16
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Tested-by: Micha Krause <micha@krausam.de>
Reviewed-by: Sage Weil <sage@redhat.com>
b76f82398c1017e303d87760e22125714010207f 08-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: set the remaining discard properties to enable support

max_discard_sectors must be set for the queue to support discard.
Operations implementing discard for rbd zero data, so report that.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
d3246fb0da5d70838469c01d5b6b11163b49cd86 08-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: use helpers to handle discard for layered images correctly

Only allocate two osd ops for discard requests, since the
preallocation hint is only added for regular writes. Use
rbd_img_obj_request_fill() to recreate the original write or discard
osd operations, isolating that logic to one place, and change the
assert in rbd_osd_req_create_copyup() to accept discard requests as
well.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
3b434a2aff38029ea053ce6c8fced53b2d01f7f0 05-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: extract a method for adding object operations

rbd_img_request_fill() creates a ceph_osd_request and has logic for
adding the appropriate osd ops to it based on the request type and
image properties.

For layered images, the original rbd_obj_request is resent with a
copyup operation in front, using a new ceph_osd_request. The logic for
adding the original operations should be the same as when first
sending them, so move it to a helper function.

op_type only needs to be checked once, so create a helper for that as
well and call it outside the loop in rbd_img_request_fill().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
1c220881e307b62cc2f77d911219de332aa3f61e 05-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: make discard trigger copy-on-write

Discard requests are a form of write, so they should go through the
same process as plain write requests and trigger copy-on-write for
layered images.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
d0265de7c358d71a494dcd1ee28206b32754bb0f 08-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: tolerate -ENOENT for discard operations

Discard may try to delete an object from a non-layered image that does not exist.
If this occurs, the image already has no data in that range, so change the
result to success.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
bef95455a44e2533fcea376740bb1a5cbd71269f 05-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: fix snapshot context reference count for discards

Discards take a reference to the snapshot context of an image when
they are created. This reference needs to be cleaned up when the
request is done just as it is for regular writes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
3c5df89367761d09d76454a2c4301a73bf2d46ce 04-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: read image size for discard check safely

In rbd_img_request_fill() the image size is only checked to determine
whether we can truncate an object instead of zeroing it for discard
requests. Take rbd_dev->header_rwsem while reading the image size, and
move this read into the discard check, so that non-discard ops don't
need to take the semaphore in this function.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
90e98c5229c0adfadf2c2ad2c91d72902bf61bc4 01-Apr-2014 Guangliang Zhao <lucienchao@gmail.com> rbd: initial discard bits from Guangliang Zhao

This patch add the discard support for rbd driver.

There are three types operation in the driver:
1. The objects would be removed if they completely contained
within the discard range.
2. The objects would be truncated if they partly contained within
the discard range, and align with their boundary.
3. Others would be zeroed.

A discard request from blkdev_issue_discard() is defined which
REQ_WRITE and REQ_DISCARD both marked and no data, so we must
check the REQ_DISCARD first when getting the request type.

This resolve:
http://tracker.ceph.com/issues/190

[ Ilya Dryomov: This is incomplete and somewhat buggy, see follow up
commits by Josh Durgin for refinements and fixes which weren't
folded in to preserve authorship. ]

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
6d2940c881aeb9f46baac548dc4e906a53957dba 13-Mar-2014 Guangliang Zhao <lucienchao@gmail.com> rbd: extend the operation type

It could only handle the read and write operations now,
extend it for the coming discard support.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
c622d226155b12276ae3d29d546f4b314d7cd68c 01-Apr-2014 Guangliang Zhao <lucienchao@gmail.com> rbd: skip the copyup when an entire object writing

It need to copyup the parent's content when layered writing,
but an entire object write would overwrite it, so skip it.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
70d045f660c7331bce8c9377929b52a9738a12cb 12-Sep-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: add img_obj_request_simple() helper

To clarify the conditions and make it easier to add new ones.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
4e752f0ab0e8114f4edd7574081dc625d679dd15 08-Apr-2014 Josh Durgin <josh.durgin@inktank.com> rbd: access snapshot context and mapping size safely

These fields may both change while the image is mapped if a snapshot
is created or deleted or the image is resized. They are guarded by
rbd_dev->header_rwsem, so hold that while reading them, and store a
local copy to refer to outside of the critical section. The local copy
will stay consistent since the snapshot context is reference counted,
and the mapping size is just a u64. This prevents torn loads from
giving us inconsistent values.

Move reading header.snapc into the caller of rbd_img_request_create()
so that we only need to take the semaphore once. The read-only caller,
rbd_parent_request_create() can just pass NULL for snapc, since the
snapshot context is only relevant for writes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
7dd440c9e0711d828442c3e129ab8bcb9aeeac23 11-Sep-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: do not return -ERANGE on auth failures

Trying to map an image out of a pool for which we don't have an 'x'
permission bit fails with -ERANGE from ceph_extract_encoded_string()
due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL
sink, thus propagating rbd::get_id cls method errors. (I've seen
a bunch of unexplained -ERANGE reports, I bet this is it).

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
255939e783d8f45f8c58487dfc18957c44ea9871 14-Aug-2014 Wei Yongjun <yongjun_wei@trendmicro.com.cn> rbd: fix error return code in rbd_dev_device_setup()

Fix to return -ENOMEM from the workqueue alloc error handling
case instead of 0, as done elsewhere in this function.

Reviewed-by: Alex Elder <elder@linaro.org>
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
58d1362b50dc87ebf18cd137e7a879fd99b7e721 12-Aug-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: avoid format-security warning inside alloc_workqueue()

drivers/block/rbd.c: In function ‘rbd_dev_device_setup’:
drivers/block/rbd.c:5090:19: warning: format not a string literal and no format arguments [-Wformat-security]

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
9584d5082653429ea219f9739a08566478b39f16 10-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: remove extra newlines from rbd_warn() messages

rbd_warn() string should be a single line - rbd_warn() appends \n.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
7a716aac01eedb8a7ebf36a0e81237c56f9f1bc1 05-Aug-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC

Now that rbd_img_request_create() is called from work functions, no
need to use GFP_ATOMIC.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
bc1ecc65a259fa9333dc8bd6a4ba0cf03b7d4bf8 04-Aug-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: rework rbd_request_fn()

While it was never a good idea to sleep in request_fn(), commit
34c6bc2c919a ("locking/mutexes: Add extra reschedule point") made it
a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting
task on the mutex wait queue, which for tasks in !TASK_RUNNING state
means block forever. request_fn() may be called with !TASK_RUNNING on
the way to schedule() in io_schedule().

Offload request handling to a workqueue, one per rbd device, to avoid
calling blocking primitives from rbd_request_fn().

Fixes: http://tracker.ceph.com/issues/8818

Cc: stable@vger.kernel.org # 3.16, needs backporting for 3.15
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Tested-by: Eric Eastman <eric0e@aol.com>
Tested-by: Greg Wilson <greg.wilson@keepertech.com>
Reviewed-by: Alex Elder <elder@linaro.org>
4d9b67cddd9b9bc320473a334cc8023a4186092f 24-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: take snap_id into account when reading in parent info

If we are mapping a snapshot, we must read in the parent_overlap value
of that snapshot instead of that of the base image. Not doing so may
in particular result in us returning zeros instead of user data:

# cat overlap-snap.sh
#!/bin/bash
rbd create --size 10 --image-format 2 foo
FOO_DEV=$(rbd map foo)
dd if=/dev/urandom of=$FOO_DEV bs=1M &>/dev/null
echo "Base image"
dd if=$FOO_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd snap create foo@snap
rbd snap protect foo@snap
rbd clone foo@snap bar
rbd snap create bar@snap
BAR_DEV=$(rbd map bar@snap)
echo "Snapshot"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd
rbd resize --allow-shrink --size 4 bar
echo "Snapshot after base image resize"
dd if=$BAR_DEV bs=1 count=16 skip=$(((4 << 20) - 8)) 2>/dev/null | xxd

# ./overlap-snap.sh
Base image
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Snapshot
0000000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K"%`4(E..6.
Resizing image: 100% complete...done.
Snapshot after base image resize
0000000: e781 e33b d34b 2225 0000 0000 0000 0000 ...;.K"%........

Even though bar@snap is taken with the old bar parent_overlap (8M),
reads from bar@snap beyond the new bar parent_overlap (4M) return
zeroes. Fix it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
e8f59b595d05b7251a9a3054c14567fd8c8220ef 24-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: do not read in parent info before snap context

Currently rbd_dev_v2_header_info() reads in parent info before the snap
context is read in. This is wrong, because we may need to look at the
the parent_overlap value of the snapshot instead of that of the base
image, for example when mapping a snapshot - see next commit. (When
mapping a snapshot, all we got is its name and we need the snap context
to translate that name into an id to know which parent info to look
for.)

The approach taken here is to make sure rbd_dev_v2_parent_info() is
called after the snap context has been read in. The other approach
would be to add a parent_overlap field to struct rbd_mapping and
maintain it the same way rbd_mapping::size is maintained. The reason
I chose the first approach is that the value of keeping around both
base image values and the actual mapping values is unclear to me.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
5ff1108ccc10dbb07bf5875e38fee313844ccef6 23-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: update mapping size only on refresh

There is no sense in trying to update the mapping size before it's even
been set.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
52bb1f9bed796127e8b446b12e5b834026241cdd 23-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: harden rbd_dev_refresh() and callers a bit

Recently discovered watch/notify problems showed that we really can't
ignore errors in anything refresh related. Alas, currently there is
not much we can do in response to those errors, except print warnings.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
0407759971cdbd302e0efcb03ff9435a0d3db3ab 23-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: split rbd_dev_spec_update() into two functions

rbd_dev_spec_update() has two modes of operation, with nothing in
common between them. Split it into two functions, one for each mode
and make our expectations more clear.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
7626eb7d82e4f1bd008e0a0bb534704d02a39832 23-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: remove unnecessary asserts in rbd_dev_image_probe()

spec->image_id assert doesn't buy us much and image_format is asserted
in rbd_dev_header_name() and rbd_dev_header_info() anyway.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
a720ae0901eddab5c94a17402b7ed29e1afb5003 23-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: introduce rbd_dev_header_info()

A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
ff96128fb020e26e7b32e12e887013956d840f08 22-Jul-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: show the entire chain of parent images

Make /sys/bus/rbd/devices/<id>/parent show the entire chain of parent
images. While at it, kernel sprintf() doesn't return negative values,
casting to unsigned long long is no longer necessary and there is no
good reason to split into multiple sprintf() calls.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
7d5079aa8bc9ca25e61576820d07503b2a558f9b 23-Jul-2014 Himangi Saraogi <himangi774@gmail.com> rbd: use rbd_segment_name_free() instead of kfree()

Free memory allocated using kmem_cache_zalloc using kmem_cache_free
rather than kfree. The helper rbd_segment_name_free does the job here.
Its position is shifted above the calling function.

The Coccinelle semantic patch that detects this change is as follows:

// <smpl>
@@
expression x,E,c;
@@

x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
... when != x = E
when != &x
?-kfree(x)
+kmem_cache_free(c,x)
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
fbba11b3bec52ff560cb42d102f61341049defb0 27-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: do not leak image_id in rbd_dev_v2_parent_info()

image_id is leaked if the parent happens to have been recorded already.
Fix it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
76756a51e27984692fe0affa564e89ee8d323e66 20-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: use rbd_obj_watch_request_helper() helper

Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix
rbd_dev_header_unwatch_sync() to destroy watch_request structures
before queuing watch-remove message while at it. This mistake slipped
into commit b30a01f2a307 ("rbd: fix osd_request memory leak in
__rbd_dev_header_watch_sync()") and could lead to "image still in use"
errors on image removal.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
bb040aa03ce870b0eff21ee75f7f324cd8cabe03 19-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: add rbd_obj_watch_request_helper() helper

In the past, rbd_dev_header_watch_sync() used to handle both watch and
unwatch requests and was entangled and leaky. Commit b30a01f2a307
("rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()")
split it into two separate functions. This commit cleanly abstracts
the common bits, relying on the fixed rbd_obj_request_wait().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
71c20a066f1a4ee1339db0efb58290fbb62e62f2 19-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: rbd_obj_request_wait() should cancel the request if interrupted

rbd_obj_request_wait() should cancel the underlying OSD request if
interrupted. Otherwise libceph will hold onto it indefinitely, causing
assert failures or leaking the original object request.

This also adds an rbd wrapper around ceph_osdc_cancel_request() to
match rbd_obj_request_submit() and rbd_obj_request_wait().

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
9638556a276125553549fdfe349c464481ec2f39 10-Jun-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: handle parent_overlap on writes correctly

The following check in rbd_img_obj_request_submit()

rbd_dev->parent_overlap <= obj_request->img_offset

allows the fall through to the non-layered write case even if both
parent_overlap and obj_request->img_offset belong to the same RADOS
object. This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data. Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:

rbd_data.<id>.0000000000000001
---------------------|----------------------|------------
| should be copyup'ed | should be zeroed out | write ...
---------------------|----------------------|------------
4M 5M 6M
parent_overlap obj_request->img_offset

4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.

Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
22001f619f29ddf66582d834223dcff4c0b74595 01-Oct-2013 Josh Durgin <josh.durgin@inktank.com> rbd: only set disk to read-only once

rbd_open(), called every time the device is opened, calls
set_device_ro(). There's no reason to set the device read-only or
read-write every time it is opened. Just do this once during device
setup, using set_disk_ro() instead because the struct block_device
isn't available to us there.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
77f33c03739697d01c2e730e4c2610424059ceaf 01-Oct-2013 Josh Durgin <josh.durgin@inktank.com> rbd: move calls that may sleep out of spin lock range

get_user() and set_disk_ro() may allocate memory, leading to a
potential deadlock if theye are called while a spin lock is held.

Move the acquisition and release of rbd_dev->lock from rbd_ioctl()
into rbd_ioctl_set_ro(), so it can occur between get_user() and
set_disk_ro().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
131fd9f6fc89ad2cc993f80664d18ca49d6f8483 24-Sep-2013 Guangliang Zhao <guangliang@unitedstack.com> rbd: add ioctl for rbd

When running the following commands:
[root@ceph0 mnt]# blockdev --setro /dev/rbd1
[root@ceph0 mnt]# blockdev --getro /dev/rbd1
0

The block setro didn't take effect, it is because
the rbd doesn't support ioctl of block driver.

This resolves:
http://tracker.ceph.com/issues/6265

Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ffe312cf31c7d8616096616d469eb5f6bb8905c0 20-May-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: fix ida/idr memory leak

ida_destroy() needs to be called on module exit to release ida caches.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
0f2d5be792b0466b06797f637cfbb0f64dbb408c 26-Apr-2014 Alex Elder <elder@linaro.org> rbd: use reference counts for image requests

Each image request contains a reference count, but to date it has
not actually been used. (I think this was just an oversight.) A
recent report involving rbd failing an assertion shed light on why
and where we need to use these reference counts.

Every OSD request associated with an object request uses
rbd_osd_req_callback() as its callback function. That function will
call a helper function (dependent on the type of OSD request) that
will set the object request's "done" flag if the object request if
appropriate. If that "done" flag is set, the object request is
passed to rbd_obj_request_complete().

In rbd_obj_request_complete(), requests are processed in sequential
order. So if an object request completes before one of its
predecessors in the image request, the completion is deferred.
Otherwise, if it's a completing object's "turn" to be completed, it
is passed to rbd_img_obj_end_request(), which records the result of
the operation, accumulates transferred bytes, and so on. Next, the
successor to this request is checked and if it is marked "done",
(deferred) completion processing is performed on that request, and
so on. If the last object request in an image request is completed,
rbd_img_request_complete() is called, which (typically) destroys
the image request.

There is a race here, however. The instant an object request is
marked "done" it can be provided (by a thread handling completion of
one of its predecessor operations) to rbd_img_obj_end_request(),
which (for the last request) can then lead to the image request
getting torn down. And this can happen *before* that object has
itself entered rbd_img_obj_end_request(). As a result, once it
*does* enter that function, the image request (and even the object
request itself) may have been freed and become invalid.

All that's necessary to avoid this is to properly count references
to the image requests. We tear down an image request's object
requests all at once--only when the entire image request has
completed. So there's no need for an image request to count
references for its object requests. However, we don't want an
image request to go away until the last of its object requests
has passed through rbd_img_obj_callback(). In other words,
we don't want rbd_img_request_complete() to necessarily
result in the image request being destroyed, because it may
get called before we've finished processing on all of its
object requests.

So the fix is to add a reference to an image request for
each of its object requests. The reference can be viewed
as representing an object request that has not yet finished
its call to rbd_img_obj_callback(). That is emphasized by
getting the reference right after assigning that as the image
object's callback function. The corresponding release of that
reference is done at the end of rbd_img_obj_callback(), which
every image object request passes through exactly once.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
b30a01f2a307f55a505762ba09c0440d906c6711 22-May-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()

osd_request, along with r_request and r_reply messages attached to it
are leaked in __rbd_dev_header_watch_sync() if the requested image
doesn't exist. This is because lingering requests are special and get
an extra ref in the reply path. Fix it by unregistering linger request
on the error path and split __rbd_dev_header_watch_sync() into two
functions to make it maintainable.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
30ba1f020221991cf239d905c82984958f29bdfe 13-May-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: make sure we have latest osdmap on 'rbd map'

Given an existing idle mapping (img1), mapping an image (img2) in
a newly created pool (pool2) fails:

$ ceph osd pool create pool1 8 8
$ rbd create --size 1000 pool1/img1
$ sudo rbd map pool1/img1
$ ceph osd pool create pool2 8 8
$ rbd create --size 1000 pool2/img2
$ sudo rbd map pool2/img2
rbd: sysfs write failed
rbd: map failed: (2) No such file or directory

This is because client instances are shared by default and we don't
request an osdmap update when bumping a ref on an existing client. The
fix is to use the mon_get_version request to see if the osdmap we have
is the latest, and block until the requested update is received if it's
not.

Fixes: http://tracker.ceph.com/issues/8184

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
461f758ac0bad40fe8e0959f415dae38efa16c12 11-Apr-2014 Duan Jiong <duanj.fnst@cn.fujitsu.com> rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO

This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
0ccd59266973047770d5160318561c9189b79c93 25-Feb-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op

In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order). Backwards compatibility is taken care
of on the libceph/osd side.

"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once. The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into. To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
deb236b300cea3e7a114115571194b9872dbdfd1 25-Feb-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: num_ops parameter for rbd_osd_req_create()

In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create(). The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
7cc69d42e6950404587bef9489a5ed6f9f6bab4e 25-Feb-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: bump CEPH_OSD_MAX_OP to 3

Our longest osd request now contains 3 ops: copyup+hint+write.

Also, CEPH_OSD_MAX_OP value in a BUG_ON in rbd_osd_req_callback() was
hard-coded to 2. Fix it, and switch to rbd_assert while at it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
42dd037c08c7cd6e3e9af7824b0c1d063f838885 04-Mar-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: fix error paths in rbd_img_request_fill()

Doing rbd_obj_request_put() in rbd_img_request_fill() error paths is
not only insufficient, but also triggers an rbd_assert() in
rbd_obj_request_destroy():

Assertion failure in rbd_obj_request_destroy() at line 1867:

rbd_assert(obj_request->img_request == NULL);

rbd_img_obj_request_add() adds obj_requests to the img_request, the
opposite is rbd_img_obj_request_del(). Use it.

Fixes: http://tracker.ceph.com/issues/7327

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
62054da65c626dd603190c16805f92cf2cf47d4c 04-Mar-2014 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: remove out_partial label in rbd_img_request_fill()

Commit 03507db631c94 ("rbd: fix buffer size for writes to images with
snapshots") moved the call to rbd_img_obj_request_add() up, making the
out_partial label bogus. Remove it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
638c323c4d1f8eaf25224946e21ce8818f1bcee1 25-Mar-2014 Alex Elder <elder@linaro.org> rbd: drop an unsafe assertion

Olivier Bonvalet reported having repeated crashes due to a failed
assertion he was hitting in rbd_img_obj_callback():

Assertion failure in rbd_img_obj_callback() at line 2165:
rbd_assert(which >= img_request->next_completion);

With a lot of help from Olivier with reproducing the problem
we were able to determine the object and image requests had
already been completed (and often freed) at the point the
assertion failed.

There was a great deal of discussion on the ceph-devel mailing list
about this. The problem only arose when there were two (or more)
object requests in an image request, and the problem was always
seen when the second request was being completed.

The problem is due to a race in the window between setting the
"done" flag on an object request and checking the image request's
next completion value. When the first object request completes, it
checks to see if its successor request is marked "done", and if
so, that request is also completed. In the process, the image
request's next_completion value is updated to reflect that both
the first and second requests are completed. By the time the
second request is able to check the next_completion value, it
has been set to a value *greater* than its own "which" value,
which caused an assertion to fail.

Fix this problem by skipping over any completion processing
unless the completing object request is the next one expected.
Test only for inequality (not >=), and eliminate the bad
assertion.

Tested-by: Olivier Bonvalet <ob@daevel.fr>
Signed-off-by: Alex Elder <elder@linaro.org>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
3c972c95c68f455d80ff185aa440857be046bbe0 27-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}

Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before
introducing r_target_{oloc,oid} needed for redirects.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
4295f2217a5aa8ef2738e3a368db3c1ceab41212 27-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: introduce and start using oid abstraction

In preparation for tiering support, which would require having two
(base and target) object names for each osd request and also copying
those names around, introduce struct ceph_object_id (oid) and a couple
helpers to facilitate those copies and encapsulate the fact that object
name is not necessarily a NUL-terminated string.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2d0ebc5d591f49131bf8f93b54c5424162c3fb7f 27-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN

In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to
CEPH_MAX_OID_NAME_LEN.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
22116525baec1d63f4878eaa92f0b57946a78819 27-Jan-2014 Ilya Dryomov <ilya.dryomov@inktank.com> libceph: start using oloc abstraction

Instead of relying on pool fields in ceph_file_layout (for mapping) and
ceph_pg (for enconding), start using ceph_object_locator (oloc)
abstraction. Note that userspace oloc currently consists of pool, key,
nspace and hash fields, while this one contains only a pool. This is
OK, because at this point we only send (i.e. encode) olocs and never
have to receive (i.e. decode) them.

This makes keeping a copy of ceph_file_layout in every osd request
unnecessary, so ceph_osd_request::r_file_layout field is nuked.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e37180c0f2f0c5b21e9295d5b19874ff4a659be1 16-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: tear down watch request if rbd_dev_device_setup() fails

Tear down watch request if rbd_dev_device_setup() fails.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
fca270653909404112ea5f6eed274ed5272d5252 16-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: introduce rbd_dev_header_unwatch_sync() and switch to it

Rename rbd_dev_header_watch_sync() to __rbd_dev_header_watch_sync() and
introduce two helpers: rbd_dev_header_{,un}watch_sync() to make it more
clear what is going on.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7e513d43669a0505ee3b122344176147a674bcbf 16-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: enable extended devt in single-major mode

If single-major device number allocation scheme is turned on, instead
of reserving 256 minors per device, which imposes a limit of 4096
images mapped at once, reserve 16 minors per device and enable extended
devt feature. This results in a theoretical limit of 65536 images
mapped at once, and still allows to have more than 15 partititions:
partitions starting with 16th are mapped under major 259 (Block
Extended Major):

$ rbd showmapped
id pool image snap device
0 rbd b5 - /dev/rbd0 # no partitions
1 rbd b2 - /dev/rbd1 # 40 partitions
2 rbd b3 - /dev/rbd2 # 2 partitions

$ cat /proc/partitions
251 0 1024 rbd0
251 16 1024 rbd1
251 17 0 rbd1p1
251 18 0 rbd1p2
...
251 30 0 rbd1p14
251 31 0 rbd1p15
259 0 0 rbd1p16
259 1 0 rbd1p17
...
259 23 0 rbd1p39
259 24 0 rbd1p40
251 32 1024 rbd2
251 33 0 rbd2p1
251 34 0 rbd2p2

(major 251 was assigned dynamically at module load time)

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9b60e70b3b6a8e4bc2d1b6d9f858a30e1cec496b 13-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: add support for single-major device number allocation scheme

Currently each rbd device is allocated its own major number, which
leads to a hard limit of 230-250 images mapped at once. This commit
adds support for a new single-major device number allocation scheme,
which is hidden behind a new single_major boolean module parameter and
is disabled by default for backwards compatibility reasons. (Old
userspace cannot correctly unmap images mapped under single-major
scheme and would essentially just unmap a random image, if that.)

$ rbd showmapped
id pool image snap device
0 rbd b100 - /dev/rbd0
1 rbd b101 - /dev/rbd1
2 rbd b102 - /dev/rbd2
3 rbd b103 - /dev/rbd3

Old scheme (modprobe rbd):

$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253, 0 Dec 10 12:24 /dev/rbd0
brw-rw---- 1 root disk 252, 0 Dec 10 12:28 /dev/rbd1
brw-rw---- 1 root disk 252, 1 Dec 10 12:28 /dev/rbd1p1
brw-rw---- 1 root disk 252, 2 Dec 10 12:28 /dev/rbd1p2
brw-rw---- 1 root disk 252, 3 Dec 10 12:28 /dev/rbd1p3
brw-rw---- 1 root disk 251, 0 Dec 10 12:28 /dev/rbd2
brw-rw---- 1 root disk 251, 1 Dec 10 12:28 /dev/rbd2p1
brw-rw---- 1 root disk 250, 0 Dec 10 12:24 /dev/rbd3

New scheme (modprobe rbd single_major=Y):

$ ls -l /dev/rbd*
brw-rw---- 1 root disk 253, 0 Dec 10 12:30 /dev/rbd0
brw-rw---- 1 root disk 253, 256 Dec 10 12:30 /dev/rbd1
brw-rw---- 1 root disk 253, 257 Dec 10 12:30 /dev/rbd1p1
brw-rw---- 1 root disk 253, 258 Dec 10 12:30 /dev/rbd1p2
brw-rw---- 1 root disk 253, 259 Dec 10 12:30 /dev/rbd1p3
brw-rw---- 1 root disk 253, 512 Dec 10 12:30 /dev/rbd2
brw-rw---- 1 root disk 253, 513 Dec 10 12:30 /dev/rbd2p1
brw-rw---- 1 root disk 253, 768 Dec 10 12:30 /dev/rbd3

(major 253 was assigned dynamically at module load time)

The new limit is 4096 images mapped at once, and it comes from the fact
that, as before, 256 minor numbers are reserved for each mapping.
(A follow-up commit changes the number of minors reserved and the way
we deal with partitions over that number.)

If single_major is set to true, two new sysfs interfaces show up:
/sys/bus/rbd/{add,remove}_single_major. These are to be used instead
of /sys/bus/rbd/{add,remove}, which are disabled for backwards
compatibility reasons outlined above.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
92c76dc036e2139226e90851864d3e01e1db5dd8 13-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: wire up is_visible() sysfs callback for rbd bus

In preparation for single-major device number allocation scheme, wire
up attribute_group::is_visible() callback for rbd bus. This allows us
to make the new single-major attributes conditional.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
dd82fff1e8e7b486887dd88981776bb44e370848 13-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: add 'minor' sysfs rbd device attribute

Introduce /sys/bus/rbd/devices/<id>/minor sysfs attribute for exporting
rbd whole disk minor numbers. This is a step towards single-major
device number allocation scheme, but also a good thing on its own.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f8a22fc238a449ff982bfb40e30c3f3c9c90a08a 13-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: switch to ida for rbd id assignments

Currently rbd ids are allocated using an atomic variable that keeps
track of the highest id currently in use and each new id is simply one
more than the value of that variable. That's nice and cheap, but it
does mean that rbd ids are allowed to grow boundlessly, and, more
importantly, it's completely unpredictable. So, in preparation for
single-major device number allocation scheme, which is going to
establish and rely on a constant mapping between rbd ids and device
numbers, switch to ida for rbd id assignments.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e1b4d96dea61c3078775090e8b121f571aab8fda 13-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: refactor rbd_init() a bit

Refactor rbd_init() a bit to make it more clear what's going on.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
90da258b887538ed3a2f904fa593173f26a1adbc 13-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: tweak "loaded" message and module description

Tweak "loaded" message, so that it looks like

[ 30.184235] rbd: loaded

instead of

[ 38.056564] rbd: loaded rbd (rados block device)

Also move (and slightly tweak) MODULE_DESCRIPTION so that all authors
are next to each other in modinfo output.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
70eebd200696aea897bfb596053d0c688ec1894b 13-Dec-2013 Ilya Dryomov <ilya.dryomov@inktank.com> rbd: rbd_device::dev_id is an int, format it as such

rbd_device::dev_id is an int, format it as such.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5341a6278bc5d10dbbb2ab6031b41d95c8db7a35 07-Aug-2013 Kent Overstreet <kmo@daterainc.com> rbd: Refactor bio cloning

Now that we've got drivers converted to the new immutable bvec
primitives, bio splitting becomes much easier - this is how the new
bio_split() will work. (Someone more familiar with the ceph code could
probably use bio_clone_fast() instead of bio_clone() here).

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
7988613b0e5b2638caf6cd493cc78e9595eba19c 24-Nov-2013 Kent Overstreet <kmo@daterainc.com> block: Convert bio_for_each_segment() to bvec_iter

More prep work for immutable biovecs - with immutable bvecs drivers
won't be able to use the biovec directly, they'll need to use helpers
that take into account bio->bi_iter.bi_bvec_done.

This updates callers for the new usage without changing the
implementation yet.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Jim Paris <jim@jtan.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
Cc: support@lsi.com
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-m68k@lists.linux-m68k.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: drbd-user@lists.linbit.com
Cc: nbd-general@lists.sourceforge.net
Cc: cbe-oss-dev@lists.ozlabs.org
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: DL-MPTFusionLinux@lsi.com
Cc: linux-scsi@vger.kernel.org
Cc: devel@driverdev.osuosl.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cluster-devel@redhat.com
Cc: linux-mm@kvack.org
Acked-by: Geoff Levand <geoff@infradead.org>
4f024f3797c43cb4b73cd2c50cec728842d0e49e 12-Oct-2013 Kent Overstreet <kmo@daterainc.com> block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
bb8e0e84b30afc9827931c9773d75d5c99fcddff 11-Sep-2013 Jingoo Han <jg1.han@samsung.com> block: replace strict_strtoul() with kstrtoul()

The use of strict_strtoul() is not preferred, because strict_strtoul() is
obsolete. Thus, kstrtoul() should be used.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
da6a6b63978d45f9ae582d1f362f182012da3a22 05-Sep-2013 Josh Durgin <josh.durgin@inktank.com> rbd: fix error handling from rbd_snap_name()

rbd_snap_name() calls rbd_dev_v{1,2}_snap_name() depending on the
format of the image. The format 1 version returns NULL on error, which
is handled by the caller. The format 2 version returns an ERR_PTR,
which the caller of rbd_snap_name() does not expect.

Fortunately this is unlikely to occur in practice because
rbd_snap_id_by_name() is called before rbd_snap_name(). This would hit
similar errors to rbd_snap_name() (like the snapshot not existing) and
return early, so rbd_snap_name() would not hit an error unless the
snapshot was removed between the two calls or memory was exhausted.

Use an ERR_PTR in rbd_dev_v1_snap_name() so that the specific error
can be propagated, and it is consistent with rbd_dev_v2_snap_name().
Handle the ERR_PTR in the only rbd_snap_name() caller.

Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
efadc98aab674153709cc357ba565f04e3164fcd 30-Aug-2013 Josh Durgin <josh.durgin@inktank.com> rbd: ignore unmapped snapshots that no longer exist

This prevents erroring out while adding a device when a snapshot
unrelated to the current mapping is deleted between reading the
snapshot context and reading the snapshot names. If the mapped
snapshot name is not found an error still occurs as usual.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
9875201e10496612080e7d164acc8f625c18725c 30-Aug-2013 Josh Durgin <josh.durgin@inktank.com> rbd: fix use-after free of rbd_dev->disk

Removing a device deallocates the disk, unschedules the watch, and
finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
from the watch callback, updates the disk size and rbd_dev
structure. With no locking between them, rbd_dev_refresh() may use the
device or rbd_dev after they've been freed.

To fix this, check whether RBD_DEV_FLAG_REMOVING is set before
updating the disk size in rbd_dev_refresh(). In order to prevent a
race where rbd_dev_refresh() is already revalidating the disk when
rbd_remove() is called, move the call to rbd_bus_del_dev() after the
watch is unregistered and all notifies are complete. It's safe to
defer deleting this structure because no new requests can be submitted
once the RBD_DEV_FLAG_REMOVING is set, since the device cannot be
opened.

Fixes: http://tracker.ceph.com/issues/5636
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
20e0af67ce88c657d0601977b9941a2256afbdaa 30-Aug-2013 Josh Durgin <josh.durgin@inktank.com> rbd: make rbd_obj_notify_ack() synchronous

The only user of rbd_obj_notify_ack() is rbd_watch_cb(). It used
asynchronously with no tracking of when the notify ack completes, so
it may still be in progress when the osd_client is shut down. This
results in a BUG() since the osd client assumes no requests are in
flight when it stops. Since all notifies are flushed before the
osd_client is stopped, waiting for the notify ack to complete before
returning from the watch callback ensures there are no notify acks in
flight during shutdown.

Rename rbd_obj_notify_ack() to rbd_obj_notify_ack_sync() to reflect
its new synchronous nature.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
9abc59908e0c5f983aaa91150da32d5b62cf60b7 30-Aug-2013 Josh Durgin <josh.durgin@inktank.com> rbd: complete notifies before cleaning up osd_client and rbd_dev

To ensure rbd_dev is not used after it's released, flush all pending
notify callbacks before calling rbd_dev_image_release(). No new
notifies can be added to the queue at this point because the watch has
already be unregistered with the osd_client.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
c35455791c1131e7ccbf56ea6fbdd562401c2ce2 29-Aug-2013 Josh Durgin <josh.durgin@inktank.com> rbd: fix null dereference in dout

The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but
the dout() always derefences it. Move this to another dout() protected
by a check that order is non-NULL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
03507db631c94a48e316c7f638ffb2991544d617 27-Aug-2013 Josh Durgin <josh.durgin@inktank.com> rbd: fix buffer size for writes to images with snapshots

rbd_osd_req_create() needs to know the snapshot context size to create
a buffer large enough to send it with the message front. It gets this
from the img_request, which was not set for the obj_request yet. This
resulted in trying to write past the end of the front payload, hitting
this BUG:

libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len);

Fix this by associating the obj_request with its img_request
immediately after it's created, before the osd request is created.

Fixes: http://tracker.ceph.com/issues/5760
Suggested-by: Alex Elder <alex.elder@linaro.org>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
17c1cc1d9293a568a00545469078e29555cc7f39 27-Aug-2013 Josh Durgin <josh.durgin@inktank.com> rbd: fix I/O error propagation for reads

When a request returns an error, the driver needs to report the entire
extent of the request as completed. Writes already did this, since
they always set xferred = length, but reads were skipping that step if
an error other than -ENOENT occurred. Instead, rbd would end up
passing 0 xferred to blk_end_request(), which would always report
needing more data. This resulted in an assert failing when more data
was required by the block layer, but all the object requests were
done:

[ 1868.719077] rbd: obj_request read result -108 xferred 0
[ 1868.719077]
[ 1868.719518] end_request: I/O error, dev rbd1, sector 0
[ 1868.719739]
[ 1868.719739] Assertion failure in rbd_img_obj_callback() at line 1736:
[ 1868.719739]
[ 1868.719739] rbd_assert(more ^ (which == img_request->obj_request_count));

Without this assert, reads that hit errors would hang forever, since
the block layer considered them incomplete.

Fixes: http://tracker.ceph.com/issues/5647
CC: stable@vger.kernel.org # v3.10
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
b15a21dddad552f7e42ae8a7da84de334f6acdcf 23-Aug-2013 Greg Kroah-Hartman <gregkh@linuxfoundation.org> rbd: convert bus code to use bus_groups

The bus_attrs field of struct bus_type is going away soon, dev_groups
should be used instead. This converts the RBD bus code to use the
correct field.

Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Acked-by: Alex Elder <elder@linaro.org>
Cc: <ceph-devel@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
a158073c43b3aa26407b4c7987c909d21a12b5e5 09-Aug-2013 Jingoo Han <jg1.han@samsung.com> block: rbd: use NULL instead of 0

The local variables such as 'bio_list', and 'pages' are pointers;
thus, use NULL instead of 0 to fix the following sparse warnings.

drivers/block/rbd.c:2166:32: warning: Using plain integer as NULL pointer
drivers/block/rbd.c:2168:31: warning: Using plain integer as NULL pointer

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Reviewed-by: Sage Weil <sage@inktank.com>
e976cad0f0dbe5440a4ca38e29e1f932d9319125 09-Jun-2013 Sage Weil <sage@inktank.com> rbd: fix a couple warnings

gcc isn't quite smart enough and generates these warnings:

drivers/block/rbd.c: In function 'rbd_img_request_fill':
drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here
drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized]

even though they are initialized for their respective code paths.

Signed-off-by: Sage Weil <sage@inktank.com>
d552c6191bcd952991ffdfff585c8849e8be911d 01-Jun-2013 Alex Elder <elder@inktank.com> rbd: take a little credit

Add a name to the list of authors.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cfbf6377b696d88461eef6966bef9e6184111183 01-Jun-2013 Alex Elder <elder@inktank.com> rbd: use rwsem to protect header updates

Updating an image header needs to be protected to ensure it's
done consistently. However distinct headers can be updated
concurrently without a problem. Instead of using the global
control lock to serialize headder updates, just rely on the header
semaphore. (It's already used, this just moves it out to cover
a broader section of the code.)

That leaves the control mutex protecting only the creation of rbd
clients, so rename it.

This resolves:
http://tracker.ceph.com/issues/5222

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1ba0f1e7975ad07557f7a931522bdcd813ae35f6 31-May-2013 Alex Elder <elder@inktank.com> rbd: don't hold ctl_mutex to get/put device

When an rbd device is first getting mapped, its device registration
is protected the control mutex. There is no need to do that though,
because the device has already been assigned an id that's guaranteed
to be unique.

An unmap of an rbd device won't proceed if the device has a non-zero
open count or is already being unmapped. So there's no need to hold
the control mutex in that case either.

Finally, an rbd device can't be opened if it is being removed, and
it won't go away if there is a non-zero open count. So here too
there's no need to hold the control mutex while getting or putting a
reference to an rbd device's Linux device structure.

Drop the mutex calls in these cases.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
82a442d239695a242c4d584464c9606322cd02aa 01-Jun-2013 Alex Elder <elder@inktank.com> rbd: protect against concurrent unmaps

Make sure two concurrent unmap operations on the same rbd device
won't collide, by only proceeding with the removal and cleanup of a
device if is not already underway.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
751cc0e3cfabdda87c4c21519253c6751e97a8d4 31-May-2013 Alex Elder <elder@inktank.com> rbd: set removing flag while holding list lock

When unmapping a device, its id is supplied, and that is used to
look up which rbd device should be unmapped. Looking up the
device involves searching the rbd device list while holding
a spinlock that protects access to that list.

Currently all of this is done under protection of the control lock,
but that protection is going away soon. To ensure the rbd_dev is
still valid (still on the list) while setting its REMOVING flag, do
so while still holding the list lock. To do so, get rid of
__rbd_get_dev(), and open code what it did in the one place it
was used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
08f75463c15e26e9d67a7c992ce7dd8964c6cbdd 29-May-2013 Alex Elder <elder@inktank.com> rbd: protect against duplicate client creation

If more than one rbd image has the same ceph cluster configuration
(same options, same set of monitors, same keys) they normally share
a single rbd client.

When an image is getting mapped, rbd looks to see if an existing
client can be used, and creates a new one if not.

The lookup and creation are not done under a common lock though, so
mapping two images concurrently could lead to duplicate clients
getting set up needlessly. This isn't a major problem, but it's
wasteful and different from what's intended.

This patch fixes that by using the control mutex to protect
both the lookup and (if needed) creation of the client. It
was previously used just when creating.

This resolves:
http://tracker.ceph.com/issues/3094

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3b5cf2a2f1746a253d56f54ffbb45170c90b1cbd 29-May-2013 Alex Elder <elder@inktank.com> rbd: clean up a few things in the refresh path

This includes a few relatively small fixes I found while examining
the code that refreshes image information.

This resolves:
http://tracker.ceph.com/issues/5040

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e215605417b87732c6debf65da6d953016a1e5bc 23-May-2013 Alex Elder <elder@inktank.com> rbd: flush dcache after zeroing page data

Neither zero_bio_chain() nor zero_pages() contains a call to flush
caches after zeroing a portion of a page. This can cause problems
on architectures that have caches that allow virtual address
aliasing.

This resolves:
http://tracker.ceph.com/issues/4777

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
912c317d4600b81664ad8f3d3ba6c1f2ff4b49c2 14-May-2013 Alex Elder <elder@inktank.com> rbd: drop original request earlier for existence check

The reference to the original request dropped at the end of
rbd_img_obj_exists_callback() corresponds to the reference taken
in rbd_img_obj_exists_submit() to account for the stat request
referring to it. Move the put of that reference up right after
clearing that pointer to make its purpose more obvious.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
491205a8b45e3d9b594e1e7a997284f2e82f22e9 14-May-2013 Geert Uytterhoeven <geert@linux-m68k.org> rbd: Use min_t() to fix comparison of distinct pointer types warning

drivers/block/rbd.c: In function ‘zero_pages’:
drivers/block/rbd.c:1102: warning: comparison of distinct pointer types lacks a cast

Remove the hackish casts and use min_t() to fix this.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Alex Elder <elder@inktank.com>
d2d1f17a0dad823a4cb71583433d26cd7f734e08 26-Jun-2013 Josh Durgin <josh.durgin@inktank.com> rbd: send snapshot context with writes

Sending the right snapshot context with each write is required for
snapshots to work. Due to the ordering of calls, the snapshot context
is never set for any requests. This causes writes to the current
version of the image to be reflected in all snapshots, which are
supposed to be read-only.

This happens because rbd_osd_req_format_write() sets the snapshot
context based on obj_request->img_request. At this point, however,
obj_request->img_request has not been set yet, to the snapshot context
is set to NULL. Fix this by moving rbd_img_obj_request_add(), which
sets obj_request->img_request, before the osd request formatting
calls.

This resolves:
http://tracker.ceph.com/issues/5465

Reported-by: Karol Jurak <karol.jurak@gmail.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
1617e40c1eeeeb857ff4b66acee20ed2acc1b5e7 12-Jun-2013 Josh Durgin <josh.durgin@inktank.com> rbd: fetch object order before using it

rbd_dev_v2_header_onetime() fetches striping information, and
checks whether the image can be read by compariing the stripe unit
to the object size. It determines the object size by shifting
the object order, which is 0 at this point since it has not been
read yet. Move the call to get the image size and object order
before rbd_dev_v2_header_onetime() so it is set before use.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
3a96d5cd7bdce45d5dded75c3a62d4fb98050280 13-Jun-2013 Josh Durgin <josh.durgin@inktank.com> rbd: use the correct length for format 2 object names

Format 2 objects use 16 characters for the object name suffix to be
able to express the full 64-bit range of object numbers. Format 1
images only use 12 characters for this. Using 12-character names for
format 2 caused userspace and kernel rbd clients to read differently
named objects, which made an image written by one client look empty to
the other client.

CC: stable@vger.kernel.org # 3.9+
Reported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
3abef3b3585bbc67d56fdc9c67761a900fb4b69d 14-May-2013 Alex Elder <elder@inktank.com> rbd: fix cleanup in rbd_add()

Bjorn Helgaas pointed out that a recent commit introduced a
use-after-free condition in an error path for rbd_add().
He correctly stated:

I think b536f69a3a5 "rbd: set up devices only for mapped images"
introduced a use-after-free error in rbd_add():
...
If rbd_dev_device_setup() returns an error, we call
rbd_dev_image_release(), which ultimately kfrees rbd_dev.
Then we call rbd_dev_destroy(), which references fields in
the already-freed rbd_dev struct before kfreeing it again.

The simple fix is to return the error code after the call to
rbd_dev_image_release().

Closer examination revealed that there's no need to clean up
rbd_opts in that function, so fix that too.

Update some other comments that have also become out of date.

Reported-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7262cfca430a1a0e0707149af29ae86bc0ded230 16-May-2013 Alex Elder <elder@inktank.com> rbd: don't destroy ceph_opts in rbd_add()

Whether rbd_client_create() successfully creates a new client or
not, it takes responsibility for getting the ceph_opts structure
it's passed destroyed. If successful, the structure becomes
associated with the created client; if not, rbd_client_create()
will destroy it.

Previously, rbd_get_client() would call ceph_destroy_options()
if rbd_get_client() failed, and that meant it got called twice.
That led freeing various pointers more than once, which is never a
good idea.

This resolves:
http://tracker.ceph.com/issues/4559

Cc: stable@vger.kernel.org # 3.8+
Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
638f5abed3f7d8a7fc24087bd760fa3d99f68a39 07-May-2013 Alex Elder <elder@inktank.com> rbd: re-submit flattened write request (part 2)

Add code to rbd_img_obj_exists_callback() to detect when a clone's
parent image has disappeared, and re-submit the original write
request in that case.

Kill off some redundant assertions.

This completes the resolution for:
http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
bbea1c1a31b1128d34b2b5b4f28276969cce14e9 07-May-2013 Alex Elder <elder@inktank.com> rbd: re-submit write request for flattened clone

Add code to rbd_img_parent_read_full_callback() to detect when a
clone's parent image has disappeared, and re-submit the original
write request in that case. (See the previous commit for more
reasoning about why this is appropriate.)

Rename some variables in rbd_img_obj_parent_read_full_callback()
to match the convention used in the previous patch.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
02c74fbad9d4a5149756eb35be7300736e4904e9 07-May-2013 Alex Elder <elder@inktank.com> rbd: re-submit read request for flattened clone

If a clone image gets flattened while a parent read request is
underway, the original rbd object request needs to be resubmitted.

The reason is that by the time we get the response to the parent
read request, the data read from the parent may be out of date.
In other words, we could see this sequence of events:

rbd client parent image/osd
---------- ----------------
original object ENOENT;
issue parent read
respond to parent read
child image flattened
original image header refresh
<--- original object written independently here
parent read response received

Add code to rbd_img_parent_read_callback() to detect when a clone's
parent image has disappeared (as evidenced by its parent overlap
becoming 0), and re-submit the original read request in that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
392a9dad7e777296fe79d97a6b3acd735ad2eb5f 07-May-2013 Alex Elder <elder@inktank.com> rbd: detect when clone image is flattened

A format 2 clone image can be the subject of a "flatten" operation,
during which all of its data gets "copied up" from its parent image,
leaving the image fully populated. Once this is complete, the
clone's association with the parent is abolished.

Since this can occur when a clone is mapped, we need to detect when
it has occurred and handle it accordingly. We know an image has
been flattened when we know it at one time had a parent, but we have
learned (via a "get_parent" object class method call) it no longer
has one.

There might be in-flight requests at the point we learn an image has
been flattened, so we can't simply clean up parent data structures
right away. Instead, we'll drop the initial parent reference when
the parent has disappeared (rather than when the image gets
destroyed), which will allow the last in-flight reference to clean
things up when it's complete.

We leverage the fact that a zero parent overlap renders an image
effectively unlayered. We set the overlap to 0 at the point we
detect the clone image has flattened, which allows the unlayered
behavior to take effect immediately, while keeping other parent
structures in place until in-flight requests to complete.

This and the next few patches resolve:
http://tracker.ceph.com/issues/3763

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a2acd00e7964dbb1668a5956c9d0a4bdeb838c4a 09-May-2013 Alex Elder <elder@inktank.com> rbd: reference count parent requests

Keep a reference count for uses of the parent information for an rbd
device.

An initial reference is set in rbd_img_request_create() if the
target image has a parent (with non-zero overlap). Each image
request for an image with a non-zero parent overlap gets another
reference when it's created, and that reference is dropped when the
request is destroyed.

The initial reference is dropped when the image gets torn down.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e93f3152357ca75284284bef8eeea7d45fe1bab1 09-May-2013 Alex Elder <elder@inktank.com> rbd: define parent image request routines

Define rbd_parent_request_create() and rbd_parent_request_destroy()
to handle the creation of parent image requests submitted for
layered image objects. For simplicity, let rbd_img_request_put()
handle dropping the reference to any image request (parent or not),
and call whichever destructor is appropriate on the last put.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
fb65d2284c117cfc28d30217d25a14a8e7a75a94 09-May-2013 Alex Elder <elder@inktank.com> rbd: define rbd_dev_unparent()

Define rbd_dev_unparent() to encapsulate cleaning up parent data
structures from a layered rbd image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8785b1d487f0a31afd2c802499786d3b355eccea 09-May-2013 Alex Elder <elder@inktank.com> rbd: don't release write request until necessary

Previously when a layered write was going to involve a copyup
request, the original osd request was released before submitting the
parent full-object read. The osd request for the copyup would then
be allocated in rbd_img_obj_parent_read_full_callback().

Shortly we will be handling the event of mapped layered images
getting flattened, and when that occurs we need to resubmit the
original request. We therefore don't want to release the osd
request until we really konw we're going to replace it--in the
callback function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
642a25375f4c863607d2170f4471aec8becf7788 07-May-2013 Alex Elder <elder@inktank.com> rbd: get parent info on refresh

Get parent info for format 2 images on every refresh (rather than
just during the initial probe). This will be needed to detect the
disappearance of the parent image in the event a mapped image
becomes unlayered (i.e., flattened). Avoid leaking the previous
parent spec on the second and subsequent times this information is
requested by dropping the previous one (if any) before updating it.
(Also, extract the pool id into a local variable before assigning
it into the parent spec.)

Switch to using a non-zero parent overlap value rather than the
existence of a parent (a non-null parent_spec pointer) to determine
whether to mark a request layered. It will soon be possible for
a layered image to become unlayered while a request is in flight.

This means that the layered flag for an image request indicates that
there was a non-zero parent overlap at the time the image request
was created. The parent overlap can change thereafter, which may
lead to special handling at request submission or completion time.

This and the next several patches are related to:
http://tracker.ceph.com/issues/3763

NOTE:
If an error occurs while refreshing the parent info (i.e.,
requesting it after initial probe), the old parent info will
persist. This is not really correct, and is a scenario that needs
to be addressed. For now we'll assert that the failure mode is
unlikely, but the issue has been documented in tracker issue 5040.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
70cf49cfc7a4d1eb4aeea6cd128b88230be9d0b1 07-May-2013 Alex Elder <elder@inktank.com> rbd: ignore zero-overlap parent

An rbd clone image that has an overlap with its parent of 0 is
effectively not a layered image at all. Detect this case and treat
such an image as non-layered. Issue a warning to be sure the user
knows what's going on.

This resolves:
http://tracker.ceph.com/issues/5028

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b91f09f17b2a302f07022e2f766969e2536d71b3 10-May-2013 Alex Elder <elder@inktank.com> rbd: support reading parent page data for writes

Currently, rbd_img_obj_parent_read_full() assumes the incoming
object request contains bio data. But if a layered image is part of
a multi-layer stack of images it will result in read requests of
page data to parent images.

This is handling the same kind of issue as was resolved by this
commit:
5b2ab72d rbd: support reading parent page data

This resolves:
http://tracker.ceph.com/issues/5027

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ebda6408f2227a774343c3cc2861384942143ff3 10-May-2013 Alex Elder <elder@inktank.com> rbd: fix parent request size assumption

The code that reads object data from the parent for a copyup on
write request currently assumes that the size of that request is the
size of a "full" object from the original target image.

That is not necessarily the case. The parent overlap could reduce
the request size below that. To fix that assumption we need to
record the number of pages in the copyup_pages array, for both an
image request and an object request. Rename a local variable in
rbd_img_obj_parent_read_full_callback() to reflect we're recording
the length of the parent read request, not the size of the target
object.

This resolves:
http://tracker.ceph.com/issues/5038

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c48f3f86e248b1649ad22151dd45ef2610101ed3 08-May-2013 Alex Elder <elder@inktank.com> rbd: kill rbd_img_request_get()

Get rid of rbd_img_request_get(), because it isn't used, and maybe
won't ever be needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1f3ef78861ac4b510175e177899b9b5ba4bbed91 07-May-2013 Alex Elder <elder@inktank.com> rbd: only set up watch for mapped images

Any changes to parent images are immaterial to any mapped clone.
So there is no need to have a watch event registered on header
objects except for the header object of an image that is mapped.
In fact, a watch request is a write operation, and we may only
have read access to a parent image.

We can't set up the watch request until we know the name of the
header object though. So pass a flag to rbd_dev_image_probe() to
indicate whether this probe is for a mapping or for a parent image.

Change the second parameter to rbd_dev_header_watch_sync() be
Boolean while we're at it.

This resolves:
http://tracker.ceph.com/issues/4941

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7ce4eef7b5fad73b365b7e4b8892af3af72d4bd3 07-May-2013 Alex Elder <elder@inktank.com> rbd: set mapping read-only flag in rbd_add()

The rbd_dev->mapping field for a parent image is not meaningful.
Since rbd_image_probe() is used both for images being mapped and
their parents, it doesn't make sense to set that flag in that
function.

So move the setting of the mapping.read_only flag out of
rbd_dev_image_probe() and into rbd_add() instead.

This resolves:
http://tracker.ceph.com/issues/4940

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5b2ab72d367d2682c1a237448fbc1595881a88fa 07-May-2013 Alex Elder <elder@inktank.com> rbd: support reading parent page data

Currently, rbd_img_parent_read() assumes the incoming object request
contains bio data. But if a layered image is part of a multi-layer
stack of images it will result in read requests of page data to parent
images.

Fortunately, it's not hard to add support for page data.

This resolves:
http://tracker.ceph.com/issues/4939

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
91c6febb3817be576785ef06aeaaa8ed423e0a2a 07-May-2013 Alex Elder <elder@inktank.com> rbd: fix an incorrect assertion condition

In rbd_img_obj_parent_read_full_callback() there is an assertion
intended to verify the size of the image request for a full parent
read was the size of the original request's target object. But
assertion was looking at the parent image order rather than the
original one, and these values can differ.

Fix that.

This resolves:
http://tracker.ceph.com/issues/4938

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2df3fac75851dc4257b90dc72fdd3cf27ba177bc 06-May-2013 Alex Elder <elder@inktank.com> rbd: define rbd_dev_v2_header_info()

This rearranges rbd_dev_v2_refresh() so it works more like
rbd_dev_v1_header_info(). While format 1 images need to read the
whole header object to get any information, format 2 can collect
almost all information selectively. So the one-time initialization
will remain in a separate function--based on rbd_dev_v2_probe().

Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have
it call rbd_dev_v2_header_onetime() if it's being called for the
first time for the given rbd device.

Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and
remove the image size and snapshot context calls it held in
common with the refresh function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
99a41ebcee1a1ea0463b1b29d2e888de21a60c66 06-May-2013 Alex Elder <elder@inktank.com> rbd: get rid of trivial v1 header wrappers

Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and
rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls
in their place.

Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to
be more generic (it will better reflect what happens with format 2
images).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
30d60ba2f258da24b91edb880338c7178f901de9 06-May-2013 Alex Elder <elder@inktank.com> rbd: simplify rbd_dev_v1_probe()

An rbd_dev structure's fields are all zero-filled for an initial
probe, so there's no need to explicitly zero the parent_spec
and parent_overlap fields in rbd_dev_v1_probe(). Removing these
assignments makes rbd_dev_v1_probe() *almost* trivial.

Move the dout() message that announces discovery of an image into
rbd_dev_image_probe(), generalize to support images in either format
and only show it if an image is fully discovered.

This highlights that are some unnecessary cleanups in the error
path for rbd_dev_v1_probe(), so they can be removed.

Now rbd_dev_v1_probe() *is* a trivial wrapper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
662518b128c27def65e9af4bea2b56a1e04b3251 06-May-2013 Alex Elder <elder@inktank.com> rbd: update in-core header directly

Now that rbd_header_from_disk() only fills in one-time fields once,
we can extend it slightly so it releases the other fields before
replacing their values. This way there's no need to pass a
temporary buffer and then copy all the results in. Just use the rbd
device header structure in rbd_header_from_disk() so its values get
updated directly.

Note that this means we need to take the header semaphore at the
point we update things. So pass the rbd_dev rather than the address
of its header as its first argument to rbd_header_from_disk(), and
have it return an error code.

As a result, rbd_dev_v1_header_read() does all the work,
rbd_read_header() becomes unnecessary, and rbd_dev_v1_refresh()
becomes a very simple wrapper.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
bb23e37acb2ae9604130c4819fb8ae0f784a3a2b 06-May-2013 Alex Elder <elder@inktank.com> rbd: refactor rbd_header_from_disk()

This rearranges rbd_header_from_disk so that it:
- allocates the snapshot context right away
- keeps results in local variables, not changing the passed-in
header until it's known we'll succeed
- does initialization of set-once fields in a header only if
they have not already been set

The last point is moot at the moment, because rbd_read_header()
(the only caller) always supplies a zero-filled header buffer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
46578dcdca951f3da70d3a5a9b5166b2a492a182 06-May-2013 Alex Elder <elder@inktank.com> rbd: zero format 1 header structure earlier

The passed-in header structure is zeroed in rbd_header_from_disk().
Instead, have the caller do it. Note that there are two callers,
rbd_dev_v1_refresh() and rbd_dev_v1_probe(). The latter already has
a zeroed header structure so zeroing it isn't necessary there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f35a4dee14c31dc00807f3bcd59cc7a6959f63d7 06-May-2013 Alex Elder <elder@inktank.com> rbd: set the mapping size and features later

Defer setting the size and features fields of a mapped image until
after the Linux disk structure is set up. Set the capacity of the
disk after that.

Rearrange the definition of rbd_image_header, separating the fields
that are set only once from those that can be updated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
51344a38ba2033be18a4ec23e318845caeccdc04 06-May-2013 Alex Elder <elder@inktank.com> rbd: always set read-only flag in rbd_add()

Hold off setting the read-only flag in rbd_add() for an image being
mapped until we have successfully probed the image. At that point
we know whether it's a snapshot mapping or not, so we can set the
read-only flag in that one place rather than doing so (for
snapshots) in rbd_dev_mapping_set(). To do this, pass a flag to the
image probe routine indicating whether we want a read-only mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6d80b130d516deef51666e210fde674c947b8b5c 06-May-2013 Alex Elder <elder@inktank.com> rbd: kill rbd_dev_clear_mapping()

This function is a duplicate of rbd_dev_mapping_clear(), and was
added by mistake.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8f4b7d9821715767ac28bbc2d401bbb5f3f9a448 06-May-2013 Alex Elder <elder@inktank.com> rbd: don't look up snapshot id in rbd_dev_mapping_set()

Currently rbd_dev_mapping_set() looks up the snapshot id for the
snapshot whose name is found in the rbd device's spec structure.

That function gets called by rbd_dev_device_setup(), which is
called by rbd_add() *after* rbd_dev_image_probe(). If the
image probe succeeds, the rbd device's spec will already have
been updated to include names and ids for all fields.

Therefore there's no need to look up the snapshot id in
rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c734b79655a91a24afcae73738a43a0db09a801a 06-May-2013 Alex Elder <elder@inktank.com> rbd: don't print warning if not mapping a parent

The presence of the LAYERING bit in an rbd image's feature mask does
not guarantee the image actually has a parent image. Currently that
bit is set only when a clone (i.e., image with a parent) is created,
but it is (currently) not cleared if that clone gets flattened back
into a "normal" image. A "parent_id" query will leave the
parent_spec for the image being mapped a null pointer, but will not
return an error.

Currently, whenever an image with the LAYERED feature gets mapped, a
warning about the use of layered images gets printed. But we don't
want to do this for a flattened image, so print the warning only
if we find there is a parent spec after the probe.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
29334ba49c3e3defd9a2697cd4a199c597c30dc9 06-May-2013 Alex Elder <elder@inktank.com> rbd: kill rbd_update_mapping_size()

Since rbd_update_mapping_size() is now a trivial wrapper, just open
code it in its two callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
00a653e216a8427547774ab3f2cc92709c3e28c9 06-May-2013 Alex Elder <elder@inktank.com> rbd: update capacity in rbd_dev_refresh()

When a mapped image changes size, we change the capacity recorded
for the Linux disk associated with it, in rbd_update_mapping_size().
That function is called in two places--the format 1 and format 2
refresh routines.

There is no need to set the capacity while holding the header
semaphore. Instead, do it in the common rbd_dev_refresh(), using
the logic that's already there to initiate disk revalidation.

Add handling in the request function, just in case a request
that exceeds the capacity of the device comes in (perhaps one
that was started before a refresh shrunk the device).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e627db085e0dab7744b68f3c927be6ed6df2f7f9 06-May-2013 Alex Elder <elder@inktank.com> rbd: revalidate only for mapping size changes

This commit:
d98df63e rbd: revalidate_disk upon rbd resize
instituted a call to revalidate_disk() to notify interested parties
that a mapped image has changed size. This works well, as long as
the the rbd device doesn't map a snapshot.

A snapshot will never change size. However, the base image the
snapshot is associated with can, and it can do so while the snapshot
is mapped.

The problem is that the test for the size is looking at the size of
the base image, not the size of the mapped snapshot. This patch
corrects that.

Update the warning message shown in the event of error, and move
it into the callers.

This resolves:
http://tracker.ceph.com/issues/4911

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
49ece554288caf1a8ea9e546ab1ff5bc4b175456 06-May-2013 Alex Elder <elder@inktank.com> rbd: fix leak of format 2 snapshot context

When rbd_dev_v2_refresh() is called, the rbd device already has a
snapshot context associated with it. But that never gets freed,
the pointer just gets overwritten.

Fix this by dropping the rbd device's reference to the snapshot
context before overwriting the pointer.

Because ceph_put_snap_context() already handles for a null pointer
we don't need to check for that (for the probe case, where no
context has yet been assigned).

This resolves:
http://tracker.ceph.com/issues/4912

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
db2a144bedd58b3dcf19950c2f476c58c9f39d18 06-May-2013 Al Viro <viro@zeniv.linux.org.uk> block_device_operations->release() should return void

The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
b5b09be30cf99f9c699e825629f02e3bce555d44 02-May-2013 Alex Elder <elder@inktank.com> rbd: fix image request leak on parent read

When a read for a layered image object finds the target object
doesn't exist, a read image request for the parent image is created
and submitted. When that completes, the callback routine was
not releasing that parent image request. Fix that.

The slab allocation stuff just added has greatly simplified the
search for the source of this memory leak.

This resolves:
http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
78c2a44aae2950ecf0279590572b861288714946 01-May-2013 Alex Elder <elder@inktank.com> rbd: allocate image object names with a slab allocator

The names of objects used for image object requests are always fixed
size. So create a slab cache to manage them. Define a new function
rbd_segment_name_free() to match rbd_segment_name() (which is what
supplies the dynamically-allocated name buffer).

This is part of:
http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
868311b1ebc9b203bae0d6d1f012ea5cbdadca03 01-May-2013 Alex Elder <elder@inktank.com> rbd: allocate object requests with a slab allocator

Create a slab cache to manage rbd_obj_request allocation. We aren't
using a constructor, and we'll zero-fill object request structures
when they're allocated.

This is part of:
http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f907ad55967fec6bc6ec5ee84021070c49cf0bb1 01-May-2013 Alex Elder <elder@inktank.com> rbd: allocate name separate from obj_request

The next patch will define a slab allocator for a object requests.
To use that we'll need to allocate the name of an object separate
from the request structure itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1c2a9dfe2107e81b9f0ee90845c687cf7ff84106 01-May-2013 Alex Elder <elder@inktank.com> rbd: allocate image requests with a slab allocator

Create a slab cache to manage rbd_img_request allocation. Nothing
too fancy at this point--we'll still initialize everything at
allocation time (no constructor)

This is part of:
http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
30d1cff817808fca9801c743d2de4c61f3f38e15 01-May-2013 Alex Elder <elder@inktank.com> rbd: use binary search for snapshot lookup

Use bsearch(3) to make snapshot lookup by id more efficient. (There
could be thousands of snapshots, and conceivably many more.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
15228ede7d9437b0dcfe9331c9830b3646fdadf7 01-May-2013 Alex Elder <elder@inktank.com> rbd: clear EXISTS flag if mapped snapshot disappears

This functionality inadvertently disappeared in the last patch.

Image snapshots can get removed at just about any time. In
particular it can disappear even if it is in use by an rbd
client as a mapped image.

The rbd client deals with such a disappearance by responding to new
requests with ENXIO. This is implemented by each rbd device
maintaining an EXISTS flag, which is normally set but cleared if a
snapshot disappears.

This patch (re-)implements the clearing of that flag.

Whenever mapped image header information is refreshed, if the
mapping is for a snapshot, verify the mapped snapshot is still
present in the updated snapshot context. If it is not, clear the
flag.

It is not necessary to check this in the initial probe, because the
probe will not succeed if the snapshot doesn't exist.

This resolves:
http://tracker.ceph.com/issues/4880

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
33dca39f5c0c750d37d3d89ce8ae66be08280a45 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: kill off the snapshot list

We no longer use the snapshot list for anything. When we need to
look up a snapshot name, id, size, or feature mask, we just do it
directly rather than relying on this list being updated with every
refresh. The main reason it existed was for the benefit of the
device/sysfs entries that previously were associated with snapshots.

So get rid of the snapshot list, and struct rbd_snap, and the
hundreds of lines of code that supported them.

This resolves:
http://tracker.ceph.com/issues/4868

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2ad3d7167e599fb149ed370a3128140b9deabd5a 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: define rbd_snap_size() and rbd_snap_features()

This patch defines a handful of new functions that will allow
us to get rid of the rbd device structure's list of snapshots.

Define rbd_snap_id_by_name() to look up a snapshot id given its
name. This is efficient for format 1 images but not for format 2.
Fortunately it only gets called at mapping time so it's not that
critical.

Use rbd_snap_id_by_name() to find out the id for a snapshot getting
mapped, and pass that id to new functions rbd_snap_size() and
rbd_snap_features() to look up information about a given snapshot's
size and feature mask given its snapshot id. All this gets done
in rbd_dev_mapping_set().

As a result, snap_by_name() is no longer needed, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
54cac61fb6b3bacecf5367d3838307b1dd69ace2 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: use snap_id not index to look up snap info

In order to align with what was needed for format 1 rbd images,
rbd_dev_v2_snap_info() was set up to take as argument an index into
the array of snapshot ids in a rbd device's snapshot context.

This switches that around, so we pass the snapshot id instead.
In doing this, rbd_snap_name() now returns a dynamically-allocated
string rather than a fixed one, so there's no need to make a
duplicate in its caller, rbd_dev_spec_update().

This means the following functions take a snapshot id where they
previously used an index value:
rbd_dev_snap_info()
rbd_dev_v1_snap_info()
rbd_dev_v2_snap_info()

A new function, rbd_dev_snap_index(), determines the snap index for
format 1 images and uses it to look up the name.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9682fc6d3a8b63f58fbfc5084f32c038170cfd6b 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: look up snapshot name in names buffer

Rather than scanning the list of snapshot structures for it, scan
the snapshot context buffer containing snapshot names in order to
determine for a format 1 image the name associated with a given
snapshot id.

Pull out the part of rbd_dev_v1_snap_info() that does this scan into
a new function, _rbd_dev_v1_snap_name(). Have that function return
a dynamically-allocated copy of the name, and don't duplicate it in
rbd_dev_v1_snap_info().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
dedc81ea8468fd29bdd13eb5a362cab96b53d802 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: drop obj_request->version

Nothing ever uses the version field maintained in the object request
structure any more, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e2a58ee55b0f132c2a6cbf2504a1c651b261fb67 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: drop rbd_obj_method_sync() version parameter

Only NULL is passed as the version argument to rbd_obj_method_sync(),
so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cc4a38bdd587a1843540989f262feb7bdc43c468 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: more version parameter removal

Continued from the last patch, more parameters that can go away
because we no longer have a need to track object versions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7097f8df6e679207c949673d2959505b59a1a30e 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: get rid of some version parameters

Several functions in rbd have parameters meant to allow the version
of an object to be passed in or out. The purpose of those was to
allow the version of a header object to be maintained, but we no
longer do that. As a result, these parameters are never actually
needed or used, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b21ebdddeb2aa86677dc7d0e3cf6918cac08f92c 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: stop tracking header object version

The rbd code takes care to maintain the version of the header
object. This was done in hopes of using it to detect a change in
the object between reading it and setting up a watch request to
be notified of changes.

The mechanism was never fully implemented, however. And we now
avoid the original problem by setting up the watch request before
ever reading the content of the header.

The osd doesn't interpret the object version supplied with a WATCH
osd op, nor does it use the version supplied with a NOTIFY_ACK op
(we can just supply 0 for both). There is therefore no need to
maintain the header's object version any more, so stop doing so.

We'll be able to simplify some more rbd code in the next few patches
as a result of this.

This resolves:
http://tracker.ceph.com/issues/3952

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cb75223d2b19161e8d916049673cd297cce43cdd 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: snap names are pointer to constant data

Make explicit that snapshot names don't change by making functions
return and take parameters that that point to const qualified data.

This resolves:
http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a3fbe5d447bf1f63efa7f4d8c222002ef136cf4b 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: don't revalidate so much

Whenever a header object event causes a mapped rbd image to refresh
its header information, revalidate_disk() is being called. This was
done in rbd_dev_refresh() outside the control mutex in order to
avoid a lock inversion. Although a an event like this *might*
indicate the image has changed size, most of the time it does not.

Record the image size before and after the refresh, and only
call revalidate_disk() if it changes.

This resolves:
http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
96882f55c40dcb4cd80b81a4374fdd297109ec98 30-Apr-2013 Alex Elder <elder@inktank.com> rbd: fix up the layering warning message

A warning gets spewed for any image being probed, including parent
images. Set up a condition such that the warning message only gets
printed for the image being mapped, not any of its parents.

Also, I didn't like the way the warning ended up being so long.
Make it a terse warning instead. People experimenting with layering
will know what the message means.

This is part of:
http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
812164f8c3f6f5348aa69003a2f81775c2872ac0 30-Apr-2013 Alex Elder <elder@inktank.com> ceph: use ceph_create_snap_context()

Now that we have a library routine to create snap contexts, use it.

This is part of:
http://tracker.ceph.com/issues/4857

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b536f69a3a589113992c32982bf2981c8225c9da 29-Apr-2013 Alex Elder <elder@inktank.com> rbd: set up devices only for mapped images

Stop setting up Linux devices during the image probe operation.
Instead, set up the devices as a separate step after the image
probe, in rbd_add().

A consequence of this is that only mapped images get devices
assigned to them, which is pretty sweet.

This resolves:
http://tracker.ceph.com/issues/4774

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8ad42cd0c002fa278f6d0135e22fcb188e400a28 29-Apr-2013 Alex Elder <elder@inktank.com> rbd: don't have device release destroy rbd_dev

Currently an rbd_device structure gets destroyed from the release
routine for the device embedded within it. Stop doing that, instead
calling rbd_dev_image_release() right after rbd_bus_del_dev()
wherever the latter is called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6fd48b3be9f6d195a970b92040d097b5b886a99b 29-Apr-2013 Alex Elder <elder@inktank.com> rbd: define rbd_dev_unprobe()

Define a new function rbd_dev_unprobe() which undoes state changes
that occur from calling rbd_dev_v1_probe() or rbd_dev_v2_probe().
Note that this is a superset of rbd_header_free(), which is now
getting removed (it seems to have been used improperly anyway).

Flesh out rbd_dev_image_release() so it undoes exactly what
rbd_dev_image_probe() does.

This means that:
- rbd_dev_device_release() gets called when the last device
reference gets dropped;
- that undoes everything done by the rbd_dev_device_setup() call
at the end of rbd_dev_image_probe() (and nothing more), ending
by calling rbd_dev_image_release(); and
- rbd_dev_image_release() undoes everything else done by
rbd_dev_image_probe() (and this includes a call to
rbd_dev_unprobe().

This means the image and device portions of an rbd device are fairly
cleanly separated now, so error paths should be a little easier to
verify than they used to be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
200a6a8be5dba96df121f3d2363964dd77ee7e1b 29-Apr-2013 Alex Elder <elder@inktank.com> rbd: don't destroy rbd_dev in device release function

Rename rbd_dev_probe_finish() to be rbd_dev_device_setup(). Its
purpose is to set up the Linux side of an rbd device mapping.
Rename rbd_dev_release() to be rbd_dev_device_release(), making
it more obvious it serves as the inverse of the setup function
(or it will).

Encapsulate some of what was done in rbd_dev_release() into a new
function rbd_dev_image_release(), which serves as the inverse of
setting up the ceph side of the mapped rbd image.

Define a new helper rbd_dev_clear_mapping() to simply zero out the
fields of a mapping structure--the inverse of rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
79ab7558aac7622109e9d9b089cac2c5f06aca20 29-Apr-2013 Alex Elder <elder@inktank.com> rbd: drop module later

Drop the module reference at the end of rbd_remove() for symmetry
with adding a reference at the top of rbd_add().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b644de2ba0c5b590db9195c03358ccd0f061daa6 27-Apr-2013 Alex Elder <elder@inktank.com> rbd: set up watch in rbd_dev_image_probe()

Move setting up the watch request for an image so it's done in
rbd_dev_image_probe() rather than rbd_dev_probe_finish(). Move
it all the way up to before doing the initial probe. This avoids
a potential race condition, in which we get (and use) the initial
snapshot context for an image, and it gets changed between that
time and the time we get the watch set up.

This resolves:
http://tracker.ceph.com/issues/3871

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
96f03e08f9f27cf72d2c24b4e75ade81d2df3c75 27-Apr-2013 Alex Elder <elder@inktank.com> rbd: don't bother checking whether order changes

When a format 2 image is refreshed, code is in place to verify that
the object order never changes from what it was originally. This
relies on the fact that the refresh will occur *after* an initial
load of information about the image.

An upcoming patch makes it possible for the refresh to occur first,
so we can no longer make this order check. The order really can't
ever change anyway--this was just a sanity check. So get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0d8189e175380c029a309f05f44e82bacf1c0404 27-Apr-2013 Alex Elder <elder@inktank.com> rbd: don't clean up watch in device release function

Currently, a watch on an rbd device header object gets torn down
when its final Linux device reference gets dropped. Instead, tear
it down when removing the device. If an error occurs cleaning up
the watch event when unmapping, abort the unmap request.

All images (including parents) still get watch requests set up, so
tear these down also, in rbd_dev_remove_parent(). For now, ignore
any errors that occur in this case.

Get rid of local variable "rc" in rbd_remove(); use "ret" instead
(they both somehow ended up defined in the function and only one is
needed).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
332bb12db9459d52dfcdb278e7607351d2eff6ab 27-Apr-2013 Alex Elder <elder@inktank.com> rbd: define rbd_header_name()

Define a new function rbd_header_name(), which allocates and formats
the name of the header object for the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9bb81c9be90c1ad265547f0a40f543548d263fb4 27-Apr-2013 Alex Elder <elder@inktank.com> rbd: move more initialization into rbd_dev_image_probe()

Move a block of initialization related to the "ceph-side" of an rbd
image out of rbd_dev_probe_finish() and into rbd_dev_image_probe().

Add appropriate error handling to clean things up in the event any
of these new functions return an error.

We know that rbd_dev_snaps_update(), rbd_dev_spec_update(), and
rbd_dev_probe_parent() all clean up after themselves before they
return an error, so no special cleanup is required except when an
earlier call succeeds. Since rbd_dev_spec_update() only updates the
spec field (whose cleanup will be handled by dropping the last
reference to the spec) there is no cleanup action associatied with
that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5de10f3b0c99983e3f9ec19baa1eb691685d9b8f 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: probe for the parent earlier

Probe for a parent device earlier in rbd_dev_probe_finish(), before
starting to set up the Linux side of the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2e93bf9e465b7d0ccf703fb791c663435d9522cf 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: remove parent devices on probe error

When an error occurs while finishing probing a device it is assumed
that parent devices get cleaned up when deleting a device. They
don't. Add a call to clean them up. Note that this means the
parent spec will already be cleaned up so it doesn't have to be
in one of the rbd_add() error paths.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ad945fc1da42965a31089d29de3754047861f348 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: fix rbd_dev_remove_parent()

In certain error paths, it is possible for an rbd device to have a
parent spec but no parent rbd_dev. In rbd_dev_remove_parent() use
the parent field rather than parent_spec in determining whether to
try to remove any parent devices. Use assertions to indicate that
any non-null parent pointer has parent_spec associated with it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b480815a17bc6bfe85d4931c53e5a8fded7f889e 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: kill __rbd_remove()

The function __rbd_remove() is used in two spots, and it's fairly
simple. It combines cleanup of part of the ceph-side state as well
as cleaning up the Linux-side state. Just open code it in the two
callers and eliminate the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d1cf5788450e1781f63a0626a854fe8309b32cb1 27-Apr-2013 Alex Elder <elder@inktank.com> rbd: set mapping info earlier

Set the mapping size and features earlier in rbd_dev_probe_finish().

Define rbd_dev_mapping_clear() as an inverse for setting those
fields, and use it both in error handling in rbd_dev_image_probe()
and in the final cleanup in rbd_dev_release(). Change the name
of rbd_dev_set_mapping() to of rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
05a46afdc7f0f73d42dcecd8ee80f9558b4c38f7 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: encapsulate removing parent devices

Encapsulate the code that removes an rbd device's parent images into
a new function, rbd_dev_remove_parent().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
124afba25d58e2b52d7d4bad993065572a28d57f 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: encapsulate probing for parent devices

Encapsulate the code that probes for an rbd device's parent images
into a new function, rbd_dev_probe_parent().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b5156e76da01c23e14e962594553f1735b1db298 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: defer setting disk capacity

Don't set the disk capacity until right before we announce the
device as available for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
129b79d4498581e52175ac5c3ef2168f616b0e5e 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: only set device exists flag when ready

Hold off setting the EXISTS rbd device flag until just before we
announce the disk as available for use. There's no point in doing
so any earlier than that, and at that point the device truly is
fully set up and ready to use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
fc71d8330e39ef3af816a9c869150250952cb712 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: fix up some sysfs stuff

This just tweaks a few things in the routines that implement
rbd sysfs files.

All of the entries for an rbd device in /sys/bus/rbd/devices/<id>/
will represent information whose valid values are known by the time
they are accessible.

Right now we get the size of the mapped image by a call to
get_capacity(). There's no need to do this, because that will
return what we last set the capacity to, which is just the size
recorded for the mapping. So just show that value instead.

We also get this under protection of the header semaphore, in order
to provide a precisely correct value. This isn't really necessary;
these files are really informational only and it's not necessary to
be so careful.

Finally, print a special value in case the major device number is
not recorded. Right now that won't matter much but soon the parent
images won't have devices associated with them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e28626a08b3e7412158551a639dd36887e2d728d 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: fix a bug in resizing a mapping

When a snapshot context update occurs, rbd_update_mapping_size() is
called to set the capacity of the disk to record the updated
size of the image in case it has changed.

There's a bug though. The mapping size is in units of *bytes*. The
code that updates the mapping size field is assigning a value that
has been scaled down to *sectors*.

Fix that. Also, check to see if the size has actually changed, and
don't bother updating things (specifically, calling set_capacity())
if it has not.

This resolves:
http://tracker.ceph.com/issues/4833

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2e9f7f1c0de23156e225046f10fad939a4017e97 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: refactor rbd_dev_probe_update_spec()

Fairly straightforward refactoring of rbd_dev_probe_update_spec().
The name is changed to rbd_dev_spec_update().

Rearrange it so nothing gets assigned to the spec until all of the
names have been successfully acquired.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
71f293e26e760c4151e00b8f611e67da222f89c7 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: rename rbd_dev_probe()

Rename rbd_dev_probe() to be rbd_dev_image_probe(). Its purpose
will eventually be to probe for the existence of a valid rbd image
for the rbd device--focusing only on the ceph side and not the Linux
device side of initialization.

For now the two "sides" are not fully separated, and this function
is still the entry point for initializing the full rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9f5dffdc8f5dbc16493566b6aac59f275d5cb3f9 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: make rbd_dev_destroy() match rbd_dev_create()

Currently, rbd_dev_destroy() does more than just the inverse of what
rbd_dev_create() does. Stop doing that, and move the two extra
things it does into the three call sites.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
468521c1b1450d8e9bda22df9455deaa4feed00f 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: define rbd snap context routines

Encapsulate the creation of a snapshot context for rbd in a new
function rbd_snap_context_create(). Define rbd wrappers for getting
and dropping references to them once they're created.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c0cd10db4685a76397f32bed246e861705642576 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: use rbd_warn(), not WARN_ON()

Change some calls to WARN_ON() so they use rbd_warn() instead, so we
get consistent messaging. A few remain but they can probably just
go away eventually.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
500d0c0fbb85b59e5e75fc83ff701b7d8aa285f9 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: move stripe_unit and stripe_count into header

This commit added fetching if fancy striping parameters:
09186ddb rbd: get and check striping parameters

They are almost unused, but the two fields storing the information
really belonged in the rbd_image_header structure.

This patch moves them there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ecb4dc225612e1c0b28d2c1b168422dde4f442a6 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: make rbd spec names pointer to const

Make the names and image id in an rbd_spec be pointers to constant
data. This required the use of a local variable to hold the
snapshot name in rbd_add_parse_args() to avoid a warning.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e1d4213f090644b06aab6ea70e307ecf16182148 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: set snapshot id in rbd_dev_probe_update_spec()

Set the rbd spec's snapshot id for an image getting mapped in
rbd_dev_probe_update_spec() rather than rbd_dev_set_mapping().
This is the more logical place for that to happen (even though
it means we might look up the snapshot by name twice).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8b0241f85ab11c87075f9de0191acd8b546c6f6a 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: have snap_by_name() return a snapshot

A function called snap_by_name() ought to just look up a snapshot by
name. It does that, but then it assigns some stuff to the rbd
device structure as well.

Change the function to do just the lookup, and have the caller do
the assignments that follow.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5655c4d940ba8dd32250ab1e4ba3db785943a28e 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: fix image id leak in initial probe

If a format 2 image id is found for an image being mapped, but the
subsequent probe of the image fails, rbd_dev_probe() quits without
freeing the image id. Fix that.

Also drop a redundant hunk of code in rbd_dev_image_id().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c0fba36880288afbeca872298c970fb4abb76464 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: have rbd_dev_image_id() set format 1 image id

Currently, rbd_dev_probe() assumes that any error returned by
rbd_dev_image_id() is most likely -ENOENT, and responds by
calling the format 1 probe routine, rbd_dev_v1_probe(). Then,
at the top of rbd_dev_v1_probe(), an empty string is allocated
for the image id.

This is sort of unbalanced. Fix this by having rbd_dev_image_id()
look for -ENOENT from its "get_id" method call. If that is seen,
have it allocate the empty string there rather than depending on
rbd_dev_v1_probe() to do it.

Given that this is effectively defining the format of the image,
set rbd_dev->image_format inside rbd_dev_image_id() rather than in
the format-specific probe routines.

Also drop a redundant hunk of code in rbd_dev_image_id().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a0cab924324fac8d6414009bc25ce31eeece038e 26-Apr-2013 Alex Elder <elder@inktank.com> rbd: avoid dropping extra reference in rbd_free_disk()

I found during some failure injection testing that the call to
rbd_free_disk() in the error path of rbd_dev_probe_finish() was
dropping an extra reference to the disk queue. The problem
occurred when put_disk tried to drop a reference to the disk's
queue. A call to blk_cleanup_queue() just prior to that will have
also dropped a reference to the queue.

The problem is that the reference dropped by put_disk() is assumed
to have been taken by add_disk(). Our code has error paths that can
occur after the disk and its queue are initialized, but before the
call to add_disk(), and in those paths we won't have that extra
reference.

The fix is easy though. In rbd_free_disk() we're already checking
the disk's GENHD_FL_UP flag. That flag is an indication that
add_disk() has been called, so just call blk_cleanup_queue()
conditional on that flag being set.

This resolves:
http://tracker.ceph.com/issues/4800

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f40eb349e032bee2b6f06e9b6f1dbfae561bd30a 25-Apr-2013 Alex Elder <elder@inktank.com> rbd: use rbd_obj_method_sync() return value

Now that rbd_obj_method_sync() returns the number of bytes
returned by the method call, that value should be used by
callers to ensure we don't overrun the valid portion of the
buffer.

Fix the two spots that remained that weren't doing that,
rbd_dev_image_name() and rbd_dev_v2_snap_name().

Rearrange the error path slightly in rbd_dev_v2_snap_name().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6e584f5244060edc77141700d814a2af7d697685 25-Apr-2013 Alex Elder <elder@inktank.com> rbd: fix leak of format 2 snapshot names

When the snapshot context for an rbd device gets updated (or the
initial one is recorded) a a list of snapshot structures is created
to represent them, one entry per snapshot. Each entry includes a
dynamically-allocated copy of the snapshot name.

Currently the name is allocated in rbd_snap_create(), as a duplicate
of the passed-in name.

For format 1 images, the snapshot name provided is just a pointer to
an existing name. But for format 2 images, the passed-in name is
already dynamically allocated, and in the the process of duplicating
it here we are leaking the passed-in name.

Fix this by dynamically allocating the name for format 1 snapshots
also, and then stop allocating a duplicate in rbd_snap_create().

Change rbd_dev_v1_snap_info() so none of its parameters is
side-effected unless it's going to return success.

This is part of:
http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6087b51b9e7b311353408945bcc48368a54b8bbc 25-Apr-2013 Alex Elder <elder@inktank.com> rbd: rename __rbd_add_snap_dev()

Rename __rbd_add_snap_dev() to be rbd_snap_create(). We no longer
have devices for non-mapped snapshots, and we're not actually
"adding" it to the list in this function, just creating it.

Rename rbd_remove_snap_dev() to be rbd_snap_destroy() for reasons
similar to the above. Stop having this function delete the snapshot
from its list (to be symmetrical with its create counterpart) and do
that in the caller instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
acb1b6caf179d405ebd1dddefe916ccbb9b90298 25-Apr-2013 Alex Elder <elder@inktank.com> rbd: only update values on snap_info success

Change rbd_dev_v2_snap_info() so it only ever sets values of the
size and features parameters if looking up the snapshot name was
successful.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c86f86e9e75e77e4d51ded9edbad30834ff606f7 25-Apr-2013 Alex Elder <elder@inktank.com> rbd: make snap_size order parameter optional

Only one of the two callers of _rbd_dev_v2_snap_size() needs the
order value returned. So make that an optional argument--a null
pointer if the caller doesn't need it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
522a0cc0f0ecdb1857db7795b1c17591f28f9ca0 25-Apr-2013 Alex Elder <elder@inktank.com> rbd: fix leak of snapshots during initial probe

When an rbd image is initially mapped, its snapshot context is
collected, and then a list of snapshot entries representing the
snapshots in that context is created. The list is created using
rbd_dev_snaps_update(). (This function also supports updating an
existing snapshot list based on a new snapshot context.)

If an error occurs, updating the list is aborted, and the list is
currently left as-is, in an inconsistent state. At that point,
there may be a partially-constructed list, but the calling functions
(rbd_dev_probe_finish() from rbd_dev_probe() from rbd_add()) never
clean them up. So this constitutes a leak.

A snapshot list that is inconsistent with the current snapshot
context is of no use, and might even be actively bad. So rather
than just having the caller clean it up, have rbd_dev_snaps_update()
just clear out the entire snapshot list in the event an error
occurs.

The other place rbd_dev_snaps_update() is used is when a refresh is
triggered, either because of a watch callback or via a write to the
/sys/bus/rbd/devices/<id>/refresh interface. An error while
updating the snapshots has no substantive effect in either of those
cases, but one of them issues a warning. Move that warning to the
common rbd_dev_refresh() function so it gets issued regardless of
how it got initiated.

This is part of:
http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3e83b65bb9a9f3a4d7f0200139bd947c940ec3ab 23-Apr-2013 Alex Elder <elder@inktank.com> rbd: don't create sysfs entries for non-mapped snapshots

When an rbd image gets mapped a device entry gets created for it
under /sys/bus/rbd/devices/<id>/. Inside that directory there are
sysfs files that contain information about the image: its size,
feature bits, major device number, and so on.

Additionally, if that image has any snapshots, a device entry gets
created for each of those as a "child" of the mapped device. Each
of these is a subdirectory of the mapped device, and each directory
contains a few files with information about the snapshot (its
snapshot id, size, and feature mask).

There is no clear benefit to having those device entries for the
snapshots. The information provided via sysfs of of little real
value--and all of it is available via rbd CLI commands. If we
still wanted to see the kernel's view of this information it could
be done much more simply by including it in a single sysfs file for
the mapped image.

But there *is* a clear cost to supporting them. Every time a snapshot
context changes, these entries need to be updated (deleted snapshots
removed, new snapshots created). The rbd driver is notified of
changes to the snapshot context via callbacks from an osd, and care
must be taken to coordinate removal of snapshot data structures
with the possibility of one these notifications occurring.

Things would be considerably simpler if we just didn't have to
maintain device entries for the snapshots.

So get rid of them.

The ability to map a snapshot of an rbd image will remain; the only
thing lost will be the ability to query these sysfs directories for
information about snapshots of mapped images.

This resolves:
http://tracker.ceph.com/issues/4796

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
770eba6e295fd36e43881176ee0644b9cc2803f1 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: activate support for layered images

Now that we have most everything in place to support layered rbd
images, enable support for them in the kernel client. Issue a
warning to the log that the support is considered experimental
whenever a format 2 layered image is mapped.

Note that we also have to claim to support the STRIPINGV2 feature,
due to a mistake in the way the rbd CLI set up those flags. This
feature can work if it has the right parameters, and safeguards
have been put in place to reject those images that do not have
compatible parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cc070d59bc422945f83a89e9d60f749d0f82787d 21-Apr-2013 Alex Elder <elder@inktank.com> rbd: get and check striping parameters

If an rbd format 2 image indicates it supports the STRIPINGV2
feature we need to find out its stripe unit and stripe count in
order to know whether we can use it. We don't yet support fancy
striping fully, but if the default parameters are used the behavior
is indistinguishible from non-fancy striping.

This is necessary because some images require the STRIPINGV2 feature
even if they use the default parameters. (Which is to say the feature
bit was erroneously set even if the feature was not used.)

This resolves:
http://tracker.ceph.com/issues/4709

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
57385b51c3ffd0fed2dd9d5d8e4ec080c85ecbcd 21-Apr-2013 Alex Elder <elder@inktank.com> rbd: have rbd_obj_method_sync() return transfer count

Callers of rbd_obj_method_sync() don't know how many bytes of data
got returned by the class method call. As a result, they have been
assuming enough got returned to decode whatever was expected.

This isn't safe. We know how many bytes got transferred, so have
rbd_obj_method_sync() return that amount (rather than just 0) if
the call is successful.

Change all callers to use this return value to ensure decoding of
the results is done safely.

On the other hand, most callers of rbd_obj_method_sync() only
indicate success or failure, so all of *their* callers can simply
test for non-zero result.

This resolves:
http://tracker.ceph.com/issues/4773

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4157976b27287e239d5ae879d2916540fe0b576e 21-Apr-2013 Alex Elder <elder@inktank.com> rbd: void data pointers for rbd_obj_method_sync()

Make the inbound and outbound data parameters have void rather than
character type for rbd_obj_method_sync(). This makes it more clear
they don't expect typed data, and eliminates the need for some silly
type casts.

One more unrelated change: define the features buffer used in
_rbd_dev_v2_snap_features() to be a packed data structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
80ef15bf71a8ed40e47238e1f4f8b3f2a41f58fe 21-Apr-2013 Alex Elder <elder@inktank.com> rbd: give rbd_obj_read_sync() buffer void type

Make the buf parameter into which the data is to be read have type
void pointer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a9e8ba2cb3eb64cf6cfa509d096ef79bc1c827ae 21-Apr-2013 Alex Elder <elder@inktank.com> rbd: enforce parent overlap

A clone image has a defined overlap point with its parent image.
That is the byte offset beyond which the parent image has no
defined data to back the clone, and anything thereafter can be
viewed as being zero-filled by the clone image.

This is needed because a clone image can be resized. If it gets
resized larger than the snapshot it is based on, the overlap defines
the original size. If the clone gets resized downward below the
original size the new clone size defines the overlap. If the clone
is subsequently resized to be larger, the overlap won't be increased
because the previous resize invalidated any parent data beyond that
point.

This resolves:
http://tracker.ceph.com/issues/4724

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0eefd470f034cc18349fa1a9e4fda000e963c4e3 19-Apr-2013 Alex Elder <elder@inktank.com> rbd: issue a copyup for layered writes

This implements the main copyup functionality for layered writes.

Here we add a copyup_pages field to the object request, which is
used only for copyup requests to keep track of the page array
containing data read from the parent image.

A copyup request is currently the only request rbd has that requires
two osd operations. Because of this we handle copyup specially.
All image object requests get an osd request allocated when they are
created. For a write request, if a copyup is required, the osd
request originally allocated is released, and a new one (with room
for two osd ops) is allocated to replace it. A new function
rbd_osd_req_create_copyup() allocates an osd request suitable for
a copyup request.

The first op is then filled with a copyup object class method call,
supplying the array of pages containing data read from the parent.
The second op is filled in with the original write request.

The original request otherwise remains intact, and it describes the
original write request (found in the second osd op). The presence
of the copyup op is sort of implicit; a non-null copyup_pages field
could be used to distinguish between a "normal" write request and a
request containing both a copyup call and a write.

This resolves:
http://tracker.ceph.com/issues/3419

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3d7efd18d9df628e30ff36e9e488a8f0e782b678 19-Apr-2013 Alex Elder <elder@inktank.com> rbd: implement full object parent reads

As a step toward implementing layered writes, implement reading the
data for a target object from the parent image for a write request
whose target object is known to not exist. Add a copyup_pages field
to an image request to track the page array used (only) for such a
request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d98df63ea7e87d5df4dce0cece0210e2a777ac00 11-Apr-2013 Laurent Barbe <laurent@ksperis.com> rbd: revalidate_disk upon rbd resize

If rbd disk is open and rbd resize is done, new size is not
visible by filesystem. Like is done in virtio-blk and dm driver,
revalidate_disk() permits to update the bd_inode size.

Signed-off-by: Laurent Barbe <laurent@ksperis.com>
Reviewed-by: Alex Elder <elder@inktank.com>
f1a4739f333b519fe041e1ad81d9b31c94b9d6a3 19-Apr-2013 Alex Elder <elder@inktank.com> rbd: support page array image requests

This patch adds the ability to build an image request whose data
will be written from or read into memory described by a page array.
(Previously only bio lists were supported.)

Originally this was going to define a new function for this purpose
but it was largely identical to the rbd_img_request_fill_bio(). So
instead, rbd_img_request_fill_bio() has been generalized to handle
both types of image request.

For the moment we still only fill image requests with bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b9434c5b43d1a90e762fe64169862fb198746935 19-Apr-2013 Alex Elder <elder@inktank.com> rbd: define zero_pages()

Define a new function zero_pages() that zeroes a range of memory
defined by a page array, along the lines of zero_bio_chain(). It
saves and the irq flags like bvec_kmap_irq() does, though I'm not
sure at this point that it's necessary.

Update rbd_img_obj_request_read_callback() to use the new function
if the object request contains page rather than bio data.

For the moment, only bio data is used for osd READ ops.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b454e36d2638c005c6574c2289529f5738f156cb 19-Apr-2013 Alex Elder <elder@inktank.com> rbd: encapsulate submission of image object requests

Object requests that are part of an image request are subject to
some additional handling. Define rbd_img_obj_request_submit() to
encapsulate that, and use it when initially submitting an image
object request, and when re-submitting it during callback of
an object existence check.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9d4df01f08e2f2a777f3476741ff4ef8afb04be6 19-Apr-2013 Alex Elder <elder@inktank.com> rbd: define separate read and write format funcs

Separate rbd_osd_req_format() into two functions, one for read
requests and the other for write requests.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c5b5ef6c51124e61829632251098f8b5efecae8a 11-Feb-2013 Alex Elder <elder@inktank.com> rbd: issue stat request before layered write

This is a step toward fully implementing layered writes.

Add checks before request submission for the object(s) associated
with an image request. For write requests, if we don't know that
the target object exists, issue a STAT request to find out. When
that request completes, mark the known and exists flags for the
original object request accordingly and re-submit the object
request. (Note that this still does the existence check only; the
copyup operation is not yet done.)

A new object request is created to perform the existence check. A
pointer to the original request is added to that object request to
allow the stat request to re-issue the original request after
updating its flags. If there is a failure with the stat request
the error code is stored with the original request, which is then
completed.

This resolves:
http://tracker.ceph.com/issues/3418

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5679c59f608f2fedff313e59b374257f1c945234 11-Feb-2013 Alex Elder <elder@inktank.com> rbd: add target object existence flags

This creates two new flags for object requests to indicate what is
known about the existence of the object to which a request is to be
sent. The KNOWN flag will be true if the the EXISTS flag is
meaningful. That is:

KNOWN EXISTS
----- ------
0 0 don't know whether the object exists
0 1 (not used/invalid)
1 0 object is known to not exist
1 0 object is known to exist

This will be used in determining how to handle write requests for
data objects for layered rbd images.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
57acbaa7fb00b6e1a74d29aaaaf273ed8cb4dabc 11-Feb-2013 Alex Elder <elder@inktank.com> rbd: always check IMG_DATA flag

In a few spots, whether the an object request's img_request pointer
is null is used to determine whether an object request is being done
as part of an image data request.

Stop doing that, and instead always use the object request IMG_DATA
flag for that purpose. Swap the order of the definition of the
IMG_DATA and DONE flag helpers, because obj_request_done_set() now
refers to obj_request_img_data_set() to get its rbd_dev value.

This will become important because the img_request pointer is
about to become part of a union.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b155e86cf619886388d80ec298b0f13694c83595 15-Apr-2013 Alex Elder <elder@inktank.com> rbd: adjust image object request ref counting

An extra reference is taken when an object request is added as one
of the requests making up an image object. A reference is dropped
again when the image's object requests get submitted.

The original reference for the object request will remain throughout
this period, so we don't need to add and then take away an extra
one.

This can be interpreted as the image request inheriting the original
object request's reference.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
406e2c9f9286fc93ae2191a7abf477dea05aadc9 15-Apr-2013 Alex Elder <elder@inktank.com> libceph: kill off osd data write_request parameters

In the incremental move toward supporting distinct data items in an
osd request some of the functions had "write_request" parameters to
indicate, basically, whether the data belonged to in_data or the
out_data. Now that we maintain the data fields in the op structure
there is no need to indicate the direction, so get rid of the
"write_request" parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8b3e1a56982d0eafff0afb0ff9e87c8b944a9bdc 24-Jan-2013 Alex Elder <elder@inktank.com> rbd: implement layered reads

Implement layered read requests for format 2 rbd images.

If an rbd image is a clone of a snapshot, the snapshot will be the
clone's "parent" image. When an object read request on a clone
comes back with ENOENT it indicates that the clone is not yet
populated with that portion of the image's data, and the parent
image should be consulted to satisfy the read.

When this occurs, a new image request is created, directed to the
parent image. The offset and length of the image are the same as
the image-relative offset and length of the object request that
produced ENOENT. Data from the parent image therefore satisfies the
object read request for the original image request.

While this code works, it will not be active until we enable the
layering feature (by adding RBD_FEATURE_LAYERING to the value of
RBD_FEATURES_SUPPORTED).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2f82ee54d95c9430838e4580f3bcc196ad36e4f2 31-Oct-2012 Alex Elder <elder@inktank.com> rbd: probe the parent of an image if present

Call the probe function for the parent device if one is present.
Since we don't formally support the layering feature we won't
be using this functionality just yet.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6365d33a275b392d3b224808490cd6172123969e 11-Feb-2013 Alex Elder <elder@inktank.com> rbd: add an object request flag for image data objects

Add a flag to distinguish between object requests being done on
standalone objects and requests being sent for objects representing
rbd image data (i.e., object requests that are the result of image
request).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
926f9b3f085cec8be0cbf4dcc66c28b5ac49cc14 11-Feb-2013 Alex Elder <elder@inktank.com> rbd: define an rbd object request flags field

We're going to need some more Boolean values for object requests,
so create a flags bit field and use it to record whether the request
is done.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1217857fbf0fe6245aa0ce775480a759a0bbadeb 08-Feb-2013 Alex Elder <elder@inktank.com> rbd: encapsulate image object end request handling

Encapsulate the code that completes processing of an object request
that's part of an image request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d0b2e944555d1f06cf6df8a37b76367d10b05b01 24-Jan-2013 Alex Elder <elder@inktank.com> rbd: define image request layered flag

Define a flag indicating whether an image request is for a layered
image (one with a parent image to which requests will be redirected
if the target object of a request does not exist). The code that
checks this flag will be added shortly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9849e986367ef95bac92609bba0349669ed87b53 24-Jan-2013 Alex Elder <elder@inktank.com> rbd: define image request originator flag

Define a flag indicating whether an image request originated from
the Linux block layer (from blk_fetch_request()) or whether it was
initiated in order to satisfy an object request for a child image
of a layered rbd device. For image requests initiated by objects of
child images we'll save a pointer to the object request rather than
the Linux block request.

For now, only block requests are used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0c425248e0c6b3ebb64489b178b5412ab164b7f8 08-Feb-2013 Alex Elder <elder@inktank.com> rbd: define image request flags

There are several Boolean values we'll be maintaining for image
requests. Switch from the single write_request field to a
general-purpose flags field, and use one if its bits to represent
the direction of I/O for the image request. Define helper functions
for setting and testing that flag.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7da22d296d871174f3e8251a02a8f86a90c7463b 24-Jan-2013 Alex Elder <elder@inktank.com> rbd: record image-relative offset in object requests

For an image object request we will need to know what offset within
the rbd image the request covers. Record that when the object
request gets created.

Update the I/O error warnings so they use this so what's reported
is more informative.

Rename a local variable to fit the convention used everywhere else.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
55f27e09312310d4dea9bb7b80c696f407caf1be 10-Apr-2013 Alex Elder <elder@inktank.com> rbd: record aggregate image transfer count

Compute the total number of bytes transferred for an image
request--the sum across each of the request's object requests.
To avoid contention do it only when all object requests are
complete, in rbd_img_request_complete().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a5a337d4382dfe0f9e9e072e7d3eaad8e05e4b0b 24-Jan-2013 Alex Elder <elder@inktank.com> rbd: record overall image request result

If any image object request produces a non-zero result, preserve
that as the result of the overall image request. If multiple
objects have non-zero results, save only the first one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5cbf6f12c48121199cc214c93dea98cce719343b 11-Apr-2013 Alex Elder <elder@inktank.com> rbd: update feature bits

There is a new rbd feature bit defined for "fancy striping." Add
it to the ones defined in the kernel client.

Change RBD_FEATURES_ALL so it represents the set of all feature
bits (rather than just the ones we support). Define a new symbol
RBD_FEATURES_SUPPORTED to indicate the supported ones.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
04017e29bbcf0673d8a6af616c56e395d05f5971 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: make method call data be a separate data item

Right now the data for a method call is specified via a pointer and
length, and it's copied--along with the class and method name--into
a pagelist data item to be sent to the osd. Instead, encode the
data in a data item separate from the class and method names.

This will allow large amounts of data to be supplied to methods
without copying. Only rbd uses the class functionality right now,
and when it really needs this it will probably need to use a page
array rather than a page list. But this simple implementation
demonstrates the functionality on the osd client, and that's enough
for now.

This resolves:
http://tracker.ceph.com/issues/4104

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a4ce40a9a7c1053ac2a41cf64255e44e356e5522 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: combine initializing and setting osd data

This ends up being a rather large patch but what it's doing is
somewhat straightforward.

Basically, this is replacing two calls with one. The first of the
two calls is initializing a struct ceph_osd_data with data (either a
page array, a page list, or a bio list); the second is setting an
osd request op so it associates that data with one of the op's
parameters. In place of those two will be a single function that
initializes the op directly.

That means we sort of fan out a set of the needed functions:
- extent ops with pages data
- extent ops with pagelist data
- extent ops with bio list data
and
- class ops with page data for receiving a response

We also have define another one, but it's only used internally:
- class ops with pagelist data for request parameters

Note that we *still* haven't gotten rid of the osd request's
r_data_in and r_data_out fields. All the osd ops refer to them for
their data. For now, these data fields are pointers assigned to the
appropriate r_data_* field when these new functions are called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2169238dd3a01bc06670fb9c85635cbe97338ff8 05-Apr-2013 Alex Elder <elder@inktank.com> rbd: rearrange some code for consistency

This patch just trivially moves around some code for consistency.

In preparation for initializing osd request data fields in
ceph_osdc_build_request(), I wanted to verify that rbd did in fact
call that immediately before it called ceph_osdc_start_request().
It was true (although image requests are built in a group and then
started as a group). But I made the changes here just to make
it more obvious, by making all of the calls follow a common
sequence:
osd_req_op_<optype>_init();
ceph_osd_data_<type>_init()
osd_req_op_<optype>_<datafield>()
rbd_osd_req_format()
...
ret = rbd_obj_request_submit()

I moved the initialization of the callback for image object requests
into rbd_img_request_fill_bio(), again, for consistency. To avoid
a forward reference, I moved the definition of rbd_img_obj_callback()
up in the file.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
44cd188d48a95e42651c59ff552d45cc8c667f2c 05-Apr-2013 Alex Elder <elder@inktank.com> rbd: separate initialization of osd data

The osd data for a request is currently initialized inside
rbd_osd_req_create(), but that assumes an object request's data
belongs in the osd request's data in or data out field.

There are only three places where requests with data are set up, and
it turns out it's easier to call just the osd data init routines
directly there rather than handling it in rbd_osd_req_create().

(The real motivation here is moving toward getting rid of the
osd request in and out data fields.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2fa123201a86ff979813e24f9e5c5fa54931ab7f 05-Apr-2013 Alex Elder <elder@inktank.com> rbd: don't set data in rbd_osd_req_format_op()

Currently an object request has its osd request's data field set in
rbd_osd_req_format_op(). That assumes a single osd op per object
request, and that won't be the case for long.

Move the code that sets this out and into the caller.

Rename rbd_osd_req_format_op() to be just rbd_osd_req_format(),
removing the notion that it's doing anything op-specific.

This and the next patch resolve:
http://tracker.ceph.com/issues/4658

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c99d2d4abb6c405ef52e9bc1da87b382b8f41739 05-Apr-2013 Alex Elder <elder@inktank.com> libceph: specify osd op by index in request

An osd request now holds all of its source op structures, and every
place that initializes one of these is in fact initializing one
of the entries in the the osd request's array.

So rather than supplying the address of the op to initialize, have
caller specify the osd request and an indication of which op it
would like to initialize. This better hides the details the
op structure (and faciltates moving the data pointers they use).

Since osd_req_op_init() is a common routine, and it's not used
outside the osd client code, give it static scope. Also make
it return the address of the specified op (so all the other
init routines don't have to repeat that code).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8c042b0df99cd06ef8473ef6e204b87b3dc80158 03-Apr-2013 Alex Elder <elder@inktank.com> libceph: add data pointers in osd op structures

An extent type osd operation currently implies that there will
be corresponding data supplied in the data portion of the request
(for write) or response (for read) message. Similarly, an osd class
method operation implies a data item will be supplied to receive
the response data from the operation.

Add a ceph_osd_data pointer to each of those structures, and assign
it to point to eithre the incoming or the outgoing data structure in
the osd message. The data is not always available when an op is
initially set up, so add two new functions to allow setting them
after the op has been initialized.

Begin to make use of the data item pointer available in the osd
operation rather than the request data in or out structure in
places where it's convenient. Add some assertions to verify
pointers are always set the way they're expected to be.

This is a sort of stepping stone toward really moving the data
into the osd request ops, to allow for some validation before
making that jump.

This is the first in a series of patches that resolve:
http://tracker.ceph.com/issues/4657

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
79528734f3ae4699a2886f62f55e18fb34fb3651 04-Apr-2013 Alex Elder <elder@inktank.com> libceph: keep source rather than message osd op array

An osd request keeps a pointer to the osd operations (ops) array
that it builds in its request message.

In order to allow each op in the array to have its own distinct
data, we will need to keep track of each op's data, and that
information does not go over the wire.

As long as we're tracking the data we might as well just track the
entire (source) op definition for each of the ops. And if we're
doing that, we'll have no more need to keep a pointer to the
wire-encoded version.

This patch makes the array of source ops be kept with the osd
request structure, and uses that instead of the version encoded in
the message in places where that was previously used. The array
will be embedded in the request structure, and the maximum number of
ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP
to 2 to reduce the size of the structure.

The result of doing this sort of ripples back up, and as a result
various function parameters and local variables become unnecessary.

Make r_num_ops be unsigned, and move the definition of struct
ceph_osd_req_op earlier to ensure it's defined where needed.

It does not yet add per-op data, that's coming soon.

This resolves:
http://tracker.ceph.com/issues/4656

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
430c28c3cb7f3dbd87de266ed52d65928957ff78 04-Apr-2013 Alex Elder <elder@inktank.com> rbd: define rbd_osd_req_format_op()

Define rbd_osd_req_format_op(), which encapsulates formatting
an osd op into an object request's osd request message. Only
one op is supported right now.

Stop calling ceph_osdc_build_request() in rbd_osd_req_create().
Instead, call rbd_osd_req_format_op() in each of the callers of
rbd_osd_req_create().

This is to prepare for the next patch, in which the source ops for
an osd request will be held in the osd request itself. Because of
that, we won't have the source op to work with until after the
request is created, so we can't format the op until then.

This an the next patch resolve:
http://tracker.ceph.com/issues/4656

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
43bfe5de9fa78e07248b70992ce50321efec622c 03-Apr-2013 Alex Elder <elder@inktank.com> libceph: define osd data initialization helpers

Define and use functions that encapsulate the initializion of a
ceph_osd_data structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6010a451c38b04cf10808a508f33e5bf32e7de63 05-Apr-2013 Alex Elder <elder@inktank.com> rbd: define inbound data size for method ops

When rbd creates an object request containing an object method call
operation it is passing 0 for the size. I originally thought this
was because the length was not needed for method calls, but I think
it really should be supplied, to describe how much space is
available to receive response data. So provide the supplied length.

This resolves:
http://tracker.ceph.com/issues/4659

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
fdce58ccb5df621695b079378c619046acabc778 14-Mar-2013 Alex Elder <elder@inktank.com> libceph: record length of bio list with bio

When assigning a bio pointer to an osd request, we don't have an
efficient way of knowing the total length bytes in the bio list.
That information is available at the point it's set up by the rbd
code, so record it with the osd data when it's set.

This and the next patch are related to maintaining the length of a
message's data independent of the message header, as described here:
http://tracker.ceph.com/issues/4589

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
33803f3300265661b5c5d20a9811c6a2a157d545 14-Mar-2013 Alex Elder <elder@inktank.com> libceph: define source request op functions

The rbd code has a function that allocates and populates a
ceph_osd_req_op structure (the in-core version of an osd request
operation). When reviewed, Josh suggested two things: that the
big varargs function might be better split into type-specific
functions; and that this functionality really belongs in the osd
client rather than rbd.

This patch implements both of Josh's suggestions. It breaks
up the rbd function into separate functions and defines them
in the osd client module as exported interfaces. Unlike the
rbd version, however, the functions don't allocate an osd_req_op
structure; they are provided the address of one and that is
initialized instead.

The rbd function has been eliminated and calls to it have been
replaced by calls to the new routines. The rbd code now now use a
stack (struct) variable to hold the op rather than allocating and
freeing it each time.

For now only the capabilities used by rbd are implemented.
Implementing all the other osd op types, and making the rest of the
code use it will be done separately, in the next few patches.

Note that only the extent, cls, and watch portions of the
ceph_osd_req_op structure are currently used. Delete the others
(xattr, pgls, and snap) from its definition so nobody thinks it's
actually implemented or needed. We can add it back again later
if needed, when we know it's been tested.

This (and a few follow-on patches) resolves:
http://tracker.ceph.com/issues/3861

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
adfe695a25e92e3a4597807fbc7f9a8105218776 14-Mar-2013 Alex Elder <elder@inktank.com> ceph: move max constant definitions

Move some definitions for max integer values out of the rbd code and
into the more central "decode.h" header file. These really belong
in a Linux (or libc) header somewhere, but I haven't gotten around
to proposing that yet.

This is in preparation for moving some code out of rbd.c and into
the osd client.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
175face2ba31025b0dcd6da4e711fca7764287fa 08-Mar-2013 Alex Elder <elder@inktank.com> libceph: let osd ops determine request data length

The length of outgoing data in an osd request is dependent on the
osd ops that are embedded in that request. Each op is encoded into
a request message using osd_req_encode_op(), so that should be used
to determine the amount of outgoing data implied by the op as it
is encoded.

Have osd_req_encode_op() return the number of bytes of outgoing data
implied by the op being encoded, and accumulate and use that in
ceph_osdc_build_request().

As a result, ceph_osdc_build_request() no longer requires its "len"
parameter, so get rid of it.

Using the sum of the op lengths rather than the length provided is
a valid change because:
- The only callers of osd ceph_osdc_build_request() are
rbd and the osd client (in ceph_osdc_new_request() on
behalf of the file system).
- When rbd calls it, the length provided is only non-zero for
write requests, and in that case the single op has the
same length value as what was passed here.
- When called from ceph_osdc_new_request(), (it's not all that
easy to see, but) the length passed is also always the same
as the extent length encoded in its (single) write op if
present.

This resolves:
http://tracker.ceph.com/issues/4406

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e0c594878e3211b09208c779df5f996f0b831d9e 07-Mar-2013 Alex Elder <elder@inktank.com> libceph: record byte count not page count

Record the byte count for an osd request rather than the page count.
The number of pages can always be derived from the byte count (and
alignment/offset) but the reverse is not true.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0fff87ec798abdb4a99f01cbb0197266bb68c5dc 14-Feb-2013 Alex Elder <elder@inktank.com> libceph: separate read and write data

An osd request defines information about where data to be read
should be placed as well as where data to write comes from.
Currently these are represented by common fields.

Keep information about data for writing separate from data to be
read by splitting these into data_in and data_out fields.

This is the key patch in this whole series, in that it actually
identifies which osd requests generate outgoing data and which
generate incoming data. It's less obvious (currently) that an osd
CALL op generates both outgoing and incoming data; that's the focus
of some upcoming work.

This resolves:
http://tracker.ceph.com/issues/4127

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2ac2b7a6d4976bd6b5dc0751aa77d12d48d3ac4c 14-Feb-2013 Alex Elder <elder@inktank.com> libceph: distinguish page and bio requests

An osd request uses either pages or a bio list for its data. Use a
union to record information about the two, and add a data type
tag to select between them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2794a82a11cfeae0890741b18b0049ddb55ce646 14-Feb-2013 Alex Elder <elder@inktank.com> libceph: separate osd request data info

Pull the fields in an osd request structure that define the data for
the request out into a separate structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
46faeed4a61e220b99591e9773057160eb437cc8 11-Apr-2013 Alex Elder <elder@inktank.com> rbd: do a safe list traversal in rbd_img_request_submit()

It's possible that the reference to the object request dropped
inside the loop in rbd_img_request_submit() will be the last
one, in which case the content of the object pointer can't be
trusted.

Use a safe form of the object request list traversal to avoid
problems.

This resolves:
http://tracker.ceph.com/issues/4705

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6e2a4505dba0cae8faa701426185dfb7b49f537c 27-Mar-2013 Alex Elder <elder@inktank.com> rbd: don't zero-fill non-image object requests

A result of ENOENT from a read request for an object that's part of
an rbd image indicates that there is a hole in that portion of the
image. Similarly, a short read for such an object indicates that
the remainder of the read should be interpreted a full read with
zeros filling out the end of the request.

This behavior is not correct for objects that are not backing rbd
image data. Currently rbd_img_obj_request_callback() assumes it
should be done for all objects.

Change rbd_img_obj_request_callback() so it only does this zeroing
for image objects. Encapsulate that special handling in its own
function. Add an assertion that the image object request is a bio
request, since we assume that (and we currently don't support any
other types).

This resolves a problem identified here:
http://tracker.ceph.com/issues/4559

The regression was introduced by bf0d5f503dc11d6314c0503591d258d60ee9c944.

Reported-by: Dan van der Ster <dan@vanderster.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-off-by: Sage Weil <sage@inktank.com>
d74c6d514fe314b8bdab58b487b25992291577ec 06-Feb-2013 Kent Overstreet <koverstreet@google.com> block: Add bio_for_each_segment_all()

__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx. Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Neil Brown <neilb@suse.de>
CC: Boaz Harrosh <bharrosh@panasas.com>
1b83bef24c6746a146d39915a18fb5425f2facb0 26-Feb-2013 Sage Weil <sage@inktank.com> libceph: update osd request/reply encoding

Use the new version of the encoding for osd requests and replies. In the
process, update the way we are tracking request ops and reply lengths and
results in the struct ceph_osd_request. Update the rbd and fs/ceph users
appropriately.

The main changes are:
- we keep pointers into the request memory for fields we need to update
each time the request is sent out over the wire
- we keep information about the result in an array in the request struct
where the users can easily get at it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
c47f9371545abe2510ac3b66c3fc180921816f65 26-Feb-2013 Alex Elder <elder@inktank.com> rbd: pass length, not op for osd completions

The only thing type-specific osd completion functions do with their
osd op parameter is (in some cases) extract the number of bytes
transferred from it. In the other cases, the xferred bytes field
is not used, and total message data transfer byte count (which may
well be zero) is used.

Just set the object request transfer count in the main osd request
callback function and provide that to the other routines. There is
then no longer any need to pass the op pointer to the type-specific
completion routines, so drop those parameters.

Stop doing anything with the total message data length.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
39bf2c5d096729939cab657fe641044eceaa84a2 26-Feb-2013 Alex Elder <elder@inktank.com> rbd: move rbd_osd_trivial_callback()

This function is slightly out of place, probably the result
of an errant automatic merge or something.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
cc344fa1b541b116190291d366583585f03d0fe6 19-Feb-2013 Alex Elder <elder@inktank.com> rbd: eliminate sparse warnings

Fengguang Wu reminded me that there were outstanding sparse reports
in the ceph and rbd code. This patch fixes these problems in rbd
that lead to those reports:
- Convert functions that are never referenced externally to have
static scope.
- Add a lockdep annotation to rbd_request_fn(), because it
releases a lock before acquiring it again.

This partially resolves:
http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
37206ee5bede14d59306fea3af4c0105d4712342 21-Feb-2013 Alex Elder <elder@inktank.com> rbd: normalize dout() calls

Add dout() calls to facilitate tracing of image and object requests.
Change a few existing calls so they use __func__ rather than the
hard-coded function name. Have calls always add ":" after the name
of the function, and prefix pointer values with a consistent tag
indicating what it represents. (Note that there remain some older
dout() calls that are left untouched by this patch.)

Issue a warning if rbd_osd_write_callback() ever gets a short write.

This resolves:
http://tracker.ceph.com/issues/4235

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
632b88cadece050ca925d74bda250c4a320c5cc7 21-Feb-2013 Alex Elder <elder@inktank.com> rbd: barriers are hard

Let's go shopping!

I'm afraid this may not have gotten it right:
07741308 rbd: add barriers near done flag operations

The smp_wmb() should have been done *before* setting the done flag,
to ensure all other data was valid before marking the object request
done.

Switch to use atomic_inc_return() here to set the done flag, which
allows us to verify we don't mark something done more than once.
Doing this also implies general barriers before and after the call.

And although a read memory barrier might have been sufficient before
reading the done flag, convert this to a full memory barrier just
to put this issue to bed.

This resolves:
http://tracker.ceph.com/issues/4238

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4dda41d3d76747414586a4bad5615b550e0986b1 21-Feb-2013 Alex Elder <elder@inktank.com> rbd: ignore zero-length requests

The old request code simply ignored zero-length requests. We should
still operate that same way to avoid any changes in behavior. We
can implement handling for special zero-length requests separately
(see http://tracker.ceph.com/issues/4236).

Add some assertions based on this new constraint.

This resolves:
http://tracker.ceph.com/issues/4237

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
903bb32e890237ca43ab847e561e5377cfe0fdb3 06-Feb-2013 Alex Elder <elder@inktank.com> libceph: drop return value from page vector copy routines

The return values provided for ceph_copy_to_page_vector() and
ceph_copy_from_page_vector() serve no purpose, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
23ed6e13b320b33decb516cbe66e71b132df488d 06-Feb-2013 Alex Elder <elder@inktank.com> rbd: ignore result of ceph_copy_from_page_vector()

The result of ceph_copy_from_page_vector() is simply the length
argument it is provided.

This is called by rbd_obj_method_sync(), which returns the result if
it's non-negative. But we always either ignore or overwrite that
return value. So explicitly ignore what's returned by the copy
function, and have rbd_obj_method_sync() always return either a
negative errno or 0.

We also return the result of ceph_copy_from_page_vector() in
rbd_obj_read_sync(). There we still want to return the number of
bytes transferred, but we can use the value we already have in hand
rather than what ceph_copy_from_page_vector() provides.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1ceae7ef0fd00c965a2257c6e9eb497ca91f01c7 06-Feb-2013 Alex Elder <elder@inktank.com> rbd: prevent bytes transferred overflow

In rbd_obj_read_sync(), verify the number of bytes transferred won't
exceed what can be represented by a size_t before using it to
indicate the number of bytes to copy to the result buffer.

(The real motivation for this is to prepare for the next patch.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
fbfab53966b279f9cdb36b96ffa1e22f042c96ff 08-Feb-2013 Alex Elder <elder@inktank.com> libceph: allow STAT osd operations

Add support for CEPH_OSD_OP_STAT operations in the osd client
and in rbd.

This operation sends no data to the osd; everything required is
encoded in identity of the target object.

The result will be ENOENT if the object doesn't exist. If it does
exist and no other error occurs the server returns the size and last
modification time of the target object as output data (in little
endian format). The size is a 64 bit unsigned and the time is
ceph_timespec structure (two unsigned 32-bit integers, representing
a seconds and nanoseconds value).

This resolves:
http://tracker.ceph.com/issues/4007

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ef06f4d32ae5b656f17b53ee3f3c43471a11cc73 08-Feb-2013 Alex Elder <elder@inktank.com> rbd: add parentheses to object request iterator macros

The for_each_obj_request*() macros should parenthesize their uses of
the ireq parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3c663bbdcdf9296e0fe3362acb9e81f49d7b72c6 15-Feb-2013 Alex Elder <elder@inktank.com> libceph: kill ceph_osdc_create_event() "one_shot" parameter

There is only one caller of ceph_osdc_create_event(), and it
provides 0 as its "one_shot" argument. Get rid of that argument and
just use 0 in its place.

Replace the code in handle_watch_notify() that executes if one_shot
is nonzero in the event with a BUG_ON() call.

While modifying "osd_client.c", give handle_watch_notify() static
scope.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
077413082f9ade9ca4d9774dbdc81ee7256d8089 06-Feb-2013 Alex Elder <elder@inktank.com> rbd: add barriers near done flag operations

Somehow, I missed this little item in Documentation/atomic_ops.txt:
*** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***

Create and use some helper functions that include the proper memory
barriers for manipulating the done field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a14ea269dd6b5e48a2941ba73b202cd7cd5d716d 05-Feb-2013 Alex Elder <elder@inktank.com> rbd: turn off interrupts for open/remove locking

This commit:
bc7a62ee5 rbd: prevent open for image being removed
added checking for removing rbd before allowing an open, and used
the same request spinlock for protecting that and updating the open
count as is used for the request queue.

However it used the non-irq protected version of the spinlocks.
Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9cbb1d7268afa997a7f96d779470cc57d28e1a13 31-Jan-2013 Alex Elder <elder@inktank.com> libceph: don't require r_num_pages for bio requests

There is a check in the completion path for osd requests that
ensures the number of pages allocated is enough to hold the amount
of incoming data expected.

For bio requests coming from rbd the "number of pages" is not really
meaningful (although total length would be). So stop requiring that
nr_pages be supplied for bio requests. This is done by checking
whether the pages pointer is null before checking the value of
nr_pages.

Note that this value is passed on to the messenger, but there it's
only used for debugging--it's never used for validation.

While here, change another spot that used r_pages in a debug message
inappropriately, and also invalidate the r_con_filling_msg pointer
after dropping a reference to it.

This resolves:
http://tracker.ceph.com/issues/3875

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1e32d34cfa6759df58b5f4002664241f2a0fef6a 30-Jan-2013 Alex Elder <elder@inktank.com> rbd: don't take extra bio reference for osd client

Currently, if the OSD client finds an osd request has had a bio list
attached to it, it drops a reference to it (or rather, to the first
entry on that list) when the request is released.

The code that added that reference (i.e., the rbd client) is
therefore required to take an extra reference to that first bio
structure.

The osd client doesn't really do anything with the bio pointer other
than transfer it from the osd request structure to outgoing (for
writes) and ingoing (for reads) messages. So it really isn't the
right place to be taking or dropping references.

Furthermore, the rbd client already holds references to all bio
structures it passes to the osd client, and holds them until the
request is completed. So there's no need for this extra reference
whatsoever.

So remove the bio_put() call in ceph_osdc_release_request(), as
well as its matching bio_get() call in rbd_osd_req_create().

This change could lead to a crash if old libceph.ko was used with
new rbd.ko. Add a compatibility check at rbd initialization time to
avoid this possibilty.

This resolves:
http://tracker.ceph.com/issues/3798 and
http://tracker.ceph.com/issues/3799

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b82d167be64b3e88d9434d8a98ce83c83a07aa48 14-Jan-2013 Alex Elder <elder@inktank.com> rbd: prevent open for image being removed

An open request for a mapped rbd image can arrive while removal of
that mapping is underway. We need to prevent such an open request
from succeeding. (It appears that Maciej Galkiewicz ran into this
problem.)

Define and use a "removing" flag to indicate a mapping is getting
removed. Set it in the remove path after verifying nothing holds
the device open. And check it in the open path before allowing the
open to proceed. Acquire the rbd device's lock around each of these
spots to avoid any races accessing the flags and open_count fields.

This addresses:
http://tracker.newdream.net/issues/3427

Reported-by: Maciej Galkiewicz <maciejgalkiewicz@ragnarson.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6d292906f80170f4647079dd503df18b737750af 14-Jan-2013 Alex Elder <elder@inktank.com> rbd: define flags field, use it for exists flag

Define a new rbd device flags field, manipulated using bit
operations. Replace the use of the current "exists" flag with a bit
in this new "flags" field. Add a little commentary about the
"exists" flag, which does not need to be manipulated atomically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8eb87565306cf40a32f5d0883d008675cd2dd510 26-Jan-2013 Alex Elder <elder@inktank.com> rbd: don't drop watch requests on completion

When we register an osd request to linger, it means that request
will stay around (under control of the osd client) until we've
unregistered it. We do that for an rbd image's header object, and
we keep a pointer to the object request associated with it.

Keep a reference to the watch object request for as long as it is
registered to linger. Drop it again after we've removed the linger
registration.

This resolves:
http://tracker.ceph.com/issues/3937

(Note: this originally came about because the osd client was
issuing a callback more than once. But that behavior will be
changing soon, documented in tracker issue 3967.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
25dcf954c3230946b5f3e18db9f91d7640eff76e 26-Jan-2013 Alex Elder <elder@inktank.com> rbd: decrement obj request count when deleting

Decrement the obj_request_count value when deleting an object
request from its image request's list. Rearrange a few lines
in the surrounding code.

This resolves:
http://tracker.ceph.com/issues/3940

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
975241afcbba82ab1ddc6ebf8a02246d1e9314fd 26-Jan-2013 Alex Elder <elder@inktank.com> rbd: track object rather than osd request for watch

Switch to keeping track of the object request pointer rather than
the osd request used to watch the rbd image header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6977c3f983b0d2b481a65b1fa3e85683fd1318af 26-Jan-2013 Alex Elder <elder@inktank.com> rbd: unregister linger in watch sync routine

Move the code that unregisters an rbd device's lingering header
object watch request into rbd_dev_header_watch_sync(), so it
occurs in the same function that originally sets up that request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9f20e02a53b944a54a35b9f0db1243cd64872f7d 20-Jan-2013 Alex Elder <elder@inktank.com> rbd: get rid of rbd_req_sync_exec()

Get rid rbd_req_sync_exec() because it is no longer used. That
eliminates the last use of rbd_req_sync_op(), so get rid of that
too. And finally, that leaves rbd_do_request() unreferenced, so get
rid of that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
36be9a761844e186f629f463b665945df4f67766 19-Jan-2013 Alex Elder <elder@inktank.com> rbd: implement sync method with new code

Reimplement synchronous object method calls using the new request
tracking code. Use the name rbd_obj_method_sync()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cf81b60e4bbd4a1281fe2640f9c0c40fe3a85fdf 17-Jan-2013 Alex Elder <elder@inktank.com> rbd: send notify ack asynchronously

When we receive notification of a change to an rbd image's header
object we need to refresh our information about the image (its
size and snapshot context). Once we have refreshed our rbd image
we need to acknowledge the notification.

This acknowledgement was previously done synchronously, but there's
really no need to wait for it to complete.

Change it so the caller doesn't wait for the notify acknowledgement
request to complete. And change the name to reflect it's no longer
synchronous.

This resolves:
http://tracker.newdream.net/issues/3877

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5ae9db81b45c2d95554c665043afffd5e9a7d5ac 20-Jan-2013 Alex Elder <elder@inktank.com> rbd: get rid of rbd_req_sync_notify_ack()

Get rid rbd_req_sync_notify_ack() because it is no longer used.
As a result rbd_simple_req_cb() becomes unreferenced, so get rid
of that too.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b8d70035b35dc12135d5835b659b229bcd6d4f94 01-Dec-2012 Alex Elder <elder@inktank.com> rbd: use new code for notify ack

Use the new object request tracking mechanism for handling a
notify_ack request.

Move the callback function below the definition of this so we don't
have to do a pre-declaration.

This resolves:
http://tracker.newdream.net/issues/3754

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ecf7a0318b1b026fb147623c36324fa8c73447a9 20-Jan-2013 Alex Elder <elder@inktank.com> rbd: get rid of rbd_req_sync_watch()

Get rid of rbd_req_sync_watch(), because it is no longer used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9969ebc5af93028c21f2614621737f0d6ff6fc06 18-Jan-2013 Alex Elder <elder@inktank.com> rbd: implement watch/unwatch with new code

Implement a new function to set up or tear down a watch event
for an mapped rbd image header using the new request code.

Create a new object request type "nodata" to handle this. And
define rbd_osd_trivial_callback() which simply marks a request done.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
86ea43bfcbeae61858b0fee4971e5b1e894d7174 20-Jan-2013 Alex Elder <elder@inktank.com> rbd: get rid of rbd_req_sync_read()

Delete rbd_req_sync_read() is no longer used, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
788e2df3b92e30f1fff74139bb53e68ec13fe2a5 17-Jan-2013 Alex Elder <elder@inktank.com> rbd: implement sync object read with new code

Reimplement the synchronous read operation used for reading a
version 1 header using the new request tracking code. Name the
resulting function rbd_obj_read_sync() to better reflect that
it's a full object operation, not an object request. To do this,
implement a new OBJ_REQUEST_PAGES object request type.

This implements a new mechanism to allow the caller to wait for
completion for an rbd_obj_request by calling rbd_obj_request_wait().

This partially resolves:
http://tracker.newdream.net/issues/3755

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7d250b949a33c8a658a2ad4ab390d8394b842224 01-Dec-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_req_coll and rbd_request

The two remaining callers of rbd_do_request() always pass a null
collection pointer, so the "coll" and "coll_index" parameters are
not needed. There is no other use of that data structure, so it
can be eliminated.

Deleting them means there is no need to allocate a rbd_request
structure for the callback function. And since that's the only use
of *that* structure, it too can be eliminated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2250a71b591728092db9adcc51629401deb2f9f8 01-Dec-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_rq_fn() and all other related code

Now that the request function has been replaced by one using the new
request management data structures the old one can go away.
Deleting it makes rbd_dev_do_request() no longer needed, and
deleting that makes other functions unneeded, and so on.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
bf0d5f503dc11d6314c0503591d258d60ee9c944 22-Nov-2012 Alex Elder <elder@inktank.com> rbd: new request tracking code

This patch fully implements the new request tracking code for rbd
I/O requests.

Each I/O request to an rbd image will get an rbd_image_request
structure allocated to track it. This provides access to all
information about the original request, as well as access to the
set of one or more object requests that are initiated as a result
of the image request.

An rbd_obj_request structure defines a request sent to a single osd
object (possibly) as part of an rbd image request. An rbd object
request refers to a ceph_osd_request structure built up to represent
the request; for now it will contain a single osd operation. It
also provides space to hold the result status and the version of the
object when the osd request completes.

An rbd_obj_request structure can also stand on its own. This will
be used for reading the version 1 header object, for issuing
acknowledgements to event notifications, and for making object
method calls.

All rbd object requests now complete asynchronously with respect
to the osd client--they supply a common callback routine.

This resolves:
http://tracker.newdream.net/issues/3741

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c04306471ad93f1daf60771a0373316d4c3494ae 18-Jan-2013 Alex Elder <elder@inktank.com> rbd: don't retry setting up header watch

When an rbd image is initially mapped a watch event is registered so
we can do something if the header object changes.

The code that does this currently loops if initiating the watch
request results in an ERANGE error. The osds will never return
ERANGE, so there's no reason to do this loop, so get rid of it.

This resolves:
http://tracker.newdream.net/issues/3860

Note that the problem this loop was intended to solve is a race
between collecting image header information and setting up the watch
on the header object. The real fix for that problem is described
here:
http://tracker.newdream.net/issues/3871

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
38901e0f240b634467fb31743365af80a006be33 10-Jan-2013 Alex Elder <elder@inktank.com> rbd: check for overflow in rbd_get_num_segments()

The return type of rbd_get_num_segments() is int, but the values it
operates on are u64. Although it's not likely, there's no guarantee
the result won't exceed what can be respresented in an int. The
function is already designed to return -ERANGE on error, so just add
this possible overflow as another reason to return that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
98571b5aa776d4a69eadd7d4e5c9d4e69365ab9a 20-Jan-2013 Alex Elder <elder@inktank.com> rbd: small changes

A few very minor changes to the rbd code:
- RBD_MAX_OPT_LEN is unused, so get rid of it
- Consolidate rbd options definitions
- Make rbd_segment_name() return pointer to const char

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e0b49868d3629708eda593b6739cb78f33ab238a 09-Jan-2013 Alex Elder <elder@inktank.com> rbd: fix type of snap_id in rbd_dev_v2_snap_info()

The type of the snap_id local variable is defined with the
wrong byte order. Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8b84de7940b69fd7326946ba244621aa5fc412e0 20-Nov-2012 Alex Elder <elder@inktank.com> rbd: assign watch request more directly

Both rbd_req_sync_op() and rbd_do_request() have a "linger"
parameter, which is the address of a pointer that should refer to
the osd request structure used to issue a request to an osd.

Only one case ever supplies a non-null "linger" argument: an
CEPH_OSD_OP_WATCH start. And in that one case it is assigned
&rbd_dev->watch_request.

Within rbd_do_request() (where the assignment ultimately gets made)
we know the rbd_dev and therefore its watch_request field. We
also know whether the op being sent is CEPH_OSD_OP_WATCH start.

Stop opaquely passing down the "linger" pointer, and instead just
assign the value directly inside rbd_do_request() when it's needed.

This makes it unnecessary for rbd_req_sync_watch() to make
arrangements to hold a value that's not available until a
bit later. This more clearly separates setting up a watch
request from submitting it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5efea49a98d1a3b3a7301d3a17f826ad4c31b290 20-Nov-2012 Alex Elder <elder@inktank.com> rbd: move remaining osd op setup into rbd_osd_req_op_create()

The two remaining osd ops used by rbd are CEPH_OSD_OP_WATCH and
CEPH_OSD_OP_NOTIFY_ACK. Move the setup of those operations into
rbd_osd_req_op_create(), and get rid of rbd_create_rw_op() and
rbd_destroy_op().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2647ba38100765298fc67ce1ec5d32e80d9fe046 20-Nov-2012 Alex Elder <elder@inktank.com> rbd: move call osd op setup into rbd_osd_req_op_create()

Move the initialization of the CEPH_OSD_OP_CALL operation into
rbd_osd_req_op_create().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8d23bf29095e5fab84535035e7a27c4920812c44 20-Nov-2012 Alex Elder <elder@inktank.com> rbd: don't assign extent info in rbd_req_sync_op()

Move the assignment of the extent offset and length and payload
length out of rbd_req_sync_op() and into its caller in the one spot
where a read (and note--no write) operation might be initiated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c561191813e232aa52022532751855ff5c9fa319 20-Nov-2012 Alex Elder <elder@inktank.com> rbd: don't assign extent info in rbd_do_request()

In rbd_do_request() there's a sort of last-minute assignment of the
extent offset and length and payload length for read and write
operations. Move those assignments into the caller (in those spots
that might initiate read or write operations)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1821665749a3d7e26ad470c63fc2c46990dc6041 30-Nov-2012 Alex Elder <elder@inktank.com> rbd: don't leak rbd_req for rbd_req_sync_notify_ack()

When rbd_req_sync_notify_ack() calls rbd_do_request() it supplies
rbd_simple_req_cb() as its callback function. Because the callback
is supplied, an rbd_req structure gets allocated and populated so it
can be used by the callback. However rbd_simple_req_cb() is not
freeing (or even using) the rbd_req structure, so it's getting
leaked.

Since rbd_simple_req_cb() has no need for the rbd_req structure,
just avoid allocating one for this case. Of the three calls to
rbd_do_request(), only the one from rbd_do_op() needs the rbd_req
structure, and that call can be distinguished from the other two
because it supplies a non-null rbd_collection pointer.

So fix this leak by only allocating the rbd_req structure if a
non-null "coll" value is provided to rbd_do_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2e53c6c379b65372df21f4d6019f6eb63af81384 30-Nov-2012 Alex Elder <elder@inktank.com> rbd: don't leak rbd_req on synchronous requests

When rbd_do_request() is called it allocates and populates an
rbd_req structure to hold information about the osd request to be
sent. This is done for the benefit of the callback function (in
particular, rbd_req_cb()), which uses this in processing when
the request completes.

Synchronous requests provide no callback function, in which case
rbd_do_request() waits for the request to complete before returning.
This case is not handling the needed free of the rbd_req structure
like it should, so it is getting leaked.

Note however that the synchronous case has no need for the rbd_req
structure at all. So rather than simply freeing this structure for
synchronous requests, just don't allocate it to begin with.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
907703d050df92979b3848ee42f88d5c9c6c13fe 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: combine rbd sync watch/unwatch functions

The rbd_req_sync_watch() and rbd_req_sync_unwatch() functions are
nearly identical. Combine them into a single function with a flag
indicating whether a watch is to be initiated or torn down.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0903e875caa93e1fb231dd66c69b118dbdad25cb 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: use a common layout for each device

Each osd message includes a layout structure, and for rbd it is
always the same (at least for osd's in a given pool).

Initialize a layout structure when an rbd_dev gets created and just
copy that into osd requests for the rbd image.

Replace an assertion that was done when initializing the layout
structures with code that catches and handles anything that would
trigger the assertion as soon as it is identified. This precludes
that (bad) condition from ever occurring.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
47dba7ba2623b088cbbe1ac0aaa1a034f3249b6d 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: don't bother calculating file mapping

When rbd_do_request() has a request to process it initializes a ceph
file layout structure and uses it to compute offsets and limits for
the range of the request using ceph_calc_file_object_mapping().

The layout used is fixed, and is based on RBD_MAX_OBJ_ORDER (30).
It sets the layout's object size and stripe unit to be 1 GB (2^30),
and sets the stripe count to be 1.

The job of ceph_calc_file_object_mapping() is to determine which
of a sequence of objects will contain data covered by range, and
within that object, at what offset the range starts. It also
truncates the length of the range at the end of the selected object
if necessary.

This is needed for ceph fs, but for rbd it really serves no purpose.
It does its own blocking of images into objects, echo of which is
(1 << obj_order) in size, and as a result it ignores the "bno"
value returned by ceph_calc_file_object_mapping(). In addition,
by the point a request has reached this function, it is already
destined for a single rbd object, and its length will not exceed
that object's extent. Because of this, and because the mapping will
result in blocking up the range using an integer multiple of the
image's object order, ceph_calc_file_object_mapping() will never
change the offset or length values defined by the request.

In other words, this call is a big no-op for rbd data requests.

There is one exception. We read the header object using this
function, and in that case we will not have already limited the
request size. However, the header is a single object (not a file or
rbd image), and should not be broken into pieces anyway. So in fact
we should *not* be calling ceph_calc_file_object_mapping() when
operating on the header object.

So...

Don't call ceph_calc_file_object_mapping() in rbd_do_request(),
because useless for image data and incorrect to do sofor the image
header.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e01e79273b251dbb35ff2522a688229b09481923 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: open code rbd_calc_raw_layout()

This patch gets rid of rbd_calc_raw_layout() by simply open coding
it in its one caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0829661863fb5c8031c1c5c119693ea157517783 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: pull in ceph_calc_raw_layout()

This is the first in a series of patches aimed at eliminating
the use of ceph_calc_raw_layout() by rbd.

It simply pulls in a copy of that function and renames it
rbd_calc_raw_layout().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
30573d680355ca0de4db2113b9080cd078ac726f 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: assume single op in a request

We now know that every of rbd_req_sync_op() passes an array of
exactly one operation, as evidenced by all callers passing 1 as its
num_op argument. So get rid of that argument, assuming a single op.

Similarly, we now know that all callers of rbd_do_request() pass 1
as the num_op value, so that parameter can be eliminated as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
139b4318ad93ae4370d88882ff89b42dcbfaaab1 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: there is really only one op

Throughout the rbd code there are spots where it appears we can
handle an osd request containing more than one osd request op.

But that is only the way it appears. In fact, currently only one
operation at a time can be supported, and supporting more than
one will require much more than fleshing out the support that's
there now.

This patch changes names to make it perfectly clear that anywhere
we're dealing with a block of ops, we're in fact dealing with
exactly one of them. We'll be able to simplify some things as
a result.

When multiple op support is implemented, we can update things again
accordingly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ae7ca4a35b1f5df86e2c32b2cfc01a8d528c7b8c 14-Nov-2012 Alex Elder <elder@inktank.com> libceph: pass num_op with ops

Both ceph_osdc_alloc_request() and ceph_osdc_build_request() are
provided an array of ceph osd request operations. Rather than just
passing the number of operations in the array, the caller is
required append an additional zeroed operation structure to signal
the end of the array.

All callers know the number of operations at the time these
functions are called, so drop the silly zero entry and supply that
number directly. As a result, get_num_ops() is no longer needed.
This also means that ceph_osdc_alloc_request() never uses its ops
argument, so that can be dropped.

Also rbd_create_rw_ops() no longer needs to add one to reserve room
for the additional op.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d07c09589f533db9ab500ac38151bc9f3a4d0648 14-Nov-2012 Alex Elder <elder@inktank.com> rbd: pass num_op with ops array

Add a num_op parameter to rbd_do_request() and rbd_req_sync_op() to
indicate the number of entries in the array. The callers of these
functions always know how many entries are in the array, so just
pass that information down.

This is in anticipation of eliminating the extra zero-filled entry
in these ops arrays.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
54a5400721da7fa5a16cea151aade5bdfee74111 14-Nov-2012 Alex Elder <elder@inktank.com> libceph: don't set pages or bio in ceph_osdc_alloc_request()

Only one of the two callers of ceph_osdc_alloc_request() provides
page or bio data for its payload. And essentially all that function
was doing with those arguments was assigning them to fields in the
osd request structure.

Simplify ceph_osdc_alloc_request() by having the caller take care of
making those assignments

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d178a9e74006e80f568d87e29f2a68f14fc7cbb1 14-Nov-2012 Alex Elder <elder@inktank.com> libceph: don't set flags in ceph_osdc_alloc_request()

The only thing ceph_osdc_alloc_request() really does with the
flags value it is passed is assign it to the newly-created
osd request structure. Do that in the caller instead.

Both callers subsequently call ceph_osdc_build_request(), so have
that function (instead of ceph_osdc_alloc_request()) issue a warning
if a request comes through with neither the read nor write flags set.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e75b45cf36565fd8ba206a9d80f670a86e61ba2f 14-Nov-2012 Alex Elder <elder@inktank.com> libceph: drop osdc from ceph_calc_raw_layout()

The osdc parameter to ceph_calc_raw_layout() is not used, so get rid
of it. Consequently, the corresponding parameter in calc_layout()
becomes unused, so get rid of that as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4d6b250bf18d44571d69a0f4afec4b6a1969729f 14-Nov-2012 Alex Elder <elder@inktank.com> libceph: drop snapid in ceph_calc_raw_layout()

A snapshot id must be provided to ceph_calc_raw_layout() even though
it is not needed at all for calculating the layout.

Where the snapshot id *is* needed is when building the request
message for an osd operation.

Drop the snapid parameter from ceph_calc_raw_layout() and pass
that value instead in ceph_osdc_build_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0120be3c60d46d6d55f4bf7a3d654cc705eb0c54 14-Nov-2012 Alex Elder <elder@inktank.com> libceph: pass length to ceph_osdc_build_request()

The len argument to ceph_osdc_build_request() is set up to be
passed by address, but that function never updates its value
so there's no need to do this. Tighten up the interface by
passing the length directly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7c3d22cf16f1bbcb37a73e88338c042bb49ff112 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: don't bother setting snapid in rbd_do_request()

For some reason, the snapid field of the osd request header is
explicitly set to CEPH_NOSNAP in rbd_do_request(). Just a few lines
later--with no code that would access this field in between--a call
is made to ceph_calc_raw_layout() passing the snapid provided to
rbd_do_request(), which encodes the snapid value it is provided into
that field instead.

In other words, there is no need to fill in CEPH_NOSNAP, and doing
so suggests it might be necessary. Don't do that any more.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
25704ac9de30ac3e73c123e7b2734f7ca744c8d8 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_req_sync_op() snapc and snapid parameters

The snapc and snapid parameters to rbd_req_sync_op() always take
the values NULL and CEPH_NOSNAP, respectively. So just get rid
of them and use those values where needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
07b2391fbbcefdecbc2f16321f8e454802e0b926 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: drop flags parameter from rbd_req_sync_exec()

All callers of rbd_req_sync_exec() pass CEPH_OSD_FLAG_READ as their
flags argument. Delete that parameter and use CEPH_OSD_FLAG_READ
within the function. If we find a need to support write operations
we can add it back again.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4775618d9255c0c135580bbee8bee6815f8194cf 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: drop snapid parameter from rbd_req_sync_read()

There is only one caller of rbd_req_sync_read(), and it passes
CEPH_NOSNAP as the snapshot id argument. Delete that parameter
and just use CEPH_NOSNAP within the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
af77f26caa35a95af09d1dac5c513b3901de7e37 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: drop oid parameters from ceph_osdc_build_request()

The last two parameters to ceph_osd_build_request() describe the
object id, but the values passed always come from the osd request
structure whose address is also provided. Get rid of those last
two parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0ec8ce87f3bb5e4a561190f5320934e003405b6f 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: separate layout init

Pull a block of code that initializes the layout structure in an osd
request into its own function so it can be reused.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a7b4c65f4f15aa657b09d13da8f45ba0b72ec0df 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: only get snap context for write requests

Right now we get the snapshot context for an rbd image (under
protection of the header semaphore) for every request processed.

There's no need to get the snap context if we're doing a read,
so avoid doing so in that case.

Note that we no longer need to hold the header semaphore to
check the rbd_dev's existence flag.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d78b650a595e23e5a115d332e3c37e019baf7703 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: make exists flag atomic

The rbd_device->exists field can be updated asynchronously, changing
from set to clear if a mapped snapshot disappears from the base
image's snapshot context.

Currently, value of the "exists" flag is only read and modified
under protection of the header semaphore, but that will change with
the next patch. Making it atomic ensures this won't be a problem
because the a the non-existence of device will be immediately known.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b395e8b5b8f06399e3fe3ee016c9cf41ff665efc 08-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: a little more cleanup of rbd_rq_fn()

Now that a big hunk in the middle of rbd_rq_fn() has been moved
into its own routine we can simplify it a little more.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cd323ac0eb433b14cbb270bfc5a82f308f2662de 08-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: end request on error in rbd_do_request() caller

Only one of the three callers of rbd_do_request() provide a
collection structure to aggregate status.

If an error occurs in rbd_do_request(), have the caller
take care of calling rbd_coll_end_req() if necessary in
that one spot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8295cda7ceceb7b25f9a12cd21bbfb6206e28a6d 08-Nov-2012 Alex Elder <elder@inktank.com> rbd: encapsulate handling for a single request

In rbd_rq_fn(), requests are fetched from the block layer and each
request is processed, looping through the request's list of bio's
until they've all been consumed.

Separate the handling for a single request into its own function to
make it a bit easier to see what's going on.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8986cb37b1cf1f54b35f062f0a12dc68dd89f311 08-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: be picky about osd request status type

The result field in a ceph osd reply header is a signed 32-bit type,
but rbd code often casually uses int to represent it.

The following changes the types of variables that handle this result
value to be "s32" instead of "int" to be completely explicit about
it. Only at the point we pass that result to __blk_end_request()
does the type get converted to the plain old int defined for that
interface.

There is almost certainly no binary impact of this change, but I
prefer to show the exact size and signedness of the value since we
know it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
5f29ddd4f0954ad6c84e28b934773f128840f064 08-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: standardize ceph_osd_request variable names

There are spots where a ceph_osds_request pointer variable is given
the name "req". Since we're dealing with (at least) three types of
requests (block layer, rbd, and osd), I find this slightly
distracting.

Change such instances to use "osd_req" consistently to make the
abstraction represented a little more obvious.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
725afc97c91cd5f71a015143da5095d20cd668b9 08-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: standardize rbd_request variable names

There are two names used for items of rbd_request structure type:
"req" and "req_data". The former name is also used to represent
items of pointers to struct ceph_osd_request.

Change all variables that have these names so they are instead
called "rbd_req" consistently.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
935dc89f3e29e2ef1d7c89778cdb9f37ff36e94b 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: add warnings to rbd_dev_probe_update_spec()

Josh suggested adding warnings to this function to help users
diagnose problems.

Other than memory allocatino errors, there are two places where
errors can be returned. Both represent problems that should
have been caught earlier, and as such might well have been
handled with BUG_ON() calls. But if either ever did manage to
happen, it will be reported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f5400b7a0e78a53edce8960a079aa022640849a1 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: add a warning in bio_chain_clone_range()

Add a warning in bio_chain_clone_range() to help a user determine
what exactly might have led to a failure. There is only one; please
say something if you disagree with the following reasoning.

There are three places this can return abnormally:
- Initially, if there is nothing to clone. It turns out that
right now this cannot happen anyway. The test is in place
because the code below it doesn't work if those conditions
don't hold. As such they could be assertions but since I can
return a null to indicate an error I just do that instead.
I have not added a warning here because it won't happen.
- While processing bio's, if none remain but there are supposed
to be more bytes to clone. Here I have added a warning.
- If bio_clone_range() returns a null pointer. That function
will have already produced a warning (at least the first
time, via WARN_ON_ONCE()) to distinguish the cause of the
error. The only exception is memory exhaustion, and I'd
rather not pepper the code with warnings in all those spots.
So no warning is added in that place.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4fb5d671399e83d3875593db2f56d5b57fcb104f 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: add warning messages for missing arguments

Tell the user (via dmesg) what was wrong with the arguments provided
via /sys/bus/rbd/add.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
06ecc6cbf7a60cd5abd9fd2bda8ae69b395c2be3 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: define and use rbd_warn()

Define a new function rbd_warn() that produces a boilerplate warning
message, identifying in the resulting message the affected rbd
device in the best way available. Use it in a few places that now
use pr_warning().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4caf35f9ecdca1feef1d2e5e223b78e52ffbea87 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: use kmemdup()

This replaces two kmalloc()/memcpy() combinations with a single
call to kmemdup().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
979ed480a2722ad8d9f87054635158f652a1241e 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: kill rbd_spec->image_id_len

There is no real benefit to keeping the length of an image id, so
get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
69e7a02f633ba7de0aefa87f3f0e3b31e953b09a 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: kill rbd_spec->image_name_len

There may have been a benefit to hanging on to the length of an
image name before, but there is really none now. The only time it's
used is when probing for rbd images, so we can just compute the
length then.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c66c6e0c0b04ce4a152fe02c562dd0752665d580 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: document rbd_spec structure

I promised Josh I would document whether there were any restrictions
needed for accessing fields of an rbd_spec structure. This adds a
big block of comments that documents the structure and how it is
used--including the fact that we don't attempt to synchronize access
to it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c3e946ce7276faf0b302acd25c7b874edbeba661 16-Nov-2012 Alex Elder <elder@inktank.com> rbd: get rid of rbd_{get,put}_dev()

The functions rbd_get_dev() and rbd_put_dev() are trivial wrappers
that add no value, and their existence suggests they may do more
than what they do.

Get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
b8f5c6edca34ff441e1ccdec68828e933a1b905b 01-Nov-2012 Alex Elder <elder@dreamhost.com> rbd: don't use ENOTSUPP

ENOTSUPP is not a standard errno (it shows up as "Unknown error 524"
in an error message). This is what was getting produced when the
the local rbd code does not implement features required by a
discovered rbd image.

Change the error code returned in this case to ENXIO.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2fd82b9e92c2a718ae81fc987b4468ceeee6979b 09-Nov-2012 Alex Elder <elder@inktank.com> rbd: get rid of RBD_MAX_SEG_NAME_LEN

RBD_MAX_SEG_NAME_LEN represents the maximum length of an rbd object
name (i.e., one of the objects providing storage backing an rbd
image).

Another symbol, MAX_OBJ_NAME_SIZE, is used in the osd client code to
define the maximum length of any object name in an osd request.

Right now they disagree, with RBD_MAX_SEG_NAME_LEN being too big.

There's no real benefit at this point to defining the rbd object
name length limit separate from any other object name, so just
get rid of RBD_MAX_SEG_NAME_LEN and use MAX_OBJ_NAME_SIZE in its
place.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
42382b709bd1d143b9f0fa93e0a3a1f2f4210707 16-Nov-2012 Alex Elder <elder@inktank.com> rbd: do not allow remove of mounted-on image

There is no check in rbd_remove() to see if anybody holds open the
image being removed. That's not cool.

Add a simple open count that goes up and down with opens and closes
(releases) of the device, and don't allow an rbd image to be removed
if the count is non-zero.

Protect the updates of the open count value with ctl_mutex to ensure
the underlying rbd device doesn't get removed while concurrently
being opened.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
9e15b77d9af3b63dfbff14e695336dfca88c22b2 31-Oct-2012 Alex Elder <elder@inktank.com> rbd: get additional info in parent spec

When a layered rbd image has a parent, that parent is identified
only by its pool id, image id, and snapshot id. Images that have
been mapped also record *names* for those three id's.

Add code to look up these names for parent images so they match
mapped images more closely. Skip doing this for an image if it
already has its pool name defined (this will be the case for images
mapped by the user).

It is possible that an the name of a parent image can't be
determined, even if the image id is valid. If this occurs it
does not preclude correct operation, so don't treat this as
an error.

On the other hand, defined pools will always have both an id and a
name. And any snapshot of an image identified as a parent for a
clone image will exist, and will have a name (if not it indicates
some other internal error). So treat failure to get these bits
of information as errors.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
86b00e0da6be7bbc16412f126c5b548ac5d91d50 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: get parent spec for version 2 images

Add support for getting the the information identifying the parent
image for rbd images that have them. The child image holds a
reference to its parent image specification structure. Create a new
entry "parent" in /sys/bus/rbd/image/N/ to report the identifying
information for the parent image, if any.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a92ffdf8a9b09f8fae9a8f418f87f30a5e459570 31-Oct-2012 Alex Elder <elder@inktank.com> rbd: allow null image name

Format 2 parent images are partially identified by their image id,
but it may not be possible to determine their image name. The name
is not strictly needed for correct operation, so we won't be
treating it as an error if we don't know it. Handle this case
gracefully in rbd_name_show().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2c0d0a10ea89456781218f458f6bf72e99d87d2a 31-Oct-2012 Alex Elder <elder@inktank.com> rbd: allow null image name

We will know the image id for format 2 parent images, but won't
initially know its image name. Avoid making the query for an image
id in rbd_dev_image_id() if it's already known.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
83a06263625b823afa3a842ddbf53473c22f24b2 30-Oct-2012 Alex Elder <elder@inktank.com> rbd: encapsulate last part of probe

Group the activities that now take place after an rbd_dev_probe()
call into a single function, and move the call to that function
into rbd_dev_probe() itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c53d589337e9a211413484a604c76072e8474dc0 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: define rbd_dev_{create,destroy}() helpers

Encapsulate the creation/initialization and destruction of rbd
device structures. The rbd_client and the rbd_spec structures
provided on creation hold references whose ownership is transferred
to the new rbd_device structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
bd4ba6554dcbae652b8b27a44f5a7795c9f3178a 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: consolidate rbd_dev init in rbd_add()

Group the allocation and initialization of fields of the rbd device
structure created in rbd_add(). Move the grouped code down later in
the function, just prior to the call to rbd_dev_probe(). This is
for the most part simple code movement.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9d3997fdf4c82adfb37a4886a21eaa513ee071b6 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: don't pass rbd_dev to rbd_get_client()

The only reason rbd_dev is passed to rbd_get_client() is so its
rbd_client field can get assigned. Instead, just return the
rbd_client pointer as a result and have the caller do the
assignment.

Change rbd_put_client() so it takes an rbd_client structure,
so follows the more typical symmetry with rbd_get_client().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
859c31df9cee9d1e1308b3b024b61355e6a629a5 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: fill rbd_spec in rbd_add_parse_args()

Pass the address of an rbd_spec structure to rbd_add_parse_args().
Use it to hold the information defining the rbd image to be mapped
in an rbd_add() call.

Use the result in the caller to initialize the rbd_dev->id field.

This means rbd_dev is no longer needed in rbd_add_parse_args(),
so get rid of it.

Now that this transformation of rbd_add_parse_args() is complete,
correct and expand on the its header documentation to reflect the
new reality.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8b8fb99c5c93a0bdfe7b0c0c9fd2d41a3244555e 27-Oct-2012 Alex Elder <elder@inktank.com> rbd: add reference counting to rbd_spec

With layered images we'll share rbd_spec structures, so add a
reference count to it. It neatens up some code also.

A silly get/put pair is added to the alloc routine just to avoid
"defined but not used" warnings. It will go away soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0d7dbfce9d6e3a57a6946fadf7f92b1792b8acc0 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: define image specification structure

Group the fields that uniquely specify an rbd image into a new
reference-counted rbd_spec structure. This structure will be used
to describe the desired image when mapping an image, and when
probing parent images in layered rbd devices. Replace the set of
fields in the rbd device structure with a pointer to a dynamically
allocated rbd_spec.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
dc79b113d6db48ee6ee7decf0216feee48757f01 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: have rbd_add_parse_args() return error

Change the interface to rbd_add_parse_args() so it returns an
error code rather than a pointer. Return the ceph_options result
via a pointer whose address is passed as an argument.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4e9afeba7aa9ca925a79d9ce82f1a8e79e14c291 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: pass and populate rbd_options structure

Have the caller pass the address of an rbd_options structure to
rbd_add_parse_args(), to be initialized with the information
gleaned as a result of the parse.

I know, this is another near-reversal of a recent change...

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
819d52bf72b61a8455024ff7863eed5d681e73c7 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: remove snap_name arg from rbd_add_parse_args()

The snapshot name returned by rbd_add_parse_args() just gets saved
in the rbd_dev eventually. So just do that inside that function and
do away with the snap_name argument, both in rbd_add_parse_args()
and rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f28e565a1b15eef62618db4011d9e320089a4214 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: remove options args from rbd_add_parse_args()

They "options" argument to rbd_add_parse_args() (and it's partner
options_size) is now only needed within the function, so there's no
need to have the caller allocate and pass the options buffer. Just
allocate the options buffer within the function using dup_token().

Also distinguish between failures due to failed memory allocation
and failing because a required argument was missing.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e5c35534042f4b5957a32bba651222c91247beba 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: get rid of snap_name_len

The value returned in the "snap_name_len" argument to
rbd_add_parse_args() is never actually used, so get rid of it.

The snap_name_len recorded in rbd_dev_v2_snap_name() is not
useful either, so get rid of that too.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0ddebc0c6c518ae42c376151e34d9d4b84443ba5 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: do all argument parsing in one place

This patch makes rbd_add_parse_args() be the single place all
argument parsing occurs for an image map request:
- Move the ceph_parse_options() call into that function
- Use local variables rather than parameters to hold the list
of monitor addresses supplied
- Rather than returning it, pass the snapshot name (and its
length) back via parameters
- Have the function return a ceph_options structure pointer

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
78cea76e0580befaf561c6989f4fc985bc66c8f7 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: move ceph_parse_options() call up

Move option parsing out of rbd_get_client() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
daba5fdb4c469838dcee4b8dd4fecf7be69fa218 27-Oct-2012 Alex Elder <elder@inktank.com> rbd: rename snap_exists field

A Boolean field "snap_exists" in an rbd mapping is used to indicate
whether a mapped snapshot has been removed from an image's snapshot
context, to stop sending requests for that snapshot as soon as we
know it's gone.

Generalize the interpretation of this field so it applies to
non-snapshot (i.e. "head") mappings. That is, define its value
to be false until the mapping has been set, and then define it to be
true for both snapshot mappings or head mappings.

Rename the field "exists" to reflect the broader interpretation.
The rbd_mapping structure is on its way out, so move the field
back into the rbd_device structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
971f839a7670197366c04e99472943532caeb0dc 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: move snap info out of rbd_mapping struct

Moving the snap_id and snap_name fields into the separate
rbd_mapping structure was misguided. (And in time, perhaps
we'll do away with that structure altogether...)

Move these fields back into struct rbd_device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
86992098e7fdb97d01feb51495a952b264a55b7c 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: make pool_id a 64 bit value

If a format 2 image has a parent, its pool id will be specified
using a 64-bit value. Change the pool id we save for an image to
match that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
41f38c2b2f8b66b176a0e548ef06294343a7bfa2 26-Oct-2012 Alex Elder <elder@inktank.com> rbd: remove snapshots on error in rbd_add()

If rbd_dev_snaps_update() has ever been called for an rbd device
structure there could be snapshot structures on its snaps list.
In rbd_add(), this function is called but a subsequent error
path neglected to clean up any of these snapshots.

Add a call to rbd_remove_all_snaps() in the appropriate spot to
remedy this. Change a couple of error labels to be a little
clearer while there.

Drop the leading underscores from the function name; there's nothing
special about that function that they might signify. As suggested
in review, the leading underscores in __rbd_remove_snap_dev() have
been removed as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f7760dad286829682a8d36f4563ab20a65732414 21-Oct-2012 Alex Elder <elder@inktank.com> rbd: simplify rbd_rq_fn()

When processing a request, rbd_rq_fn() makes clones of the bio's in
the request's bio chain and submits the results to osd's to be
satisfied. If a request bio straddles the boundary between objects
backing the rbd image, it must be represented by two cloned bio's,
one for the first part (at the end of one object) and one for the
second (at the beginning of the next object).

This has been handled by a function bio_chain_clone(), which
includes an interface only a mother could love, and which has
been found to have other problems.

This patch defines two new fairly generic bio functions (one which
replaces bio_chain_clone()) to help out the situation, and then
revises rbd_rq_fn() to make use of them.

First, bio_clone_range() clones a portion of a single bio, starting
at a given offset within the bio and including only as many bytes
as requested. As a convenience, a request to clone the entire bio
is passed directly to bio_clone().

Second, bio_chain_clone_range() performs a similar function,
producing a chain of cloned bio's covering a sub-range of the
source chain. No bio_pair structures are used, and if successful
the result will represent exactly the specified range.

Using bio_chain_clone_range() makes bio_rq_fn() a little easier
to understand, because it avoids the need to pass very much
state information between consecutive calls. By avoiding the need
to track a bio_pair structure, it also eliminates the problem
described here: http://tracker.newdream.net/issues/2933

Note that a block request (and therefore the complete length of
a bio chain processed in rbd_rq_fn()) is an unsigned int, while
the result of rbd_segment_length() is u64. This change makes
this range trunctation explicit, and trips a bug if the the
segment boundary is too far off.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
069a4b5690a952e74157fd489833c71c73f012b3 22-Oct-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_device->rbd_opts

The rbd_device structure has an embedded rbd_options structure.
Such a structure is needed to work with the generic ceph argument
parsing code, but there's no need to keep it around once argument
parsing is done.

Use a local variable to hold the rbd options used in parsing in
rbd_get_client(), and just transfer its content (it's just a
read_only flag) into the field in the rbd_mapping sub-structure
that requires that information.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
e5cfeed281a842a37c9da84bad2911c9b470347e 21-Oct-2012 Alex Elder <elder@inktank.com> rbd: simplify rbd_merge_bvec()

The aim of this patch is to make what's going on rbd_merge_bvec() a
bit more obvious than it was before. This was an issue when a
recent btrfs bug led us to question whether the merge function was
working correctly.

Use "obj" rather than "chunk" to indicate the units whose boundaries
we care about we call (rados) "objects".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
d4b125e9eb43babd14538ba61718e3db71a98d29 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: increase maximum snapshot name length

Change RBD_MAX_SNAP_NAME_LEN to be based on NAME_MAX. That is a
practical limit for the length of a snapshot name (based on the
presence of a directory using the name under /sys/bus/rbd to
represent the snapshot).

The /sys entry is created by prefixing it with "snap_"; define that
prefix symbolically, and take its length into account in defining
the snapshot name length limit.

Enforce the limit in rbd_add_parse_args(). Also delete a dout()
call in that function that was not meant to be committed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
db2388b6ee40a949084e4cdddc3b0a4357068a62 21-Oct-2012 Alex Elder <elder@inktank.com> rbd: verify rbd image order value

This adds a verification that an rbd image's object order is
within the upper and lower bounds supported by this implementation.

It must be at least 9 (SECTOR_SHIFT), because the Linux bio system
assumes that minimum granularity.

It also must be less than 32 (at the moment anyway) because there
exist spots in the code that store the size of a "segment" (object
backing an rbd image) in a signed int variable, which can be 32 bits
including the sign. We should be able to relax this limit once
we've verified the code uses 64-bit types where needed.

Note that the CLI tool already limits the order to the range 12-25.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4634246db8cb2e5117ef7c682efcc383fa3354f8 11-Oct-2012 Alex Elder <elder@inktank.com> rbd: consolidate rbd_do_op() calls

The two calls to rbd_do_op() from rbd_rq_fn() differ only in the
value passed for the snapshot id and the snapshot context.

For reads the snapshot always comes from the mapping, and for writes
the snapshot id is always CEPH_NOSNAP.

The snapshot context is always null for reads. For writes, the
snapshot context always comes from the rbd header, but it is
acquired under protection of header semaphore and could change
thereafter, so we can't simply use what's available inside
rbd_do_op().

Eliminate the snapid parameter from rbd_do_op(), and set it
based on the I/O direction inside that function instead. Always
pass the snapshot context acquired in the caller, but reset it
to a null pointer inside rbd_do_op() if the operation is a read.

As a result, there is no difference in the read and write calls
to rbd_do_op() made in rbd_rq_fn(), so just call it unconditionally.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ff2e4bb5b32f89c455979a4222a9b78007cde254 11-Oct-2012 Alex Elder <elder@inktank.com> rbd: drop rbd_do_op() opcode and flags

The only callers of rbd_do_op() are in rbd_rq_fn(), where call one
is used for writes and the other used for reads. The request passed
to rbd_do_op() already encodes the I/O direction, and that
information can be used inside the function to set the opcode and
flags value (rather than passing them in as arguments).

So get rid of the opcode and flags arguments to rbd_do_op().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
13f4042c05b6a1a638ccab3f0cabdb84993803a2 11-Oct-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_req_{read,write}()

Both rbd_req_read() and rbd_req_write() are simple wrapper routines
for rbd_do_op(), and each is only called once. Replace each wrapper
call with a direct call to rbd_do_op(), and get rid of the wrapper
functions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
be466c1cc36621590ef17b05a6d342dfd33f7280 22-Oct-2012 Alex Elder <elder@inktank.com> rbd: fix read-only option name

The name of the "read-only" mapping option was inadvertently changed
in this commit:

f84344f3 rbd: separate mapping info in rbd_dev

Revert that hunk to return it to what it should be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a0ea3a40fd20b8c66381f747c454f89d6d1f50d4 11-Oct-2012 Alex Elder <elder@inktank.com> rbd: zero return code in rbd_dev_image_id()

When rbd_dev_probe() calls rbd_dev_image_id() it expects to get
a 0 return code if successful, but it is getting a positive value.

The reason is that rbd_dev_image_id() returns the value it gets from
rbd_req_sync_exec(), which returns the number of bytes read in as a
result of the request. (This ultimately comes from
ceph_copy_from_page_vector() in rbd_req_sync_op()).

Force the return value to 0 when successful in rbd_dev_image_id().
Do the same in rbd_dev_v2_object_prefix().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
b213e0b1a62637b2a9395a34349b13d73ca2b90a 11-Oct-2012 Alex Elder <elder@inktank.com> rbd: fix bug in rbd_dev_id_put()

In rbd_dev_id_put(), there's a loop that's intended to determine
the maximum device id in use. But it isn't doing that at all,
the effect of how it's written is to simply use the just-put id
number, which ignores whole purpose of this function.

Fix the bug.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
35152979e6181b1fbb4b61c3ff641c14df53ad66 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: activate v2 image support

Now that v2 images support is fully implemented, have
rbd_dev_v2_probe() return 0 to indicate a successful probe.

(Note that an image that implements layering will fail
the probe early because of the feature chekc.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d889140c4a1c5edb6a7bd90392b9d878bfaccfb6 09-Oct-2012 Alex Elder <elder@inktank.com> rbd: implement feature checks

Version 2 images have two sets of feature bit fields. The first
indicates features possibly used by the image. The second indicates
features that the client *must* support in order to use the image.

When an image (or snapshot) is first examined, we need to make sure
that the local implementation supports the image's required
features. If not, fail the probe for the image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
117973fb4c91f3fd913127577e9f71b3aa6cb556 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: define rbd_dev_v2_refresh()

Define a new function rbd_dev_v2_refresh() to update/refresh the
snapshot context for a format version 2 rbd image. This function
will update anything that is not fixed for the life of an rbd
image--at the moment this is mainly the snapshot context and (for
a base mapping) the size.

Update rbd_refresh_header() so it selects which function to use
based on the image format.

Rename __rbd_refresh_header() to be rbd_dev_v1_refresh()
to be consistent with the naming of its version 2 counterpart.
Similarly rename rbd_refresh_header() to be rbd_dev_refresh().

Unrelated--we use rbd_image_format_valid() here. Delete the other
use of it, which was primarily put in place to ensure that function
was referenced at the time it was defined.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9478554ae5d21d65e948a3eff4ee2a8ad30d70e9 09-Oct-2012 Alex Elder <elder@inktank.com> rbd: define rbd_update_mapping_size()

Encapsulate the code that handles updating the size of a mapping
after an rbd image has been refreshed. This is done in anticipation
of the next patch, which will make this common code for format 1 and
2 images.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6cae3717cddaf8e5e96e304733dca66e40d56f89 25-Sep-2012 Sage Weil <sage@inktank.com> rbd: BUG on invalid layout

This shouldn't actually be possible because the layout struct is
constructed from the RBD header and validated then.

[elder@inktank.com: converted BUG() call to equivalent rbd_assert()]

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
6e14b1a6c3b8d7e48ece68733d2dac0464611ee4 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: update remaining header fields for v2

There are three fields that are not yet updated for format 2 rbd
image headers: the version of the header object; the encryption
type; and the compression type. There is no interface defined for
fetching the latter two, so just initialize them explicitly to 0 for
now.

Change rbd_dev_v2_snap_context() so the caller can be supplied the
version for the header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b8b1e2db52de61f575981d0c23da785a7c5b4a77 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: get snapshot name for a v2 image

Define rbd_dev_v2_snap_name() to fetch the name for a particular
snapshot in a format 2 rbd image.

Define rbd_dev_v2_snap_info() to to be a wrapper for getting the
name, size, and features for a particular snapshot, using an
interface that matches the equivalent function for version 1 images.

Define rbd_dev_snap_info() wrapper function and use it to call the
appropriate function for getting the snapshot name, size, and
features, dependent on the rbd image format.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
35d489f94651ac19be55661732a7ee15c6304a55 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: get the snapshot context for a v2 image

Fetch the snapshot context for an rbd format 2 image by calling
the "get_snapcontext" method on its header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b1b5402aa9c4a9aeb8431886e41b0a1d127318d1 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: get image features for a v2 image

The features values for an rbd format 2 image are fetched from the
server using a "get_features" method. The same method is used for
getting the features for a snapshot, so structure this addition with
a generic helper routine that can get this information for either.

The server will provide two 64-bit feature masks, one representing
the features potentially in use for this image (or its snapshot),
and one representing features that must be supported by the client
in order to work with the image.

For the time being, neither of these is really used so we keep
things simple and just record the first feature vector. Once we
start using these feature masks, what we record and what we expose
to the user will most likely change.

Signed-off-by: Alex Elder <elder@inktank.com>
1e1301998ee80d9a8cc09297906293f16f8a6064 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: get the object prefix for a v2 rbd image

The object prefix of an rbd format 2 image is fetched from the
server using a "get_object_prefix" method.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9d475de5d12af8ac4c2101807e0a889ac7389c5a 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: add code to get the size of a v2 rbd image

The size of an rbd format 2 image is fetched from the server using a
"get_size" method. The same method is used for getting the size of
a snapshot, so structure this addition with a generic helper routine
that we can get this information for either.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a30b71b999c92071befec73434f4e67fd4b4734b 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: lay out header probe infrastructure

This defines a new function rbd_dev_probe() as a top-level
function for populating detailed information about an rbd device.

It first checks for the existence of a format 2 rbd image id object.
If it exists, the image is assumed to be a format 2 rbd image, and
another function rbd_dev_v2() is called to finish populating
header data for that image. If it does not exist, it is assumed to
be an old (format 1) rbd image, and calls a similar function
rbd_dev_v1() to populate its header information.

A new field, rbd_dev->format, is defined to record which version
of the rbd image format the device represents. For a valid mapped
rbd device it will have one of two values, 1 or 2.

So far, the format 2 images are not really supported; this is
laying out the infrastructure for fleshing out that support.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cd892126c617b3837b6088bf6c097ad2def4de83 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: encapsulate code that gets snapshot info

Create a function that encapsulates looking up the name, size and
features related to a given snapshot, which is indicated by its
index in an rbd device's snapshot context array of snapshot ids.

This interface will be used to hide differences between the format 1
and format 2 images.

At the moment this (looking up the name anyway) is slightly less
efficient than what's done currently, but we may be able to optimize
this a bit later on by cacheing the last lookup if it proves to be a
problem.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
34b131849feb359f183907b467e9aa4d652b1baa 14-Jul-2012 Alex Elder <elder@inktank.com> rbd: add an rbd features field

Record the features values for each rbd image and each of its
snapshots. This is really something that only becomes meaningful
for version 2 images, so this is just putting in place code
that will form common infrastructure.

It may be useful to expand the sysfs entries--and therefore the
information we maintain--for the image and for each snapshot.
But I'm going to hold off doing that until we start making
active use of the feature bits.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c8d184250d8a47b1a958affcffe3ffdd85644301 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: don't use index in __rbd_add_snap_dev()

Pass the snapshot id and snapshot size rather than an index
to __rbd_add_snap_dev() to specify values for a new snapshot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
02cdb02ceab1f3dd9ac2bc899fc51f0e0e744782 10-Aug-2012 Alex Elder <elder@inktank.com> rbd: kill create_snap sysfs entry

Josh proposed the following change, and I don't think I could
explain it any better than he did:

From: Josh Durgin <josh.durgin@inktank.com>
Date: Tue, 24 Jul 2012 14:22:11 -0700
To: ceph-devel <ceph-devel@vger.kernel.org>
Message-ID: <500F1203.9050605@inktank.com>

Right now the kernel still has one piece of rbd management
duplicated from the rbd command line tool: snapshot creation.
There's nothing special about snapshot creation that makes it
advantageous to do from the kernel, so I'd like to remove the
create_snap sysfs interface. That is,
/sys/bus/rbd/devices/<id>/create_snap
would be removed.

Does anyone rely on the sysfs interface for creating rbd
snapshots? If so, how hard would it be to replace with:

rbd snap create pool/image@snap

Is there any benefit to the sysfs interface that I'm missing?

Josh

This patch implements this proposal, removing the code that
implements the "snap_create" sysfs interface for rbd images.
As a result, quite a lot of other supporting code goes away.

Suggested-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
589d30e0b3e649e2660f9a67be88e235b28bc319 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: define rbd_dev_image_id()

New format 2 rbd images are permanently identified by a unique image
id. Each rbd image also has a name, but the name can be changed.
A format 2 rbd image will have an object--whose name is based on the
image name--which maps an image's name to its image id.

Create a new function rbd_dev_image_id() that checks for the
existence of the image id object, and if it's found, records the
image id in the rbd_device structure.

Create a new rbd device attribute (/sys/bus/rbd/<num>/image_id) that
makes this information available.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f8d4de6e1c939d56f1ee0a21ad677401846f990c 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: support data returned from OSD methods

An OSD object method call can be made using rbd_req_sync_exec().
Until now this has only been used for creating a new RBD snapshot,
and that has only required sending data out, not receiving anything
back from the OSD.

We will now need to get data back from an OSD on a method call, so
add parameters to rbd_req_sync_exec() that allow a buffer into which
returned data should be placed to be specified, along with its size.

Previously, rbd_req_sync_exec() passed a null pointer and zero
size to rbd_req_sync_op(); change this so the new inbound buffer
information is provided instead.

Rename the "buf" and "len" parameters in rbd_req_sync_op() to
make it more obvious they are describing inbound data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3cb4a687c72bd16c95f514933d68884eacac4e4e 26-Jun-2012 Alex Elder <elder@inktank.com> rbd: pass flags to rbd_req_sync_exec()

In order to allow both read requests and write requests to be
initiated using rbd_req_sync_exec(), add an OSD flags value
which can be passed down to rbd_req_sync_op(). Rename the "data"
and "len" parameters to be more clear that they represent data
that is outbound.

At this point, this function is still only used (and only works) for
write requests.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3ee4001e0c875ce8ebcdf5ea305e9a105b3687bd 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: set up watch before announcing disk

We're ready to handle header object (refresh) events at the point we
call rbd_bus_add_dev(). Set up the watch request on the rbd image
header just after that, and after we've registered the devices for
the snapshots for the initial snapshot context. Do this before
announce the disk as available for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12f029448c3d73e0f30bc5aee5964442aa95c0f4 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: set initial capacity in rbd_init_disk()

Move the setting of the initial capacity for an rbd image mapping
into rb_init_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
86ff77bb68c6cda783b195a260f68fd5d32f7aaf 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: drop dev registration check for new snap

By the time rbd_dev_snaps_register() gets called during rbd device
initialization, the main device will have already been registered.
Similarly, a header refresh will only occur for an rbd device whose
Linux device is registered. There is therefore no need to verify
the main device is registered when registering a snapshot device.

For the time being, turn the check into a WARN_ON(), but it can
eventually just go away.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0f308a3188b37f36bc5a078f5fe039a41714476e 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: call rbd_init_disk() sooner

Call rbd_init_disk() from rbd_add() as soon as we have the major
device number for the mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
85ae8926751db5e09b9a12ee44609ee9e74b7aad 27-Jul-2012 Alex Elder <elder@inktank.com> rbd: defer setting device id

Hold off setting the device id and formatting the device name
in rbd_add() until just before it's needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
05fd6f6f8c7b07e746d513e4cf862675b70aac59 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: read the header before registering device

Read the rbd header information and call rbd_dev_set_mapping()
earlier--before registering the block device or setting up the sysfs
entries for the image. The sysfs entries provide users access to
some information that's only available after doing the rbd header
initialization, so this will make sure it's valid right away.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
5ed1617731a1e9201c3541a9c05ce3ec73975589 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: call set_snap() before snap_devs_update()

rbd_header_set_snap() is a simple initialization routine for an rbd
device's mapping. It has to be called after the snapshot context
for the rbd_dev has been updated, but can be done before snapshot
devices have been registered.

Change the name to rbd_dev_set_mapping() to better reflect its
purpose, and call it a little sooner, before registering snapshot
devices.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
304f68086f8206da7c5930a9cb0207c91d1983a6 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: defer registering snapshot devices

When a new snapshot is found in an rbd device's updated snapshot
context, __rbd_add_snap_dev() is called to create and insert an
entry in the rbd devices list of snapshots. In addition, a Linux
device is registered to represent the snapshot.

For version 2 rbd images, it will be undesirable to initialize the
device right away. So in anticipation of that, this patch separates
the insertion of a snapshot entry in the snaps list from the
creation of devices for those snapshots.

To do this, create a new function rbd_dev_snaps_register() which
traverses the list of snapshots and calls rbd_register_snap_dev()
on any that have not yet been registered.

Rename rbd_dev_snap_devs_update() to be rbd_dev_snaps_update()
to better reflect that only the entry in the snaps list and not
the snapshot's device is affected by the function.

For now, call rbd_dev_snaps_register() immediately after each
call to rbd_dev_snaps_update().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3fcf2581c2c3c910aa46f6d205e502a97243ca2c 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: assign header name later

Move the assignment of the header name for an rbd image a bit later,
outside rbd_add_parse_args() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e86924a8092fda66b859f12a4d7d37a4a458d74a 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: use snaps list in rbd_snap_by_name()

An rbd_dev structure maintains a list of current snapshots that have
already been fully initialized. The entries on the list have type
struct rbd_snap, and each entry contains a copy of information
that's found in the rbd_dev's snapshot context and header.

The only caller of snap_by_name() is rbd_header_set_snap(). In that
call site any positive return value (the index in the snapshot
array) is ignored, so there's no need to return the index in
the snapshot context's id array when it's found.

rbd_header_set_snap() also has only one caller--rbd_add()--and that
call is made after a call to rbd_dev_snap_devs_update(). Because
the rbd_snap structures are initialized in that function, the
current snapshot list can be used instead of the snapshot context to
look up a snapshot's information by name.

Change snap_by_name() so it uses the snapshot list rather than the
rbd_dev's snapshot context in looking up snapshot information.
Return 0 if it's found rather than the snapshot id.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cd789ab9cacbda1aad43304b89cff29004b793ea 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: don't register snapshots in bus_add_dev()

When rbd_bus_add_dev() is called (one spot--in rbd_add()), the rbd
image header has not even been read yet. This means that the list
of snapshots will be empty at the time of the call. As a result,
there is no need for the code that calls rbd_register_snap_dev()
for each entry in that list--so get rid of it.

Once the header has been read (just after returning), a call will
be made to rbd_dev_snap_devs_update(), which will then find every
snapshot in the context to be new and will therefore call
rbd_register_snap_dev() via __rbd_add_snap_dev() accomplishing
the same thing.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4bb1f1ed0063870f34ae5783cda08924964bac0b 24-Aug-2012 Alex Elder <elder@inktank.com> rbd: move locking out of rbd_header_set_snap()

Move the calls to get the header semaphore out of
rbd_header_set_snap() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1fcdb8aa1f58af72eb8206ba97fab2df77df2b14 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: simplify rbd_init_disk() a bit

This just simplifies a few things in rbd_init_disk(), now that the
previous patch has moved a bunch of initialization code out if it.
Done separately to facilitate review.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2ac4e75d89e9df8eea6390a759eac2b6df0ebff6 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: do some header initialization earlier

Move some of the code that initializes an rbd header out of
rbd_init_disk() and into its caller.

Move the code at the end of rbd_init_disk() that sets the device
capacity and activates the Linux device out of that function and
into the caller, ensuring we still have the disk size available
where we need it.

Update rbd_free_disk() so it still aligns well as an inverse of
rbd_init_disk(), moving the rbd_header_free() call out to its
caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8836b995fd192dba23d312d2a4fba68dd8ca7183 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: simplify snap_by_name() interface

There is only one caller of snap_by_name(), and it passes two values
to be assigned, both of which are found within an rbd device
structure.

Change the interface so it just passes the address of the rbd_dev,
and make the assignments to its fields directly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4e1105a299adf7ac421d42a8be05205f51610f3c 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: set mapping name with the rest

With the exception of the snapshot name, all of the mapping-specific
fields in an rbd device structure are set in rbd_header_set_snap().

Pass the snapshot name to be assigned into rbd_header_set_snap()
to keep all of the mapping assignments together.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3feeb8946739d980fb0922bf68363552a493a49c 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: return snap name from rbd_add_parse_args()

This is the first of two patches aimed at isolating the code that
sets the mapping information into a single spot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
99c1f08f6459cfa6fe1f5fb68706b437e006be2e 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: record mapped size

Add the size of the mapped image to the set of mapping-specific
fields in an rbd_device, and use it when setting the capacity of the
disk.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
f84344f334df8f1d41eba7cfa7eb1024da25e1fe 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: separate mapping info in rbd_dev

Several fields in a struct rbd_dev are related to what is mapped, as
opposed to the actual base rbd image. If the base image is mapped
these are almost unneeded, but if a snapshot is mapped they describe
information about that snapshot.

In some contexts this can be a little bit confusing. So group these
mapping-related field into a structure to make it clear what they
are describing.

This also includes a minor change that rearranges the fields in the
in-core image header structure so that invariant fields are at the
top, followed by those that change.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
c9aadfe7860f83ee17e55fe17398f3fe948a0a84 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_image_header->total_snaps

The "total_snaps" field in an rbd header structure is never any
different from the value of "num_snaps" stored within a snapshot
context. Avoid any confusion by just using the value held within
the snapshot context, and get rid of the "total_snaps" field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
98cec111c08d8d52168e28648b4a1c2f5011cd70 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_dev->q

A copy of rbd_dev->disk->queue is held in rbd_dev->q, but it's
never actually used. So get just get rid of the field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9fcbb80024795dc1a282f11af593ae5aa3d1b67e 24-Aug-2012 Alex Elder <elder@inktank.com> rbd: rename __rbd_init_snaps_header()

The name __rbd_init_snaps_header() doesn't really convey what that
function does very well. Its purpose is to scan a new snapshot
context and either create or destroy snapshot device entries so
that local host's view is consistent with the reality maintained
on the OSDs. This patch just changes the name of this function,
to be rbd_dev_snap_devs_update(). Still not perfect, but I think
better.

Also add some dynamic debug statements to this function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
e28393082dd3991156d12a9e64b9584cef28fe25 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: rename rbd_id_get()

This should have been done as part of this commit:

commit de71a2970d57463d3d965025e33ec3adcf391248
Author: Alex Elder <elder@inktank.com>
Date: Tue Jul 3 16:01:19 2012 -0500
rbd: rename rbd_device->id

rbd_id_get() is assigning the rbd_dev->dev_id field. Change the
name of that function as well as rbd_id_put() and rbd_id_max
to reflect what they are affecting.

Add some dynamic debug statements related to rbd device id activity.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
aafb230ebc3bcdbbd1781f56e482ec75f7f1f263 06-Sep-2012 Alex Elder <elder@inktank.com> rbd: define rbd_assert()

Define rbd_assert() and use it in place of various BUG_ON() calls
now present in the code. By default assertion checking is enabled;
we want to do this differently at some point.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
65ccfe21dd8fb402547bb1c50bbc2737c4ef37b8 09-Aug-2012 Alex Elder <elder@inktank.com> rbd: split up rbd_get_segment()

There are two places where rbd_get_segment() is called. One, in
rbd_rq_fn(), only needs to know the length within a segment that an
I/O request should be. The other, in rbd_do_op(), also needs the
name of the object and the offset within it for the I/O request.

Split out rbd_segment_name() into three dedicated functions:
- rbd_segment_name() allocates and formats the name of the
object for a segment containing a given rbd image offset
- rbd_segment_offset() computes the offset within a segment for
a given rbd image offset
- rbd_segment_length() computes the length to use for I/O within
a segment for a request, not to exceed the end of a segment
object.

In the new functions be a bit more careful, checking for possible
error conditions:
- watch for errors or overflows returned by snprintf()
- catch (using BUG_ON()) potential overflow conditions
when computing segment length

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
df111be6310fc41d059a485368e3c51a684859c2 09-Aug-2012 Alex Elder <elder@inktank.com> rbd: check for overflow in rbd_get_num_segments()

It is possible in rbd_get_num_segments() for an overflow to occur
when adding the offset and length. This is easily avoided.

Since the function returns an int and the one caller is already
prepared to handle errors, have it return -ERANGE if overflow would
occur.

The overflow check would not work if a zero-length request was
being tested, so short-circuit that case, returning 0 for the
number of segments required. (This condition might be avoided
elsewhere already, I don't know.)

Have the caller end the request if either an error or 0 is returned.
The returned value is passed to __blk_end_request_all(), meaning
a 0 length request is not treated an error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
38f5f65e9d25b0a7270c337a35c724ca3d56f4d8 09-Aug-2012 Alex Elder <elder@inktank.com> rbd: drop needless test in rbd_rq_fn()

There's a test for null rq pointer inside the while loop in
rbd_rq_fn() that's not needed. That same test already occurred
in the immediatly preceding loop condition test.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
542582fce1700c01b12e7945aaf173074e008e3e 09-Aug-2012 Alex Elder <elder@inktank.com> rbd: bio_chain_clone() cleanups

In bio_chain_clone(), at the end of the function the bi_next field
of the tail of the new bio chain is nulled. This isn't necessary,
because if "tail" is non-null, its value will be the last bio
structure allocated at the top of the while loop in that function.
And before that structure is added to the end of the new chain, its
bi_next pointer is always made null.

While touching that function, clean a few other things:
- define each local variable on its own line
- move the definition of "tmp" to an inner scope
- move the modification of gfpmask closer to where it's used
- rearrange the logic that sets the chain's tail pointer

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
84d34dcc116e117a41c6fc8be13430529fc2d9e7 10-Aug-2012 Alex Elder <elder@inktank.com> rbd: kill notify_timeout option

The "notify_timeout" rbd device option is never used, so get rid of
it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
cc0538b62c839c2df7b9f8378bb37e3b35faa608 10-Aug-2012 Alex Elder <elder@inktank.com> rbd: add read_only rbd map option

Add the ability to map an rbd image read-only, by specifying either
"read_only" or "ro" as an option on the rbd "command line." Also
allow the inverse to be explicitly specified using "read_write" or
"rw".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
f8c3892911145db7cf895cc26f53ad73dd4e7b1f 10-Aug-2012 Alex Elder <elder@inktank.com> rbd: move rbd_opts to struct rbd_device

The rbd options don't really apply to the ceph client. So don't
store a pointer to it in the ceph_client structure, and put them
(a struct, not a pointer) into the rbd_dev structure proper.

Pass the rbd device structure to rbd_client_create() so it can
assign rbd_dev->rbdc if successful, and have it return an error code
instead of the rbd client pointer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
621901d652db10ad8ceddd25dd4b883978a873e1 24-Aug-2012 Alex Elder <elder@inktank.com> rbd: more cleanup in rbd_header_from_disk()

This just rearranges things a bit more in rbd_header_from_disk()
so that the snapshot sizes are initialized right after the buffer
to hold them is allocated and doing a little further consolidation
that follows from that. Also adds a few simple comments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
f785cc1dbe90b561b8ded92df4fe9732bdc54859 24-Aug-2012 Alex Elder <elder@inktank.com> rbd: kill incore snap_names_len

The only thing the on-disk snap_names_len field is needed is to
size the buffer allocated to hold a copy of the snapshot names
for an rbd image.

So don't bother saving it in the in-core rbd_image_header structure.
Just use a local variable to hold the required buffer size while
it's needed.

Move the code that actually copies the snapshot names up closer
to where the required length is saved.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
58c17b0e1b2278824aedc5d1201f6a43a38d6a48 24-Aug-2012 Alex Elder <elder@inktank.com> rbd: don't over-allocate space for object prefix

In rbd_header_from_disk() the object prefix buffer is sized based on
the maximum size it's block_name equivalent on disk could be.

Instead, only allocate enough to hold null-terminated string from
the on-disk header--or the maximum size of no NUL is found.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
1f7ba3311530993801d6877889efff0382bcd641 10-Aug-2012 Alex Elder <elder@inktank.com> rbd: handle locking inside __rbd_client_find()

There is only caller of __rbd_client_find(), and it somewhat
clumsily gets the appropriate lock and gets a reference to the
existing ceph_client structure if it's found.

Instead, have that function handle its own locking, and acquire the
reference if found while it holds the lock. Drop the underscores
from the name because there's no need to signify anything special
about this function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
523f32582f30768ab4e56b71b276fc1ea71f48cc 30-Aug-2012 Alex Elder <elder@inktank.com> rbd: add new snapshots at the tail

This fixes a bug that went in with this commit:

commit f6e0c99092cca7be00fca4080cfc7081739ca544
Author: Alex Elder <elder@inktank.com>
Date: Thu Aug 2 11:29:46 2012 -0500
rbd: simplify __rbd_init_snaps_header()

The problem is that a new rbd snapshot needs to go either after an
existing snapshot entry, or at the *end* of an rbd device's snapshot
list. As originally coded, it is placed at the beginning. This was
based on the assumption the list would be empty (so it wouldn't
matter), but in fact if multiple new snapshots are added to an empty
list in one shot the list will be non-empty after the first one is
added.

This addresses http://tracker.newdream.net/issues/3063

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
843a0d0879935742bb7270c9dc8d94abb8b39cee 01-Sep-2012 Alex Elder <elder@inktank.com> rbd: rename block_name -> object_prefix

In the on-disk image header structure there is a field "block_name"
which represents what we now call the "object prefix" for an rbd
image. Rename this field "object_prefix" to be consistent with
modern usage.

This appears to be the only remaining vestige of the use of "block"
in symbols that represent objects in the rbd code.

This addresses http://tracker.newdream.net/issues/1761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
4156d998409be065aa8141b6bd2c6f18be1b21e9 02-Aug-2012 Alex Elder <elder@inktank.com> rbd: separate reading header from decoding it

Right now rbd_read_header() both reads the header object for an rbd
image and decodes its contents. It does this repeatedly if needed,
in order to ensure a complete and intact header is obtained.

Separate this process into two steps--reading of the raw header
data (in new function, rbd_dev_v1_header_read()) and separately
decoding its contents (in rbd_header_from_disk()). As a result,
the latter function no longer requires its allocated_snaps argument.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
103a150f0cc57576b1c4b80bf07af60a14349eee 02-Aug-2012 Alex Elder <elder@inktank.com> rbd: expand rbd_dev_ondisk_valid() checks

Add checks on the validity of the snap_count and snap_names_len
field values in rbd_dev_ondisk_valid(). This eliminates the
need to do them in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
28cb775de1bd1bcc62c43f767ab81b7b9cfb6678 27-Jul-2012 Alex Elder <elder@inktank.com> rbd: return earlier in rbd_header_from_disk()

The only caller of rbd_header_from_disk() is rbd_read_header().
It passes as allocated_snaps the number of snapshots it will
have received from the server for the snapshot context that
rbd_header_from_disk() is to interpret. The first time through
it provides 0--mainly to extract the number of snapshots from
the snapshot context header--so that it can allocate an
appropriately-sized buffer to receive the entire snapshot
context from the server in a second request.

rbd_header_from_disk() will not fill in the array of snapshot ids
unless the number in the snapshot matches the number the caller
had allocated.

This patch adjusts that logic a little further to be more efficient.
rbd_read_header() doesn't even examine the snapshot context unless
the snapshot count (stored in header->total_snaps) matches the
number of snapshots allocated. So rbd_header_from_disk() doesn't
need to allocate or fill in the snapshot context field at all in
that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6a52325f61760c6a7d7f3ea9736029bc9f63e7f3 20-Jul-2012 Alex Elder <elder@inktank.com> rbd: rearrange rbd_header_from_disk()

This just moves code around for the most part. It was pulled out as
a separate patch to avoid cluttering up some upcoming patches which
are more substantive. The point is basically to group everything
related to initializing the snapshot context together.

The only functional change is that rbd_header_from_disk() now
ensures the (in-core) header it is passed is zero-filled. This
allows a simpler error handling path in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d2bb24e506596ad0c10e4a7f4b2fca88cc75c0bc 27-Jul-2012 Alex Elder <elder@inktank.com> rbd: use sizeof (object) instead of sizeof (type)

Fix a few spots in rbd_header_from_disk() to use sizeof (object)
rather than sizeof (type). Use a local variable to record sizes
to shorten some lines and improve readability.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d78fd7ae03136c0610bee33eeebb4ffe67c752d5 27-Jul-2012 Alex Elder <elder@inktank.com> rbd: ensure invalid pointers are made null

Fix a number of spots where a pointer value that is known to
have become invalid but was not reset to null.

Also, toss in a change so we use sizeof (object) rather than
sizeof (type).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0f1d3f938527f319d830ef3082c218c77cfd159f 02-Aug-2012 Alex Elder <elder@inktank.com> rbd: make snap_names_len a u64

The snap_names_len field of an rbd_image_header structure is defined
with type size_t. That field is used as both the source and target
of 64-bit byte-order swapping operations though, so it's best to
define it with type u64 instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
3593815022998ab75379f64735ff67b3ea388cbe 02-Aug-2012 Alex Elder <elder@inktank.com> rbd: simplify __rbd_init_snaps_header()

The purpose of __rbd_init_snaps_header() is to compare a new
snapshot context with an rbd device's list of existing snapshots.
It updates the list by adding any new snapshots or removing any
that are not present in the new snapshot context.

The code as written is a little confusing, because it traverses both
the existing snapshot list and the set of snapshots in the snapshot
context in reverse. This was done based on an assumption about
snapshots that is not true--namely that a duplicate snapshot name
could cause an error in intepreting things if they were not
processed in ascending order.

These precautions are not necessary, because:
- all snapshots are uniquely identified by their snapshot id
- a new snapshot cannot be created if the rbd device has another
snapshot with the same name
(It is furthermore not currently possible to rename a snapshot.)

This patch re-implements __rbd_init_snaps_header() so it passes
through both the existing snapshot list and the entries in the
snapshot context in forward order. It still does the same thing
as before, but I find the logic considerably easier to understand.

By going forward through the names in the snapshot context, there
is no longer a need for the rbd_prev_snap_name() helper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
340c7a2b2c9a2da640af28a8c196356484ac8b50 10-Aug-2012 Alex Elder <elder@inktank.com> rbd: drop dev reference on error in rbd_open()

If a read-only rbd device is opened for writing in rbd_open(), it
returns without dropping the just-acquired device reference.

Fix this by moving the read-only check before getting the reference.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
1fe5e9932156f6122c3b1ff6ba7541c27c86718c 25-Jul-2012 Alex Elder <elder@inktank.com> rbd: create rbd_refresh_helper()

Create a simple helper that handles the common case of calling
__rbd_refresh_header() while holding the ctl_mutex.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
b813623ab95d0b4bbeb22e160bd5461965d0c571 25-Jul-2012 Alex Elder <elder@inktank.com> rbd: return obj version in __rbd_refresh_header()

Add a new parameter to __rbd_refresh_header() through which the
version of the header object is passed back to the caller. In most
cases this isn't needed. The main motivation is to normalize
(almost) all calls to __rbd_refresh_header() so they are all
wrapped immediately by mutex_lock()/mutex_unlock().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ccece235d3737221e7a1118fdbd8474112adac84 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: fixes in rbd_header_from_disk()

This fixes a few issues in rbd_header_from_disk():
- There is a check intended to catch overflow, but it's wrong in
two ways.
- First, the type we don't want to overflow is size_t, not
unsigned int, and there is now a SIZE_MAX we can use for
use with that type.
- Second, we're allocating the snapshot ids and snapshot
image sizes separately (each has type u64; on disk they
grouped together as a rbd_image_header_ondisk structure).
So we can use the size of u64 in this overflow check.
- If there are no snapshots, then there should be no snapshot
names. Enforce this, and issue a warning if we encounter a
header with no snapshots but a non-zero snap_names_len.
- When saving the snapshot names into the header, be more direct
in defining the offset in the on-disk structure from which
they're being copied by using "snap_count" rather than "i"
in the array index.
- If an error occurs, the "snapc" and "snap_names" fields are
freed at the end of the function. Make those fields be null
pointers after they're freed, to be explicit that they are
no longer valid.
- Finally, move the definition of the local variable "i" to the
innermost scope in which it's needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
913d2fdcf60576cbbd82969fcb2bc78a4d59ba33 26-Jun-2012 Alex Elder <elder@inktank.com> rbd: always pass ops array to rbd_req_sync_op()

All of the callers of rbd_req_sync_op() except one pass a non-null
"ops" pointer. The only one that does not is rbd_req_sync_read(),
which passes CEPH_OSD_OP_READ as its "opcode" and, CEPH_OSD_FLAG_READ
for "flags".

By allocating the ops array in rbd_req_sync_read() and moving the
special case code for the null ops pointer into it, it becomes
clear that much of that code is not even necessary.

In addition, the "opcode" argument to rbd_req_sync_op() is never
actually used, so get rid of that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d67d4be56a3ec8d07b1f29aab6095b363085b028 14-Jul-2012 Alex Elder <elder@inktank.com> rbd: pass null version pointer in add_snap()

rbd_header_add_snap() passes the address of a version variable to
rbd_req_sync_exec(), but it ignores the result. Just pass a null
pointer instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
57cfc1060f35ac2345cb37ea474f9644ac5cfd75 26-Jun-2012 Alex Elder <elder@inktank.com> rbd: make rbd_create_rw_ops() return a pointer

Either rbd_create_rw_ops() will succeed, or it will fail because a
memory allocation failed. Have it just return a valid pointer or
null rather than stuffing a pointer into a provided address and
returning an errno.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4e891e0af0f0011c90067373c46d7228568ec079 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: have __rbd_add_snap_dev() return a pointer

It's not obvious whether the snapshot pointer whose address is
provided to __rbd_add_snap_dev() will be assigned by that function.
Change it to return the snapshot, or a pointer-coded errno in the
event of a failure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
070c633f60c23a89c226eb696f4a17b08a164b10 25-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop "object_name" from rbd_req_sync_unwatch()

rbd_req_sync_unwatch() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already. So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
7f0a24d8552db422640e810414c43579bb3d9fb9 25-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop "object_name" from rbd_req_sync_notify_ack()

rbd_req_sync_notify_ack() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already. So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
4cb162508afade6d24d58e30be2bbaed80cf84d5 25-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop "object_name" from rbd_req_sync_notify()

rbd_req_sync_notify() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already. So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0e6f322d550a104b2065288c9f6426d5c0414b76 25-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop "object_name" from rbd_req_sync_watch()

rbd_req_sync_watch() is only called in one place, and in that place
it passes rbd_dev->header_name as the value of the "object_name"
parameter. This value is available within the function already.

Having the extra parameter leaves the impression the object name
could take on different values, but it does not.

So get rid of the parameter. We can always add it back again if
we find we want to watch some other object in the future.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
14e7085d8460bf45e7145524a13802f1f4f9d81f 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop rbd_dev parameter in snap functions

Both rbd_register_snap_dev() and __rbd_remove_snap_dev() have
rbd_dev parameters that are unused. Remove them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ed63f4fd9a88218ee709e8f57c36c0c5f219a7ad 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop rbd_header_from_disk() gfp_flags parameter

The function rbd_header_from_disk() is only called in one spot, and
it passes GFP_KERNEL as its value for the gfp_flags parameter.

Just drop that parameter and substitute GFP_KERNEL everywhere within
that function it had been used. (If we find we need the parameter
again in the future it's easy enough to add back again.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9a5d690b08478fc2358d885703014853e44a357e 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: snapc is unused in rbd_req_sync_read()

The "snapc" parameter to in rbd_req_sync_read() is not used, so
get rid of it.

Reported-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
de71a2970d57463d3d965025e33ec3adcf391248 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: rename rbd_device->id

The "id" field of an rbd device structure represents the unique
client-local device id mapped to the underlying rbd image. Each rbd
image will have another id--the image id--and each snapshot has its
own id as well. The simple name "id" no longer conveys the
information one might like to have.

Rename the device "id" field in struct rbd_dev to be "dev_id" to
make it a little more obvious what we're dealing with without having
to think more about context.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
8e94af8e2b582e5915abc171a28130881d1c26e4 25-Jul-2012 Alex Elder <elder@inktank.com> rbd: encapsulate header validity test

If an rbd image header is read and it doesn't begin with the
expected magic information, a warning is displayed. This is
a fairly simple test, but it could be extended at some point.
Fix the comparison so it actually looks at the "text" field
rather than the front of the structure.

In any case, encapsulate the validity test in its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
bd919d45aa61c19d9ed82548d6deb06bcae31153 14-Jul-2012 Alex Elder <elder@inktank.com> rbd: clean up a few dout() calls

There was a dout() call in rbd_do_request() that was reporting
the reporting the offset as the length and vice versa. While
fixing that I did a quick scan of other dout() calls and fixed
a couple of other minor things.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a05932905695f8c6c06d353ecd52c8e5d607cc77 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: simplify __rbd_remove_all_snaps()

This just replaces a while loop with list_for_each_entry_safe()
in __rbd_remove_all_snaps().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a66f8c97a31fd7b2cfd7b86d4789858dbfbedffb 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop extra header_rwsem init

In commit c666601a there was inadvertently added an extra
initialization of rbd_dev->header_rwsem. This gets rid of the
duplicate.

Reported-by: Guangliang Zhao <gzhao@suse.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9e15dc735a7a0418be14e2deab44ddee369af857 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: kill rbd_image_header->snap_seq

The snap_seq field in an rbd_image_header structure held the value
from the rbd image header when it was last refreshed. We now
maintain this value in the snapc->seq field. So get rid of the
other one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
505cbb9bedc8c609c31d86ff4f8f656e5a0f9c49 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: set snapc->seq only when refreshing header

In rbd_header_add_snap() there is code to set snapc->seq to the
just-added snapshot id. This is the only remnant left of the
use of that field for recording which snapshot an rbd_dev was
associated with. That functionality is no longer supported,
so get rid of that final bit of code.

Doing so means we never actually set snapc->seq any more. On the
server, the snapshot context's sequence value represents the highest
snapshot id ever issued for a particular rbd image. So we'll make
it have that meaning here as well. To do so, set this value
whenever the rbd header is (re-)read. That way it will always be
consistent with the rest of the snapshot context we maintain.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
78dc447d3ca3701206a1dd813c901556a3fad451 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: preserve snapc->seq in rbd_header_set_snap()

In rbd_header_set_snap(), there is logic to make the snap context's
seq field get set to a particular snapshot id, or 0 if there is no
snapshot for the rbd image.

This seems to be an artifact of how the current snapshot id for an
rbd_dev was recorded before the rbd_dev->snap_id field began to be
used for that purpose.

There's no need to update the value of snapc->seq here any more, so
stop doing it. Tidy up a few local variables in that function
while we're at it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
75fe9e19816d6ed3e90f1bd3b741f99bf030e848 19-Jul-2012 Alex Elder <elder@inktank.com> rbd: don't use snapc->seq that way

In what appears to be an artifact of a different way of encoding
whether an rbd image maps a snapshot, __rbd_refresh_header() has
code that arranges to update the seq value in an rbd image's
snapshot context to point to the first entry in its snapshot
array if that's where it was pointing initially.

We now use rbd_dev->snap_id to record the snapshot id--using the
special value CEPH_NOSNAP to indicate the rbd_dev is not mapping a
snapshot at all.

There is therefore no need to check for this case, nor to update the
seq value, in __rbd_refresh_header(). Just preserve the seq value
that rbd_read_header() provides (which, at the moment, is nothing).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
a71b891bc7d77a070e723c8c53d1dd73cf931555 06-Dec-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: send header version when notifying

Previously the original header version was sent. Now, we update it
when the header changes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
d1d25646543134d756a02ffe4e02073faa761f2c 05-Dec-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: use reference counting for the snap context

This prevents a race between requests with a given snap context and
header updates that free it. The osd client was already expecting the
snap context to be reference counted, since it get()s it in
ceph_osdc_build_request and put()s it when the request completes.

Also remove the second down_read()/up_read() on header_rwsem in
rbd_do_request, which wasn't actually preventing this race or
protecting any other data.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
93a24e084d67ba2fcb9a4c289135825b623ec864 05-Dec-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: set image size when header is updated

The image may have been resized.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
a51aa0c042fa39946dd017d5f91a073300a71577 05-Dec-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: expose the correct size of the device in sysfs

If an image was mapped to a snapshot, the size of the head version
would be shown. Protect capacity with header_rwsem, since it may
change.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
474ef7ce832d471148f63a9d07f67fc5564834f1 22-Nov-2011 Josh Durgin <josh.durgin@inktank.com> rbd: only reset capacity when pointing to head

Snapshots cannot be resized, and the new capacity of head should not
be reflected by the snapshot.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
e88a36ec961b8c1899c59c5e4ae35a318c0209d3 22-Nov-2011 Josh Durgin <josh.durgin@inktank.com> rbd: return errors for mapped but deleted snapshot

When a snapshot is deleted, the OSD will return ENOENT when reading
from it. This is normally interpreted as a hole by rbd, which will
return zeroes. To minimize the time in which this can happen, stop
requests early when we are notified that our snapshot no longer
exists.

[elder@inktank.com: updated __rbd_init_snaps_header() logic]

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
d1f57ea66369b5c34bd42f104b8070db409447f9 26-Jun-2012 Alex Elder <elder@inktank.com> rbd: kill num_reply parameters

Several functions include a num_reply parameter, but it is never
used. Just get rid of it everywhere--it seems to be something
that never got fully implemented.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
43ae47011232c1e792d77e78db4a7d0ab05032be 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: option symbol renames

Use the name "ceph_opts" consistently (rather than just "opt") for
pointers to a ceph_options structure.

Change the few spots that don't use "rbd_opts" for a rbd_options
pointer to match the rest.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
aded07ea9f7de26e352f679910ac057212ea35e3 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: more symbol renames

Rename variables named "obj" which represent object names so they're
consistently named "object_name".

Rename the "cls" and "method" parameters in rbd_req_sync_exec()
to be "class_name" and "method_name", and make similar changes
to the names of local variables in that function representing
the lengths of those names.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0bed54dc9af4ab75547739a27b64f0fe0aa98756 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: rename some fields in struct rbd_dev

An rbd image is not a single object, but a logical construct made up
of an aggregation of objects.

Rename some fields in struct rbd_dev, in hopes of reinforcing this.
obj --> image_name
obj_len --> image_name_len
obj_md_name --> header_name

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
0ce1a7941341cc63c8352b7df50020e5485bc43a 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: use rbd_dev consistently

Most variables that represent a struct rbd_device are named
"rbd_dev", but in some cases "dev" is used instead. Change all the
"dev" references so they use "rbd_dev" consistently, to make it
clear from the name that we're working with an RBD device (as
opposed to, for example, a struct device). Similarly, change the
name of the "dev" field in struct rbd_notify_info to be "rbd_dev".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
820a5f3e94b9f8ea8c0c6125ce34b237ed67b1dc 10-Jul-2012 Alex Elder <elder@inktank.com> rbd: dynamically allocate snapshot name

There is no need to impose a small limit the length of the snapshot
name recorded for an rbd image in a struct rbd_dev. Remove the
limitation by allocating space for the snapshot name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
bf3e5ae1129ef18a702c14fbaac27aeb2fe25e62 10-Jul-2012 Alex Elder <elder@inktank.com> rbd: dynamically allocate image name

There is no need to impose a small limit the length of the rbd image
name recorded in a struct rbd_dev. Remove the limitation by
allocating space for the image name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
cb8627c76db699e3a085596aa80503fb0973c041 10-Jul-2012 Alex Elder <elder@inktank.com> rbd: dynamically allocate image header name

There is no need to impose a small limit the length of the header
name recorded for an rbd image in a struct rbd_dev. Remove the
limitation by allocating space for the header name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
849b4260d482f7d4be5565b2044901a25f80e2c6 10-Jul-2012 Alex Elder <elder@inktank.com> rbd: dynamically allocate object prefix

There is no need to impose a small limit the length of the object
prefix recorded for an rbd image in a struct rbd_image_header.
Remove the limitation by allocating space for the object prefix
dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
d22f76e703040c2cc4ad13dd7747845b9844d360 12-Jul-2012 Alex Elder <elder@inktank.com> rbd: dynamically allocate pool name

There is no need to impose a small limit the length of the pool name
recorded for an rbd image in a struct rbd_device. Remove the
limitation by allocating space for the pool name ynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
9bb2f334b9b5f2eb6030b7988b7f2302c3115bbb 12-Jul-2012 Alex Elder <elder@inktank.com> rbd: create pool_id device attribute

Add an entry under /sys/bus/rbd/devices/<N>/ named "pool_id" that
provides the id for the pool the rbd image is assocatied with. This
is in addition to the pool name already provided.

Rename the "poolid" field in struct rbd_device to be "pool_id".

Update the documentation to reflect the addition of this new entry.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ca1e49a6afe87ea4e2d3e73e10d1d3c0fad2aa3f 11-Jul-2012 Alex Elder <elder@inktank.com> rbd: rename rbd_dev->block_name

Each rbd image has a name that forms the basis of all data objects
backing the device. Old (format 1) images refer to this name as the
"block name," while new (format 2) images use the term "object
prefix" for this.

Change the field name in the in-core rbd image header structure to
reflect the more modern usage. We intentionally keep the the name
"block_name" in the on-disk definition for format 1 image headers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ea3352f4aa4fc32397d9a535780315e0f2bfee15 10-Jul-2012 Alex Elder <elder@inktank.com> rbd: define dup_token()

Define a new function dup_token(), to be used during argument
parsing for making dynamically-allocated copies of tokens being
parsed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
ad4f232f28e8059fa1de51f3127d8a6a2759ef16 03-Jul-2012 Alex Elder <elder@inktank.com> rbd: drop a useless local variable

In rbd_req_sync_notify_ack(), a local variable was needlessly being
used to hold a null pointer. Just pass NULL instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
6a3ca4f18873f950895cb64ddefafb51a732e3f7 06-Jun-2012 Dan Carpenter <dan.carpenter@oracle.com> rbd: endian bug in rbd_req_cb()

Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16

These are set in osd_req_encode_op() and they are le16.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 895cfcc810e53d7d36639969c71efb9087221167)
236df3755d944c33120d77e84b5ff6f3917eb307 06-Jun-2012 Yan, Zheng <zheng.z.yan@intel.com> rbd: Fix ceph_snap_context size calculation

ceph_snap_context->snaps is an u64 array

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit f9f9a1904467816452fc70740165030e84c2c659)
895cfcc810e53d7d36639969c71efb9087221167 06-Jun-2012 Dan Carpenter <dan.carpenter@oracle.com> rbd: endian bug in rbd_req_cb()

Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16

These are set in osd_req_encode_op() and they are le16.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
f9f9a1904467816452fc70740165030e84c2c659 06-Jun-2012 Yan, Zheng <zheng.z.yan@intel.com> rbd: Fix ceph_snap_context size calculation

ceph_snap_context->snaps is an u64 array

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
263c6ca007a6693fb724a24c5a55716c49d33573 05-Dec-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: rename __rbd_update_snaps to __rbd_refresh_header

This function rereads the entire header and handles any changes in
it, not just changes in snapshots.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
3591538fb272d2432d112d47d7e0ddd0be4cded2 06-Dec-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: fix snapshot size type

Snapshot sizes should be the same type as regular image sizes. This
only affects their displayed size in sysfs, not the reported size of
an actual block device sizes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
b06e6a6be796bc365a19b0ac5176b553c13abf2f 22-Nov-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: remove conditional snapid parameters

The snapid parameters passed to rbd_do_op() and rbd_req_sync_op()
are now always either a valid snapid or an explicit CEPH_NOSNAP.

[elder@dreamhost.com: Rephrased the description]

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
77dfe99fe3cb0b2b0545e19e2d57b7a9134ee3c0 21-Nov-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: store snapshot id instead of index

When a device was open at a snapshot, and snapshots were deleted or
added, data from the wrong snapshot could be read. Instead of
assuming the snap context is constant, store the actual snap id when
the device is initialized, and rely on the OSDs to signal an error
if we try reading from a snapshot that was deleted.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
403f24d3d51760a8b9368d595fa5f48c309f1a0f 05-Dec-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: protect read of snapshot sequence number

This is updated whenever a snapshot is added or deleted, and the
snapc pointer is changed with every refresh of the header.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
50f7c4c967d0b5acd8e7ba6ab654dc4a7ac869ac 20-Apr-2012 Xi Wang <xi.wang@gmail.com> rbd: fix integer overflow in rbd_header_from_disk()

ondisk->snap_count is read from disk via rbd_req_sync_read() and thus
needs validation. Otherwise, a bogus `snap_count' could overflow the
kmalloc() size, leading to memory corruption.

Also use `u32' consistently for `snap_count'.

[elder@dreamhost.com: changed to use UINT_MAX rather than ULONG_MAX]

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
f8ad495a8a0277b88c59bf38319e5e944aaf5a4a 20-Apr-2012 Dan Carpenter <dan.carpenter@oracle.com> rbd: use gfp_flags parameter in rbd_header_from_disk()

We should use the gfp_flags that the caller specified instead of
GFP_KERNEL here.

There is only one caller and it uses GFP_KERNEL, so this change is
just a cleanup and doesn't change how the code works.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
3469ac1aa3a2f1e2586a412923c414779a0af854 08-May-2012 Sage Weil <sage@inktank.com> ceph: drop support for preferred_osd pgs

This was an ill-conceived feature that has been removed from Ceph. Do
this gracefully:

- reject attempts to specify a preferred_osd via the ioctl
- stop exposing this information via virtual xattrs
- always fill in -1 for requests, in case we talk to an older server
- don't calculate preferred_osd placements/pgids

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
cd9d9f5df6098c50726200d4185e9e8da32785b3 04-Apr-2012 Alex Elder <elder@dreamhost.com> rbd: don't hold spinlock during messenger flush

A recent change made changes to the rbd_client_list be protected by
a spinlock. Unfortunately in rbd_put_client(), the lock is taken
before possibly dropping the last reference to an rbd_client, and on
the last reference that eventually calls flush_workqueue() which can
sleep.

The problem was flagged by a debug spinlock warning:
BUG: spinlock wrong CPU on CPU#3, rbd/27814

The solution is to move the spinlock acquisition and release inside
rbd_client_release(), which is the spot where it's really needed for
protecting the removal of the rbd_client from the client list.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
c666601a935b94cc0f3310339411b6940de751ba 22-Nov-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: move snap_rwsem to the device, rename to header_rwsem

A new temporary header is allocated each time the header changes, but
only the changed properties are copied over. We don't need a new
semaphore for each header update.

This addresses http://tracker.newdream.net/issues/2174

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
32eec68d2f233e8a6ae1cd326022f6862e2b9ce3 08-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: don't drop the rbd_id too early

Currently an rbd device's id is released when it is removed, but it
is done before the code is run to clean up sysfs-related files (such
as /sys/bus/rbd/devices/1).

It's possible that an rbd is still in use after the rbd_remove()
call has been made. It's essentially the same as an active inode
that stays around after it has been removed--until its final close
operation. This means that the id shows up as free for reuse at a
time it should not be.

The effect of this was seen by Jens Rehpoehler, who:
- had a filesystem mounted on an rbd device
- unmapped that filesystem (without unmounting)
- found that the mount still worked properly
- but hit a panic when he attempted to re-map a new rbd device

This re-map attempt found the previously-unmapped id available.
The subsequent attempt to reuse it was met with a panic while
attempting to (re-)install the sysfs entry for the new mapped
device.

Fix this by holding off "putting" the rbd id, until the rbd_device
release function is called--when the last reference is finally
dropped.

Note: This fixes: http://tracker.newdream.net/issues/1907

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
593a9e7b34fa62d703b473ae923a9681556cdf74 07-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: small changes

Here is another set of small code tidy-ups:
- Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
names throughout. Tell the blk_queue system our physical
block size, in the (unlikely) event we want to use something
other than the default.
- Delete the definition of struct rbd_info, which is never used.
- Move the definition of dev_to_rbd() down in its source file,
just above where it gets first used, and change its name to
dev_to_rbd_dev().
- Replace an open-coded operation in rbd_dev_release() to use
dev_to_rbd_dev() instead.
- Calculate the segment size for a given rbd_device just once in
rbd_init_disk().
- Use the '%zd' conversion specifier in rbd_snap_size_show(),
since the value formatted is a size_t.
- Switch to the '%llu' conversion specifier in rbd_snap_id_show().
since the value formatted is unsigned.

Signed-off-by: Alex Elder <elder@dreamhost.com>
00f1f36ffa29a6b8e0529770bce2a44585ab3af0 07-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: do some refactoring

A few blocks of code are rearranged a bit here:
- In rbd_header_from_disk():
- Don't bother computing snap_count until we're sure the
on-disk header starts with a good signature.
- Move a few independent lines of code so they are *after* a
check for a failed memory allocation.
- Get rid of unnecessary local variable "ret".
- Make a few other changes in rbd_read_header(), similar to the
above--just moving things around a bit while preserving the
functionality.
- In rbd_rq_fn(), just assign rq in the while loop's controlling
expression rather than duplicating it before and at the end of
the loop body. This allows the use of "continue" rather than
"goto next" in a number of spots.
- Rearrange the logic in snap_by_name(). End result is the same.

Signed-off-by: Alex Elder <elder@dreamhost.com>
fed4c143ba8f08c8bddfdc7c69738e691a06d565 07-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: fix module sysfs setup/teardown code

Once rbd_bus_type is registered, it allows an "add" operation via
the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
way establishes a connection between the device and rbd_root_dev.
But rbd_root_dev is not registered until after the rbd_bus_type
registration is complete. This could (in principle anyway) result
in an invalid state.

Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
two initializations and never be faced with this scenario.

In addition, unregister the device in the event the bus registration
fails at module init time.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
7ef3214af220515b8fe223ec92ec017d2e5607a7 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: don't allocate mon_addrs buffer in rbd_add()

The mon_addrs buffer in rbd_add is used to hold a copy of the
monitor IP addresses supplied via /sys/bus/rbd/add. That is
passed to rbd_get_client(), which never modifies it (nor do
any of the functions it gets passed to thereafter)--the mon_addr
parameter to rbd_get_client() is a pointer to constant data, so it
can't be modifed. Furthermore, rbd_get_client() has the length of
the mon_addrs buffer and that is used to ensure nothing goes beyond
its end.

Based on all this, there is no reason that a buffer needs to
be used to hold a copy of the mon_addrs provided via
/sys/bus/rbd/add. Instead, the location within that passed-in
buffer can be provided, along with the length of the "token"
therein which represents the monitor IP's.

A small change to rbd_add_parse_args() allows the address within the
buffer to be passed back, and the length is already returned. This
now means that, at least from the perspective of this interface,
there is no such thing as a list of monitor addresses that is too
long.

Signed-off-by: Alex Elder <elder@dreamhost.com>
5214ecc45cf9b9f1365b189bcb63e441e3865334 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: have rbd_parse_args() report found mon_addrs size

The argument parsing routine already computes the size of the
mon_addrs buffer it extracts from the "command." Pass it to the
caller so it can use it to provide the length to rbd_get_client().

Signed-off-by: Alex Elder <elder@dreamhost.com>
81a897937827a81107861d50c77b4d04ff8b43a2 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: do a few checks at build time

This is a bit gratuitous, but there are a few things that can be
verified at build time rather than run time, so do that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
e28fff268e7d40ea7a936478c97ce41b6c22815f 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: don't use sscanf() in rbd_add_parse_args()

Make use of a few simple helper routines to parse the arguments
rather than sscanf(). This will treat both missing and too-long
arguments as invalid input (rather than silently truncating the
input in the too-long case). In time this can also be used by
rbd_add() to use the passed-in buffer in place, rather than copying
its contents into new buffers.

It appears to me that the sscanf() previously used would not
correctly handle a supplied snapshot--the two final "%s" conversion
specifications were not separated by a space, and I'm not sure
how sscanf() handles that situation. It may not be well-defined.
So that may be a bug this change fixes (but I didn't verify that).

The sizes of the mon_addrs and options buffers are now passed to
rbd_add_parse_args(), so they can be supplied to copy_token().

Signed-off-by: Alex Elder <elder@dreamhost.com>
a725f65e52de73defb3c7033c471c48c56ca6cdd 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: encapsulate argument parsing for rbd_add()

Move the code that parses the arguments provided to rbd_add() (which
are supplied via /sys/bus/rbd/add) into a separate function.

Also rename the "mon_dev_name" variable in rbd_add() to be
"mon_addrs". The variable represents a list of one or more
comma-separated monitor IP addresses, each with an optional port
number. I think "mon_addrs" captures that notion a little better.

Signed-off-by: Alex Elder <elder@dreamhost.com>
27cc25943fb359241546e7bb7a3ab1c2f35796a2 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: simplify error handling in rbd_add()

If a couple pointers are initialized to NULL then a single
"out_nomem" label can be used for all of the memory allocation
failure cases in rbd_add().

Also, get rid of the "irc" local variable there. There is no
real need for "rc" to be type ssize_t, and it can be used in
the spot "irc" was.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
60571c7d556b10db7e555bd4b6765647af5c41e8 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: reduce memory used for rbd_dev fields

The length of the string containing the monitor address
specification(s) will never exceed the length of the string passed
in to rbd_add(). The same holds true for the ceph + rbd options
string. So reduce the amount of memory allocated for these to
that length rather than the maximum (1024 bytes).

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
d720bcb0a8f246eb441ba9d4f341bc16746556c6 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: have rbd_get_client() return a rbd_client

Since rbd_get_client() currently returns an error code. It assigns
the rbd_client field of the rbd_device structure it is passed if
successful. Instead, have it return the created rbd_client
structure and return a pointer-coded error if there is an error.
This makes the assignment of the client pointer more obvious at the
call site.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
f0f8cef5a30504eaeba5588a9115b46c824d91a3 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: a few simple changes

Here are a few very simple cleanups:
- Add a "RBD_" prefix to the two driver name string definitions.
- Move the definition of struct rbd_request below struct rbd_req_coll
to avoid the need for an empty declaration of the latter.
- Move and group the definitions of rbd_root_dev_release() and
rbd_root_dev, as well as rbd_bus_type and rbd_bus_attrs[],
close to the top of the file. Arrange the latter so
rbd_bus_type.bus_attrs can be initialized statically.
- Get rid of an unnecessary local variable in rbd_open().
- Rework some hokey logic in rbd_bus_add_dev(), so the value of
"ret" at the end is either 0 or -ENOENT to avoid the need for
the code duplication that was there.
- Rename a goto target in rbd_add().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
432b858749631dc011ac919dace4b0705ba8cecf 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: rename "node_lock"

The spinlock used to protect rbd_client_list is named "node_lock".
Rename it to "rbd_client_list_lock" to make it more obvious what
it's for.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
bc534d86be71aaf8dfac46131420ab1c47387d42 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: move ctl_mutex lock inside rbd_client_create()

Since rbd_client_create() is only called in one place, move the
acquisition of the mutex around that call inside that function.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
d97081b0c7bdb55371994cc6690217bf393eb63e 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: move ctl_mutex lock inside rbd_get_client()

Since rbd_get_client() is only called in one place, move the
acquisition of the mutex around that call inside that function.

Furthermore, within rbd_get_client(), it appears the mutex only
needs to be held while calling rbd_client_create(). (Moving
the lock inside that function will wait for the next patch.)

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
e6994d3ddedf1a9f35cb43655bb4b5810e71199a 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: release client list lock sooner

In rbd_get_client(), if a client is reused, a number of things
get done while still holding the list lock unnecessarily.

This just moves a few things that need no lock protection outside
the lock.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
d184f6bfde1428ad4a690d49b28afc9ab4d57b35 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: restore previous rbd id sequence behavior

It used to be that selecting a new unique identifier for an added
rbd device required searching all existing ones to find the highest
id is used. A recent change made that unnecessary, but made it
so that id's used were monotonically non-decreasing. It's a bit
more pleasant to have smaller rbd id's though, and this change
makes ids get allocated as they were before--each new id is one more
than the maximum currently in use.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
499afd5b8e742792fda6bd7730c738ad83aecf6b 02-Feb-2012 Alex Elder <elder@dreamhost.com> rbd: tie rbd_dev_list changes to rbd_id operations

The only time entries are added to or removed from the global
rbd_dev_list is exactly when a "put" or "get" operation is being
performed on a rbd_dev's id. So just move the list management code
into get/put routines.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
e124a82f3c4efc2cc2bae68a2bf30020fb8c4fc2 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: protect the rbd_dev_list with a spinlock

The rbd_dev_list is just a simple list of all the current
rbd_devices. Using the ctl_mutex as a concurrency guard is
overkill. Instead, use a spinlock for that specific purpose.

This also reduces the window that the ctl_mutex needs to be held in
rbd_add().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
1ddbe94eda58597cb6dd464b455cb62d3f68be7b 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: rework calculation of new rbd id's

In order to select a new unique identifier for an added rbd device,
the list of all existing ones is searched and a value one greater
than the highest id is used.

The list search can be avoided by using an atomic variable that
keeps track of the current highest id. Using a get/put model for
id's we can limit the boundless growth of id numbers a bit by
arranging to reuse the current highest id once it gets released.
Add these calls to "put" the id when an rbd is getting removed.

Note that this changes the pattern of device id's used--new values
will never be below the highest one seen so far (even if there
exists an unused lower one). I assert this is OK because the key
property of an rbd id is its uniqueness, not its magnitude.

Regardless, a follow-on patch will restore the old way of doing
things, I just think this commit just makes the incremental change
to atomics a little easier to understand.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
b7f23c361b65a0bdcc81acd8d38471b7810df3ff 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: encapsulate new rbd id selection

Move the loop that finds a new unique rbd id to use into
its own helper function.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
cc9d734c3d1b39c6a557673469aea26364060226 22-Nov-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: use a single value of snap_name to mean no snap

There's already a constant for this anyway.

Since rbd_header_set_snap() is only used to set the rbd device
snap_name field, just do that within that function rather than
having it take the snap_name as an argument.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

v2: Changed interface rbd_header_set_snap() so it explicitly updates
the snap_name in the rbd_device. Also added a BUILD_BUG_ON()
to verify the size of the snap_name field is sufficient for
SNAP_HEAD_NAME.
1dbb439913f0fc0bc30d36411a4a3b3202c0aab1 24-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: do not duplicate ceph_client pointer in rbd_device

The rbd_device structure maintains a duplicate copy of the
ceph_client pointer maintained in its rbd_client structure. There
appears to be no good reason for this, and its presence presents a
risk of them getting out of synch or otherwise misused. So kill it
off, and use the rbd_client copy only.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
ee57741c5209154b8ef124bcaa2496da1b69a988 24-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: make ceph_parse_options() return a pointer

ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful. With this interface is not evident at call sites that
the pointer is always initialized. Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2107978668de13da484f7abc3f03516494c7fca9 24-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: a few small cleanups

Some minor cleanups in "drivers/block/rbd.c:
- Use the more meaningful "RBD_MAX_OBJ_NAME_LEN" in place if "96"
in the definition of RBD_MAX_MD_NAME_LEN.
- Use DEFINE_SPINLOCK() to define and initialize node_lock.
- Drop a needless (char *) cast in parse_rbd_opts_token().
- Make a few minor formatting changes.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
d23a4b3fd6ef70b80411b39b8c8bc548a219ce70 29-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: fix safety of rbd_put_client()

The rbd_client structure uses a kref to arrange for cleaning up and
freeing an instance when its last reference is dropped. The cleanup
routine is rbd_client_release(), and one of the things it does is
delete the rbd_client from rbd_client_list. It acquires node_lock
to do so, but the way it is done is still not safe.

The problem is that when attempting to reuse an existing rbd_client,
the structure found might already be in the process of getting
destroyed and cleaned up.

Here's the scenario, with "CLIENT" representing an existing
rbd_client that's involved in the race:

Thread on CPU A | Thread on CPU B
--------------- | ---------------
rbd_put_client(CLIENT) | rbd_get_client()
kref_put() | (acquires node_lock)
kref->refcount becomes 0 | __rbd_client_find() returns CLIENT
calls rbd_client_release() | kref_get(&CLIENT->kref);
| (releases node_lock)
(acquires node_lock) |
deletes CLIENT from list | ...and starts using CLIENT...
(releases node_lock) |
and frees CLIENT | <-- but CLIENT gets freed here

Fix this by having rbd_put_client() acquire node_lock. The result
could still be improved, but at least it avoids this problem.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
97bb59a03dd6767fcc00be09b0c6d9e5294eeea6 24-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: fix a memory leak in rbd_get_client()

If an existing rbd client is found to be suitable for use in
rbd_get_client(), the rbd_options structure is not being
freed as it should. Fix that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
0e805a1d857799352e51e8697c0b1a30aec16913 12-Jan-2012 Alex Elder <elder@dreamhost.com> rbd: initialize snap_rwsem in rbd_add()

New rbd device structures get initialized in rbd_add(). Many of
the fields rely on being initially zero-filled. However we lockdep
was noticing that the rw_semaphore embedded in the header field
was not getting properly initialized. Fix that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
51703306b3b9ea7c05728040998521e47358147b 25-Oct-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: remove buggy rollback functionality

This doesn't interact with resizing well, since it doesn't set the
size of the device to the size at the snapshot. It's also an expensive
operation to be synchronous. Rollback can still be done with the
userspace rbd tool.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
81e759fbf7715514c32e563789db1d9d47fd55fb 15-Nov-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: return an error when an invalid header is read

This protects against opening future rbd images that have incompatible format changes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
6ab00d465a1c8c02c2216f8220727282f3aa50b5 09-Aug-2011 Sage Weil <sage@newdream.net> libceph: create messenger with client

This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.

Signed-off-by: Sage Weil <sage@newdream.net>
699324871fcc3650f2023c5e36cb119a92d7894b 27-Jul-2011 Justin P. Mattock <justinmattock@gmail.com> treewide: remove extra semicolons from various parts of the kernel

This is a resend from the original, changing the title from PATCH to
RFC(since this is a review for commit, and I should have put that the first go around).
and also removing some of the commit's with ia64 and bash since it is significant.
let me know if I might have missed anything etc..

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
029bcbd8b076fd19787b8c73e58dd0a6f2c0caf1 22-Jul-2011 Josh Durgin <josh.durgin@dreamhost.com> rbd: set blk_queue request sizes to object size

This improves performance since more requests can be merged.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
79e3057c4c9d32b88e6745fd220d91b0a8b2030b 13-Jul-2011 Yehuda Sadeh <yehuda@hq.newdream.net> rbd: cancel watch request when releasing the device

We were missing this cleanup, so when a device was released
the osd didn't clean up its watchers list, so following notifications
could be slow as osd needed to timeout on the client.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
9db4b3e32778400555d5cc6fb61d4058902d37f7 20-Apr-2011 Sage Weil <sage@newdream.net> rbd: handle online resize of underlying rbd image

If we get a notification that the image header has changed, check for
a change in the image size.

Signed-off-by: Sage Weil <sage@newdream.net>
aedfec59eed37d1ff7ce09b303b668234e9a7f8e 13-May-2011 Sage Weil <sage@newdream.net> rbd: use snprintf for disk->disk_name

Signed-off-by: Sage Weil <sage@newdream.net>
916d4d672779de8e42346fff338617c7b841e8e5 13-May-2011 Sage Weil <sage@newdream.net> rbd: cleanup: make kfree match kmalloc

Signed-off-by: Sage Weil <sage@newdream.net>
13143d2d1cffd243a6d778000b02ab4938ac751a 13-May-2011 Sage Weil <sage@newdream.net> rbd: warn on update_snaps failure on notify

Signed-off-by: Sage Weil <sage@newdream.net>
1fec70932d867416ffe620dd17005f168cc84eb5 13-May-2011 Yehuda Sadeh <yehuda@hq.newdream.net> rbd: fix split bio handling

The rbd driver currently splits bios when they span an object boundary.
However, the blk_end_request expects the completions to roll up the results
in block device order, and the split rbd/ceph ops can complete in any
order. This patch adds a struct rbd_req_coll to track completion of split
requests and ensures that the results are passed back up to the block layer
in order.

This fixes errors where the file system gets completion of a read operation
that spans an object boundary before the data has actually arrived. The
bug is easily reproduced with iozone with a working set larger than
available RAM.

Reported-by: Fyodor Ustinov <ufm@ufm.su>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
11f770027b5c0de16544f3ec82b5c6f9f8d5a644 13-May-2011 Sage Weil <sage@newdream.net> rbd: fix leak of ops struct

The ops vector must be freed by the rbd_do_request caller.

Signed-off-by: Sage Weil <sage@newdream.net>
4ad12621e442b7a072e81270808f617cb65c5672 03-May-2011 Sage Weil <sage@newdream.net> libceph: fix ceph_osdc_alloc_request error checks

ceph_osdc_alloc_request returns NULL on failure.

Signed-off-by: Sage Weil <sage@newdream.net>
59c2be1e4d42c0d4949cecdeef3f37070a1fbc13 21-Mar-2011 Yehuda Sadeh <yehuda@hq.newdream.net> rbd: use watch/notify for changes in rbd header

Send notifications when we change the rbd header (e.g. create a snapshot)
and wait for such notifications. This allows synchronizing the snapshot
creation between different rbd clients/rools.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
766fc43973b16f9becb6b7402b3e052dbb84adee 07-Jan-2011 Yehuda Sadeh <yehuda@hq.newdream.net> rbd: fix cleanup when trying to mount inexistent image

Previously we didn't clean up the sysfs entry that was just
created.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
dfc5606dc51381186de765243bab340c8e021868 19-Nov-2010 Yehuda Sadeh <yehuda@hq.newdream.net> rbd: replace the rbd sysfs interface

The new interface creates directories per mapped image
and under each it creates a subdir per available snapshot.
This allows keeping a cleaner interface within the sysfs
guidelines. The ABI documentation was updated too.

Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
85b5aaa624aac568b8a3a88dbe4de6628c7cc527 11-Oct-2010 Dan Carpenter <error27@gmail.com> rbd: passing wrong variable to bvec_kunmap_irq()

We should be passing "buf" here insead of "bv". This is tricky because
it's not the same as kmap() and kunmap(). GCC does warn about it if you
compile on i386 with CONFIG_HIGHMEM.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
b8d0638a98aa4a42ff322234b882487cd74e5c52 11-Oct-2010 Dan Carpenter <error27@gmail.com> rbd: null vs ERR_PTR

ceph_alloc_page_vector() returns ERR_PTR(-ENOMEM) on errors.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
f4cf3deef4c474381e8fee2e6099d49edd9105cb 27-Sep-2010 Yehuda Sadeh <yehuda@hq.newdream.net> block: rbd: removing unnecessary test

rbd_get_segment() can't return a negative value, we don't need to check
the return output.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
28f259b7cd78eb29d38b7ae6b475d656e08fd348 25-Sep-2010 Vasiliy Kulikov <segooon@gmail.com> block: rbd: fixed may leaks

rbd_client_create() doesn't free rbdc, this leads to many leaks.

seg_len in rbd_do_op() is unsigned, so (seg_len < 0) makes no sense.
Also if fixed check fails then seg_name is leaked.

Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
602adf400201636e95c3fed9f31fba54a3d7e844 13-Aug-2010 Yehuda Sadeh <yehuda@hq.newdream.net> rbd: introduce rados block device (rbd), based on libceph

The rados block device (rbd), based on osdblk, creates a block device
that is backed by objects stored in the Ceph distributed object storage
cluster. Each device consists of a single metadata object and data
striped over many data objects.

The rbd driver supports read-only snapshots.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>