History log of /drivers/md/raid10.c
Revision Date Author Comments
65c3f18b9032f7237fc74403ce3a92176eaebd8c 03-Jul-2012 NeilBrown <neilb@suse.de> md/raid10: fix failure when trying to repair a read error.

commit 055d3747dbf00ce85c6872ecca4d466638e80c22 upstream.

commit 58c54fcca3bac5bf9290cfed31c76e4c4bfbabaf
md/raid10: handle further errors during fix_read_error better.

in 3.1 added "r10_sync_page_io" which takes an IO size in sectors.
But we were passing the IO size in bytes!!!
This resulting in bio_add_page failing, and empty request being sent
down, and a consequent BUG_ON in scsi_lib.

[fix missing space in error message at same time]

This fix is suitable for 3.1.y and later.

Reported-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
04e0f69d135f6cf10534282fc09cf3efd6973c5b 03-Jul-2012 NeilBrown <neilb@suse.de> md/raid10: Don't try to recovery unmatched (and unused) chunks.

commit fc448a18ae6219af9a73257b1fbcd009efab4a81 upstream.

If a RAID10 has an odd number of chunks - as might happen when there
are an odd number of devices - the last chunk has no pair and so is
not mirrored. We don't store data there, but when recovering the last
device in an array we retry to recover that last chunk from a
non-existent location. This results in an error, and the recovery
aborts.

When we get to that last chunk we should just stop - there is nothing
more to do anyway.

This bug has been present since the introduction of RAID10, so the
patch is appropriate for any -stable kernel.

Reported-by: Christian Balzer <chibi@gol.com>
Tested-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
584b886aee3aff5fe7eb21e30f779eb5cd1daa36 31-May-2012 NeilBrown <neilb@suse.de> md: raid1/raid10: fix problem with merge_bvec_fn

commit aba336bd1d46d6b0404b06f6915ed76150739057 upstream.

The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.

However were were only setting that when a device was hot-added,
not when a device was present from the start.

This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels. However that are conflicts in raid10.c so a separate
patch will be needed for 3.4.y.

Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
b0d634d5683f0b186b242ce6a4f3b041edb8b956 19-May-2012 NeilBrown <neilb@suse.de> md/raid10: fix transcription error in calc_sectors conversion.

The old code was
sector_div(stride, fc);
the new code was
sector_dir(size, conf->near_copies);

'size' is right (the stride various wasn't really needed), but
'fc' means 'far_copies', and that is an important difference.

Signed-off-by: NeilBrown <neilb@suse.de>
6508fdbf40a92fd7c19d32780ea33ce8e8362b93 17-May-2012 NeilBrown <neilb@suse.de> md/raid10: set dev_sectors properly when resizing devices in array.

raid10 stores dev_sectors in 'conf' separately from the one in
'mddev' because it can have a very significant effect on block
addressing and so need to be updated carefully.

However raid10_resize isn't updating it at all!

To update it correctly, we need to make sure it is a proper
multiple of the chunksize taking various details of the layout
in to account.
This calculation is currently done in setup_conf. So split it
out from there and call it from raid10_resize as well.
Then set conf->dev_sectors properly.

Signed-off-by: NeilBrown <neilb@suse.de>
f4380a915823dbed0bf8e3cf502ebcf2b7c7f833 12-Apr-2012 majianpeng <majianpeng@gmail.com> md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery.

If r1bio->sectors % 8 != 0,then the memcmp and a later
memcpy will omit the last bio_vec.

This is suitable for any stable kernel since 3.1 when bad-block
management was introduced.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
5020ad7d143ccfcf8149974096220d59e5572120 01-Apr-2012 NeilBrown <neilb@suse.de> md/raid1,raid10: don't compare excess byte during consistency check.

When comparing two pages read from different legs of a mirror, only
compare the bytes that were read, not the whole page.

In most cases we read a whole page, but in some cases with
bad blocks or odd sizes devices we might read fewer than that.

This bug has been present "forever" but at worst it might cause
a report of two many mismatches and generate a little bit
extra resync IO, so there is no need to back-port to -stable
kernels.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
006a09a0ae0a494473a8cd82c8d1d653e37e6663 18-Mar-2012 NeilBrown <neilb@suse.de> md/raid10 - support resizing some RAID10 arrays.

'resizing' an array in this context means making use of extra
space that has become available in component devices, not adding new
devices.
It also includes shrinking the array to take up less space of
component devices.

This is not supported for array with a 'far' layout. However
for 'near' and 'offset' layout arrays, adding and removing space at
the end of the devices is easy to support, and this patch provides
that support.

Signed-off-by: NeilBrown <neilb@suse.de>
050b66152f87c79e8d66aed0e7996f9336462d5f 18-Mar-2012 NeilBrown <neilb@suse.de> md/raid10: handle merge_bvec_fn in member devices.

Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.

So enhance the raid10 merge_bvec_fn to check that function in children
as well.

This introduces a small problem. There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request. So a
device added between these could end up getting a request which
violates its merge_bvec_fn.

Currently the best we can do is synchronize_sched(). This will work
providing no preemption happens. If there is preemption, we just
have to hope that new devices are largely consistent with old devices.

Signed-off-by: NeilBrown <neilb@suse.de>
dafb20fa34320a472deb7442f25a0c086e0feb33 18-Mar-2012 NeilBrown <neilb@suse.de> md: tidy up rdev_for_each usage.

md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev. However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
- rename rdev_for_each to rdev_for_each_safe
- create a new rdev_for_each which uses the plain
list_for_each_entry,
- use the 'safe' version only where needed, and convert all other
list_for_each_entry calls to use rdev_for_each.

Signed-off-by: NeilBrown <neilb@suse.de>
d6b42dcb995e6acd7cc276774e751ffc9f0ef4bf 18-Mar-2012 NeilBrown <neilb@suse.de> md/raid1,raid10: avoid deadlock during resync/recovery.

If RAID1 or RAID10 is used under LVM or some other stacking
block device, it is possible to enter a deadlock during
resync or recovery.
This can happen if the upper level block device creates
two requests to the RAID1 or RAID10. The first request gets
processed, blocks recovery and queue requests for underlying
requests in current->bio_list. A resync request then starts
which will wait for those requests and block new IO.

But then the second request to the RAID1/10 will be attempted
and it cannot progress until the resync request completes,
which cannot progress until the underlying device requests complete,
which are on a queue behind that second request.

So allow that second request to proceed even though there is
a resync request about to start.

This is suitable for any -stable kernel.

Cc: stable@vger.kernel.org
Reported-by: Ray Morris <support@bettercgi.com>
Tested-by: Ray Morris <support@bettercgi.com>
Signed-off-by: NeilBrown <neilb@suse.de>
dc10c643e8a8d008fd16dd6706e9e0018eadf8d2 18-Mar-2012 NeilBrown <neilb@suse.de> md: allow re-add to failed arrays.

When an array is failed (some data inaccessible) then there is no
point attempting to add a spare as it could not possibly be recovered.

However that may be value in re-adding a recently removed device.
e.g. if there is a write-intent-bitmap and it is clear, then access
to the data could be restored by this action.

So don't reject a re-add to a failed array for RAID10 and RAID5 (the
only arrays types that check for a failed array).

Signed-off-by: NeilBrown <neilb@suse.de>
547414d19fd72376ff2ecc42aac8d7a051f03d26 13-Mar-2012 NeilBrown <neilb@suse.de> md/raid10: remove unnecessary smp_mb() from end_sync_write

Recent commit 4ca40c2ce099e4f1ce3 (md/raid10: Allow replacement device ...)
added an smp_mb in end_sync_write.
This was to close a possible race with raid10_remove_disk.
However there is no such race as it is never attempted to remove a
disk while resync (or recovery) is happening.
so the smp_mb is just noise.

Signed-off-by: NeilBrown <neilb@suse.de>
7a90484825680e7831856105f5fef654e6c02701 05-Mar-2012 NeilBrown <neilb@suse.de> md/raid10: fix assembling of arrays with replacement devices.

commit 56a2559bb654a (md/raid10: recognise replacements ...)
changed 'run' to set ->replacement or ->rdev depending on the
'Replacement' status if the device, but it didn't remove the
old unconditional setting of 'rdev'. So it was largely ineffective.

So remove that now.

Signed-off-by: NeilBrown <neilb@suse.de>
fae8cc5ed0714953b1ad7cf86f030d2177278424 14-Feb-2012 NeilBrown <neilb@suse.de> md/raid10: fix handling of error on last working device in array.

If we get a read error on the last working device in a RAID10 which
contains the target block, then we don't fail the device (which is
good) but we don't abort retries, which is wrong.
We end up in an infinite loop retrying the read on the one device.

This patch fixes the problem in two places:
1/ in raid10_end_read_request we don't even ask for a retry if this
was the last usable device. This is efficient but a little racy
and will sometimes retry when it should not.

2/ in handle_read_error we are careful to exclude any device from
retry which we tried to mark as faulty (that might have failed if
it was the last device). This is race-free but less efficient.

Signed-off-by: NeilBrown <neilb@suse.de>
b7044d41b5a09ce9082699f74c8f10e0fe59f704 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: If there is a spare and a want_replacement device, start replacement.

When attempting to add a spare to a RAID10 array, also consider
adding it as a replacement for a want_replacement device.

Signed-off-by: NeilBrown <neilb@suse.de>
56a2559bb654ae2555b2ae3b29c837615d0c45c9 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: recognise replacements when assembling array.

If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Signed-off-by: NeilBrown <neilb@suse.de>
4ca40c2ce099e4f1ce35445994f49836662596c8 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: Allow replacement device to be replace old drive.

When recovery finish and spare_active is called, check for a
replace that might have just become fully synced and mark it
as such, marking the original as failed.

Then when the original is removed, move the replacement into
its position.

This means that 'replacement' and spontaneously become NULL in some
situations. Make sure we check for those.
It also means that 'rdev' and 'replacement' could appear to be
identical - check for that too.

Signed-off-by: NeilBrown <neilb@suse.de>
24afd80d99f80a79d8824d2805114b8b067e9823 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: handle recovery of replacement devices.

If there is a replacement device, then recover to it,
reading from any drives - maybe the one being replaced, maybe not.

Signed-off-by: NeilBrown <neilb@suse.de>
9ad1aefc8ae8d2e482b4cc4b7199e2354148bbdc 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: Handle replacement devices during resync.

If we need to resync an array which has replacement devices,
we always write any block checked to every replacement.

If the resync was bitmap-based resync we will then complete the
replacement normally.
If it was a full resync, we mark the replacements as fully recovered
when the resync finishes so no further recovery is needed.

Signed-off-by: NeilBrown <neilb@suse.de>
475b0321a4df381f64db10ddd750a8b7bb82d88b 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: writes should get directed to replacement as well as original.

When writing, we need to submit two writes, one to the original,
and one to the replacements - if there is a replacement.

If the write to the replacement results in a write error we just
fail the device. We only try to record write errors to the
original.

This only handles writing new data. Writing for resync/recovery
will come later.

Signed-off-by: NeilBrown <neilb@suse.de>
c8ab903ea9d7309044910c33dc087418be84f9b5 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: allow removal of failed replacement devices.

Enhance raid10_remove_disk to be able to remove ->replacement
as well as ->rdev

Signed-off-by: NeilBrown <neilb@suse.de>
abbf098e6e1e23d5d247b9eaaf325e67f67b0328 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: preferentially read from replacement device if possible.

When reading (for array reads, not for recovery etc) we read from the
replacement device if it has recovered far enough.
This requires storing the chosen rdev in the 'r10_bio' so we can make
sure to drop the ref on the right device when the read finishes.

Signed-off-by: NeilBrown <neilb@suse.de>
96c3fd1f3802371610c620cff03f9d825707e80e 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: change read_balance to return an rdev

It makes more sense to return an rdev than just an index as
read_balance() gets a reference to the rdev and so returning
the pointer make this more idiomatic.

This will be needed in a future patch when we might return
a 'replacement' rdev instead of the main rdev.

Signed-off-by: NeilBrown <neilb@suse.de>
69335ef3bc5b766f34db2d688be1d35313138bca 23-Dec-2011 NeilBrown <neilb@suse.de> md/raid10: prepare data structures for handling replacement.

Allow each slot in the RAID10 to have 2 devices, the want_replacement
and the replacement.

Also an r10bio to have 2 bios, and for resync/recovery allocate the
second bio if there are any replacement devices.

Signed-off-by: NeilBrown <neilb@suse.de>
b8321b68d1445f308324517e45fb0a5c2b48e271 23-Dec-2011 NeilBrown <neilb@suse.de> md: change hot_remove_disk to take an rdev rather than a number.

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
056075c76417b112b4924e7b6386fdc6dfc9ac03 03-Jul-2011 Paul Gortmaker <paul.gortmaker@windriver.com> md: Add module.h to all files using it implicitly

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
7fcc7c8acf0fba44d19a713207af7e58267c1179 30-Oct-2011 NeilBrown <neilb@suse.de> md/raid10: Fix bug when activating a hot-spare.

This is a fairly serious bug in RAID10.

When a RAID10 array is degraded and a hot-spare is activated, the
spare does not take up the empty slot, but rather replaces the first
working device.
This is likely to make the array non-functional. It would normally
be possible to recover the data, but that would need care and is not
guaranteed.

This bug was introduced in commit
2bb77736ae5dca0a189829fbb7379d43364a9dac
which first appeared in 3.1.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
d890fa2b0586b6177b119643ff66932127d58afa 26-Oct-2011 NeilBrown <neilb@suse.de> md: Fix some bugs in recovery_disabled handling.

In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
1/ one place in raid1 was missed and still sets to '1'.
2/ We didn't explicitly set the conf-> value at array creation
time.
It defaulted to '0' just like the mddev value does so they
could appear equal and thus disable recovery.
This did not affect normal 'md' as it calls bind_rdev_to_array
which changes the mddev value. However the dmraid interface
doesn't call this and so doesn't change ->recovery_disabled; so at
array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.

Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
34db0cd60f8a1f4ab73d118a8be3797c20388223 11-Oct-2011 NeilBrown <neilb@suse.de> md: add proper write-congestion reporting to RAID1 and RAID10.

RAID1 and RAID10 handle write requests by queuing them for handling by
a separate thread. This is because when a write-intent-bitmap is
active we might need to update the bitmap first, so it is good to
queue a lot of writes, then do one big bitmap update for them all.

However writeback request devices to appear to be congested after a
while so it can make some guesstimate of throughput. The infinite
queue defeats that (note that RAID5 has already has a finite queue so
it doesn't suffer from this problem).

So impose a limit on the number of pending write requests. By default
it is 1024 which seems to be generally suitable. Make it configurable
via module option just in case someone finds a regression.

Signed-off-by: NeilBrown <neilb@suse.de>
84fc4b56db85cb9e05326424049973a2036c9940 11-Oct-2011 NeilBrown <neilb@suse.de> md: rename "mdk_personality" to "md_personality"

"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown <neilb@suse.de>
e879a8793f915aa7933364d962d2435bd71de462 11-Oct-2011 NeilBrown <neilb@suse.de> md/raid10: typedef removal: conf_t -> struct r10conf

Signed-off-by: NeilBrown <neilb@suse.de>
e373ab109172abc2d821bd3b5c1b400acddef5a5 11-Oct-2011 NeilBrown <neilb@suse.de> md/raid0: typedef removal: raid0_conf_t -> struct r0conf

Signed-off-by: NeilBrown <neilb@suse.de>
0f6d02d580ca77ee4be085c29c5fe5b879df24d9 11-Oct-2011 NeilBrown <neilb@suse.de> md: remove typedefs: mirror_info_t -> struct mirror_info

Signed-off-by: NeilBrown <neilb@suse.de>
9f2c9d12bcc53fcb3b787023723754e84d1aef8b 11-Oct-2011 NeilBrown <neilb@suse.de> md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio

Signed-off-by: NeilBrown <neilb@suse.de>
fd01b88c75a718020ff77e7f560d33835e9b58de 11-Oct-2011 NeilBrown <neilb@suse.de> md: remove typedefs: mddev_t -> struct mddev

Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown <neilb@suse.de>
3cb03002000f133f9f97269edefd73611eafc873 11-Oct-2011 NeilBrown <neilb@suse.de> md: removing typedefs: mdk_rdev_t -> struct md_rdev

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown <neilb@suse.de>
01f96c0a9922cd9919baf9d16febdf7016177a12 21-Sep-2011 NeilBrown <neilb@suse.de> md: Avoid waking up a thread after it has been freed.

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org
5a7bbad27a410350e64a2d7f5ec18fc73836c14f 12-Sep-2011 Christoph Hellwig <hch@infradead.org> block: remove support for bio remapping from ->make_request

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
079fa166a2874985ae58b2e21e26e1cbc91127d4 10-Sep-2011 NeilBrown <neilb@suse.de> md/raid1,10: Remove use-after-free bug in make_request.

A single request to RAID1 or RAID10 might result in multiple
requests if there are known bad blocks that need to be avoided.

To detect if we need to submit another write request we test:
if (sectors_handled < (bio->bi_size >> 9)) {

However this is after we call **_write_done() so the 'bio' no longer
belongs to us - the writes could have completed and the bio freed.

So move the **_write_done call until after the test against
bio->bi_size.

This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862

Reported-by: Bruno Wolff III <bruno@wolff.to>
Tested-by: Bruno Wolff III <bruno@wolff.to>
Signed-off-by: NeilBrown <neilb@suse.de>
19d5f834d6aff7efb1c9353523865c5bce869470 10-Sep-2011 NeilBrown <neilb@suse.de> md/raid10: unify handling of write completion.

A write can complete at two different places:
1/ when the last member-device write completes, through
raid10_end_write_request
2/ in make_request() when we remove the initial bias from ->remaining.

These two should do exactly the same thing and the comment says they
do, but they don't.

So factor the correct code out into a function and call it in both
places. This makes the code much more similar to RAID1.

The difference is only significant if there is an error, and they
usually take a while, so it is unlikely that there will be an error
already when make_request is completing, so this is unlikely to cause
real problems.

Signed-off-by: NeilBrown <neilb@suse.de>
58c54fcca3bac5bf9290cfed31c76e4c4bfbabaf 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: handle further errors during fix_read_error better.

If we find more read/write errors we should record a bad block before
failing the device.

Signed-off-by: NeilBrown <neilb@suse.de>
5e5702898e93eee7d69b6efde109609a89a61001 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: Handle read errors during recovery better.

Currently when we get a read error during recovery, we simply abort
the recovery.

Instead, repeat the read in page-sized blocks.
On successful reads, write to the target.
On read errors, record a bad block on the destination,
and only if that fails do we abort the recovery.

As we now retry reads we need to know where we read from. This was in
bi_sector but that can be changed during a read attempt.
So store the correct from_addr and to_addr in the r10_bio for later
access.


Signed-off-by: NeilBrown<neilb@suse.de>
e684e41db3bad44f1262341300b827c0d94ae220 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: simplify read error handling during recovery.

If a read error is detected during recovery the code currently
fails the read device.
This isn't really necessary. recovery_request_write will signal
a write error to end_sync_write and it will record a write
error on the destination device which will record a bad block
there or kick it from the array.

So just remove this call to do md_error.

Signed-off-by: NeilBrown <neilb@suse.de>
1a0b7cd82657a590f163b090bd9123a3a6b9aae4 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: record bad blocks due to write errors during resync/recovery.

If we get a write error during resync/recovery don't fail the device
but instead record a bad block. If that fails we can then fail the
device.

Signed-off-by: NeilBrown <neilb@suse.de>
f84ee364dd15af11cada1e673f94128f62db189e 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: attempt to fix read errors during resync/check

We already attempt to fix read errors found during normal IO
and a 'repair' process.
It is best to try to repair them at any time they are found,
so move a test so that during sync and check a read error will
be corrected by over-writing with good data.

If both (all) devices have known bad blocks in the sync section we
won't try to fix even though the bad blocks might not overlap. That
should be considered later.

Also if we hit a read error during recovery we don't try to fix it.
It would only be possible to fix if there were at least three copies
of data, which is not very common with RAID10. But it should still
be considered later.

Signed-off-by: NeilBrown <neilb@suse.de>
bd870a16c5946d86126f7203db3c73b71de0a1d8 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: Handle write errors by updating badblock log.

When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown <neilb@suse.de>
749c55e942d91cb27045fe2eb313aa5afe68ae0b 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: clear bad-block record when write succeeds.

If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown <neilb@suse.de>
d4432c23be957ff061f7b23fd60e8506cb472a55 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: avoid writing to known bad blocks on known bad drives.

Writing to known bad blocks on drives that have seen a write error
is asking for trouble. So try to avoid these blocks.

Signed-off-by: NeilBrown <neilb@suse.de>
e875ecea266a543e643b19e44cf472f1412708f9 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10 record bad blocks as needed during recovery.

When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.

Signed-off-by: NeilBrown <neilb@suse.de>
40c356ce5ad1a6be817825e1da1bc7494349cc6d 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: avoid reading known bad blocks during resync/recovery.

During resync/recovery limit the size of the request to avoid
reading into a bad block that does not start at-or-before the current
read address.

Similarly if there is a bad block at this address, don't allow the
current request to extend beyond the end of that bad block.

Now that we don't ever read from known bad blocks, it is safe to allow
devices with those blocks into the array.

Signed-off-by: NeilBrown <neilb@suse.de>
8dbed5cebdf6796bf2618457b3653cf820934366 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10 - avoid reading from known bad blocks - part 3

When attempting to repair a read error, don't read from
devices with a known bad block.

As we are only reading PAGE_SIZE blocks, we don't try to
narrow down to smaller regions in the hope that only part of this
page is bad - it isn't worth the effort.

Signed-off-by: NeilBrown <neilb@suse.de>
7399c31bc92a26bb8388a73f8e14acadcc512fe5 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: avoid reading from known bad blocks - part 2

When redirecting a read error to a different device, we must
again avoid bad blocks and possibly split the request.

Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com>

Signed-off-by: NeilBrown <neilb@suse.de>
856e08e23762dfb92ffc68fd0a8d228f9e152160 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: avoid reading from known bad blocks - part 1

This patch just covers the basic read path:
1/ read_balance needs to check for badblocks, and return not only
the chosen slot, but also how many good blocks are available
there.
2/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'

On read error we currently just fail the request if another target
cannot handle the whole request. Next patch refines that a bit.

Signed-off-by: NeilBrown <neilb@suse.de>
560f8e5532d63a314271bfb99d3d1d53c938ed14 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: Split handle_read_error out from raid10d.

raid10d() is too big and is about to get bigger, so split
handle_read_error() out as a separate function.

Signed-off-by: NeilBrown <neilb@suse.de>
1294b9c973251a5e68b62c9b40dd914517bda675 28-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: simplify/reindent some loops.

When a loop ends with a large if, it can be neater to change the
if to invert the condition and just 'continue'.
Then the body of the if can be indented to a lower level.

Signed-off-by: NeilBrown <neilb@suse.de>
de393cdea66cbd63c90725663f400c76faf1b255 28-Jul-2011 NeilBrown <neilb@suse.de> md: make it easier to wait for bad blocks to be acknowledged.

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown <neilb@suse.de>
34b343cff4354ab9864be83be88405fd53d928a0 28-Jul-2011 NeilBrown <neilb@suse.de> md: don't allow arrays to contain devices with bad blocks.

As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
cbea21703b2484f83faef040ed1de30114794392 27-Jul-2011 Namhyung Kim <namhyung@gmail.com> md/raid10: move rdev->corrected_errors counting

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
700c72138938cf428c74379806886c6b017d6295 27-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: Improve decision on whether to fail a device with a read error.

Normally we would fail a device with a READ error. However if doing
so causes the array to fail, it is better to leave the device
in place and just return the read error to the caller.

The current test for decide if the array will fail is overly
simplistic.
We have a function 'enough' which can tell if the array is failed or
not, so use it to guide the decision.

Signed-off-by: NeilBrown <neilb@suse.de>
2bb77736ae5dca0a189829fbb7379d43364a9dac 27-Jul-2011 NeilBrown <neilb@suse.de> md/raid10: Make use of new recovery_disabled handling

When we get a read error during recovery, RAID10 previously
arranged for the recovering device to appear to fail so that
the recovery stops and doesn't restart. This is misleading and wrong.

Instead, make use of the new recovery_disabled handling and mark
the target device and having recovery disabled.

Add appropriate checks in add_disk and remove_disk so that devices
are removed and not re-added when recovery is disabled.

Signed-off-by: NeilBrown <neilb@suse.de>
8bda470e8ebde35f9349e98ecbce4dfb508a60fa 27-Jul-2011 Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> md/raid: use printk_ratelimited instead of printk_ratelimit

As per printk_ratelimit comment, it should not be used.

Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NeilBrown <neilb@suse.de>
c65060ad4274f70048d62e0a86332cd3fd23f28d 18-Jul-2011 Namhyung Kim <namhyung@gmail.com> md/raid10: share pages between read and write bio's during recovery

When performing a recovery, only first 2 slots in r10_bio are in use,
for read and write respectively. However all of pages in the write bio
are never used and just replaced to read bio's when the read completes.

Get rid of those unused pages and share read pages properly.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
778ca01852e6cc9ff335119b37a1938a978df384 18-Jul-2011 Namhyung Kim <namhyung@gmail.com> md/raid10: factor out common bio handling code

When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2c4193df379bb89114ff60d4b0fa66131abe6a75 18-Jul-2011 Namhyung Kim <namhyung@gmail.com> md/raid10: get rid of duplicated conditional expression

Variable 'first' is initialized to zero and updated to @rdev->raid_disk
only if it is greater than 0. Thus condition '>= first' always implies
'>= 0' so the latter is not needed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
ab9d47e990c12c11cc95ed1247a3782234a7e33a 11-May-2011 NeilBrown <neilb@suse.de> md/raid10: reformat some loops with less indenting.

When a loop ends with an 'if' with a large body, it is neater
to make the if 'continue' on the inverse condition, and then
the body is indented less.

Apply this pattern 3 times, and wrap some other long lines.

Signed-off-by: NeilBrown <neilb@suse.de>
f17ed07c853d5d772515f565a7fc68f9098d6d69 11-May-2011 NeilBrown <neilb@suse.de> md/raid10: remove unused variable.

This variable 'disk' is never used - how odd.

Signed-off-by: NeilBrown <neilb@suse.de>
a8830bcaf3206f15e29efcd9e04becd96a0722e9 11-May-2011 NeilBrown <neilb@suse.de> md/raid10: make more use of 'slot' in raid10d.

Now that we have a 'slot' variable, make better use of it to simplify
some code a little.

Signed-off-by: NeilBrown <neilb@suse.de>
7c4e06ff2b6a4c09638551dfde76f37f9fca5c0c 11-May-2011 NeilBrown <neilb@suse.de> md/raid10: some tidying up in fix_read_error

Currently the rdev on which a read error happened could be removed
before we perform the fix_error handling. This requires extra tests
for NULL.

So delay the rdev_dec_pending call until after the call to
fix_read_error so that we can be sure that the rdev still exists.

This allows an 'if' clause to be removed so the body gets re-indented
back one level.

Signed-off-by: NeilBrown <neilb@suse.de>
56d9912106b0974ffb6dd264c80c7e816677e998 11-May-2011 NeilBrown <neilb@suse.de> md: simplify raid10 read_balance

raid10 read balance has two different loop for looking through
possible devices to chose the best.
Collapse those into one loop and generally make the code more
readable.

Signed-off-by: NeilBrown <neilb@suse.de>
c3b328ac846bcf6b9a62c5563380a81ab723006d 18-Apr-2011 NeilBrown <neilb@suse.de> md: fix up raid1/raid10 unplugging.

We just need to make sure that an unplug event wakes up the md
thread, which is exactly what mddev_check_plugged does.

Also remove some plug-related code that is no longer needed.

Signed-off-by: NeilBrown <neilb@suse.de>
e1dfa0a29737142c32f00a3bac0f609dc85b4a82 18-Apr-2011 NeilBrown <neilb@suse.de> md: use new plugging interface for RAID IO.

md/raid submits a lot of IO from the various raid threads.
So adding start/finish plug calls to those so that some
plugging happens.

Signed-off-by: NeilBrown <neilb@suse.de>
25985edcedea6396277003854657b5f3cb31a628 31-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi> Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
a91a2785b200864aef2270ed6a3babac7a253a20 17-Mar-2011 Martin K. Petersen <martin.petersen@oracle.com> block: Require subsystems to explicitly allocate bio_set integrity mempool

MD and DM create a new bio_set for every metadevice. Each bio_set has an
integrity mempool attached regardless of whether the metadevice is
capable of passing integrity metadata. This is a waste of memory.

Instead we defer the allocation decision to MD and DM since we know at
metadevice creation time whether integrity passthrough is needed or not.

Automatic integrity mempool allocation can then be removed from
bioset_create() and we make an explicit integrity allocation for the
fs_bio_set.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Acked-by: Mike Snitzer <snizer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
7eaceaccab5f40bbfda044629a6298616aeaed50 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com> block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
da9cf5050a2e3dbc3cf26a8d908482eb4485ed49 21-Feb-2011 NeilBrown <neilb@suse.de> md: avoid spinlock problem in blk_throtl_exit

blk_throtl_exit assumes that ->queue_lock still exists,
so make sure that it does.
To do this, we stop redirecting ->queue_lock to conf->device_lock
and leave it pointing where it is initialised - __queue_lock.

As the blk_plug functions check the ->queue_lock is held, we now
take that spin_lock explicitly around the plug functions. We don't
need the locking, just the warning removal.

This is needed for any kernel with the blk_throtl code, which is
which is 2.6.37 and later.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
02214dc5461c36da26a34014cab4e1bb484edba2 04-Feb-2011 Krzysztof Wojcik <krzysztof.wojcik@intel.com> FIX: md: process hangs at wait_barrier after 0->10 takeover

Following symptoms were observed:
1. After raid0->raid10 takeover operation we have array with 2
missing disks.
When we add disk for rebuild, recovery process starts as expected
but it does not finish- it stops at about 90%, md126_resync process
hangs in "D" state.
2. Similar behavior is when we have mounted raid0 array and we
execute takeover to raid10. After this when we try to unmount array-
it causes process umount hangs in "D"

In scenarios above processes hang at the same function- wait_barrier
in raid10.c.
Process waits in macro "wait_event_lock_irq" until the
"!conf->barrier" condition will be true.
In scenarios above it never happens.

Reason was that at the end of level_store, after calling pers->run,
we call mddev_resume. This calls pers->quiesce(mddev, 0) with
RAID10, that calls lower_barrier.
However raise_barrier hadn't been called on that 'conf' yet,
so conf->barrier becomes negative, which is bad.

This patch introduces setting conf->barrier=1 after takeover
operation. It prevents to become barrier negative after call
lower_barrier().

Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
ccebd4c4159462c96397ae9af9c667bb394d7b70 13-Jan-2011 Jonathan Brassow <jbrassow@redhat.com> md-new-param-to_sync_page_io

Add new parameter to 'sync_page_io'.

The new parameter allows us to distinguish between metadata and data
operations. This becomes important later when we add the ability to
use separate devices for data and metadata.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
067032bc628598606056412594042564fcf09e22 13-Jan-2011 Joe Perches <joe@perches.com> md: Fix single printks with multiple KERN_<level>s

Noticed-by: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: NeilBrown <neilb@suse.de>
589a594be1fb8815b3f18e517be696c48664f728 09-Dec-2010 NeilBrown <neilb@suse.de> md: protect against NULL reference when waiting to start a raid10.

When we fail to start a raid10 for some reason, we call
md_unregister_thread to kill the thread that was created.

Unfortunately md_thread() will then make one call into the handler
(raid10d) even though md_wakeup_thread has not been called. This is
not safe and as md_unregister_thread is called after mddev->private
has been set to NULL, it will definitely cause a NULL dereference.

So fix this at both ends:
- md_thread should only call the handler if THREAD_WAKEUP has been
set.
- raid10 should call md_unregister_thread before setting things
to NULL just like all the other raid modules do.

This is applicable to 2.6.35 and later.

Cc: stable@kernel.org
Reported-by: "Citizen" <citizen_lee@thecus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
a167f663243662aa9153c01086580a11cde9ffdc 26-Oct-2010 NeilBrown <neilb@suse.de> md: use separate bio pool for each md device.

bio_clone and bio_alloc allocate from a common bio pool.
If an md device is stacked with other devices that use this pool, or under
something like swap which uses the pool, then the multiple calls on
the pool can cause deadlocks.

So allocate a local bio pool for each md array and use that rather
than the common pool.

This pool is used both for regular IO and metadata updates.

Signed-off-by: NeilBrown <neilb@suse.de>
2b193363ef68667ad717a6723165e0dccf99470f 27-Oct-2010 NeilBrown <neilb@suse.de> md: change type of first arg to sync_page_io.

Currently sync_page_io takes a 'bdev'.
Every caller passes 'rdev->bdev'.
We will soon want another field out of the rdev in sync_page_io,
So just pass the rdev instead of the bdev out of it.

Signed-off-by: NeilBrown <neilb@suse.de>
6746557f0325a66f57d179126426e38a8ea66945 26-Oct-2010 NeilBrown <neilb@suse.de> md: use bio_kmalloc rather than bio_alloc when failure is acceptable.

bio_alloc can never fail (as it uses a mempool) but an block
indefinitely, especially if the caller is holding a reference to a
previously allocated bio.

So these to places which both handle failure and hold multiple bios
should not use bio_alloc, they should use bio_kmalloc.

Signed-off-by: NeilBrown <neilb@suse.de>
4e78064f42ad474ce9c31760861f7fb0cfc22532 18-Oct-2010 NeilBrown <neilb@suse.de> md: Fix possible deadlock with multiple mempool allocations.

It is not safe to allocate from a mempool while holding an item
previously allocated from that mempool as that can deadlock when the
mempool is close to exhaustion.

So don't use a bio list to collect the bios to write to multiple
devices in raid1 and raid10.
Instead queue each bio as it becomes available so an unplug will
activate all previously allocated bios and so a new bio has a chance
of being allocated.

This means we must set the 'remaining' count to '1' before submitting
any requests, then when all are submitted, decrement 'remaining' and
possible handle the write completion at that point.

Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
57dab0bdf689d42972975ec646d862b0900a4bf3 19-Oct-2010 NeilBrown <neilb@suse.de> md: use sector_t in bitmap_get_counter

bitmap_get_counter returns the number of sectors covered
by the counter in a pass-by-reference variable.
In some cases this can be very large, so make it a sector_t
for safety.

Signed-off-by: NeilBrown <neilb@suse.de>
e9c7469bb4f502dafc092166201bea1ad5fc0fbf 03-Sep-2010 Tejun Heo <tj@kernel.org> md: implment REQ_FLUSH/FUA support

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.

* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2c7d46ec192e4f2b350f67a0e185b9bce646cd6b 18-Aug-2010 NeilBrown <neilb@suse.de> md raid-1/10 Fix bio_rw bit manipulations again

commit 7b6d91daee5cac6402186ff224c3af39d79f4a0e changed the behaviour
of a few variables in raid1 and raid10 from flags to bit-sets, but
left them as type 'bool' so they did not work.

Change them (back) to unsigned long.
(historical note: see 1ef04fefe2241087d9db7e9615c3f11b516e36cf)

Signed-off-by: NeilBrown <neilb@suse.de>
Reported-by: Jiri Slaby <jslaby@suse.cz> and many others
6b9656205469269c050963c71fca1998b247a560 18-Aug-2010 NeilBrown <neilb@suse.de> md: provide appropriate return value for spare_active functions.

md_check_recovery expects ->spare_active to return 'true' if any
spares were activated, but none of them do, so the consequent change
in 'degraded' is not notified through sysfs.

So count the number of spares activated, subtract it from 'degraded'
just once, and return it.

Reported-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
e6ffbcb6cd0ac471223df24ae77eb486c1ee68cc 18-Aug-2010 Adrian Drzewiecki <adriand@vmware.com> md: Notify sysfs when RAID1/5/10 disk is In_sync.

When RAID1 is done syncing disks, it'll update the state
of synced rdevs to In_sync. But it neglected to notify
sysfs that the attribute changed. So any programs that
are waiting for an rdev's state to change will not be
woken.

(raid5/raid10 added by neilb)

Signed-off-by: Adrian Drzewiecki <adriand@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
7b6d91daee5cac6402186ff224c3af39d79f4a0e 07-Aug-2010 Christoph Hellwig <hch@lst.de> block: unify flags for struct bio and struct request

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
51e9ac77035a3dfcb6fc0a88a0d80b6f99b5edb1 07-Aug-2010 NeilBrown <neilb@suse.de> md/raid10: fix deadlock with unaligned read during resync

If the 'bio_split' path in raid10-read is used while
resync/recovery is happening it is possible to deadlock.
Fix this be elevating ->nr_waiting for the duration of both
parts of the split request.

This fixes a bug that has been present since 2.6.22
but has only started manifesting recently for unknown reasons.
It is suitable for and -stable since then.

Reported-by: Justin Bronder <jsbronder@gentoo.org>
Tested-by: Justin Bronder <jsbronder@gentoo.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
f73ea87375a1b2bf6c0be82bb9a3cb9d5ee7a407 16-Jun-2010 Maciej Trela <maciej.trela@intel.com> md: fix raid10 takeover: use new_layout for setup_conf

Use mddev->new_layout in setup_conf.
Also use new_chunk, and don't set ->degraded in takeover(). That
gets set in run()

Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
e93f68a1fc6244c05ad8fae28e75835ec74ab34e 15-Jun-2010 NeilBrown <neilb@suse.de> md: fix handling of array level takeover that re-arranges devices.

Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over. However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.

Reported-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
0544a21db02c1d8883158fd6f323364f830a120a 24-Jun-2010 Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com> md: raid10: Fix null pointer dereference in fix_read_error()

Such NULL pointer dereference can occur when the driver was fixing the
read errors/bad blocks and the disk was physically removed
causing a system crash. This patch check if the
rcu_dereference() returns valid rdev before accessing it in fix_read_error().

Cc: stable@kernel.org
Signed-off-by: Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com>
Signed-off-by: Rob Becker <rbecker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
af3a2cd6b8a479345786e7fe5e199ad2f6240e56 08-May-2010 NeilBrown <neilb@suse.de> md: Fix read balancing in RAID1 and RAID10 on drives > 2TB

read_balance uses a "unsigned long" for a sector number which
will get truncated beyond 2TB.
This will cause read-balancing to be non-optimal, and can cause
data to be read from the 'wrong' branch during a resync. This has a
very small chance of returning wrong data.

Reported-by: Jordan Russell <jr-list-2010@quo.to>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
128595ed6ff2c7358ae253a560d47a0af463bc99 03-May-2010 NeilBrown <neilb@suse.de> md/raid10: tidy up printk messages.

All raid10 printk messages now start
md/raid10:md-device-name:

Signed-off-by: NeilBrown <neilb@suse.de>
21a52c6d05c15f862797736393915bfa8cd40ee9 01-Apr-2010 NeilBrown <neilb@suse.de> md: pass mddev to make_request functions rather than request_queue

We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown <neilb@suse.de>
490773268cf64f68da2470e07b52c7944da6312d 25-Mar-2010 NeilBrown <neilb@suse.de> md: move io accounting out of personalities into md_make_request

While I generally prefer letting personalities do as much as possible,
given that we have a central md_make_request anyway we may as well use
it to simplify code.
Also this centralises knowledge of ->gendisk which will help later.

Signed-off-by: NeilBrown <neilb@suse.de>
dab8b29248b3f14f456651a2a6ee9b8fd16d1b3c 08-Mar-2010 Trela, Maciej <Maciej.Trela@intel.com> md: Add support for Raid0->Raid10 takeover


Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
84707f38e767ac470fd82af6c45a8cafe2bd1b9a 16-Mar-2010 NeilBrown <neilb@suse.de> md: don't use mddev->raid_disks in raid0 or raid10 while array is active.

In a subsequent patch we will make it possible to change
mddev->raid_disks while a RAID0 or RAID10 array is active. This is
part of the process of reshaping such an array.

This means that we cannot use this value while processes requests
(it is OK to use it during initialisation as we are locked against
changes then).
Both RAID0 and RAID10 have the same value stored in the private data
structure, so use that value instead.

Signed-off-by: NeilBrown <neilb@suse.de>
7b92813c3c0b6990f14838e3985fb385d2655d0c 08-Mar-2010 H Hartley Sweeten <hartleys@visionengravers.com> drivers/md: Remove unnecessary casts of void *

void pointers do not need to be cast to other pointer types.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: NeilBrown <neilb@suse.de>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
627a2d3c29427637f4c5d31ccc7fcbd8d312cd71 08-Mar-2010 NeilBrown <neilb@suse.de> md: deal with merge_bvec_fn in component devices better.

If a component device has a merge_bvec_fn then as we never call it
we must ensure we never need to. Currently this is done by setting
max_sector to 1 PAGE, however this does not stop a bio being created
with several sub-page iovecs that would violate the merge_bvec_fn.

So instead set max_segments to 1 and set the segment boundary to the
same as a page boundary to ensure there is only ever one single-page
segment of IO requested at a time.

This can particularly be an issue when 'xen' is used as it is
known to submit multiple small buffers in a single bio.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
086fa5ff0854c676ec333760f4c0154b3b242616 26-Feb-2010 Martin K. Petersen <martin.petersen@oracle.com> block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors

The block layer calling convention is blk_queue_<limit name>.
blk_queue_max_sectors predates this practice, leading to some confusion.
Rename the function to appropriately reflect that its intended use is to
set max_hw_sectors.

Also introduce a temporary wrapper for backwards compability. This can
be removed after the merge window is closed.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
0efb9e6191e1d3d34c1db90b829b742bc36d532e 13-Dec-2009 NeilBrown <neilb@suse.de> md: add MODULE_DESCRIPTION for all md related modules.

Suggested by Oren Held <orenhe@il.ibm.com>

Signed-off-by: NeilBrown <neilb@suse.de>
1e50915fe0bbf7a46db0fa7e1e604d3fc95f057d 13-Dec-2009 Robert Becker <Rob.Becker@riverbed.com> raid: improve MD/raid10 handling of correctable read errors.

We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts. Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
67b8dc4b06b0e97df55fd76e209f34f9a52e820e 13-Dec-2009 Robert Becker <Rob.Becker@riverbed.com> md/raid10: print more useful messages on device failure.

When we get a read error on a device in a RAID10, and attempting to
repair the error fails, print more useful messages about why it
failed.

Signed-off-by: Robert Becker <Rob.Becker@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
9cd30fdc33cde9ae4ac55a1ccbbb89f3f7b9b2f2 13-Dec-2009 NeilBrown <neilb@suse.de> md: remove needless setting of thread->timeout in raid10_quiesce

As bitmap_create and bitmap_destroy already set thread->timeout
as appropriate, there is no need to do it in raid10_quiesce.
There is a possible need to wake the thread after the timeout
has been set low, but it is better to do that where the timeout
is actually set low, in bitmap_create.

Signed-off-by: NeilBrown <neilb@suse.de>
1b04be96f6910ee415287bf0d5309c7d4c94bd2b 13-Dec-2009 NeilBrown <neilb@suse.de> md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.

This removes a lot of multiplications by HZ.

Signed-off-by: NeilBrown <neilb@suse.de>
42a04b5078ce73a32f85762551d5703c5bd646a1 13-Dec-2009 NeilBrown <neilb@suse.de> md: move offset, daemon_sleep and chunksize out of bitmap structure

... and into bitmap_info. These are all configuration parameters
that need to be set before the bitmap is created.

Signed-off-by: NeilBrown <neilb@suse.de>
a2826aa92e2e14db372eda01d333267258944033 13-Dec-2009 NeilBrown <neilb@suse.de> md: support barrier requests on all personalities.

Previously barriers were only supported on RAID1. This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device. When that completes - and if the original request was not
empty - we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail. If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail. That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet. So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted. Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Andre Noll <maan@systemlinux.org>
ed9bfdf1a40952fd0f8094ec77f876b84ead69af 16-Oct-2009 NeilBrown <neilb@suse.de> md: raid1/raid10: handle allocation errors during array setup.

Both raid1 and raid10 create a mempool during startup.
If the 'alloc' function for this mempool fails, unplug_slaves
is called.
If that happens when the pool is being initialised, unplug_slaves
will try to use the 'conf' structure that isn't filled in yet, and
badness will happen.

So ensure that unplug_slaves doesn't get called unless we know
that the conf structure if fully initialised.

Signed-off-by: NeilBrown <neilb@suse.de>
1d9d52416c0445019ccc1f0fddb9a227456eb61b 16-Oct-2009 NeilBrown <neilb@suse.de> md/raid1/raid10: add a cond_resched

During 'check' of a raid1 or raid10 it is possible for the management
thread to spend a lot of time running 'memcmp' on blocks from
different devices, so make sure the thread has a chance to schedule.
raid5d already has a cond_resched (in process_stripe).

Reported-By: Lee Howard <faxguy@howardsilvan.com>
Signed-off-by: NeilBrown <neilb@suse.de>
1ef04fefe2241087d9db7e9615c3f11b516e36cf 20-Sep-2009 Dmitry Monakhov <rjevskiy@gmail.com> md: raid-1/10: fix RW bits manipulation

Recently Jens has changed bio_rw_flagged() logic by following
commit 1f98a13f623e0ef666690a18c1250335fc6d7ef1. Now it returns
bool instead of int. This broke raid1/raid10 RW bits manipulation logic.
One of visible result is BUG_ON triggering due to empty barrier
here scsi_lib.c:1108 scsi_setup_fs_cmnd()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: NeilBrown <neilb@suse.de>
3fa841d7e7266f6fcc1b3885b905f5153ba897d8 23-Sep-2009 NeilBrown <neilb@suse.de> md: report device as congested when suspended

This should writeback from coming when the device is temporarily
suspended.

Signed-off-by: NeilBrown <neilb@suse.de>
0da3c6194ec2f32617b272df4505a1cf022faea5 23-Sep-2009 NeilBrown <neilb@suse.de> md: Improve name of threads created by md_register_thread

The management thread for raid4,5,6 arrays are all called
mdX_raid5, independent of the actual raid level, which is wrong and
can be confusion.

So change md_register_thread to use the name from the personality
unless no alternate name (like 'resync' or 'reshape') is given.

This is simpler and more correct.

Cc: Jinzc <zhenchengjin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
a9f326ebf22a0de776815240fb76dabe139397ea 23-Sep-2009 NeilBrown <neilb@suse.de> md: remove sparse waring "symbol xxx shadows an earlier one"

Rename some variable and remove some duplicate definitions
to avoid there warnings. None of them are actual errors.

Signed-off-by: NeilBrown <neilb@suse.de>
1f98a13f623e0ef666690a18c1250335fc6d7ef1 11-Sep-2009 Jens Axboe <jens.axboe@oracle.com> bio: first step in sanitizing the bio->bi_rw flag testing

Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ac5e7113e74872928844d00085bd47c988f12728 03-Aug-2009 Andre Noll <maan@systemlinux.org> md: Push down data integrity code to personalities.

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity. The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
8f6c2e4b325a8e9f8f47febb2fd0ed4fae7d45a9 01-Jul-2009 Martin K. Petersen <martin.petersen@oracle.com> md: Use new topology calls to indicate alignment and I/O sizes

Switch MD over to the new disk_stack_limits() function which checks for
aligment and adjusts preferred I/O sizes when stacking.

Also indicate preferred I/O sizes where applicable.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
8c6ac868b107ed50a46204f6d14e2ad9443ff146 18-Jun-2009 Andre Noll <maan@systemlinux.org> md: Push down reconstruction log message to personality code.

Currently, the md layer checks in analyze_sbs() if the raid level
supports reconstruction (mddev->level >= 1) and if reconstruction is
in progress (mddev->recovery_cp != MaxSector).

Move that printk into the personality code of those raid levels that
care (levels 1, 4, 5, 6, 10).

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
9d8f0363623b3da12c43007cf77f5e1a4e8a5964 18-Jun-2009 Andre Noll <maan@systemlinux.org> md: Make mddev->chunk_size sector-based.

This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics. Since

is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
= is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
964e7913b0d25b988e27a7cd9378bc55cc572bb4 16-Jun-2009 raz ben yehuda <raziebe@gmail.com> md: raid10: chunk size check in run

have raid10 check chunk size in run method instead of in md

Signed-off-by: raziebe@gmail.com
Signed-off-by: NeilBrown <neilb@suse.de>
070ec55d07157a3041f92654135c3c6e2eaaf901 16-Jun-2009 NeilBrown <neilb@suse.de> md: remove mddev_to_conf "helper" macro

Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.

Signed-off-by: NeilBrown <neilb@suse.de>
ae03bf639a5027d27270123f5f6e3ee6a412781d 22-May-2009 Martin K. Petersen <martin.petersen@oracle.com> block: Use accessor functions for queue limits

Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
18055569127253755d01733f6ecc004ed02f88d0 06-May-2009 NeilBrown <neilb@suse.de> md/raid10: don't clear bitmap during recovery if array will still be degraded.

If we have a raid10 with multiple missing devices, and we recover just
one of these to a spare, then we risk (depending on the bitmap and
array chunk size) clearing bits of the bitmap for which recovery isn't
complete (because a device is still missing).

This can lead to a subsequent "re-add" being recovered without
any IO happening, which would result in loss of data.

This patch takes the safe approach of not clearing bitmap bits
if the array will still be degraded.

This patch is suitable for all active -stable kernels.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
8f3d8ba20e67991b531e9c0227dcd1f99271a32c 07-Apr-2009 Christoph Hellwig <hch@lst.de> block: move bio list helpers into bio.h

It's used by DM and MD and generally useful, so move the bio list
helpers into bio.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
b522adcde9c4d3fb7b579cfa9160d8bde7744be8 31-Mar-2009 Dan Williams <dan.j.williams@intel.com> md: 'array_size' sysfs attribute

Allow userspace to set the size of the array according to the following
semantics:

1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
a) If size is set before the array is running, do_md_run will fail
if size is greater than the default size
b) A reshape attempt that reduces the default size to less than the set
array size should be blocked
2/ once userspace sets the size the kernel will not change it
3/ writing 'default' to this attribute returns control of the size to the
kernel and reverts to the size reported by the personality

Also, convert locations that need to know the default size from directly
reading ->array_sectors to <pers>_size. Resync/reshape operations
always follow the default size.

Finally, fixup other locations that read a number of 1k-blocks from
userspace to use strict_blocks_to_sectors() which checks for unsigned
long long to sector_t overflow and blocks to sectors overflow.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
1f403624bde3c678a166984b1e6a727a0ce06f2b 31-Mar-2009 Dan Williams <dan.j.williams@intel.com> md: centralize ->array_sectors modifications

Get personalities out of the business of directly modifying
->array_sectors. Lays groundwork to introduce policy on when
->array_sectors can be modified.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
80c3a6ce4ba4470379b9e6a4d9bcd9d2ee26ae03 18-Mar-2009 Dan Williams <dan.j.williams@intel.com> md: add 'size' as a personality method

In preparation for giving userspace control over ->array_sectors we need
to be able to retrieve the 'default' size, and the 'anticipated' size
when a reshape is requested. For personalities that do not reshape emit
a warning if anything but the default size is requested.

In the raid5 case we need to update ->previous_raid_disks to make the
new 'default' size available.

Reviewed-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
409c57f3801701dfee27a28103dda4831306cb20 31-Mar-2009 NeilBrown <neilb@suse.de> md: enable suspend/resume of md devices.

To be able to change the 'level' of an md/raid array, we need to
suspend the device so that no requests are active - then move some
pointers around etc.

The code already keeps counts of active requests and the ->quiesce
function can be used to wait until those counts hit zero.
However the quiesce function blocks new requests once they are all
ready 'inside' the personality module, and that is too late if we want
to replace the personality modules.

So make all md requests come in through a common md_make_request
function that keeps track of how many requests have entered the
modules but may not yet be on the internal reference counts.
Allow md_make_request to be blocked when we want to suspend the
device, and make it possible to wait for all those in-transit requests
to be added to internal lists so that ->quiesce can wait for them.

There is still a problem that when a request completes, we drop the
ref count inside the personality code so there is a short time between
when the refcount hits zero, and when the personality code is no
longer being used.
The personality code never blocks (schedule or spinlock) between
dropping the refcount and exiting the routine, so this should be safe
(as put_module calls synchronize_sched() before unmapping the module
code).

Signed-off-by: NeilBrown <neilb@suse.de>
58c0fed400603a802968b23ddf78f029c5a84e41 31-Mar-2009 Andre Noll <maan@systemlinux.org> md: Make mddev->size sector-based.

This patch renames the "size" field of struct mddev_s to "dev_sectors"
and stores the number of 512-byte sectors instead of the number of
1K-blocks in it.

All users of that field, including raid levels 1,4-6,10, are adjusted
accordingly. This simplifies the code a bit because it allows to get
rid of a couple of divisions/multiplications by two.

In order to make checkpatch happy, some minor coding style issues
have also been addressed. In particular, size_store() now uses
strict_strtoull() instead of simple_strtoull().

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
43b2e5d86d8bdd77386226db0bc961529492c043 31-Mar-2009 NeilBrown <neilb@suse.de> md: move md_k.h from include/linux/raid/ to drivers/md/

It really is nicer to keep related code together..

Signed-off-by: NeilBrown <neilb@suse.de>
bff61975b3d6c18ee31457cc5b4d73042f44915f 31-Mar-2009 NeilBrown <neilb@suse.de> md: move lots of #include lines out of .h files and into .c

This makes the includes more explicit, and is preparation for moving
md_k.h to drivers/md/md.h

Remove include/raid/md.h as its only remaining use was to #include
other files.

Signed-off-by: NeilBrown <neilb@suse.de>
ef740c372dfd80e706dbf955d4e4aedda6c0c148 31-Mar-2009 Christoph Hellwig <hch@lst.de> md: move headers out of include/linux/raid/

Move the headers with the local structures for the disciplines and
bitmap.h into drivers/md/ so that they are more easily grepable for
hacking and not far away. md.h is left where it is for now as there
are some uses from the outside.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.de>
73d5c38a9536142e062c35997b044e89166e063b 25-Feb-2009 NeilBrown <neilb@suse.de> md: avoid races when stopping resync.

There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.

When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, ->stop can be called, and
->conf can be freed. So using conf after reschedule_retry is not
safe.

Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.

The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
78200d45cde2a79c0d0ae0407883bb264caa3c18 25-Feb-2009 NeilBrown <neilb@suse.de> md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery.

For raid1/4/5/6, resync (fixing inconsistencies between devices) is
very similar to recovery (rebuilding a failed device onto a spare).
The both walk through the device addresses in order.

For raid10 it can be quite different. resync follows the 'array'
address, and makes sure all copies are the same. Recover walks
through 'device' addresses and recreates each missing block.

The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
(When present) to be updated to reflect a partially completed resync.
It makes assumptions which mean that it does not work correctly for
raid10 recovery at all.

In particularly, it can cause bitmap-directed recovery of a raid10 to
not recovery some of the blocks that need to be recovered.

So move the call to bitmap_cond_end_sync into the resync path, rather
than being in the common "resync or recovery" path.


Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
09b4068a7fe442efc40e9dcbcf5ff37c3338ab15 25-Feb-2009 NeilBrown <neilb@suse.de> md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery.

When doing recovery on a raid10 with a write-intent bitmap, we only
need to recovery chunks that are flagged in the bitmap.

However if we choose to skip a chunk as it isn't flag, the code
currently skips the whole raid10-chunk, thus it might not recovery
some blocks that need recovering.

This patch fixes it.

In case that is confusing, it might help to understand that there
is a 'raid10 chunk size' which guides how data is distributed across
the devices, and a 'bitmap chunk size' which says how much data
corresponds to a single bit in the bitmap.

This bug only affects cases where the bitmap chunk size is smaller
than the raid10 chunk size.



Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
159ec1fc060ab22b157a62364045f5e98749c4d3 08-Jan-2009 Cheng Renquan <crquan@gmail.com> md: use list_for_each_entry macro directly

The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
list_for_each_entry_safe, from <linux/list.h>, it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.

But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.

In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
a53a6c85756339f82ff19e001e90cfba2d6299a8 06-Nov-2008 NeilBrown <neilb@suse.de> md: fix bug in raid10 recovery.

Adding a spare to a raid10 doesn't cause recovery to start.
This is due to an silly type in
commit 6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda
and so is a bug in 2.6.27 and .28-rc.

Thanks to Thomas Backlund for bisecting to find this.

Cc: Thomas Backlund <tmb@mandriva.org>
Cc: stable@kernel.org

Signed-off-by: NeilBrown <neilb@suse.de>
255707274ea25d486b7de060a30ba4ac50593408 15-Oct-2008 Stephen Rothwell <sfr@canb.auug.org.au> md: build failure due to missing delay.h

Today's linux-next build (powerpc ppc64_defconfig) failed like this:

drivers/md/raid1.c: In function 'sync_request':
drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid1.o] Error 1
make[3]: *** Waiting for unfinished jobs....
drivers/md/raid10.c: In function 'sync_request':
drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
make[3]: *** [drivers/md/raid10.o] Error 1
drivers/md/md.c: In function 'md_do_sync':
drivers/md/md.c:5915: error: implicit declaration of function 'msleep'

Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
unnecessary #includes, #defines, and function declarations"). I added
the following patch.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NeilBrown <neilb@suse.de>
4bbf3771ca40d0aaec8316d0e7476b16010288e5 13-Oct-2008 NeilBrown <neilb@suse.de> md: Relax minimum size restrictions on chunk_size.

Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.

This makes moving an array to a machine with a larger PAGE_SIZE, or
changing the kernel to use a larger PAGE_SIZE, can stop an array from
working.

For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
process works on whole pages at a time, and assumes them to be wholly
within a stripe. For other raid personalities, this restriction is
not needed at all and can be dropped.

So remove the test on chunk_size from common can, and add it in just
the places where it is needed: raid10 and raid4/5/6.

Signed-off-by: NeilBrown <neilb@suse.de>
6feef531f55cf4a20fd9eb39f5352e5745203603 09-Oct-2008 Denis ChengRq <crquan@gmail.com> block: mark bio_split_pool static

Since all bio_split calls refer the same single bio_split_pool, the bio_split
function can use bio_split_pool directly instead of the mempool_t parameter;

then the mempool_t parameter can be removed from bio_split param list, and
bio_split_pool is only referred in fs/bio.c file, can be marked static.

Signed-off-by: Denis ChengRq <crquan@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
074a7aca7afa6f230104e8e65eba3420263714a5 25-Aug-2008 Tejun Heo <tj@kernel.org> block: move stats from disk to part0

Move stats related fields - stamp, in_flight, dkstats - from disk to
part0 and unify stat handling such that...

* part_stat_*() now updates part0 together if the specified partition
is not part0. ie. part_stat_*() are now essentially all_stat_*().

* {disk|all}_stat_*() are gone.

* part_round_stats() is updated similary. It handles part0 stats
automatically and disk_round_stats() is killed.

* part_{inc|dec}_in_fligh() is implemented which automatically updates
part0 stats for parts other than part0.

* disk_map_sector_rcu() is updated to return part0 if no part matches.
Combined with the above changes, this makes NULL special case
handling in callers unnecessary.

* Separate stats show code paths for disk are collapsed into part
stats show code paths.

* Rename disk_stat_lock/unlock() to part_stat_lock/unlock()

While at it, reposition stat handling macros a bit and add missing
parentheses around macro parameters.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
c9959059161ddd7bf4670cf47367033d6b2f79c4 25-Aug-2008 Tejun Heo <tj@kernel.org> block: fix diskstats access

There are two variants of stat functions - ones prefixed with double
underbars which don't care about preemption and ones without which
disable preemption before manipulating per-cpu counters. It's unclear
whether the underbarred ones assume that preemtion is disabled on
entry as some callers don't do that.

This patch unifies diskstats access by implementing disk_stat_lock()
and disk_stat_unlock() which take care of both RCU (for partition
access) and preemption (for per-cpu counter access). diskstats access
should always be enclosed between the two functions. As such, there's
no need for the versions which disables preemption. They're removed
and double underbars ones are renamed to drop the underbars. As an
extra argument is added, there's no danger of using the old version
unconverted.

disk_stat_lock() uses get_cpu() and returns the cpu index and all
diskstat functions which access per-cpu counters now has @cpu
argument to help RT.

This change adds RCU or preemption operations at some places but also
collapses several preemption ops into one at others. Overall, the
performance difference should be negligible as all involved ops are
very lightweight per-cpu ones.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
960e739d9e9f1c2346d8bdc65299ee2e1ed42218 15-Aug-2008 Jens Axboe <jens.axboe@oracle.com> block: raid fixups for removal of bi_hw_segments

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
5df97b91b5d7ed426034fcc84cb6e7cf682b8838 15-Aug-2008 Mikulas Patocka <mpatocka@redhat.com> drop vmerge accounting

Remove hw_segments field from struct bio and struct request. Without virtual
merge accounting they have no purpose.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
0310fa216decc3ecfab41f327638fa48a81f3735 05-Aug-2008 NeilBrown <neilb@suse.de> Allow raid10 resync to happening in larger chunks.

The raid10 resync/recovery code currently limits the amount of
in-flight resync IO to 2Meg. This was copied from raid1 where
it seems quite adequate. However for raid10, some layouts require
a bit of seeking to perform a resync, and allowing a larger buffer
size means that the seeking can be significantly reduced.

There is probably no real need to limit the amount of in-flight
IO at all. Any shortage of memory will naturally reduce the
amount of buffer space available down to a set minimum, and any
concurrent normal IO will quickly cause resync IO to back off.

The only problem would be that normal IO has to wait for all resync IO
to finish, so a very large amount of resync IO could cause unpleasant
latency when normal IO starts up.

So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is
available) which seems to be a good amount. Also reduce the amount
of memory reserved as there is no need to keep 2Meg just for resync if
memory is tight.

Thanks to Keld for the suggestion.

Cc: Keld Jørn Simonsen <keld@dkuug.dk>
Signed-off-by: NeilBrown <neilb@suse.de>
388667bed591b2359713bb17d5de0cf56e961447 25-Jul-2008 Arthur Jones <ajones@riverbed.com> md: raid10: wake up frozen array

When rescheduling a bio in raid10, we wake up
the md thread, but if the array is frozen, this
will have no effect. This causes the array to
remain frozen for eternity. We add a wake_up
to allow the array to de-freeze. This code is
nearly identical to the raid1 code, which has
this fix already.

Signed-off-by: Arthur Jones <ajones@riverbed.com>
Signed-off-by: NeilBrown <neilb@suse.de>
f233ea5c9e0d8b95e4283bf6a3436b88f6fd3586 21-Jul-2008 Andre Noll <maan@systemlinux.org> md: Make mddev->array_size sector-based.

This patch renames the array_size field of struct mddev_s to array_sectors
and converts all instances to use units of 512 byte sectors instead of 1k
blocks.

Signed-off-by: Andre Noll <maan@systemlinux.org>
Signed-off-by: NeilBrown <neilb@suse.de>
cc371e66e340f35eed8dc4651c7c18e754c7fb26 03-Jul-2008 Alasdair G Kergon <agk@redhat.com> Add bvec_merge_data to handle stacked devices and ->merge_bvec()

When devices are stacked, one device's merge_bvec_fn may need to perform
the mapping and then call one or more functions for its underlying devices.

The following bio fields are used:
bio->bi_sector
bio->bi_bdev
bio->bi_size
bio->bi_rw using bio_data_dir()

This patch creates a new struct bvec_merge_data holding a copy of those
fields to avoid having to change them directly in the struct bio when
going down the stack only to have to change them back again on the way
back up. (And then when the bio gets mapped for real, the whole
exercise gets repeated, but that's a problem for another day...)

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
199050ea1ff2270174ee525b73bc4c3323098897 28-Jun-2008 Neil Brown <neilb@notabene.brown> rationalise return value for ->hot_add_disk method.

For all array types but linear, ->hot_add_disk returns 1 on
success, 0 on failure.
For linear, it returns 0 on success and -errno on failure.

This doesn't cause a functional problem because the ->hot_add_disk
function of linear is used quite differently to the others.
However it is confusing.

So convert all to return 0 for success or -errno on failure
and fix call sites to match.

Signed-off-by: Neil Brown <neilb@suse.de>
6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda 28-Jun-2008 Neil Brown <neilb@notabene.brown> Support adding a spare to a live md array with external metadata.

i.e. extend the 'md/dev-XXX/slot' attribute so that you can
tell a device to fill an vacant slot in an and md array.

Signed-off-by: Neil Brown <neilb@suse.de>
8c2e870a625bd336b2e7a65a97c1836acef07322 28-Jun-2008 Neil Brown <neilb@notabene.brown> Ensure interrupted recovery completed properly (v1 metadata plus bitmap)

If, while assembling an array, we find a device which is not fully
in-sync with the array, it is important to set the "fullsync" flags.
This is an exact analog to the setting of this flag in hot_add_disk
methods.

Currently, only v1.x metadata supports having devices in an array
which are not fully in-sync (it keep track of how in sync they are).
The 'fullsync' flag only makes a difference when a write-intent bitmap
is being used. In this case it tells recovery to ignore the bitmap
and recovery all blocks.

This fix is already in place for raid1, but not raid5/6 or raid10.

So without this fix, a raid1 ir raid4/5/6 array with version 1.x
metadata and a write intent bitmaps, that is stopped in the middle
of a recovery, will appear to complete the recovery instantly
after it is reassembled, but the recovery will not be correct.

If you might have an array like that, issueing
echo repair > /sys/block/mdXX/md/sync_action

will make sure recovery completes properly.

Cc: <stable@kernel.org>
Signed-off-by: Neil Brown <neilb@suse.de>
dfc7064500061677720fa26352963c772d3ebe6b 23-May-2008 NeilBrown <neilb@suse.de> md: restart recovery cleanly after device failure.

When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.

Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto" <ivan@kasenna.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e7e72bf641b1fc7b9df6f40bd2c36dfccd8d647c 15-May-2008 Neil Brown <neilb@suse.de> Remove blkdev warning triggered by using md

As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.

For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock. So always initialise that lock when allocated.

With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.

Thanks to Dan Williams for review and fixing the silly bugs.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Alistair John Strachan <alistair@devzero.co.uk>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jacek Luczak <difrost.kernel@gmail.com>
Cc: Prakash Punnoor <prakash@punnoor.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cb6969e8cdef39e613b1755eff595f830b89bc82 07-May-2008 Harvey Harrison <harvey.harrison@gmail.com> misc: fix integer as NULL pointer warnings

drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer
drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer
sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6bfe0b499082fd3950429017cd8ebf2a6c458aa5 30-Apr-2008 Dan Williams <dan.j.williams@intel.com> md: support blocking writes to an array on device failure

Allows a userspace metadata handler to take action upon detecting a device
failure.

Based on an original patch by Neil Brown.

Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d7a420c9472a95c46600a0345434b7b166e0b9c7 28-Apr-2008 Nick Andrew <nick@nick-andrew.net> raid: remove leading TAB on printk messages

MD drivers use one printk() call to print 2 log messages and the second line
may be prefixed by a TAB character. It may also output a trailing space
before newline. klogd (I think) turns the TAB character into the 2 characters
'^I' when logging to a file. This looks ugly.

Instead of a leading TAB to indicate continuation, prefix both output lines
with 'raid:' or similar. Also remove any trailing space in the vicinity of
the affected code and consistently end the sentences with a period.

Signed-off-by: Nick Andrew <nick@nick-andrew.net>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a07e6ab41be179cf1ed728a4f41368435508b550 04-Mar-2008 K.Tanaka <k-tanaka@ce.jp.nec.com> md: the md RAID10 resync thread could cause a md RAID10 array deadlock

This message describes another issue about md RAID10 found by testing the
2.6.24 md RAID10 using new scsi fault injection framework.

Abstract:

When a scsi error results in disabling a disk during RAID10 recovery, the
resync threads of md RAID10 could stall.

This case, the raid array has already been broken and it may not matter. But
I think stall is not preferable. If it occurs, even shutdown or reboot will
fail because of resource busy.

The deadlock mechanism:

The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
be handled when recovering. The "remaining" counter is incremented when
building a BIO in sync_request() and is decremented when finish a BIO in
end_sync_write().

If building a BIO fails for some reasons in sync_request(), the "remaining"
should be decremented if it has already been incremented. I found a case
where this decrement is forgotten. This causes a md_do_sync() deadlock
because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
end_sync_write() never calls md_done_sync() because of the "remaining" counter
mismatch.

For example, this problem would be reproduced in the following case:

Personalities : [raid10]
md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
[>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec

This case, sdf1 is recovering, sdb1 and sde1 are disabled.
An additional error with detaching sdd will cause a deadlock.

md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
[=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec

2739 ? S< 0:17 [md0_raid10]
28608 ? D< 0:00 [md0_resync]
28629 pts/1 Ss 0:00 bash
28830 pts/1 R+ 0:00 ps ax
31819 ? D< 0:00 [kjournald]

The resync thread keeps working, but actually it is deadlocked.

Patch:
By this patch, the remaining counter will be decremented if needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1c830532f6b44d10a1743ccd00e990c6b83396f5 04-Mar-2008 NeilBrown <neilb@suse.de> md: fix possible raid1/raid10 deadlock on read error during resync

Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
another possible deadlock in raid1/raid10 error handing.

If a read request returns an error while a resync is happening and a resync
request is pending, the attempt to fix the error will block until the resync
progresses, and the resync will block until the read request completes. Thus
a deadlock.

This patch fixes the problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8ed3a19563b6c05b7625649b1769ddb063d53253 04-Mar-2008 Keld Simonsen <keld@dkuug.dk> md: don't attempt read-balancing for raid10 'far' layouts

This patch changes the disk to be read for layout "far > 1" to always be the
disk with the lowest block address.

Thus the chunks to be read will always be (for a fully functioning array) from
the first band of stripes, and the raid will then work as a raid0 consisting
of the first band of stripes.

Some advantages:

The fastest part which is the outer sectors of the disks involved will be
used. The outer blocks of a disk may be as much as 100 % faster than the
inner blocks.

Average seek time will be smaller, as seeks will always be confined to the
first part of the disks.

Mixed disks with different performance characteristics will work better, as
they will work as raid0, the sequential read rate will be number of disks
involved times the IO rate of the slowest disk.

If a disk is malfunctioning, the first disk which is working, and has the
lowest block address for the logical block will be used.

Signed-off-by: Keld Simonsen <keld@dkuug.dk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a35e63efa1fb18c6f20f38e3ddf3f8ffbcf0f6e7 04-Mar-2008 NeilBrown <neilb@suse.de> md: fix deadlock in md/raid1 and md/raid10 when handling a read error

When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.

This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).

However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
problem.

Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d089c6af10c2be5988f03667d6d22fe6085fbe5e 06-Feb-2008 NeilBrown <neilb@suse.de> md: change ITERATE_RDEV to rdev_for_each

As this is more in line with common practice in the kernel. Also swap the
args around to be more like list_for_each.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c620727779f7cc8ea96efb71f0651a26349e59c1 06-Feb-2008 NeilBrown <neilb@suse.de> md: allow a maximum extent to be set for resyncing

This allows userspace to control resync/reshape progress and synchronise it
with other activities, such as shared access in a SAN, or backing up critical
sections during a tricky reshape.

Writing a number of sectors (which must be a multiple of the chunk size if
such is meaningful) causes a resync to pause when it gets to that point.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b47490c9bc73d0b34e4c194db40de183e592e446 06-Feb-2008 NeilBrown <neilb@suse.de> md: Update md bitmap during resync.

Currently an md array with a write-intent bitmap does not updated that bitmap
to reflect successful partial resync. Rather the entire bitmap is updated
when the resync completes.

This is because there is no guarentee that resync requests will complete in
order, and tracking each request individually is unnecessarily burdensome.

However there is value in regularly updating the bitmap, so add code to
periodically pause while all pending sync requests complete, then update the
bitmap. Doing this only every few seconds (the same as the bitmap update
time) does not notciably affect resync performance.

[snitzer@gmail.com: export bitmap_cond_end_sync]
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Mike Snitzer" <snitzer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2ad8b1ef11c98c5603580878aebf9f1bc74129e4 07-Nov-2007 Alan D. Brunelle <Alan.Brunelle@hp.com> Add UNPLUG traces to all appropriate places

Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.

Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
fd5d806266935179deda1502101624832eacd01f 16-Oct-2007 Jens Axboe <jens.axboe@oracle.com> block: convert blkdev_issue_flush() to use empty barriers

Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
6712ecf8f648118c3363c142196418f89a510b90 27-Sep-2007 NeilBrown <neilb@suse.de> Drop 'size' argument from bio_endio and bi_end_io

As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.

Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.

While we are at it, change bi_end_io to return void.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
f6f953aa99d456aff44ffdb1c77061d1a010eae2 31-Jul-2007 Arne Redlich <agr@powerkom-dd.de> md: handle writes to broken raid10 arrays gracefully

When writing to a broken array, raid10 currently happily emits empty bio
lists. IOW, the master bio will never be completed, sending writers to
UNINTERRUPTIBLE_SLEEP forever.

Signed-off-by: Arne Redlich <agr@powerkom-dd.de>
Acked-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14e713446aaca97dbe590fe845f7dcbd74ddbee2 31-Jul-2007 Maik Hampel <m.hampel@gmx.de> md: raid10: fix use-after-free of bio

In case of read errors raid10d tries to print a nice error message,
unfortunately using data from an already put bio.

Signed-off-by: Maik Hampel <m.hampel@gmx.de>
Acked-By: NeilBrown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
165125e1e480f9510a5ffcfbfee4e3ee38c05f23 24-Jul-2007 Jens Axboe <jens.axboe@oracle.com> [BLOCK] Get rid of request_queue_t typedef

Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
4ad1366376bfef32ec0ffa12d1faa483d6f330bd 17-Jul-2007 NeilBrown <neilb@suse.de> md: change bitmap_unplug and others to void functions

bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
to print a message if it returns non-zero, but that message is already printed
by bitmap_file_kick.

write_page returns an error which is not consistently checked. It always
causes BITMAP_WRITE_ERROR to be set on an error, and that can more
conveniently be checked.

When the return of write_page is checked, an error causes bitmap_file_kick to
be called - so move that call into write_page - and protect against recursive
calls into bitmap_file_kick.

bitmap_update_sb returns an error that is never checked.

So make these 'void' and be consistent about checking the bit.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
af03b8e4e81c3789e597632268940edd11ffe870 16-Jun-2007 NeilBrown <neilb@suse.de> md: fix two raid10 bugs

1/ When resyncing a degraded raid10 which has more than 2 copies of each block,
garbage can get synced on top of good data.

2/ We round the wrong way in part of the device size calculation, which
can cause confusion.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
64a742bc61f9115b0bb270fa081e5b5b9c35dcd0 01-Mar-2007 NeilBrown <neilb@suse.de> [PATCH] md: fix raid10 recovery problem.

There are two errors that can lead to recovery problems with raid10
when used in 'far' more (not the default).

Due to a '>' instead of '>=' the wrong block is located which would result in
garbage being written to some random location, quite possible outside the
range of the device, causing the newly reconstructed device to fail.

The device size calculation had some rounding errors (it didn't round when it
should) and so recovery would go a few blocks too far which would again cause
a write to a random block address and probably a device error.

The code for working with device sizes was fairly confused and spread out, so
this has been tided up a bit.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e3881a6816b45668df60a426e5c3431ece1539a7 11-Jan-2007 Lars Ellenberg <Lars.Ellenberg@linbit.com> [PATCH] md: pass down BIO_RW_SYNC in raid{1,10}

md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
introducing additional latency.

Fixing this in raid1 and raid10 seems to be straightforward enough.

For our particular usage case in DRBD, passing this flag improved some
initialization time from ~5 minutes to ~5 seconds.

Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Lars Ellenberg <lars@linbit.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
802ba064c49f655d20fed563f2a4924c8256ea10 13-Dec-2006 NeilBrown <neilb@suse.de> [PATCH] md: Don't assume that READ==0 and WRITE==1 - use the names explicitly

Thanks Jens for alerting me to this.

Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: <raziebe@gmail.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
969b755aadf7bcf3df5991a127a103acd0145a52 28-Oct-2006 Randy Dunlap <randy.dunlap@oracle.com> [PATCH] md: fix printk format warnings, seen on powerpc64:

drivers/md/raid1.c:1479: warning: long long unsigned int format, long unsigned int arg (arg 4)
drivers/md/raid10.c:1475: warning: long long unsigned int format, long unsigned int arg (arg 4)

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2e333e89860431d22816c1bdaa2ea72c2753396e 21-Oct-2006 NeilBrown <neilb@suse.de> [PATCH] md: fix calculation of ->degraded for multipath and raid10

Two less-used md personalities have bugs in the calculation of ->degraded (the
extent to which the array is degraded).

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
0d12922823408b26f83b15cae4a4feff4bd22f28 03-Oct-2006 NeilBrown <neilb@suse.de> [PATCH] md: define ->congested_fn for raid1, raid10, and multipath

raid1, raid10 and multipath don't report their 'congested' status through
bdi_*_congested, but should.

This patch adds the appropriate functions which just check the 'congested'
status of all active members (with appropriate locking).

raid1 read_balance should be modified to prefer devices where
bdi_read_congested returns false. Then we could use the '&' branch rather
than the '|' branch. However that should would need some benchmarking first
to make sure it is actually a good idea.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
c04be0aa82ff535e3676ab3e573957bdeef41879 03-Oct-2006 NeilBrown <neilb@suse.de> [PATCH] md: Improve locking around error handling

The error handling routines don't use proper locking, and so two concurrent
errors could trigger a problem.

So:
- use test-and-set and test-and-clear to synchonise
the In_sync bits with the ->degraded count
- use the spinlock to protect updates to the
degraded count (could use an atomic_t but that
would be a bigger change in code, and isn't
really justified)
- remove un-necessary locking in raid5

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
76186dd8b73d2b7b9b4c8629b89c845e97009801 03-Oct-2006 NeilBrown <neilb@suse.de> [PATCH] md: remove 'working_disks' from raid10 state

It isn't needed as mddev->degraded contains equivalent info.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
850b2b420cd5b363ed4cf48a8816d656c8b5251b 03-Oct-2006 NeilBrown <neilb@suse.de> [PATCH] md: replace magic numbers in sb_dirty with well defined bit flags

Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
Some device state has changed requiring superblock update
on all devices.
MD_CHANGE_CLEAN
The array has transitions from 'clean' to 'dirty' or back,
requiring a superblock update on active devices, but possibly
not on spares
MD_CHANGE_PENDING
A superblock update is underway.

We wait for an update to complete by waiting for all flags to be clear. A
flag can be set at any time, even during an update, without risk that the
change will be lost.

Stop exporting md_update_sb - isn't needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
6814d5368d68341ec6b5e4ecd10ea5947130775a 03-Oct-2006 NeilBrown <neilb@suse.de> [PATCH] md: factor out part of raid10d into a separate function.

raid10d has toooo many nested block, so take the fix_read_error functionality
out into a separate function.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d69504325978c461b51b03cca49626026970307b 10-Jul-2006 NeilBrown <neilb@suse.de> [PATCH] md: include sector number in messages about corrected read errors

This is generally useful, but particularly helps see if it is the same sector
that always needs correcting, or different ones.

[akpm@osdl.org: fix printk warnings]
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
8838832830d2c6c28ae2db93188ae90652eb7fc2 26-Jun-2006 NeilBrown <neilb@suse.de> [PATCH] md: Calculate correct array size for raid10 in new offset mode

The size calculation made assumtion which the new offset mode didn't
follow. This gets the size right in all cases.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
c93983bf517c100a31e40ef087e19bd3d7aa2d28 26-Jun-2006 NeilBrown <neilb@suse.de> [PATCH] md: support stripe/offset mode in raid10

The "industry standard" DDF format allows for a stripe/offset layout where
data is duplicated on different stripes. e.g.

A B C D
D A B C
E F G H
H E F G

(columns are drives, rows are stripes, LETTERS are chunks of data).

This is similar to raid10's 'far' mode, but not quite the same. So enhance
'far' mode with a 'far/offset' option which follows the layout of DDFs
stripe/offset.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
5fd6c1dce06ec24ef3de20fe0c7ecf2ba9fe5ef9 26-Jun-2006 NeilBrown <neilb@suse.de> [PATCH] md: allow checkpoint of recovery with version-1 superblock

For a while we have had checkpointing of resync. The version-1 superblock
allows recovery to be checkpointed as well, and this patch implements that.

Due to early carelessness we need to add a feature flag to signal that the
recovery_offset field is in use, otherwise older kernels would assume that a
partially recovered array is in fact fully recovered.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
8932c2e0dcae52e73430878fd8a7a7800176eada 26-Jun-2006 NeilBrown <neilb@suse.de> [PATCH] md: remove arbitrary limit on chunk size

The largest chunk size the code can support without substantial surgery is
2^30 bytes, so make that the limit instead of an arbitrary 4Meg. Some day,
the 'chunksize' should change to a sector-shift instead of a byte-count. Then
no limit would be needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e0a33270ed0e8e00cbb882a33d21e1f92aac0ceb 01-May-2006 NeilBrown <neilb@suse.de> [PATCH] md: Fixed refcounting/locking when attempting read error correction in raid10

We need to hold a reference to rdevs while reading and writing to attempt to
correct read errors. This reference must be taken under an rcu lock.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
df30d0f4ca3c41b60068232d6de9d58be88436f0 01-May-2006 NeilBrown <neilb@suse.de> [PATCH] md: Avoid oops when attempting to fix read errors on raid10

We should add to the counter for the rdev *after* checking if the rdev is
NULL!!!

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b6385483840e903d99cb753593faea215ae8d324 02-Apr-2006 Eric Sesterhenn <snakebyte@gmx.de> BUG_ON() Conversion in md/raid10.c

this changes if() BUG(); constructs to BUG_ON() which is
cleaner and can better optimized away

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
29fc7e3e70a05e9eea28afb6707a39c1a53e2f66 03-Feb-2006 NeilBrown <neilb@suse.de> [PATCH] md: Assorted little md fixes

- version-1 superblock
+ The default_bitmap_offset is in sectors, not bytes.
+ the 'size' field in the superblock is in sectors, not KB
- raid0_run should return a negative number on error, not '1'
- raid10_read_balance should not return a valid 'disk' number if
->rdev turned out to be NULL
- kmem_cache_destroy doesn't like being passed a NULL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
858119e159384308a5dde67776691a2ebf70df0f 14-Jan-2006 Arjan van de Ven <arjan@infradead.org> [PATCH] Unlinline a bunch of other functions

Remove the "inline" keyword from a bunch of big functions in the kernel with
the goal of shrinking it by 30kb to 40kb

Signed-off-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4dbcdc751cb25ffca3a8374cbc5ab6de961cc545 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: count corrected read errors per drive

Store this total in superblock (As appropriate), and make it available to
userspace via sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d9d166c2a9d5d01af34396793950aa695883eed4 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: allow array level to be set textually via sysfs

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
f188593ee7af8c71755d2df269a7a5f62c4b695e 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: fix typo in comment

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1345b1d8adbdeceb1c871d9a4af5e2a700b341c6 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: define and use safe_put_page for md

md sometimes call put_page on NULL pointers (treating it like kfree). This is
not safe, so define and use a 'safe_put_page' which checks for NULL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
097426f689f179747f3cd6b4749eb2a6b605702d 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: fix possible problem in raid1/raid10 error overwriting

The code to overwrite/reread for addressing read errors in raid1/raid10
currently assumes that the read will not alter the buffer which could be used
to write to the next device. This is not a safe assumption to make.

So we split the loops into a overwrite loop and a separate re-read loop, so
that the writing is complete before reading is attempted.

Cc: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2604b703b6b3db80e3c75ce472a54dfd0b7bf9f4 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: remove personality numbering from md

md supports multiple different RAID level, each being implemented by a
'personality' (which is often in a separate module).

These personalities have fairly artificial 'numbers'. The numbers
are use to:
1- provide an index into an array where the various personalities
are recorded
2- identify the module (via an alias) which implements are particular
personality.

Neither of these uses really justify the existence of personality numbers.
The array can be replaced by a linked list which is searched (array lookup
only happens very rarely). Module identification can be done using an alias
based on level rather than 'personality' number.

The current 'raid5' modules support two level (4 and 5) but only one
personality. This slight awkwardness (which was handled in the mapping from
level to personality) can be better handled by allowing raid5 to register 2
personalities.

With this change in place, the core md module does not need to have an
exhaustive list of all possible personalities, so other personalities can be
added independently.

This patch also moves the check for chunksize being non-zero into the ->run
routines for the personalities that need it, rather than having it in core-md.
This has a side effect of allowing 'faulty' and 'linear' not to have a
chunk-size set.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
a24a8dd858e0ba50f06a9fd8f61fe8c4fe7a8d8e 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: break out of a loop that doesn't need to run to completion

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
9ffae0cf3ea02f75d163922accfd3e592d87adde 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: convert md to use kzalloc throughout

Replace multiple kmalloc/memset pairs with kzalloc calls.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2d1f3b5d1b2cd11a162eb29645df749ec0036413 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: clean up 'page' related names in md

Substitute:

page_cache_get -> get_page
page_cache_release -> put_page
PAGE_CACHE_SHIFT -> PAGE_SHIFT
PAGE_CACHE_SIZE -> PAGE_SIZE
PAGE_CACHE_MASK -> PAGE_MASK
__free_page -> put_page

because we aren't using the page cache, we are just using pages.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
0eb3ff12aa8a12538ef681dc83f4361636a0699f 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: raid10 read-error handling - resync and read-only

Add in correct read-error handling for resync and read-only situations.

When read-only, we don't over-write, so we need to mark the failed drive in
the r10_bio so we don't re-try it. During resync, we always read all blocks,
so if there is a read error, we simply over-write it with the good block that
we found (assuming we found one).

Note that the recovery case still isn't handled in an interesting way. There
is nothing useful to do for the 2-copies case. If there are 3 or more copies,
then we could try reading from one of the non-missing copies, but this is a
bit complicated and very rarely would be used, so I'm leaving it for now.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4443ae10ca15d07922ceda622f03db8865fa3d13 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: auto-correct correctable read errors in raid10

Largely just a cross-port from raid1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
18f08819f42b647783e4f6ea99141623881bf182 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: support check-without-repair of raid10 arrays

Also keep count on the number of errors found.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
6cce3b23f6f8e974c00af7a9b88f1d413ba368a8 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: write intent bitmap support for raid10

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
0a27ec96b6fb1abf867e36d7b0b681d67588767a 06-Jan-2006 NeilBrown <neilb@suse.de> [PATCH] md: improve raid10 "IO Barrier" concept

raid10 needs to put up a barrier to new requests while it does resync or other
background recovery. The code for this is currently open-coded, slighty
obscure by its use of two waitqueues, and not documented.

This patch gathers all the related code into 4 functions, and includes a
comment which (hopefully) explains what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
22dfdf5212e5864b844f629736fb993d4611f190 28-Nov-2005 NeilBrown <neilb@suse.de> [PATCH] md: improve read speed to raid10 arrays using 'far copies'

raid10 has two different layouts. One uses near-copies (so multiple
copies of a block are at the same or similar offsets of different
devices) and the other uses far-copies (so multiple copies of a block
are stored a greatly different offsets on different devices). The point
of far-copies is that it allows the first section (normally first half)
to be layed out in normal raid0 style, and thus provide raid0 sequential
read performance.

Unfortunately, the read balancing in raid10 makes some poor decisions
for far-copies arrays and you don't get the desired performance. So
turn off that bad bit of read_balance for far-copies arrays.

With this patch, read speed of an 'f2' array is comparable with a raid0
with the same number of devices, though write speed is ofcourse still
very slow.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b2d444d7ad975d555bb919601bcdc0e58975a40e 09-Nov-2005 NeilBrown <neilb@suse.de> [PATCH] md: convert 'faulty' and 'in_sync' fields to bits in 'flags' field

This has the advantage of removing the confusion caused by 'rdev_t' and
'mddev_t' both having 'in_sync' fields.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d6065f7bf8bec170c9c56524a250093ce73ca5d9 09-Nov-2005 Suzanne Wood <suzannew@cs.pdx.edu> [PATCH] md: provide proper rcu_dereference / rcu_assign_pointer annotations in md

Acked-by: <paulmck@us.ibm.com>
Signed-off-by: Suzanne Wood <suzannew@cs.pdx.edu>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
a362357b6cd62643d4dda3b152639303d78473da 01-Nov-2005 Jens Axboe <axboe@suse.de> [BLOCK] Unify the seperate read/write io stat fields into arrays

Instead of having ->read_sectors and ->write_sectors, combine the two
into ->sectors[2] and similar for the other fields. This saves a branch
several places in the io path, since we don't have to care for what the
actual io direction is. On my x86-64 box, that's 200 bytes less text in
just the core (not counting the various drivers).

Signed-off-by: Jens Axboe <axboe@suse.de>
dd0fc66fb33cd610bc1a5db8a5e232d34879b4d7 07-Oct-2005 Al Viro <viro@ftp.linux.org.uk> [PATCH] gfp flags annotations - part 1

- added typedef unsigned int __nocast gfp_t;

- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
87fc767b832ef5a681a0ff9d203c3289bc3be2bf 10-Sep-2005 NeilBrown <neilb@suse.de> [PATCH] md: fix BUG when raid10 rebuilds without enough drives

This shouldn't be a BUG. We should cope.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
6d508242b231cb6e6803faaef54456abe846edb8 10-Sep-2005 NeilBrown <neilb@suse.de> [PATCH] md: fix raid10 assembly when too many devices are missing

If you try to assemble an array with too many missing devices, raid10 will now
reject the attempt, instead of allowing it.

Also check when hot-adding a drive and refuse the hot-add if the array is
beyond hope.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e5dcdd80a60627371f40797426273048630dc8ca 10-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au> [PATCH] md: fail IO request to md that require a barrier.

md does not yet support BIO_RW_BARRIER, so be honest about it and fail
(-EOPNOTSUPP) any such requests.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
3ec67ac1a399d576d48b0736096bcce7721fe3cf 10-Sep-2005 NeilBrown <neilb@cse.unsw.edu.au> [PATCH] md: fix minor error in raid10 read-balancing calculation.

'this_sector' is a virtual (array) address while 'head_position' is a physical
(device) address, so substraction doesn't make any sense. devs[slot].addr
should be used instead of this_sector.

However, this patch doesn't make much practical different to the read
balancing due to the effects of later code.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
990a8baf568ca1d0ae65e59783ff821794118d07 22-Jun-2005 Jesper Juhl <juhl-lkml@dif.dk> [PATCH] md: remove unneeded NULL checks before kfree

This patch removes some unneeded checks of pointers being NULL before
calling kfree() on them. kfree() handles NULL pointers just fine, checking
first is pointless.

Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
3d310eb7b3df1252e8595d059d982b0a9825a137 22-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au> [PATCH] md: fix deadlock due to md thread processing delayed requests.

Before completing a 'write' the md superblock might need to be updated.
This is best done by the md_thread.

The current code schedules this up and queues the write request for later
handling by the md_thread.

However some personalities (Raid5/raid6) will deadlock if the md_thread
tries to submit requests to its own array.

So this patch changes things so the processes submitting the request waits
for the superblock to be written and then submits the request itself.

This fixes a recently-created deadlock in raid5/raid6

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
57afd89f98a990747445f01c458ecae64263b2f8 22-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au> [PATCH] md: improve the interface to sync_request

1/ change the return value (which is number-of-sectors synced)
from 'int' to 'sector_t'.
The number of sectors is usually easily small enough to fit
in an int, but if resync needs to abort, it may want to return
the total number of remaining sectors, which could be large.
Also errors cannot be returned as negative numbers now, so use
0 instead
2/ Add a 'skipped' return parameter to allow the array to report
that it skipped the sectors. This allows md to take this into account
in the speed calculations.
Currently there is no important skipping, but the bitmap-based-resync
that is coming will use this.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
06d91a5fe0b50c9060e70bdf7786f8a3c66249db 22-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au> [PATCH] md: improve locking on 'safemode' and move superblock writes

When md marks the superblock dirty before a write, it calls
generic_make_request (to write the superblock) from within
generic_make_request (to write the first dirty block), which could cause
problems later.

With this patch, the superblock write is always done by the helper thread, and
write request are delayed until that write completes.

Also, the locking around marking the array dirty and writing the superblock is
improved to avoid possible races.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fca4d848f0e6fafdc2b25f8a0cf1e76935f13ac2 22-Jun-2005 NeilBrown <neilb@cse.unsw.edu.au> [PATCH] md: merge md_enter_safemode into md_check_recovery

md_enter_safemode checks if it is time to mark the md superblock as 'clean'.
i.e. if all writes have completed and a suitable delay has passed.

This is currently called from md_handle_safemode which in-turn is called
(almost) every time md_check_recovery is called, and from the end of
md_do_sync which causes the mddev->thread to run, which will always call
md_check_recovery as well.

So it doesn't need to be a separate function and fits quite well into
md_check_recovery.

The "almost" is because multipathd calls md_check_recovery but not
md_handle_safemode. This is OK because the code from md_enter_safemode is a
no-op if mddev->safemode == 0, which it always is for a multipathd (providing
we don't allow it to be set to 2 on a signal...)

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
7a5febe9ffeecd1e78c5b505260ccc1ef18021b4 17-May-2005 NeilBrown <neilb@cse.unsw.edu.au> [PATCH] md: set the unplug_fn and issue_flush_fn for md devices *after* committed to creation

We we set the too early, they may still be in place and possibly get called
even though the array didn't get set up properly.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fbd568a3e61a7decb8a754ad952aaa5b5c82e9e5 01-May-2005 Paul E. McKenney <paulmck@us.ibm.com> [PATCH] Change synchronize_kernel to _rcu and _sched

This patch changes calls to synchronize_kernel(), deprecated in the earlier
"Deprecate synchronize_kernel, GPL replacement" patch to instead call the new
synchronize_rcu() and synchronize_sched() APIs.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!