Cross Reference: /drivers/md/raid10.c

History log of /drivers/md/raid10.c
Revision	Date	Author	Comments
65c3f18b9032f7237fc74403ce3a92176eaebd8c	03-Jul-2012	NeilBrown <neilb@suse.de>	md/raid10: fix failure when trying to repair a read error. commit 055d3747dbf00ce85c6872ecca4d466638e80c22 upstream. commit 58c54fcca3bac5bf9290cfed31c76e4c4bfbabaf md/raid10: handle further errors during fix_read_error better. in 3.1 added "r10_sync_page_io" which takes an IO size in sectors. But we were passing the IO size in bytes!!! This resulting in bio_add_page failing, and empty request being sent down, and a consequent BUG_ON in scsi_lib. [fix missing space in error message at same time] This fix is suitable for 3.1.y and later. Reported-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
04e0f69d135f6cf10534282fc09cf3efd6973c5b	03-Jul-2012	NeilBrown <neilb@suse.de>	md/raid10: Don't try to recovery unmatched (and unused) chunks. commit fc448a18ae6219af9a73257b1fbcd009efab4a81 upstream. If a RAID10 has an odd number of chunks - as might happen when there are an odd number of devices - the last chunk has no pair and so is not mirrored. We don't store data there, but when recovering the last device in an array we retry to recover that last chunk from a non-existent location. This results in an error, and the recovery aborts. When we get to that last chunk we should just stop - there is nothing more to do anyway. This bug has been present since the introduction of RAID10, so the patch is appropriate for any -stable kernel. Reported-by: Christian Balzer <chibi@gol.com> Tested-by: Christian Balzer <chibi@gol.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
584b886aee3aff5fe7eb21e30f779eb5cd1daa36	31-May-2012	NeilBrown <neilb@suse.de>	md: raid1/raid10: fix problem with merge_bvec_fn commit aba336bd1d46d6b0404b06f6915ed76150739057 upstream. The new merge_bvec_fn which calls the corresponding function in subsidiary devices requires that mddev->merge_check_needed be set if any child has a merge_bvec_fn. However were were only setting that when a device was hot-added, not when a device was present from the start. This bug was introduced in 3.4 so patch is suitable for 3.4.y kernels. However that are conflicts in raid10.c so a separate patch will be needed for 3.4.y. Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
b0d634d5683f0b186b242ce6a4f3b041edb8b956	19-May-2012	NeilBrown <neilb@suse.de>	md/raid10: fix transcription error in calc_sectors conversion. The old code was sector_div(stride, fc); the new code was sector_dir(size, conf->near_copies); 'size' is right (the stride various wasn't really needed), but 'fc' means 'far_copies', and that is an important difference. Signed-off-by: NeilBrown <neilb@suse.de>
6508fdbf40a92fd7c19d32780ea33ce8e8362b93	17-May-2012	NeilBrown <neilb@suse.de>	md/raid10: set dev_sectors properly when resizing devices in array. raid10 stores dev_sectors in 'conf' separately from the one in 'mddev' because it can have a very significant effect on block addressing and so need to be updated carefully. However raid10_resize isn't updating it at all! To update it correctly, we need to make sure it is a proper multiple of the chunksize taking various details of the layout in to account. This calculation is currently done in setup_conf. So split it out from there and call it from raid10_resize as well. Then set conf->dev_sectors properly. Signed-off-by: NeilBrown <neilb@suse.de>
f4380a915823dbed0bf8e3cf502ebcf2b7c7f833	12-Apr-2012	majianpeng <majianpeng@gmail.com>	md/raid1,raid10: Fix calculation of 'vcnt' when processing error recovery. If r1bio->sectors % 8 != 0,then the memcmp and a later memcpy will omit the last bio_vec. This is suitable for any stable kernel since 3.1 when bad-block management was introduced. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
5020ad7d143ccfcf8149974096220d59e5572120	01-Apr-2012	NeilBrown <neilb@suse.de>	md/raid1,raid10: don't compare excess byte during consistency check. When comparing two pages read from different legs of a mirror, only compare the bytes that were read, not the whole page. In most cases we read a whole page, but in some cases with bad blocks or odd sizes devices we might read fewer than that. This bug has been present "forever" but at worst it might cause a report of two many mismatches and generate a little bit extra resync IO, so there is no need to back-port to -stable kernels. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
006a09a0ae0a494473a8cd82c8d1d653e37e6663	18-Mar-2012	NeilBrown <neilb@suse.de>	md/raid10 - support resizing some RAID10 arrays. 'resizing' an array in this context means making use of extra space that has become available in component devices, not adding new devices. It also includes shrinking the array to take up less space of component devices. This is not supported for array with a 'far' layout. However for 'near' and 'offset' layout arrays, adding and removing space at the end of the devices is easy to support, and this patch provides that support. Signed-off-by: NeilBrown <neilb@suse.de>
050b66152f87c79e8d66aed0e7996f9336462d5f	18-Mar-2012	NeilBrown <neilb@suse.de>	md/raid10: handle merge_bvec_fn in member devices. Currently we don't honour merge_bvec_fn in member devices so if there is one, we force all requests to be single-page at most. This is not ideal. So enhance the raid10 merge_bvec_fn to check that function in children as well. This introduces a small problem. There is no locking around calls the ->merge_bvec_fn and subsequent calls to ->make_request. So a device added between these could end up getting a request which violates its merge_bvec_fn. Currently the best we can do is synchronize_sched(). This will work providing no preemption happens. If there is preemption, we just have to hope that new devices are largely consistent with old devices. Signed-off-by: NeilBrown <neilb@suse.de>
dafb20fa34320a472deb7442f25a0c086e0feb33	18-Mar-2012	NeilBrown <neilb@suse.de>	md: tidy up rdev_for_each usage. md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NeilBrown <neilb@suse.de>
d6b42dcb995e6acd7cc276774e751ffc9f0ef4bf	18-Mar-2012	NeilBrown <neilb@suse.de>	md/raid1,raid10: avoid deadlock during resync/recovery. If RAID1 or RAID10 is used under LVM or some other stacking block device, it is possible to enter a deadlock during resync or recovery. This can happen if the upper level block device creates two requests to the RAID1 or RAID10. The first request gets processed, blocks recovery and queue requests for underlying requests in current->bio_list. A resync request then starts which will wait for those requests and block new IO. But then the second request to the RAID1/10 will be attempted and it cannot progress until the resync request completes, which cannot progress until the underlying device requests complete, which are on a queue behind that second request. So allow that second request to proceed even though there is a resync request about to start. This is suitable for any -stable kernel. Cc: stable@vger.kernel.org Reported-by: Ray Morris <support@bettercgi.com> Tested-by: Ray Morris <support@bettercgi.com> Signed-off-by: NeilBrown <neilb@suse.de>
dc10c643e8a8d008fd16dd6706e9e0018eadf8d2	18-Mar-2012	NeilBrown <neilb@suse.de>	md: allow re-add to failed arrays. When an array is failed (some data inaccessible) then there is no point attempting to add a spare as it could not possibly be recovered. However that may be value in re-adding a recently removed device. e.g. if there is a write-intent-bitmap and it is clear, then access to the data could be restored by this action. So don't reject a re-add to a failed array for RAID10 and RAID5 (the only arrays types that check for a failed array). Signed-off-by: NeilBrown <neilb@suse.de>
547414d19fd72376ff2ecc42aac8d7a051f03d26	13-Mar-2012	NeilBrown <neilb@suse.de>	md/raid10: remove unnecessary smp_mb() from end_sync_write Recent commit 4ca40c2ce099e4f1ce3 (md/raid10: Allow replacement device ...) added an smp_mb in end_sync_write. This was to close a possible race with raid10_remove_disk. However there is no such race as it is never attempted to remove a disk while resync (or recovery) is happening. so the smp_mb is just noise. Signed-off-by: NeilBrown <neilb@suse.de>
7a90484825680e7831856105f5fef654e6c02701	05-Mar-2012	NeilBrown <neilb@suse.de>	md/raid10: fix assembling of arrays with replacement devices. commit 56a2559bb654a (md/raid10: recognise replacements ...) changed 'run' to set ->replacement or ->rdev depending on the 'Replacement' status if the device, but it didn't remove the old unconditional setting of 'rdev'. So it was largely ineffective. So remove that now. Signed-off-by: NeilBrown <neilb@suse.de>
fae8cc5ed0714953b1ad7cf86f030d2177278424	14-Feb-2012	NeilBrown <neilb@suse.de>	md/raid10: fix handling of error on last working device in array. If we get a read error on the last working device in a RAID10 which contains the target block, then we don't fail the device (which is good) but we don't abort retries, which is wrong. We end up in an infinite loop retrying the read on the one device. This patch fixes the problem in two places: 1/ in raid10_end_read_request we don't even ask for a retry if this was the last usable device. This is efficient but a little racy and will sometimes retry when it should not. 2/ in handle_read_error we are careful to exclude any device from retry which we tried to mark as faulty (that might have failed if it was the last device). This is race-free but less efficient. Signed-off-by: NeilBrown <neilb@suse.de>
b7044d41b5a09ce9082699f74c8f10e0fe59f704	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: If there is a spare and a want_replacement device, start replacement. When attempting to add a spare to a RAID10 array, also consider adding it as a replacement for a want_replacement device. Signed-off-by: NeilBrown <neilb@suse.de>
56a2559bb654ae2555b2ae3b29c837615d0c45c9	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: recognise replacements when assembling array. If a Replacement is seen, file it as such. If we see two replacements (or two normal devices) for the one slot, abort. Signed-off-by: NeilBrown <neilb@suse.de>
4ca40c2ce099e4f1ce35445994f49836662596c8	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: Allow replacement device to be replace old drive. When recovery finish and spare_active is called, check for a replace that might have just become fully synced and mark it as such, marking the original as failed. Then when the original is removed, move the replacement into its position. This means that 'replacement' and spontaneously become NULL in some situations. Make sure we check for those. It also means that 'rdev' and 'replacement' could appear to be identical - check for that too. Signed-off-by: NeilBrown <neilb@suse.de>
24afd80d99f80a79d8824d2805114b8b067e9823	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: handle recovery of replacement devices. If there is a replacement device, then recover to it, reading from any drives - maybe the one being replaced, maybe not. Signed-off-by: NeilBrown <neilb@suse.de>
9ad1aefc8ae8d2e482b4cc4b7199e2354148bbdc	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: Handle replacement devices during resync. If we need to resync an array which has replacement devices, we always write any block checked to every replacement. If the resync was bitmap-based resync we will then complete the replacement normally. If it was a full resync, we mark the replacements as fully recovered when the resync finishes so no further recovery is needed. Signed-off-by: NeilBrown <neilb@suse.de>
475b0321a4df381f64db10ddd750a8b7bb82d88b	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: writes should get directed to replacement as well as original. When writing, we need to submit two writes, one to the original, and one to the replacements - if there is a replacement. If the write to the replacement results in a write error we just fail the device. We only try to record write errors to the original. This only handles writing new data. Writing for resync/recovery will come later. Signed-off-by: NeilBrown <neilb@suse.de>
c8ab903ea9d7309044910c33dc087418be84f9b5	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: allow removal of failed replacement devices. Enhance raid10_remove_disk to be able to remove ->replacement as well as ->rdev Signed-off-by: NeilBrown <neilb@suse.de>
abbf098e6e1e23d5d247b9eaaf325e67f67b0328	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: preferentially read from replacement device if possible. When reading (for array reads, not for recovery etc) we read from the replacement device if it has recovered far enough. This requires storing the chosen rdev in the 'r10_bio' so we can make sure to drop the ref on the right device when the read finishes. Signed-off-by: NeilBrown <neilb@suse.de>
96c3fd1f3802371610c620cff03f9d825707e80e	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: change read_balance to return an rdev It makes more sense to return an rdev than just an index as read_balance() gets a reference to the rdev and so returning the pointer make this more idiomatic. This will be needed in a future patch when we might return a 'replacement' rdev instead of the main rdev. Signed-off-by: NeilBrown <neilb@suse.de>
69335ef3bc5b766f34db2d688be1d35313138bca	23-Dec-2011	NeilBrown <neilb@suse.de>	md/raid10: prepare data structures for handling replacement. Allow each slot in the RAID10 to have 2 devices, the want_replacement and the replacement. Also an r10bio to have 2 bios, and for resync/recovery allocate the second bio if there are any replacement devices. Signed-off-by: NeilBrown <neilb@suse.de>
b8321b68d1445f308324517e45fb0a5c2b48e271	23-Dec-2011	NeilBrown <neilb@suse.de>	md: change hot_remove_disk to take an rdev rather than a number. Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
056075c76417b112b4924e7b6386fdc6dfc9ac03	03-Jul-2011	Paul Gortmaker <paul.gortmaker@windriver.com>	md: Add module.h to all files using it implicitly A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
7fcc7c8acf0fba44d19a713207af7e58267c1179	30-Oct-2011	NeilBrown <neilb@suse.de>	md/raid10: Fix bug when activating a hot-spare. This is a fairly serious bug in RAID10. When a RAID10 array is degraded and a hot-spare is activated, the spare does not take up the empty slot, but rather replaces the first working device. This is likely to make the array non-functional. It would normally be possible to recover the data, but that would need care and is not guaranteed. This bug was introduced in commit 2bb77736ae5dca0a189829fbb7379d43364a9dac which first appeared in 3.1. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
d890fa2b0586b6177b119643ff66932127d58afa	26-Oct-2011	NeilBrown <neilb@suse.de>	md: Fix some bugs in recovery_disabled handling. In 3.0 we changed the way recovery_disabled was handle so that instead of testing against zero, we test an mddev-> value against a conf-> value. Two problems: 1/ one place in raid1 was missed and still sets to '1'. 2/ We didn't explicitly set the conf-> value at array creation time. It defaulted to '0' just like the mddev value does so they could appear equal and thus disable recovery. This did not affect normal 'md' as it calls bind_rdev_to_array which changes the mddev value. However the dmraid interface doesn't call this and so doesn't change ->recovery_disabled; so at array start all recovery is incorrectly disabled. So initialise the 'conf' value to one less that the mddev value, so the will only be the same when explicitly set that way. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
34db0cd60f8a1f4ab73d118a8be3797c20388223	11-Oct-2011	NeilBrown <neilb@suse.de>	md: add proper write-congestion reporting to RAID1 and RAID10. RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de>
84fc4b56db85cb9e05326424049973a2036c9940	11-Oct-2011	NeilBrown <neilb@suse.de>	md: rename "mdk_personality" to "md_personality" "mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>
e879a8793f915aa7933364d962d2435bd71de462	11-Oct-2011	NeilBrown <neilb@suse.de>	md/raid10: typedef removal: conf_t -> struct r10conf Signed-off-by: NeilBrown <neilb@suse.de>
e373ab109172abc2d821bd3b5c1b400acddef5a5	11-Oct-2011	NeilBrown <neilb@suse.de>	md/raid0: typedef removal: raid0_conf_t -> struct r0conf Signed-off-by: NeilBrown <neilb@suse.de>
0f6d02d580ca77ee4be085c29c5fe5b879df24d9	11-Oct-2011	NeilBrown <neilb@suse.de>	md: remove typedefs: mirror_info_t -> struct mirror_info Signed-off-by: NeilBrown <neilb@suse.de>
9f2c9d12bcc53fcb3b787023723754e84d1aef8b	11-Oct-2011	NeilBrown <neilb@suse.de>	md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio Signed-off-by: NeilBrown <neilb@suse.de>
fd01b88c75a718020ff77e7f560d33835e9b58de	11-Oct-2011	NeilBrown <neilb@suse.de>	md: remove typedefs: mddev_t -> struct mddev Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
3cb03002000f133f9f97269edefd73611eafc873	11-Oct-2011	NeilBrown <neilb@suse.de>	md: removing typedefs: mdk_rdev_t -> struct md_rdev The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
01f96c0a9922cd9919baf9d16febdf7016177a12	21-Sep-2011	NeilBrown <neilb@suse.de>	md: Avoid waking up a thread after it has been freed. Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org
5a7bbad27a410350e64a2d7f5ec18fc73836c14f	12-Sep-2011	Christoph Hellwig <hch@infradead.org>	block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
079fa166a2874985ae58b2e21e26e1cbc91127d4	10-Sep-2011	NeilBrown <neilb@suse.de>	md/raid1,10: Remove use-after-free bug in make_request. A single request to RAID1 or RAID10 might result in multiple requests if there are known bad blocks that need to be avoided. To detect if we need to submit another write request we test: if (sectors_handled < (bio->bi_size >> 9)) { However this is after we call _write_done() so the 'bio' no longer belongs to us - the writes could have completed and the bio freed. So move the _write_done call until after the test against bio->bi_size. This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862 Reported-by: Bruno Wolff III <bruno@wolff.to> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: NeilBrown <neilb@suse.de>
19d5f834d6aff7efb1c9353523865c5bce869470	10-Sep-2011	NeilBrown <neilb@suse.de>	md/raid10: unify handling of write completion. A write can complete at two different places: 1/ when the last member-device write completes, through raid10_end_write_request 2/ in make_request() when we remove the initial bias from ->remaining. These two should do exactly the same thing and the comment says they do, but they don't. So factor the correct code out into a function and call it in both places. This makes the code much more similar to RAID1. The difference is only significant if there is an error, and they usually take a while, so it is unlikely that there will be an error already when make_request is completing, so this is unlikely to cause real problems. Signed-off-by: NeilBrown <neilb@suse.de>
58c54fcca3bac5bf9290cfed31c76e4c4bfbabaf	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: handle further errors during fix_read_error better. If we find more read/write errors we should record a bad block before failing the device. Signed-off-by: NeilBrown <neilb@suse.de>
5e5702898e93eee7d69b6efde109609a89a61001	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: Handle read errors during recovery better. Currently when we get a read error during recovery, we simply abort the recovery. Instead, repeat the read in page-sized blocks. On successful reads, write to the target. On read errors, record a bad block on the destination, and only if that fails do we abort the recovery. As we now retry reads we need to know where we read from. This was in bi_sector but that can be changed during a read attempt. So store the correct from_addr and to_addr in the r10_bio for later access. Signed-off-by: NeilBrown<neilb@suse.de>
e684e41db3bad44f1262341300b827c0d94ae220	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: simplify read error handling during recovery. If a read error is detected during recovery the code currently fails the read device. This isn't really necessary. recovery_request_write will signal a write error to end_sync_write and it will record a write error on the destination device which will record a bad block there or kick it from the array. So just remove this call to do md_error. Signed-off-by: NeilBrown <neilb@suse.de>
1a0b7cd82657a590f163b090bd9123a3a6b9aae4	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: record bad blocks due to write errors during resync/recovery. If we get a write error during resync/recovery don't fail the device but instead record a bad block. If that fails we can then fail the device. Signed-off-by: NeilBrown <neilb@suse.de>
f84ee364dd15af11cada1e673f94128f62db189e	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: attempt to fix read errors during resync/check We already attempt to fix read errors found during normal IO and a 'repair' process. It is best to try to repair them at any time they are found, so move a test so that during sync and check a read error will be corrected by over-writing with good data. If both (all) devices have known bad blocks in the sync section we won't try to fix even though the bad blocks might not overlap. That should be considered later. Also if we hit a read error during recovery we don't try to fix it. It would only be possible to fix if there were at least three copies of data, which is not very common with RAID10. But it should still be considered later. Signed-off-by: NeilBrown <neilb@suse.de>
bd870a16c5946d86126f7203db3c73b71de0a1d8	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: Handle write errors by updating badblock log. When we get a write error (in the data area, not in metadata), update the badblock log rather than failing the whole device. As the write may well be many blocks, we trying writing each block individually and only log the ones which fail. Signed-off-by: NeilBrown <neilb@suse.de>
749c55e942d91cb27045fe2eb313aa5afe68ae0b	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: clear bad-block record when write succeeds. If we succeed in writing to a block that was recorded as being bad, we clear the bad-block record. This requires some delayed handling as the bad-block-list update has to happen in process-context. Signed-off-by: NeilBrown <neilb@suse.de>
d4432c23be957ff061f7b23fd60e8506cb472a55	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: avoid writing to known bad blocks on known bad drives. Writing to known bad blocks on drives that have seen a write error is asking for trouble. So try to avoid these blocks. Signed-off-by: NeilBrown <neilb@suse.de>
e875ecea266a543e643b19e44cf472f1412708f9	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10 record bad blocks as needed during recovery. When recovering one or more devices, if all the good devices have bad blocks we should record a bad block on the device being rebuilt. If this fails, we need to abort the recovery. To ensure we don't think that we aborted later than we actually did, we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync, in particular before mddev->curr_resync is updated. Signed-off-by: NeilBrown <neilb@suse.de>
40c356ce5ad1a6be817825e1da1bc7494349cc6d	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: avoid reading known bad blocks during resync/recovery. During resync/recovery limit the size of the request to avoid reading into a bad block that does not start at-or-before the current read address. Similarly if there is a bad block at this address, don't allow the current request to extend beyond the end of that bad block. Now that we don't ever read from known bad blocks, it is safe to allow devices with those blocks into the array. Signed-off-by: NeilBrown <neilb@suse.de>
8dbed5cebdf6796bf2618457b3653cf820934366	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10 - avoid reading from known bad blocks - part 3 When attempting to repair a read error, don't read from devices with a known bad block. As we are only reading PAGE_SIZE blocks, we don't try to narrow down to smaller regions in the hope that only part of this page is bad - it isn't worth the effort. Signed-off-by: NeilBrown <neilb@suse.de>
7399c31bc92a26bb8388a73f8e14acadcc512fe5	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: avoid reading from known bad blocks - part 2 When redirecting a read error to a different device, we must again avoid bad blocks and possibly split the request. Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
856e08e23762dfb92ffc68fd0a8d228f9e152160	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: avoid reading from known bad blocks - part 1 This patch just covers the basic read path: 1/ read_balance needs to check for badblocks, and return not only the chosen slot, but also how many good blocks are available there. 2/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' On read error we currently just fail the request if another target cannot handle the whole request. Next patch refines that a bit. Signed-off-by: NeilBrown <neilb@suse.de>
560f8e5532d63a314271bfb99d3d1d53c938ed14	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: Split handle_read_error out from raid10d. raid10d() is too big and is about to get bigger, so split handle_read_error() out as a separate function. Signed-off-by: NeilBrown <neilb@suse.de>
1294b9c973251a5e68b62c9b40dd914517bda675	28-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: simplify/reindent some loops. When a loop ends with a large if, it can be neater to change the if to invert the condition and just 'continue'. Then the body of the if can be indented to a lower level. Signed-off-by: NeilBrown <neilb@suse.de>
de393cdea66cbd63c90725663f400c76faf1b255	28-Jul-2011	NeilBrown <neilb@suse.de>	md: make it easier to wait for bad blocks to be acknowledged. It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty or it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>
34b343cff4354ab9864be83be88405fd53d928a0	28-Jul-2011	NeilBrown <neilb@suse.de>	md: don't allow arrays to contain devices with bad blocks. As no personality understand bad block lists yet, we must reject any device that is known to contain bad blocks. As the personalities get taught, these tests can be removed. This only applies to raid1/raid5/raid10. For linear/raid0/multipath/faulty the whole concept of bad blocks doesn't mean anything so there is no point adding the checks. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
cbea21703b2484f83faef040ed1de30114794392	27-Jul-2011	Namhyung Kim <namhyung@gmail.com>	md/raid10: move rdev->corrected_errors counting Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
700c72138938cf428c74379806886c6b017d6295	27-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: Improve decision on whether to fail a device with a read error. Normally we would fail a device with a READ error. However if doing so causes the array to fail, it is better to leave the device in place and just return the read error to the caller. The current test for decide if the array will fail is overly simplistic. We have a function 'enough' which can tell if the array is failed or not, so use it to guide the decision. Signed-off-by: NeilBrown <neilb@suse.de>
2bb77736ae5dca0a189829fbb7379d43364a9dac	27-Jul-2011	NeilBrown <neilb@suse.de>	md/raid10: Make use of new recovery_disabled handling When we get a read error during recovery, RAID10 previously arranged for the recovering device to appear to fail so that the recovery stops and doesn't restart. This is misleading and wrong. Instead, make use of the new recovery_disabled handling and mark the target device and having recovery disabled. Add appropriate checks in add_disk and remove_disk so that devices are removed and not re-added when recovery is disabled. Signed-off-by: NeilBrown <neilb@suse.de>
8bda470e8ebde35f9349e98ecbce4dfb508a60fa	27-Jul-2011	Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>	md/raid: use printk_ratelimited instead of printk_ratelimit As per printk_ratelimit comment, it should not be used. Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> Signed-off-by: NeilBrown <neilb@suse.de>
c65060ad4274f70048d62e0a86332cd3fd23f28d	18-Jul-2011	Namhyung Kim <namhyung@gmail.com>	md/raid10: share pages between read and write bio's during recovery When performing a recovery, only first 2 slots in r10_bio are in use, for read and write respectively. However all of pages in the write bio are never used and just replaced to read bio's when the read completes. Get rid of those unused pages and share read pages properly. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
778ca01852e6cc9ff335119b37a1938a978df384	18-Jul-2011	Namhyung Kim <namhyung@gmail.com>	md/raid10: factor out common bio handling code When normal-write and sync-read/write bio completes, we should find out the disk number the bio belongs to. Factor those common code out to a separate function. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2c4193df379bb89114ff60d4b0fa66131abe6a75	18-Jul-2011	Namhyung Kim <namhyung@gmail.com>	md/raid10: get rid of duplicated conditional expression Variable 'first' is initialized to zero and updated to @rdev->raid_disk only if it is greater than 0. Thus condition '>= first' always implies '>= 0' so the latter is not needed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
ab9d47e990c12c11cc95ed1247a3782234a7e33a	11-May-2011	NeilBrown <neilb@suse.de>	md/raid10: reformat some loops with less indenting. When a loop ends with an 'if' with a large body, it is neater to make the if 'continue' on the inverse condition, and then the body is indented less. Apply this pattern 3 times, and wrap some other long lines. Signed-off-by: NeilBrown <neilb@suse.de>
f17ed07c853d5d772515f565a7fc68f9098d6d69	11-May-2011	NeilBrown <neilb@suse.de>	md/raid10: remove unused variable. This variable 'disk' is never used - how odd. Signed-off-by: NeilBrown <neilb@suse.de>
a8830bcaf3206f15e29efcd9e04becd96a0722e9	11-May-2011	NeilBrown <neilb@suse.de>	md/raid10: make more use of 'slot' in raid10d. Now that we have a 'slot' variable, make better use of it to simplify some code a little. Signed-off-by: NeilBrown <neilb@suse.de>
7c4e06ff2b6a4c09638551dfde76f37f9fca5c0c	11-May-2011	NeilBrown <neilb@suse.de>	md/raid10: some tidying up in fix_read_error Currently the rdev on which a read error happened could be removed before we perform the fix_error handling. This requires extra tests for NULL. So delay the rdev_dec_pending call until after the call to fix_read_error so that we can be sure that the rdev still exists. This allows an 'if' clause to be removed so the body gets re-indented back one level. Signed-off-by: NeilBrown <neilb@suse.de>
56d9912106b0974ffb6dd264c80c7e816677e998	11-May-2011	NeilBrown <neilb@suse.de>	md: simplify raid10 read_balance raid10 read balance has two different loop for looking through possible devices to chose the best. Collapse those into one loop and generally make the code more readable. Signed-off-by: NeilBrown <neilb@suse.de>
c3b328ac846bcf6b9a62c5563380a81ab723006d	18-Apr-2011	NeilBrown <neilb@suse.de>	md: fix up raid1/raid10 unplugging. We just need to make sure that an unplug event wakes up the md thread, which is exactly what mddev_check_plugged does. Also remove some plug-related code that is no longer needed. Signed-off-by: NeilBrown <neilb@suse.de>
e1dfa0a29737142c32f00a3bac0f609dc85b4a82	18-Apr-2011	NeilBrown <neilb@suse.de>	md: use new plugging interface for RAID IO. md/raid submits a lot of IO from the various raid threads. So adding start/finish plug calls to those so that some plugging happens. Signed-off-by: NeilBrown <neilb@suse.de>
25985edcedea6396277003854657b5f3cb31a628	31-Mar-2011	Lucas De Marchi <lucas.demarchi@profusion.mobi>	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
a91a2785b200864aef2270ed6a3babac7a253a20	17-Mar-2011	Martin K. Petersen <martin.petersen@oracle.com>	block: Require subsystems to explicitly allocate bio_set integrity mempool MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
7eaceaccab5f40bbfda044629a6298616aeaed50	10-Mar-2011	Jens Axboe <jaxboe@fusionio.com>	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
da9cf5050a2e3dbc3cf26a8d908482eb4485ed49	21-Feb-2011	NeilBrown <neilb@suse.de>	md: avoid spinlock problem in blk_throtl_exit blk_throtl_exit assumes that ->queue_lock still exists, so make sure that it does. To do this, we stop redirecting ->queue_lock to conf->device_lock and leave it pointing where it is initialised - __queue_lock. As the blk_plug functions check the ->queue_lock is held, we now take that spin_lock explicitly around the plug functions. We don't need the locking, just the warning removal. This is needed for any kernel with the blk_throtl code, which is which is 2.6.37 and later. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
02214dc5461c36da26a34014cab4e1bb484edba2	04-Feb-2011	Krzysztof Wojcik <krzysztof.wojcik@intel.com>	FIX: md: process hangs at wait_barrier after 0->10 takeover Following symptoms were observed: 1. After raid0->raid10 takeover operation we have array with 2 missing disks. When we add disk for rebuild, recovery process starts as expected but it does not finish- it stops at about 90%, md126_resync process hangs in "D" state. 2. Similar behavior is when we have mounted raid0 array and we execute takeover to raid10. After this when we try to unmount array- it causes process umount hangs in "D" In scenarios above processes hang at the same function- wait_barrier in raid10.c. Process waits in macro "wait_event_lock_irq" until the "!conf->barrier" condition will be true. In scenarios above it never happens. Reason was that at the end of level_store, after calling pers->run, we call mddev_resume. This calls pers->quiesce(mddev, 0) with RAID10, that calls lower_barrier. However raise_barrier hadn't been called on that 'conf' yet, so conf->barrier becomes negative, which is bad. This patch introduces setting conf->barrier=1 after takeover operation. It prevents to become barrier negative after call lower_barrier(). Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
ccebd4c4159462c96397ae9af9c667bb394d7b70	13-Jan-2011	Jonathan Brassow <jbrassow@redhat.com>	md-new-param-to_sync_page_io Add new parameter to 'sync_page_io'. The new parameter allows us to distinguish between metadata and data operations. This becomes important later when we add the ability to use separate devices for data and metadata. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
067032bc628598606056412594042564fcf09e22	13-Jan-2011	Joe Perches <joe@perches.com>	md: Fix single printks with multiple KERN_<level>s Noticed-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>
589a594be1fb8815b3f18e517be696c48664f728	09-Dec-2010	NeilBrown <neilb@suse.de>	md: protect against NULL reference when waiting to start a raid10. When we fail to start a raid10 for some reason, we call md_unregister_thread to kill the thread that was created. Unfortunately md_thread() will then make one call into the handler (raid10d) even though md_wakeup_thread has not been called. This is not safe and as md_unregister_thread is called after mddev->private has been set to NULL, it will definitely cause a NULL dereference. So fix this at both ends: - md_thread should only call the handler if THREAD_WAKEUP has been set. - raid10 should call md_unregister_thread before setting things to NULL just like all the other raid modules do. This is applicable to 2.6.35 and later. Cc: stable@kernel.org Reported-by: "Citizen" <citizen_lee@thecus.com> Signed-off-by: NeilBrown <neilb@suse.de>
a167f663243662aa9153c01086580a11cde9ffdc	26-Oct-2010	NeilBrown <neilb@suse.de>	md: use separate bio pool for each md device. bio_clone and bio_alloc allocate from a common bio pool. If an md device is stacked with other devices that use this pool, or under something like swap which uses the pool, then the multiple calls on the pool can cause deadlocks. So allocate a local bio pool for each md array and use that rather than the common pool. This pool is used both for regular IO and metadata updates. Signed-off-by: NeilBrown <neilb@suse.de>
2b193363ef68667ad717a6723165e0dccf99470f	27-Oct-2010	NeilBrown <neilb@suse.de>	md: change type of first arg to sync_page_io. Currently sync_page_io takes a 'bdev'. Every caller passes 'rdev->bdev'. We will soon want another field out of the rdev in sync_page_io, So just pass the rdev instead of the bdev out of it. Signed-off-by: NeilBrown <neilb@suse.de>
6746557f0325a66f57d179126426e38a8ea66945	26-Oct-2010	NeilBrown <neilb@suse.de>	md: use bio_kmalloc rather than bio_alloc when failure is acceptable. bio_alloc can never fail (as it uses a mempool) but an block indefinitely, especially if the caller is holding a reference to a previously allocated bio. So these to places which both handle failure and hold multiple bios should not use bio_alloc, they should use bio_kmalloc. Signed-off-by: NeilBrown <neilb@suse.de>
4e78064f42ad474ce9c31760861f7fb0cfc22532	18-Oct-2010	NeilBrown <neilb@suse.de>	md: Fix possible deadlock with multiple mempool allocations. It is not safe to allocate from a mempool while holding an item previously allocated from that mempool as that can deadlock when the mempool is close to exhaustion. So don't use a bio list to collect the bios to write to multiple devices in raid1 and raid10. Instead queue each bio as it becomes available so an unplug will activate all previously allocated bios and so a new bio has a chance of being allocated. This means we must set the 'remaining' count to '1' before submitting any requests, then when all are submitted, decrement 'remaining' and possible handle the write completion at that point. Reported-by: Torsten Kaiser <just.for.lkml@googlemail.com> Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com> Signed-off-by: NeilBrown <neilb@suse.de>
57dab0bdf689d42972975ec646d862b0900a4bf3	19-Oct-2010	NeilBrown <neilb@suse.de>	md: use sector_t in bitmap_get_counter bitmap_get_counter returns the number of sectors covered by the counter in a pass-by-reference variable. In some cases this can be very large, so make it a sector_t for safety. Signed-off-by: NeilBrown <neilb@suse.de>
e9c7469bb4f502dafc092166201bea1ad5fc0fbf	03-Sep-2010	Tejun Heo <tj@kernel.org>	md: implment REQ_FLUSH/FUA support This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2c7d46ec192e4f2b350f67a0e185b9bce646cd6b	18-Aug-2010	NeilBrown <neilb@suse.de>	md raid-1/10 Fix bio_rw bit manipulations again commit 7b6d91daee5cac6402186ff224c3af39d79f4a0e changed the behaviour of a few variables in raid1 and raid10 from flags to bit-sets, but left them as type 'bool' so they did not work. Change them (back) to unsigned long. (historical note: see 1ef04fefe2241087d9db7e9615c3f11b516e36cf) Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Jiri Slaby <jslaby@suse.cz> and many others
6b9656205469269c050963c71fca1998b247a560	18-Aug-2010	NeilBrown <neilb@suse.de>	md: provide appropriate return value for spare_active functions. md_check_recovery expects ->spare_active to return 'true' if any spares were activated, but none of them do, so the consequent change in 'degraded' is not notified through sysfs. So count the number of spares activated, subtract it from 'degraded' just once, and return it. Reported-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
e6ffbcb6cd0ac471223df24ae77eb486c1ee68cc	18-Aug-2010	Adrian Drzewiecki <adriand@vmware.com>	md: Notify sysfs when RAID1/5/10 disk is In_sync. When RAID1 is done syncing disks, it'll update the state of synced rdevs to In_sync. But it neglected to notify sysfs that the attribute changed. So any programs that are waiting for an rdev's state to change will not be woken. (raid5/raid10 added by neilb) Signed-off-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
7b6d91daee5cac6402186ff224c3af39d79f4a0e	07-Aug-2010	Christoph Hellwig <hch@lst.de>	block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
51e9ac77035a3dfcb6fc0a88a0d80b6f99b5edb1	07-Aug-2010	NeilBrown <neilb@suse.de>	md/raid10: fix deadlock with unaligned read during resync If the 'bio_split' path in raid10-read is used while resync/recovery is happening it is possible to deadlock. Fix this be elevating ->nr_waiting for the duration of both parts of the split request. This fixes a bug that has been present since 2.6.22 but has only started manifesting recently for unknown reasons. It is suitable for and -stable since then. Reported-by: Justin Bronder <jsbronder@gentoo.org> Tested-by: Justin Bronder <jsbronder@gentoo.org> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
f73ea87375a1b2bf6c0be82bb9a3cb9d5ee7a407	16-Jun-2010	Maciej Trela <maciej.trela@intel.com>	md: fix raid10 takeover: use new_layout for setup_conf Use mddev->new_layout in setup_conf. Also use new_chunk, and don't set ->degraded in takeover(). That gets set in run() Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
e93f68a1fc6244c05ad8fae28e75835ec74ab34e	15-Jun-2010	NeilBrown <neilb@suse.de>	md: fix handling of array level takeover that re-arranges devices. Most array level changes leave the list of devices largely unchanged, possibly causing one at the end to become redundant. However conversions between RAID0 and RAID10 need to renumber all devices (except 0). This renumbering is currently being done in the ->run method when the new personality takes over. However this is too late as the common code in md.c might already have invalidated some of the devices if they had a ->raid_disk number that appeared to high. Moving it into the ->takeover method is too early as the array is still active at that time and wrong ->raid_disk numbers could cause confusion. So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate the new raid_disk number. Now the common code knows exactly which devices need to be renumbered, and which can be invalidated, and can do it all at a convenient time when the array is suspend. It can also update some symlinks in sysfs which previously were not be updated correctly. Reported-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
0544a21db02c1d8883158fd6f323364f830a120a	24-Jun-2010	Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com>	md: raid10: Fix null pointer dereference in fix_read_error() Such NULL pointer dereference can occur when the driver was fixing the read errors/bad blocks and the disk was physically removed causing a system crash. This patch check if the rcu_dereference() returns valid rdev before accessing it in fix_read_error(). Cc: stable@kernel.org Signed-off-by: Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com> Signed-off-by: Rob Becker <rbecker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
af3a2cd6b8a479345786e7fe5e199ad2f6240e56	08-May-2010	NeilBrown <neilb@suse.de>	md: Fix read balancing in RAID1 and RAID10 on drives > 2TB read_balance uses a "unsigned long" for a sector number which will get truncated beyond 2TB. This will cause read-balancing to be non-optimal, and can cause data to be read from the 'wrong' branch during a resync. This has a very small chance of returning wrong data. Reported-by: Jordan Russell <jr-list-2010@quo.to> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
128595ed6ff2c7358ae253a560d47a0af463bc99	03-May-2010	NeilBrown <neilb@suse.de>	md/raid10: tidy up printk messages. All raid10 printk messages now start md/raid10:md-device-name: Signed-off-by: NeilBrown <neilb@suse.de>
21a52c6d05c15f862797736393915bfa8cd40ee9	01-Apr-2010	NeilBrown <neilb@suse.de>	md: pass mddev to make_request functions rather than request_queue We used to pass the personality make_request function direct to the block layer so the first argument had to be a queue. But now we have the intermediary md_make_request so it makes at lot more sense to pass a struct mddev_s. It makes it possible to have an mddev without its own queue too. Signed-off-by: NeilBrown <neilb@suse.de>
490773268cf64f68da2470e07b52c7944da6312d	25-Mar-2010	NeilBrown <neilb@suse.de>	md: move io accounting out of personalities into md_make_request While I generally prefer letting personalities do as much as possible, given that we have a central md_make_request anyway we may as well use it to simplify code. Also this centralises knowledge of ->gendisk which will help later. Signed-off-by: NeilBrown <neilb@suse.de>
dab8b29248b3f14f456651a2a6ee9b8fd16d1b3c	08-Mar-2010	Trela, Maciej <Maciej.Trela@intel.com>	md: Add support for Raid0->Raid10 takeover Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
84707f38e767ac470fd82af6c45a8cafe2bd1b9a	16-Mar-2010	NeilBrown <neilb@suse.de>	md: don't use mddev->raid_disks in raid0 or raid10 while array is active. In a subsequent patch we will make it possible to change mddev->raid_disks while a RAID0 or RAID10 array is active. This is part of the process of reshaping such an array. This means that we cannot use this value while processes requests (it is OK to use it during initialisation as we are locked against changes then). Both RAID0 and RAID10 have the same value stored in the private data structure, so use that value instead. Signed-off-by: NeilBrown <neilb@suse.de>
7b92813c3c0b6990f14838e3985fb385d2655d0c	08-Mar-2010	H Hartley Sweeten <hartleys@visionengravers.com>	drivers/md: Remove unnecessary casts of void * void pointers do not need to be cast to other pointer types. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NeilBrown <neilb@suse.de>
5a0e3ad6af8660be21ca98a971cd00f331318c05	24-Mar-2010	Tejun Heo <tj@kernel.org>	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
627a2d3c29427637f4c5d31ccc7fcbd8d312cd71	08-Mar-2010	NeilBrown <neilb@suse.de>	md: deal with merge_bvec_fn in component devices better. If a component device has a merge_bvec_fn then as we never call it we must ensure we never need to. Currently this is done by setting max_sector to 1 PAGE, however this does not stop a bio being created with several sub-page iovecs that would violate the merge_bvec_fn. So instead set max_segments to 1 and set the segment boundary to the same as a page boundary to ensure there is only ever one single-page segment of IO requested at a time. This can particularly be an issue when 'xen' is used as it is known to submit multiple small buffers in a single bio. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
086fa5ff0854c676ec333760f4c0154b3b242616	26-Feb-2010	Martin K. Petersen <martin.petersen@oracle.com>	block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors The block layer calling convention is blk_queue_<limit name>. blk_queue_max_sectors predates this practice, leading to some confusion. Rename the function to appropriately reflect that its intended use is to set max_hw_sectors. Also introduce a temporary wrapper for backwards compability. This can be removed after the merge window is closed. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
0efb9e6191e1d3d34c1db90b829b742bc36d532e	13-Dec-2009	NeilBrown <neilb@suse.de>	md: add MODULE_DESCRIPTION for all md related modules. Suggested by Oren Held <orenhe@il.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
1e50915fe0bbf7a46db0fa7e1e604d3fc95f057d	13-Dec-2009	Robert Becker <Rob.Becker@riverbed.com>	raid: improve MD/raid10 handling of correctable read errors. We've noticed severe lasting performance degradation of our raid arrays when we have drives that yield large amounts of media errors. The raid10 module will queue each failed read for retry, and also will attempt call fix_read_error() to perform the read recovery. Read recovery is performed while the array is frozen, so repeated recovery attempts can degrade the performance of the array for extended periods of time. With this patch I propose adding a per md device max number of corrected read attempts. Each rdev will maintain a count of read correction attempts in the rdev->read_errors field (not used currently for raid10). When we enter fix_read_error() we'll check to see when the last read error occurred, and divide the read error count by 2 for every hour since the last read error. If at that point our read error count exceeds the read error threshold, we'll fail the raid device. In addition in this patch I add sysfs nodes (get/set) for the per md max_read_errors attribute, the rdev->read_errors attribute, and added some printk's to indicate when fix_read_error fails to repair an rdev. For testing I used debugfs->fail_make_request to inject IO errors to the rdev while doing IO to the raid array. Signed-off-by: Robert Becker <Rob.Becker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
67b8dc4b06b0e97df55fd76e209f34f9a52e820e	13-Dec-2009	Robert Becker <Rob.Becker@riverbed.com>	md/raid10: print more useful messages on device failure. When we get a read error on a device in a RAID10, and attempting to repair the error fails, print more useful messages about why it failed. Signed-off-by: Robert Becker <Rob.Becker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
9cd30fdc33cde9ae4ac55a1ccbbb89f3f7b9b2f2	13-Dec-2009	NeilBrown <neilb@suse.de>	md: remove needless setting of thread->timeout in raid10_quiesce As bitmap_create and bitmap_destroy already set thread->timeout as appropriate, there is no need to do it in raid10_quiesce. There is a possible need to wake the thread after the timeout has been set low, but it is better to do that where the timeout is actually set low, in bitmap_create. Signed-off-by: NeilBrown <neilb@suse.de>
1b04be96f6910ee415287bf0d5309c7d4c94bd2b	13-Dec-2009	NeilBrown <neilb@suse.de>	md: change daemon_sleep to be in 'jiffies' rather than 'seconds'. This removes a lot of multiplications by HZ. Signed-off-by: NeilBrown <neilb@suse.de>
42a04b5078ce73a32f85762551d5703c5bd646a1	13-Dec-2009	NeilBrown <neilb@suse.de>	md: move offset, daemon_sleep and chunksize out of bitmap structure ... and into bitmap_info. These are all configuration parameters that need to be set before the bitmap is created. Signed-off-by: NeilBrown <neilb@suse.de>
a2826aa92e2e14db372eda01d333267258944033	13-Dec-2009	NeilBrown <neilb@suse.de>	md: support barrier requests on all personalities. Previously barriers were only supported on RAID1. This is because other levels requires synchronisation across all devices and so needed a different approach. Here is that approach. When a barrier arrives, we send a zero-length barrier to every active device. When that completes - and if the original request was not empty - we submit the barrier request itself (with the barrier flag cleared) and then submit a fresh load of zero length barriers. The barrier request itself is asynchronous, but any subsequent request will block until the barrier completes. The reason for clearing the barrier flag is that a barrier request is allowed to fail. If we pass a non-empty barrier through a striping raid level it is conceivable that part of it could succeed and part could fail. That would be way too hard to deal with. So if the first run of zero length barriers succeed, we assume all is sufficiently well that we send the request and ignore errors in the second run of barriers. RAID5 needs extra care as write requests may not have been submitted to the underlying devices yet. So we flush the stripe cache before proceeding with the barrier. Note that the second set of zero-length barriers are submitted immediately after the original request is submitted. Thus when a personality finds mddev->barrier to be set during make_request, it should not return from make_request until the corresponding per-device request(s) have been queued. That will be done in later patches. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>
ed9bfdf1a40952fd0f8094ec77f876b84ead69af	16-Oct-2009	NeilBrown <neilb@suse.de>	md: raid1/raid10: handle allocation errors during array setup. Both raid1 and raid10 create a mempool during startup. If the 'alloc' function for this mempool fails, unplug_slaves is called. If that happens when the pool is being initialised, unplug_slaves will try to use the 'conf' structure that isn't filled in yet, and badness will happen. So ensure that unplug_slaves doesn't get called unless we know that the conf structure if fully initialised. Signed-off-by: NeilBrown <neilb@suse.de>
1d9d52416c0445019ccc1f0fddb9a227456eb61b	16-Oct-2009	NeilBrown <neilb@suse.de>	md/raid1/raid10: add a cond_resched During 'check' of a raid1 or raid10 it is possible for the management thread to spend a lot of time running 'memcmp' on blocks from different devices, so make sure the thread has a chance to schedule. raid5d already has a cond_resched (in process_stripe). Reported-By: Lee Howard <faxguy@howardsilvan.com> Signed-off-by: NeilBrown <neilb@suse.de>
1ef04fefe2241087d9db7e9615c3f11b516e36cf	20-Sep-2009	Dmitry Monakhov <rjevskiy@gmail.com>	md: raid-1/10: fix RW bits manipulation Recently Jens has changed bio_rw_flagged() logic by following commit 1f98a13f623e0ef666690a18c1250335fc6d7ef1. Now it returns bool instead of int. This broke raid1/raid10 RW bits manipulation logic. One of visible result is BUG_ON triggering due to empty barrier here scsi_lib.c:1108 scsi_setup_fs_cmnd() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: NeilBrown <neilb@suse.de>
3fa841d7e7266f6fcc1b3885b905f5153ba897d8	23-Sep-2009	NeilBrown <neilb@suse.de>	md: report device as congested when suspended This should writeback from coming when the device is temporarily suspended. Signed-off-by: NeilBrown <neilb@suse.de>
0da3c6194ec2f32617b272df4505a1cf022faea5	23-Sep-2009	NeilBrown <neilb@suse.de>	md: Improve name of threads created by md_register_thread The management thread for raid4,5,6 arrays are all called mdX_raid5, independent of the actual raid level, which is wrong and can be confusion. So change md_register_thread to use the name from the personality unless no alternate name (like 'resync' or 'reshape') is given. This is simpler and more correct. Cc: Jinzc <zhenchengjin@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
a9f326ebf22a0de776815240fb76dabe139397ea	23-Sep-2009	NeilBrown <neilb@suse.de>	md: remove sparse waring "symbol xxx shadows an earlier one" Rename some variable and remove some duplicate definitions to avoid there warnings. None of them are actual errors. Signed-off-by: NeilBrown <neilb@suse.de>
1f98a13f623e0ef666690a18c1250335fc6d7ef1	11-Sep-2009	Jens Axboe <jens.axboe@oracle.com>	bio: first step in sanitizing the bio->bi_rw flag testing Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ac5e7113e74872928844d00085bd47c988f12728	03-Aug-2009	Andre Noll <maan@systemlinux.org>	md: Push down data integrity code to personalities. This patch replaces md_integrity_check() by two new public functions: md_integrity_register() and md_integrity_add_rdev() which are both personality-independent. md_integrity_register() is called from the ->run and ->hot_remove methods of all personalities that support data integrity. The function iterates over the component devices of the array and determines if all active devices are integrity capable and if their profiles match. If this is the case, the common profile is registered for the mddev via blk_integrity_register(). The second new function, md_integrity_add_rdev() is called from the ->hot_add_disk methods, i.e. whenever a new device is being added to a raid array. If the new device does not support data integrity, or has a profile different from the one already registered, data integrity for the mddev is disabled. For raid0 and linear, only the call to md_integrity_register() from the ->run method is necessary. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
8f6c2e4b325a8e9f8f47febb2fd0ed4fae7d45a9	01-Jul-2009	Martin K. Petersen <martin.petersen@oracle.com>	md: Use new topology calls to indicate alignment and I/O sizes Switch MD over to the new disk_stack_limits() function which checks for aligment and adjusts preferred I/O sizes when stacking. Also indicate preferred I/O sizes where applicable. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
8c6ac868b107ed50a46204f6d14e2ad9443ff146	18-Jun-2009	Andre Noll <maan@systemlinux.org>	md: Push down reconstruction log message to personality code. Currently, the md layer checks in analyze_sbs() if the raid level supports reconstruction (mddev->level >= 1) and if reconstruction is in progress (mddev->recovery_cp != MaxSector). Move that printk into the personality code of those raid levels that care (levels 1, 4, 5, 6, 10). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
9d8f0363623b3da12c43007cf77f5e1a4e8a5964	18-Jun-2009	Andre Noll <maan@systemlinux.org>	md: Make mddev->chunk_size sector-based. This patch renames the chunk_size field to chunk_sectors with the implied change of semantics. Since is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9) = is_power_of_2(chunk_sectors) these bits don't need an adjustment for the shift. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
964e7913b0d25b988e27a7cd9378bc55cc572bb4	16-Jun-2009	raz ben yehuda <raziebe@gmail.com>	md: raid10: chunk size check in run have raid10 check chunk size in run method instead of in md Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
070ec55d07157a3041f92654135c3c6e2eaaf901	16-Jun-2009	NeilBrown <neilb@suse.de>	md: remove mddev_to_conf "helper" macro Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>
ae03bf639a5027d27270123f5f6e3ee6a412781d	22-May-2009	Martin K. Petersen <martin.petersen@oracle.com>	block: Use accessor functions for queue limits Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
18055569127253755d01733f6ecc004ed02f88d0	06-May-2009	NeilBrown <neilb@suse.de>	md/raid10: don't clear bitmap during recovery if array will still be degraded. If we have a raid10 with multiple missing devices, and we recover just one of these to a spare, then we risk (depending on the bitmap and array chunk size) clearing bits of the bitmap for which recovery isn't complete (because a device is still missing). This can lead to a subsequent "re-add" being recovered without any IO happening, which would result in loss of data. This patch takes the safe approach of not clearing bitmap bits if the array will still be degraded. This patch is suitable for all active -stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
8f3d8ba20e67991b531e9c0227dcd1f99271a32c	07-Apr-2009	Christoph Hellwig <hch@lst.de>	block: move bio list helpers into bio.h It's used by DM and MD and generally useful, so move the bio list helpers into bio.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
b522adcde9c4d3fb7b579cfa9160d8bde7744be8	31-Mar-2009	Dan Williams <dan.j.williams@intel.com>	md: 'array_size' sysfs attribute Allow userspace to set the size of the array according to the following semantics: 1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0) a) If size is set before the array is running, do_md_run will fail if size is greater than the default size b) A reshape attempt that reduces the default size to less than the set array size should be blocked 2/ once userspace sets the size the kernel will not change it 3/ writing 'default' to this attribute returns control of the size to the kernel and reverts to the size reported by the personality Also, convert locations that need to know the default size from directly reading ->array_sectors to <pers>_size. Resync/reshape operations always follow the default size. Finally, fixup other locations that read a number of 1k-blocks from userspace to use strict_blocks_to_sectors() which checks for unsigned long long to sector_t overflow and blocks to sectors overflow. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
1f403624bde3c678a166984b1e6a727a0ce06f2b	31-Mar-2009	Dan Williams <dan.j.williams@intel.com>	md: centralize ->array_sectors modifications Get personalities out of the business of directly modifying ->array_sectors. Lays groundwork to introduce policy on when ->array_sectors can be modified. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
80c3a6ce4ba4470379b9e6a4d9bcd9d2ee26ae03	18-Mar-2009	Dan Williams <dan.j.williams@intel.com>	md: add 'size' as a personality method In preparation for giving userspace control over ->array_sectors we need to be able to retrieve the 'default' size, and the 'anticipated' size when a reshape is requested. For personalities that do not reshape emit a warning if anything but the default size is requested. In the raid5 case we need to update ->previous_raid_disks to make the new 'default' size available. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
409c57f3801701dfee27a28103dda4831306cb20	31-Mar-2009	NeilBrown <neilb@suse.de>	md: enable suspend/resume of md devices. To be able to change the 'level' of an md/raid array, we need to suspend the device so that no requests are active - then move some pointers around etc. The code already keeps counts of active requests and the ->quiesce function can be used to wait until those counts hit zero. However the quiesce function blocks new requests once they are all ready 'inside' the personality module, and that is too late if we want to replace the personality modules. So make all md requests come in through a common md_make_request function that keeps track of how many requests have entered the modules but may not yet be on the internal reference counts. Allow md_make_request to be blocked when we want to suspend the device, and make it possible to wait for all those in-transit requests to be added to internal lists so that ->quiesce can wait for them. There is still a problem that when a request completes, we drop the ref count inside the personality code so there is a short time between when the refcount hits zero, and when the personality code is no longer being used. The personality code never blocks (schedule or spinlock) between dropping the refcount and exiting the routine, so this should be safe (as put_module calls synchronize_sched() before unmapping the module code). Signed-off-by: NeilBrown <neilb@suse.de>
58c0fed400603a802968b23ddf78f029c5a84e41	31-Mar-2009	Andre Noll <maan@systemlinux.org>	md: Make mddev->size sector-based. This patch renames the "size" field of struct mddev_s to "dev_sectors" and stores the number of 512-byte sectors instead of the number of 1K-blocks in it. All users of that field, including raid levels 1,4-6,10, are adjusted accordingly. This simplifies the code a bit because it allows to get rid of a couple of divisions/multiplications by two. In order to make checkpatch happy, some minor coding style issues have also been addressed. In particular, size_store() now uses strict_strtoull() instead of simple_strtoull(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
43b2e5d86d8bdd77386226db0bc961529492c043	31-Mar-2009	NeilBrown <neilb@suse.de>	md: move md_k.h from include/linux/raid/ to drivers/md/ It really is nicer to keep related code together.. Signed-off-by: NeilBrown <neilb@suse.de>
bff61975b3d6c18ee31457cc5b4d73042f44915f	31-Mar-2009	NeilBrown <neilb@suse.de>	md: move lots of #include lines out of .h files and into .c This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>
ef740c372dfd80e706dbf955d4e4aedda6c0c148	31-Mar-2009	Christoph Hellwig <hch@lst.de>	md: move headers out of include/linux/raid/ Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>
73d5c38a9536142e062c35997b044e89166e063b	25-Feb-2009	NeilBrown <neilb@suse.de>	md: avoid races when stopping resync. There has been a race in raid10 and raid1 for a long time which has only recently started showing up due to a scheduler changed. When a sync_read request finishes, as soon as reschedule_retry is called, another thread can mark the resync request as having completed, so md_do_sync can finish, ->stop can be called, and ->conf can be freed. So using conf after reschedule_retry is not safe. Similarly, when finishing a sync_write, calling md_done_sync must be the last thing we do, as it allows a chain of events which will free conf and other data structures. The first of these requires action in raid10.c The second requires action in raid1.c and raid10.c Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
78200d45cde2a79c0d0ae0407883bb264caa3c18	25-Feb-2009	NeilBrown <neilb@suse.de>	md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. For raid1/4/5/6, resync (fixing inconsistencies between devices) is very similar to recovery (rebuilding a failed device onto a spare). The both walk through the device addresses in order. For raid10 it can be quite different. resync follows the 'array' address, and makes sure all copies are the same. Recover walks through 'device' addresses and recreates each missing block. The 'bitmap_cond_end_sync' function allows the write-intent-bitmap (When present) to be updated to reflect a partially completed resync. It makes assumptions which mean that it does not work correctly for raid10 recovery at all. In particularly, it can cause bitmap-directed recovery of a raid10 to not recovery some of the blocks that need to be recovered. So move the call to bitmap_cond_end_sync into the resync path, rather than being in the common "resync or recovery" path. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
09b4068a7fe442efc40e9dcbcf5ff37c3338ab15	25-Feb-2009	NeilBrown <neilb@suse.de>	md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. When doing recovery on a raid10 with a write-intent bitmap, we only need to recovery chunks that are flagged in the bitmap. However if we choose to skip a chunk as it isn't flag, the code currently skips the whole raid10-chunk, thus it might not recovery some blocks that need recovering. This patch fixes it. In case that is confusing, it might help to understand that there is a 'raid10 chunk size' which guides how data is distributed across the devices, and a 'bitmap chunk size' which says how much data corresponds to a single bit in the bitmap. This bug only affects cases where the bitmap chunk size is smaller than the raid10 chunk size. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
159ec1fc060ab22b157a62364045f5e98749c4d3	08-Jan-2009	Cheng Renquan <crquan@gmail.com>	md: use list_for_each_entry macro directly The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to list_for_each_entry_safe, from <linux/list.h>, it should be defined to use list_for_each_entry_safe, instead of reinventing the wheel. But some calls to each_entry_safe don't really need a safe version, just a direct list_for_each_entry is enough, this could save a temp variable (tmp) in every function that used rdev_for_each. In this patch, most rdev_for_each loops are replaced by list_for_each_entry, totally save many tmp vars; and only in the other situations that will call list_del to delete an entry, the safe version is used. Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
a53a6c85756339f82ff19e001e90cfba2d6299a8	06-Nov-2008	NeilBrown <neilb@suse.de>	md: fix bug in raid10 recovery. Adding a spare to a raid10 doesn't cause recovery to start. This is due to an silly type in commit 6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda and so is a bug in 2.6.27 and .28-rc. Thanks to Thomas Backlund for bisecting to find this. Cc: Thomas Backlund <tmb@mandriva.org> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
255707274ea25d486b7de060a30ba4ac50593408	15-Oct-2008	Stephen Rothwell <sfr@canb.auug.org.au>	md: build failure due to missing delay.h Today's linux-next build (powerpc ppc64_defconfig) failed like this: drivers/md/raid1.c: In function 'sync_request': drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible' make[3]: * [drivers/md/raid1.o] Error 1 make[3]: * Waiting for unfinished jobs.... drivers/md/raid10.c: In function 'sync_request': drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible' make[3]: *** [drivers/md/raid10.o] Error 1 drivers/md/md.c: In function 'md_do_sync': drivers/md/md.c:5915: error: implicit declaration of function 'msleep' Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove unnecessary #includes, #defines, and function declarations"). I added the following patch. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NeilBrown <neilb@suse.de>
4bbf3771ca40d0aaec8316d0e7476b16010288e5	13-Oct-2008	NeilBrown <neilb@suse.de>	md: Relax minimum size restrictions on chunk_size. Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE. This makes moving an array to a machine with a larger PAGE_SIZE, or changing the kernel to use a larger PAGE_SIZE, can stop an array from working. For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync process works on whole pages at a time, and assumes them to be wholly within a stripe. For other raid personalities, this restriction is not needed at all and can be dropped. So remove the test on chunk_size from common can, and add it in just the places where it is needed: raid10 and raid4/5/6. Signed-off-by: NeilBrown <neilb@suse.de>
6feef531f55cf4a20fd9eb39f5352e5745203603	09-Oct-2008	Denis ChengRq <crquan@gmail.com>	block: mark bio_split_pool static Since all bio_split calls refer the same single bio_split_pool, the bio_split function can use bio_split_pool directly instead of the mempool_t parameter; then the mempool_t parameter can be removed from bio_split param list, and bio_split_pool is only referred in fs/bio.c file, can be marked static. Signed-off-by: Denis ChengRq <crquan@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
074a7aca7afa6f230104e8e65eba3420263714a5	25-Aug-2008	Tejun Heo <tj@kernel.org>	block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_() now updates part0 together if the specified partition is not part0. ie. part_stat_() are now essentially all_stat_(). {disk\|all}_stat_() are gone. part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc\|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
c9959059161ddd7bf4670cf47367033d6b2f79c4	25-Aug-2008	Tejun Heo <tj@kernel.org>	block: fix diskstats access There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
960e739d9e9f1c2346d8bdc65299ee2e1ed42218	15-Aug-2008	Jens Axboe <jens.axboe@oracle.com>	block: raid fixups for removal of bi_hw_segments Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
5df97b91b5d7ed426034fcc84cb6e7cf682b8838	15-Aug-2008	Mikulas Patocka <mpatocka@redhat.com>	drop vmerge accounting Remove hw_segments field from struct bio and struct request. Without virtual merge accounting they have no purpose. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
0310fa216decc3ecfab41f327638fa48a81f3735	05-Aug-2008	NeilBrown <neilb@suse.de>	Allow raid10 resync to happening in larger chunks. The raid10 resync/recovery code currently limits the amount of in-flight resync IO to 2Meg. This was copied from raid1 where it seems quite adequate. However for raid10, some layouts require a bit of seeking to perform a resync, and allowing a larger buffer size means that the seeking can be significantly reduced. There is probably no real need to limit the amount of in-flight IO at all. Any shortage of memory will naturally reduce the amount of buffer space available down to a set minimum, and any concurrent normal IO will quickly cause resync IO to back off. The only problem would be that normal IO has to wait for all resync IO to finish, so a very large amount of resync IO could cause unpleasant latency when normal IO starts up. So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is available) which seems to be a good amount. Also reduce the amount of memory reserved as there is no need to keep 2Meg just for resync if memory is tight. Thanks to Keld for the suggestion. Cc: Keld Jørn Simonsen <keld@dkuug.dk> Signed-off-by: NeilBrown <neilb@suse.de>
388667bed591b2359713bb17d5de0cf56e961447	25-Jul-2008	Arthur Jones <ajones@riverbed.com>	md: raid10: wake up frozen array When rescheduling a bio in raid10, we wake up the md thread, but if the array is frozen, this will have no effect. This causes the array to remain frozen for eternity. We add a wake_up to allow the array to de-freeze. This code is nearly identical to the raid1 code, which has this fix already. Signed-off-by: Arthur Jones <ajones@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>
f233ea5c9e0d8b95e4283bf6a3436b88f6fd3586	21-Jul-2008	Andre Noll <maan@systemlinux.org>	md: Make mddev->array_size sector-based. This patch renames the array_size field of struct mddev_s to array_sectors and converts all instances to use units of 512 byte sectors instead of 1k blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
cc371e66e340f35eed8dc4651c7c18e754c7fb26	03-Jul-2008	Alasdair G Kergon <agk@redhat.com>	Add bvec_merge_data to handle stacked devices and ->merge_bvec() When devices are stacked, one device's merge_bvec_fn may need to perform the mapping and then call one or more functions for its underlying devices. The following bio fields are used: bio->bi_sector bio->bi_bdev bio->bi_size bio->bi_rw using bio_data_dir() This patch creates a new struct bvec_merge_data holding a copy of those fields to avoid having to change them directly in the struct bio when going down the stack only to have to change them back again on the way back up. (And then when the bio gets mapped for real, the whole exercise gets repeated, but that's a problem for another day...) Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
199050ea1ff2270174ee525b73bc4c3323098897	28-Jun-2008	Neil Brown <neilb@notabene.brown>	rationalise return value for ->hot_add_disk method. For all array types but linear, ->hot_add_disk returns 1 on success, 0 on failure. For linear, it returns 0 on success and -errno on failure. This doesn't cause a functional problem because the ->hot_add_disk function of linear is used quite differently to the others. However it is confusing. So convert all to return 0 for success or -errno on failure and fix call sites to match. Signed-off-by: Neil Brown <neilb@suse.de>
6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda	28-Jun-2008	Neil Brown <neilb@notabene.brown>	Support adding a spare to a live md array with external metadata. i.e. extend the 'md/dev-XXX/slot' attribute so that you can tell a device to fill an vacant slot in an and md array. Signed-off-by: Neil Brown <neilb@suse.de>
8c2e870a625bd336b2e7a65a97c1836acef07322	28-Jun-2008	Neil Brown <neilb@notabene.brown>	Ensure interrupted recovery completed properly (v1 metadata plus bitmap) If, while assembling an array, we find a device which is not fully in-sync with the array, it is important to set the "fullsync" flags. This is an exact analog to the setting of this flag in hot_add_disk methods. Currently, only v1.x metadata supports having devices in an array which are not fully in-sync (it keep track of how in sync they are). The 'fullsync' flag only makes a difference when a write-intent bitmap is being used. In this case it tells recovery to ignore the bitmap and recovery all blocks. This fix is already in place for raid1, but not raid5/6 or raid10. So without this fix, a raid1 ir raid4/5/6 array with version 1.x metadata and a write intent bitmaps, that is stopped in the middle of a recovery, will appear to complete the recovery instantly after it is reassembled, but the recovery will not be correct. If you might have an array like that, issueing echo repair > /sys/block/mdXX/md/sync_action will make sure recovery completes properly. Cc: <stable@kernel.org> Signed-off-by: Neil Brown <neilb@suse.de>
dfc7064500061677720fa26352963c772d3ebe6b	23-May-2008	NeilBrown <neilb@suse.de>	md: restart recovery cleanly after device failure. When we get any IO error during a recovery (rebuilding a spare), we abort the recovery and restart it. For RAID6 (and multi-drive RAID1) it may not be best to restart at the beginning: when multiple failures can be tolerated, the recovery may be able to continue and re-doing all that has already been done doesn't make sense. We already have the infrastructure to record where a recovery is up to and restart from there, but it is not being used properly. This is because: - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR, which causes the recovery not be be checkpointed. - We remove spares and then re-added them which loses important state information. The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't needed. If there is an error, the relevant drive will be marked as Faulty, and that is enough to ensure correct handling of the error. So we first remove MD_RECOVERY_ERR, changing some of the uses of it to MD_RECOVERY_INTR. Then we cause the attempt to remove a non-faulty device from an array to fail (unless recovery is impossible as the array is too degraded). Then when remove_and_add_spares attempts to remove the devices on which recovery can continue, it will fail, they will remain in place, and recovery will continue on them as desired. Issue: If we are halfway through rebuilding a spare and another drive fails, and a new spare is immediately available, do we want to: 1/ complete the current rebuild, then go back and rebuild the new spare or 2/ restart the rebuild from the start and rebuild both devices in parallel. Both options can be argued for. The code currently takes option 2 as a/ this requires least code change b/ this results in a minimally-degraded array in minimal time. Cc: "Eivind Sarto" <ivan@kasenna.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e7e72bf641b1fc7b9df6f40bd2c36dfccd8d647c	15-May-2008	Neil Brown <neilb@suse.de>	Remove blkdev warning triggered by using md As setting and clearing queue flags now requires that we hold a spinlock on the queue, and as blk_queue_stack_limits is called without that lock, get the lock inside blk_queue_stack_limits. For blk_queue_stack_limits to be able to find the right lock, each md personality needs to set q->queue_lock to point to the appropriate lock. Those personalities which didn't previously use a spin_lock, us q->__queue_lock. So always initialise that lock when allocated. With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no longer cause warnings as it will be clear that the proper lock is held. Thanks to Dan Williams for review and fixing the silly bugs. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alistair John Strachan <alistair@devzero.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Jacek Luczak <difrost.kernel@gmail.com> Cc: Prakash Punnoor <prakash@punnoor.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cb6969e8cdef39e613b1755eff595f830b89bc82	07-May-2008	Harvey Harrison <harvey.harrison@gmail.com>	misc: fix integer as NULL pointer warnings drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6bfe0b499082fd3950429017cd8ebf2a6c458aa5	30-Apr-2008	Dan Williams <dan.j.williams@intel.com>	md: support blocking writes to an array on device failure Allows a userspace metadata handler to take action upon detecting a device failure. Based on an original patch by Neil Brown. Changes: -added blocked_wait waitqueue to rdev -don't qualify Blocked with Faulty always let userspace block writes -added md_wait_for_blocked_rdev to wait for the block device to be clear, if userspace misses the notification another one is sent every 5 seconds -set MD_RECOVERY_NEEDED after clearing "blocked" -kill DoBlock flag, just test mddev->external Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d7a420c9472a95c46600a0345434b7b166e0b9c7	28-Apr-2008	Nick Andrew <nick@nick-andrew.net>	raid: remove leading TAB on printk messages MD drivers use one printk() call to print 2 log messages and the second line may be prefixed by a TAB character. It may also output a trailing space before newline. klogd (I think) turns the TAB character into the 2 characters '^I' when logging to a file. This looks ugly. Instead of a leading TAB to indicate continuation, prefix both output lines with 'raid:' or similar. Also remove any trailing space in the vicinity of the affected code and consistently end the sentences with a period. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a07e6ab41be179cf1ed728a4f41368435508b550	04-Mar-2008	K.Tanaka <k-tanaka@ce.jp.nec.com>	md: the md RAID10 resync thread could cause a md RAID10 array deadlock This message describes another issue about md RAID10 found by testing the 2.6.24 md RAID10 using new scsi fault injection framework. Abstract: When a scsi error results in disabling a disk during RAID10 recovery, the resync threads of md RAID10 could stall. This case, the raid array has already been broken and it may not matter. But I think stall is not preferable. If it occurs, even shutdown or reboot will fail because of resource busy. The deadlock mechanism: The r10bio_s structure has a "remaining" member to keep track of BIOs yet to be handled when recovering. The "remaining" counter is incremented when building a BIO in sync_request() and is decremented when finish a BIO in end_sync_write(). If building a BIO fails for some reasons in sync_request(), the "remaining" should be decremented if it has already been incremented. I found a case where this decrement is forgotten. This causes a md_do_sync() deadlock because md_do_sync() waits for md_done_sync() called by end_sync_write(), but end_sync_write() never calls md_done_sync() because of the "remaining" counter mismatch. For example, this problem would be reproduced in the following case: Personalities : [raid10] md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F) 3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_] [>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec This case, sdf1 is recovering, sdb1 and sde1 are disabled. An additional error with detaching sdd will cause a deadlock. md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F) 3919616 blocks 64K chunks 2 near-copies [4/1] [_U__] [=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec 2739 ? S< 0:17 [md0_raid10] 28608 ? D< 0:00 [md0_resync] 28629 pts/1 Ss 0:00 bash 28830 pts/1 R+ 0:00 ps ax 31819 ? D< 0:00 [kjournald] The resync thread keeps working, but actually it is deadlocked. Patch: By this patch, the remaining counter will be decremented if needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1c830532f6b44d10a1743ccd00e990c6b83396f5	04-Mar-2008	NeilBrown <neilb@suse.de>	md: fix possible raid1/raid10 deadlock on read error during resync Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for another possible deadlock in raid1/raid10 error handing. If a read request returns an error while a resync is happening and a resync request is pending, the attempt to fix the error will block until the resync progresses, and the resync will block until the read request completes. Thus a deadlock. This patch fixes the problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8ed3a19563b6c05b7625649b1769ddb063d53253	04-Mar-2008	Keld Simonsen <keld@dkuug.dk>	md: don't attempt read-balancing for raid10 'far' layouts This patch changes the disk to be read for layout "far > 1" to always be the disk with the lowest block address. Thus the chunks to be read will always be (for a fully functioning array) from the first band of stripes, and the raid will then work as a raid0 consisting of the first band of stripes. Some advantages: The fastest part which is the outer sectors of the disks involved will be used. The outer blocks of a disk may be as much as 100 % faster than the inner blocks. Average seek time will be smaller, as seeks will always be confined to the first part of the disks. Mixed disks with different performance characteristics will work better, as they will work as raid0, the sequential read rate will be number of disks involved times the IO rate of the slowest disk. If a disk is malfunctioning, the first disk which is working, and has the lowest block address for the logical block will be used. Signed-off-by: Keld Simonsen <keld@dkuug.dk> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a35e63efa1fb18c6f20f38e3ddf3f8ffbcf0f6e7	04-Mar-2008	NeilBrown <neilb@suse.de>	md: fix deadlock in md/raid1 and md/raid10 when handling a read error When handling a read error, we freeze the array to stop any other IO while attempting to over-write with correct data. This is done in the raid1d(raid10d) thread and must wait for all submitted IO to complete (except for requests that failed and are sitting in the retry queue - these are counted in ->nr_queue and will stay there during a freeze). However write requests need attention from raid1d as bitmap updates might be required. This can cause a deadlock as raid1 is waiting for requests to finish that themselves need attention from raid1d. So we create a new function 'flush_pending_writes' to give that attention, and call it in freeze_array to be sure that we aren't waiting on raid1d. Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this problem. Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d089c6af10c2be5988f03667d6d22fe6085fbe5e	06-Feb-2008	NeilBrown <neilb@suse.de>	md: change ITERATE_RDEV to rdev_for_each As this is more in line with common practice in the kernel. Also swap the args around to be more like list_for_each. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c620727779f7cc8ea96efb71f0651a26349e59c1	06-Feb-2008	NeilBrown <neilb@suse.de>	md: allow a maximum extent to be set for resyncing This allows userspace to control resync/reshape progress and synchronise it with other activities, such as shared access in a SAN, or backing up critical sections during a tricky reshape. Writing a number of sectors (which must be a multiple of the chunk size if such is meaningful) causes a resync to pause when it gets to that point. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b47490c9bc73d0b34e4c194db40de183e592e446	06-Feb-2008	NeilBrown <neilb@suse.de>	md: Update md bitmap during resync. Currently an md array with a write-intent bitmap does not updated that bitmap to reflect successful partial resync. Rather the entire bitmap is updated when the resync completes. This is because there is no guarentee that resync requests will complete in order, and tracking each request individually is unnecessarily burdensome. However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciably affect resync performance. [snitzer@gmail.com: export bitmap_cond_end_sync] Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Mike Snitzer" <snitzer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2ad8b1ef11c98c5603580878aebf9f1bc74129e4	07-Nov-2007	Alan D. Brunelle <Alan.Brunelle@hp.com>	Add UNPLUG traces to all appropriate places Added blk_unplug interface, allowing all invocations of unplugs to result in a generated blktrace UNPLUG. Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
fd5d806266935179deda1502101624832eacd01f	16-Oct-2007	Jens Axboe <jens.axboe@oracle.com>	block: convert blkdev_issue_flush() to use empty barriers Then we can get rid of ->issue_flush_fn() and all the driver private implementations of that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
6712ecf8f648118c3363c142196418f89a510b90	27-Sep-2007	NeilBrown <neilb@suse.de>	Drop 'size' argument from bio_endio and bi_end_io As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
f6f953aa99d456aff44ffdb1c77061d1a010eae2	31-Jul-2007	Arne Redlich <agr@powerkom-dd.de>	md: handle writes to broken raid10 arrays gracefully When writing to a broken array, raid10 currently happily emits empty bio lists. IOW, the master bio will never be completed, sending writers to UNINTERRUPTIBLE_SLEEP forever. Signed-off-by: Arne Redlich <agr@powerkom-dd.de> Acked-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
14e713446aaca97dbe590fe845f7dcbd74ddbee2	31-Jul-2007	Maik Hampel <m.hampel@gmx.de>	md: raid10: fix use-after-free of bio In case of read errors raid10d tries to print a nice error message, unfortunately using data from an already put bio. Signed-off-by: Maik Hampel <m.hampel@gmx.de> Acked-By: NeilBrown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
165125e1e480f9510a5ffcfbfee4e3ee38c05f23	24-Jul-2007	Jens Axboe <jens.axboe@oracle.com>	[BLOCK] Get rid of request_queue_t typedef Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
4ad1366376bfef32ec0ffa12d1faa483d6f330bd	17-Jul-2007	NeilBrown <neilb@suse.de>	md: change bitmap_unplug and others to void functions bitmap_unplug only ever returns 0, so it may as well be void. Two callers try to print a message if it returns non-zero, but that message is already printed by bitmap_file_kick. write_page returns an error which is not consistently checked. It always causes BITMAP_WRITE_ERROR to be set on an error, and that can more conveniently be checked. When the return of write_page is checked, an error causes bitmap_file_kick to be called - so move that call into write_page - and protect against recursive calls into bitmap_file_kick. bitmap_update_sb returns an error that is never checked. So make these 'void' and be consistent about checking the bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
af03b8e4e81c3789e597632268940edd11ffe870	16-Jun-2007	NeilBrown <neilb@suse.de>	md: fix two raid10 bugs 1/ When resyncing a degraded raid10 which has more than 2 copies of each block, garbage can get synced on top of good data. 2/ We round the wrong way in part of the device size calculation, which can cause confusion. Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
64a742bc61f9115b0bb270fa081e5b5b9c35dcd0	01-Mar-2007	NeilBrown <neilb@suse.de>	[PATCH] md: fix raid10 recovery problem. There are two errors that can lead to recovery problems with raid10 when used in 'far' more (not the default). Due to a '>' instead of '>=' the wrong block is located which would result in garbage being written to some random location, quite possible outside the range of the device, causing the newly reconstructed device to fail. The device size calculation had some rounding errors (it didn't round when it should) and so recovery would go a few blocks too far which would again cause a write to a random block address and probably a device error. The code for working with device sizes was fairly confused and spread out, so this has been tided up a bit. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e3881a6816b45668df60a426e5c3431ece1539a7	11-Jan-2007	Lars Ellenberg <Lars.Ellenberg@linbit.com>	[PATCH] md: pass down BIO_RW_SYNC in raid{1,10} md raidX make_request functions strip off the BIO_RW_SYNC flag, thus introducing additional latency. Fixing this in raid1 and raid10 seems to be straightforward enough. For our particular usage case in DRBD, passing this flag improved some initialization time from ~5 minutes to ~5 seconds. Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Lars Ellenberg <lars@linbit.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
802ba064c49f655d20fed563f2a4924c8256ea10	13-Dec-2006	NeilBrown <neilb@suse.de>	[PATCH] md: Don't assume that READ==0 and WRITE==1 - use the names explicitly Thanks Jens for alerting me to this. Cc: Jens Axboe <jens.axboe@oracle.com> Cc: <raziebe@gmail.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
969b755aadf7bcf3df5991a127a103acd0145a52	28-Oct-2006	Randy Dunlap <randy.dunlap@oracle.com>	[PATCH] md: fix printk format warnings, seen on powerpc64: drivers/md/raid1.c:1479: warning: long long unsigned int format, long unsigned int arg (arg 4) drivers/md/raid10.c:1475: warning: long long unsigned int format, long unsigned int arg (arg 4) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2e333e89860431d22816c1bdaa2ea72c2753396e	21-Oct-2006	NeilBrown <neilb@suse.de>	[PATCH] md: fix calculation of ->degraded for multipath and raid10 Two less-used md personalities have bugs in the calculation of ->degraded (the extent to which the array is degraded). Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
0d12922823408b26f83b15cae4a4feff4bd22f28	03-Oct-2006	NeilBrown <neilb@suse.de>	[PATCH] md: define ->congested_fn for raid1, raid10, and multipath raid1, raid10 and multipath don't report their 'congested' status through bdi_*_congested, but should. This patch adds the appropriate functions which just check the 'congested' status of all active members (with appropriate locking). raid1 read_balance should be modified to prefer devices where bdi_read_congested returns false. Then we could use the '&' branch rather than the '\|' branch. However that should would need some benchmarking first to make sure it is actually a good idea. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
c04be0aa82ff535e3676ab3e573957bdeef41879	03-Oct-2006	NeilBrown <neilb@suse.de>	[PATCH] md: Improve locking around error handling The error handling routines don't use proper locking, and so two concurrent errors could trigger a problem. So: - use test-and-set and test-and-clear to synchonise the In_sync bits with the ->degraded count - use the spinlock to protect updates to the degraded count (could use an atomic_t but that would be a bigger change in code, and isn't really justified) - remove un-necessary locking in raid5 Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
76186dd8b73d2b7b9b4c8629b89c845e97009801	03-Oct-2006	NeilBrown <neilb@suse.de>	[PATCH] md: remove 'working_disks' from raid10 state It isn't needed as mddev->degraded contains equivalent info. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
850b2b420cd5b363ed4cf48a8816d656c8b5251b	03-Oct-2006	NeilBrown <neilb@suse.de>	[PATCH] md: replace magic numbers in sb_dirty with well defined bit flags Instead of magic numbers (0,1,2,3) in sb_dirty, we have some flags instead: MD_CHANGE_DEVS Some device state has changed requiring superblock update on all devices. MD_CHANGE_CLEAN The array has transitions from 'clean' to 'dirty' or back, requiring a superblock update on active devices, but possibly not on spares MD_CHANGE_PENDING A superblock update is underway. We wait for an update to complete by waiting for all flags to be clear. A flag can be set at any time, even during an update, without risk that the change will be lost. Stop exporting md_update_sb - isn't needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
6814d5368d68341ec6b5e4ecd10ea5947130775a	03-Oct-2006	NeilBrown <neilb@suse.de>	[PATCH] md: factor out part of raid10d into a separate function. raid10d has toooo many nested block, so take the fix_read_error functionality out into a separate function. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d69504325978c461b51b03cca49626026970307b	10-Jul-2006	NeilBrown <neilb@suse.de>	[PATCH] md: include sector number in messages about corrected read errors This is generally useful, but particularly helps see if it is the same sector that always needs correcting, or different ones. [akpm@osdl.org: fix printk warnings] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
8838832830d2c6c28ae2db93188ae90652eb7fc2	26-Jun-2006	NeilBrown <neilb@suse.de>	[PATCH] md: Calculate correct array size for raid10 in new offset mode The size calculation made assumtion which the new offset mode didn't follow. This gets the size right in all cases. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
c93983bf517c100a31e40ef087e19bd3d7aa2d28	26-Jun-2006	NeilBrown <neilb@suse.de>	[PATCH] md: support stripe/offset mode in raid10 The "industry standard" DDF format allows for a stripe/offset layout where data is duplicated on different stripes. e.g. A B C D D A B C E F G H H E F G (columns are drives, rows are stripes, LETTERS are chunks of data). This is similar to raid10's 'far' mode, but not quite the same. So enhance 'far' mode with a 'far/offset' option which follows the layout of DDFs stripe/offset. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
5fd6c1dce06ec24ef3de20fe0c7ecf2ba9fe5ef9	26-Jun-2006	NeilBrown <neilb@suse.de>	[PATCH] md: allow checkpoint of recovery with version-1 superblock For a while we have had checkpointing of resync. The version-1 superblock allows recovery to be checkpointed as well, and this patch implements that. Due to early carelessness we need to add a feature flag to signal that the recovery_offset field is in use, otherwise older kernels would assume that a partially recovered array is in fact fully recovered. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
8932c2e0dcae52e73430878fd8a7a7800176eada	26-Jun-2006	NeilBrown <neilb@suse.de>	[PATCH] md: remove arbitrary limit on chunk size The largest chunk size the code can support without substantial surgery is 2^30 bytes, so make that the limit instead of an arbitrary 4Meg. Some day, the 'chunksize' should change to a sector-shift instead of a byte-count. Then no limit would be needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e0a33270ed0e8e00cbb882a33d21e1f92aac0ceb	01-May-2006	NeilBrown <neilb@suse.de>	[PATCH] md: Fixed refcounting/locking when attempting read error correction in raid10 We need to hold a reference to rdevs while reading and writing to attempt to correct read errors. This reference must be taken under an rcu lock. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
df30d0f4ca3c41b60068232d6de9d58be88436f0	01-May-2006	NeilBrown <neilb@suse.de>	[PATCH] md: Avoid oops when attempting to fix read errors on raid10 We should add to the counter for the rdev after checking if the rdev is NULL!!! Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b6385483840e903d99cb753593faea215ae8d324	02-Apr-2006	Eric Sesterhenn <snakebyte@gmx.de>	BUG_ON() Conversion in md/raid10.c this changes if() BUG(); constructs to BUG_ON() which is cleaner and can better optimized away Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
29fc7e3e70a05e9eea28afb6707a39c1a53e2f66	03-Feb-2006	NeilBrown <neilb@suse.de>	[PATCH] md: Assorted little md fixes - version-1 superblock + The default_bitmap_offset is in sectors, not bytes. + the 'size' field in the superblock is in sectors, not KB - raid0_run should return a negative number on error, not '1' - raid10_read_balance should not return a valid 'disk' number if ->rdev turned out to be NULL - kmem_cache_destroy doesn't like being passed a NULL. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
858119e159384308a5dde67776691a2ebf70df0f	14-Jan-2006	Arjan van de Ven <arjan@infradead.org>	[PATCH] Unlinline a bunch of other functions Remove the "inline" keyword from a bunch of big functions in the kernel with the goal of shrinking it by 30kb to 40kb Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4dbcdc751cb25ffca3a8374cbc5ab6de961cc545	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: count corrected read errors per drive Store this total in superblock (As appropriate), and make it available to userspace via sysfs. Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d9d166c2a9d5d01af34396793950aa695883eed4	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: allow array level to be set textually via sysfs Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
f188593ee7af8c71755d2df269a7a5f62c4b695e	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: fix typo in comment Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1345b1d8adbdeceb1c871d9a4af5e2a700b341c6	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: define and use safe_put_page for md md sometimes call put_page on NULL pointers (treating it like kfree). This is not safe, so define and use a 'safe_put_page' which checks for NULL. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
097426f689f179747f3cd6b4749eb2a6b605702d	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: fix possible problem in raid1/raid10 error overwriting The code to overwrite/reread for addressing read errors in raid1/raid10 currently assumes that the read will not alter the buffer which could be used to write to the next device. This is not a safe assumption to make. So we split the loops into a overwrite loop and a separate re-read loop, so that the writing is complete before reading is attempted. Cc: Paul Clements <paul.clements@steeleye.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2604b703b6b3db80e3c75ce472a54dfd0b7bf9f4	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: remove personality numbering from md md supports multiple different RAID level, each being implemented by a 'personality' (which is often in a separate module). These personalities have fairly artificial 'numbers'. The numbers are use to: 1- provide an index into an array where the various personalities are recorded 2- identify the module (via an alias) which implements are particular personality. Neither of these uses really justify the existence of personality numbers. The array can be replaced by a linked list which is searched (array lookup only happens very rarely). Module identification can be done using an alias based on level rather than 'personality' number. The current 'raid5' modules support two level (4 and 5) but only one personality. This slight awkwardness (which was handled in the mapping from level to personality) can be better handled by allowing raid5 to register 2 personalities. With this change in place, the core md module does not need to have an exhaustive list of all possible personalities, so other personalities can be added independently. This patch also moves the check for chunksize being non-zero into the ->run routines for the personalities that need it, rather than having it in core-md. This has a side effect of allowing 'faulty' and 'linear' not to have a chunk-size set. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
a24a8dd858e0ba50f06a9fd8f61fe8c4fe7a8d8e	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: break out of a loop that doesn't need to run to completion Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
9ffae0cf3ea02f75d163922accfd3e592d87adde	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: convert md to use kzalloc throughout Replace multiple kmalloc/memset pairs with kzalloc calls. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2d1f3b5d1b2cd11a162eb29645df749ec0036413	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: clean up 'page' related names in md Substitute: page_cache_get -> get_page page_cache_release -> put_page PAGE_CACHE_SHIFT -> PAGE_SHIFT PAGE_CACHE_SIZE -> PAGE_SIZE PAGE_CACHE_MASK -> PAGE_MASK __free_page -> put_page because we aren't using the page cache, we are just using pages. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
0eb3ff12aa8a12538ef681dc83f4361636a0699f	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: raid10 read-error handling - resync and read-only Add in correct read-error handling for resync and read-only situations. When read-only, we don't over-write, so we need to mark the failed drive in the r10_bio so we don't re-try it. During resync, we always read all blocks, so if there is a read error, we simply over-write it with the good block that we found (assuming we found one). Note that the recovery case still isn't handled in an interesting way. There is nothing useful to do for the 2-copies case. If there are 3 or more copies, then we could try reading from one of the non-missing copies, but this is a bit complicated and very rarely would be used, so I'm leaving it for now. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4443ae10ca15d07922ceda622f03db8865fa3d13	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: auto-correct correctable read errors in raid10 Largely just a cross-port from raid1. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
18f08819f42b647783e4f6ea99141623881bf182	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: support check-without-repair of raid10 arrays Also keep count on the number of errors found. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
6cce3b23f6f8e974c00af7a9b88f1d413ba368a8	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: write intent bitmap support for raid10 Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
0a27ec96b6fb1abf867e36d7b0b681d67588767a	06-Jan-2006	NeilBrown <neilb@suse.de>	[PATCH] md: improve raid10 "IO Barrier" concept raid10 needs to put up a barrier to new requests while it does resync or other background recovery. The code for this is currently open-coded, slighty obscure by its use of two waitqueues, and not documented. This patch gathers all the related code into 4 functions, and includes a comment which (hopefully) explains what is happening. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
22dfdf5212e5864b844f629736fb993d4611f190	28-Nov-2005	NeilBrown <neilb@suse.de>	[PATCH] md: improve read speed to raid10 arrays using 'far copies' raid10 has two different layouts. One uses near-copies (so multiple copies of a block are at the same or similar offsets of different devices) and the other uses far-copies (so multiple copies of a block are stored a greatly different offsets on different devices). The point of far-copies is that it allows the first section (normally first half) to be layed out in normal raid0 style, and thus provide raid0 sequential read performance. Unfortunately, the read balancing in raid10 makes some poor decisions for far-copies arrays and you don't get the desired performance. So turn off that bad bit of read_balance for far-copies arrays. With this patch, read speed of an 'f2' array is comparable with a raid0 with the same number of devices, though write speed is ofcourse still very slow. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b2d444d7ad975d555bb919601bcdc0e58975a40e	09-Nov-2005	NeilBrown <neilb@suse.de>	[PATCH] md: convert 'faulty' and 'in_sync' fields to bits in 'flags' field This has the advantage of removing the confusion caused by 'rdev_t' and 'mddev_t' both having 'in_sync' fields. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d6065f7bf8bec170c9c56524a250093ce73ca5d9	09-Nov-2005	Suzanne Wood <suzannew@cs.pdx.edu>	[PATCH] md: provide proper rcu_dereference / rcu_assign_pointer annotations in md Acked-by: <paulmck@us.ibm.com> Signed-off-by: Suzanne Wood <suzannew@cs.pdx.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
a362357b6cd62643d4dda3b152639303d78473da	01-Nov-2005	Jens Axboe <axboe@suse.de>	[BLOCK] Unify the seperate read/write io stat fields into arrays Instead of having ->read_sectors and ->write_sectors, combine the two into ->sectors[2] and similar for the other fields. This saves a branch several places in the io path, since we don't have to care for what the actual io direction is. On my x86-64 box, that's 200 bytes less text in just the core (not counting the various drivers). Signed-off-by: Jens Axboe <axboe@suse.de>
dd0fc66fb33cd610bc1a5db8a5e232d34879b4d7	07-Oct-2005	Al Viro <viro@ftp.linux.org.uk>	[PATCH] gfp flags annotations - part 1 - added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
87fc767b832ef5a681a0ff9d203c3289bc3be2bf	10-Sep-2005	NeilBrown <neilb@suse.de>	[PATCH] md: fix BUG when raid10 rebuilds without enough drives This shouldn't be a BUG. We should cope. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
6d508242b231cb6e6803faaef54456abe846edb8	10-Sep-2005	NeilBrown <neilb@suse.de>	[PATCH] md: fix raid10 assembly when too many devices are missing If you try to assemble an array with too many missing devices, raid10 will now reject the attempt, instead of allowing it. Also check when hot-adding a drive and refuse the hot-add if the array is beyond hope. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e5dcdd80a60627371f40797426273048630dc8ca	10-Sep-2005	NeilBrown <neilb@cse.unsw.edu.au>	[PATCH] md: fail IO request to md that require a barrier. md does not yet support BIO_RW_BARRIER, so be honest about it and fail (-EOPNOTSUPP) any such requests. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
3ec67ac1a399d576d48b0736096bcce7721fe3cf	10-Sep-2005	NeilBrown <neilb@cse.unsw.edu.au>	[PATCH] md: fix minor error in raid10 read-balancing calculation. 'this_sector' is a virtual (array) address while 'head_position' is a physical (device) address, so substraction doesn't make any sense. devs[slot].addr should be used instead of this_sector. However, this patch doesn't make much practical different to the read balancing due to the effects of later code. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
990a8baf568ca1d0ae65e59783ff821794118d07	22-Jun-2005	Jesper Juhl <juhl-lkml@dif.dk>	[PATCH] md: remove unneeded NULL checks before kfree This patch removes some unneeded checks of pointers being NULL before calling kfree() on them. kfree() handles NULL pointers just fine, checking first is pointless. Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
3d310eb7b3df1252e8595d059d982b0a9825a137	22-Jun-2005	NeilBrown <neilb@cse.unsw.edu.au>	[PATCH] md: fix deadlock due to md thread processing delayed requests. Before completing a 'write' the md superblock might need to be updated. This is best done by the md_thread. The current code schedules this up and queues the write request for later handling by the md_thread. However some personalities (Raid5/raid6) will deadlock if the md_thread tries to submit requests to its own array. So this patch changes things so the processes submitting the request waits for the superblock to be written and then submits the request itself. This fixes a recently-created deadlock in raid5/raid6 Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
57afd89f98a990747445f01c458ecae64263b2f8	22-Jun-2005	NeilBrown <neilb@cse.unsw.edu.au>	[PATCH] md: improve the interface to sync_request 1/ change the return value (which is number-of-sectors synced) from 'int' to 'sector_t'. The number of sectors is usually easily small enough to fit in an int, but if resync needs to abort, it may want to return the total number of remaining sectors, which could be large. Also errors cannot be returned as negative numbers now, so use 0 instead 2/ Add a 'skipped' return parameter to allow the array to report that it skipped the sectors. This allows md to take this into account in the speed calculations. Currently there is no important skipping, but the bitmap-based-resync that is coming will use this. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
06d91a5fe0b50c9060e70bdf7786f8a3c66249db	22-Jun-2005	NeilBrown <neilb@cse.unsw.edu.au>	[PATCH] md: improve locking on 'safemode' and move superblock writes When md marks the superblock dirty before a write, it calls generic_make_request (to write the superblock) from within generic_make_request (to write the first dirty block), which could cause problems later. With this patch, the superblock write is always done by the helper thread, and write request are delayed until that write completes. Also, the locking around marking the array dirty and writing the superblock is improved to avoid possible races. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fca4d848f0e6fafdc2b25f8a0cf1e76935f13ac2	22-Jun-2005	NeilBrown <neilb@cse.unsw.edu.au>	[PATCH] md: merge md_enter_safemode into md_check_recovery md_enter_safemode checks if it is time to mark the md superblock as 'clean'. i.e. if all writes have completed and a suitable delay has passed. This is currently called from md_handle_safemode which in-turn is called (almost) every time md_check_recovery is called, and from the end of md_do_sync which causes the mddev->thread to run, which will always call md_check_recovery as well. So it doesn't need to be a separate function and fits quite well into md_check_recovery. The "almost" is because multipathd calls md_check_recovery but not md_handle_safemode. This is OK because the code from md_enter_safemode is a no-op if mddev->safemode == 0, which it always is for a multipathd (providing we don't allow it to be set to 2 on a signal...) Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
7a5febe9ffeecd1e78c5b505260ccc1ef18021b4	17-May-2005	NeilBrown <neilb@cse.unsw.edu.au>	[PATCH] md: set the unplug_fn and issue_flush_fn for md devices after committed to creation We we set the too early, they may still be in place and possibly get called even though the array didn't get set up properly. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
fbd568a3e61a7decb8a754ad952aaa5b5c82e9e5	01-May-2005	Paul E. McKenney <paulmck@us.ibm.com>	[PATCH] Change synchronize_kernel to _rcu and _sched This patch changes calls to synchronize_kernel(), deprecated in the earlier "Deprecate synchronize_kernel, GPL replacement" patch to instead call the new synchronize_rcu() and synchronize_sched() APIs. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2	17-Apr-2005	Linus Torvalds <torvalds@ppc970.osdl.org>	Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!