d3f940223c31cffd977864cdc13c69e89f03ce55 |
|
12-Jun-2012 |
majianpeng <majianpeng@gmail.com> |
md/raid5: Do not add data_offset before call to is_badblock commit 6c0544e255dd6582a9899572e120fb55d9f672a4 upstream. In chunk_aligned_read() we are adding data_offset before calling is_badblock. But is_badblock also adds data_offset, so that is bad. So move the addition of data_offset to after the call to is_badblock. This bug was introduced by commit 31c176ecdf3563140e639 md/raid5: avoid reading from known bad blocks. which first appeared in 3.0. So that patch is suitable for any -stable kernel from 3.0.y onwards. However it will need minor revision for most of those (as the comment didn't appear until recently). Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: ignored missing comment] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
5bbbd747918d241b41f3220ff26323d7ed943c52 |
|
03-Jul-2012 |
Shaohua Li <shli@kernel.org> |
raid5: delayed stripe fix commit fab363b5ff502d1b39ddcfec04271f5858d9f26e upstream. There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but the two bits have relationship. A delayed stripe can be moved to hold list only when preread active stripe count is below IO_THRESHOLD. If a stripe has both the bits set, such stripe will be in delayed list and preread count not 0, which will make such stripe never leave delayed list. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
8d9369807370331cebf3e237b95ecce068af80f1 |
|
02-Jul-2012 |
majianpeng <majianpeng@gmail.com> |
md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev commit 1850753d2e6d9ca7856581ca5d3cf09521e6a5d7 upstream. In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement nr_pending so we lose the reference we hold on the rdev. So atomic_inc it first to maintain the reference. This bug was introduced by commit 73e92e51b7969ef5477d md/raid5. Don't write to known bad block on doubtful devices. which appeared in 3.0, so patch is suitable for stable kernels since then. Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
c6d2e084c7411f61f2b446d94989e5aaf9879b0f |
|
01-Apr-2012 |
majianpeng <majianpeng@gmail.com> |
md/raid5: Fix a bug about judging if the operation is syncing or replacing When create a raid5 using assume-clean and echo check or repair to sync_action.Then component disks did not operated IO but the raid check/resync faster than normal. Because the judgement in function analyse_stripe(): if (do_recovery || sh->sector >= conf->mddev->recovery_cp) s->syncing = 1; else s->replacing = 1; When check or repair,the recovery_cp == MaxSectore,so syncing equal zero not one. This bug was introduced by commit 9a3e1101b827 md/raid5: detect and handle replacements during recovery. so this patch is suitable for 3.3-stable. Cc: stable@vger.kernel.org Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
18b9837ea0dc3cf844c6c4196871ce91d047bddb |
|
01-Apr-2012 |
NeilBrown <neilb@suse.de> |
md/raid5: fix handling of bad blocks during recovery. 1/ We can only treat a known-bad-block like a read-error if we have the data that belongs in that block. So fix that test. 2/ If we cannot recovery a stripe due to insufficient data, don't tell "md_done_sync" that the sync failed unless we really did fail something. If we successfully record bad blocks, that is success. Reported-by: "majianpeng" <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
dafb20fa34320a472deb7442f25a0c086e0feb33 |
|
18-Mar-2012 |
NeilBrown <neilb@suse.de> |
md: tidy up rdev_for_each usage. md.h has an 'rdev_for_each()' macro for iterating the rdevs in an mddev. However it uses the 'safe' version of list_for_each_entry, and so requires the extra variable, but doesn't include 'safe' in the name, which is useful documentation. Consequently some places use this safe version without needing it, and many use an explicity list_for_each entry. So: - rename rdev_for_each to rdev_for_each_safe - create a new rdev_for_each which uses the plain list_for_each_entry, - use the 'safe' version only where needed, and convert all other list_for_each_entry calls to use rdev_for_each. Signed-off-by: NeilBrown <neilb@suse.de>
|
dc10c643e8a8d008fd16dd6706e9e0018eadf8d2 |
|
18-Mar-2012 |
NeilBrown <neilb@suse.de> |
md: allow re-add to failed arrays. When an array is failed (some data inaccessible) then there is no point attempting to add a spare as it could not possibly be recovered. However that may be value in re-adding a recently removed device. e.g. if there is a write-intent-bitmap and it is clear, then access to the data could be restored by this action. So don't reject a re-add to a failed array for RAID10 and RAID5 (the only arrays types that check for a failed array). Signed-off-by: NeilBrown <neilb@suse.de>
|
41fe75f60bcd4d698daed3e54bb099227358ce58 |
|
13-Mar-2012 |
majianpeng <majianpeng@gmail.com> |
md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read(). Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
9d4c7d8799c418816342e263479fa010b182183e |
|
13-Mar-2012 |
NeilBrown <neilb@suse.de> |
md/raid5: removed unused 'added_devices' variable. commit 908f4fbd265733 removed the last user of this variable, so we should discard it completely. Signed-off-by: NeilBrown <neilb@suse.de>
|
1e3fa9bd5061778fb5cf4648e4e8321e8cbbb95b |
|
13-Mar-2012 |
NeilBrown <neilb@suse.de> |
md/raid5: make sure reshape_position is cleared on error path. Leaving a valid reshape_position value in place could be confusing. Signed-off-by: NeilBrown <neilb@suse.de>
|
3a6de2924af602f9c1b5a5154438c37f2d712dfa |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Mark device want_replacement when we see a write error. Now that WantReplacement drives are replaced cleanly, mark a drive as WantReplacement when we see a write error. It might get failed soon so the WantReplacement flag is irrelevant, but if the write error is recorded in the bad block log, we still want to activate any spare that might be available. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
7bfec5f35c68121e7b1849f3f4166dd96c8da5b3 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: If there is a spare and a want_replacement device, start replacement. When attempting to add a spare to a RAID[456] array, also consider adding it as a replacement for a want_replacement device. This requires that common md code attempt hot_add even when the array is not formally degraded. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
17045f52ac76d9cd1a120e52af5d83b570af4ba8 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: recognise replacements when assembling array. If a Replacement is seen, file it as such. If we see two replacements (or two normal devices) for the one slot, abort. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
dd054fce88d33da1aa81d018db75b91b102a6959 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: handle activation of replacement device when recovery completes. When recovery completes - as reported by a call to ->spare_active, we clear In_sync on the original and set it on the replacement. Then when the original gets removed we move the replacement from 'replacement' to 'rdev'. This could race with other code that is looking at these pointers, so we use memory barriers and careful ordering to ensure that a reader might see one device twice, but never no devices. Then the readers guard against using both devices, which could only happen when writing. Signed-off-by: NeilBrown <neilb@suse.de>
|
9a3e1101b827a59ac9036a672f5fa8d5279d0fe2 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: detect and handle replacements during recovery. During recovery we want to write to the replacement but not the original. So we have two new flags - R5_NeedReplace if this stripe has a replacement that needs to be written at some stage - R5_WantReplace if NeedReplace, and the data is available, and a 'sync' has been requested on this stripe. We also distinguish between 'sync and replace' which need to read all other devices, and 'replace' which only needs to read the devices being replaced. Note that during resync we always write to any replacement device. It might not need to be written to, but as we don't read to compare, we have to write to be sure. Signed-off-by: NeilBrown <neilb@suse.de>
|
977df36255ab0ea78b048cbc9055300c586dcc91 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: writes should get directed to replacement as well as original. When writing, we need to submit two writes, one to the original, and one to the replacement - if there is a replacement. If the write to the replacement results in a write error, we just fail the device. We only try to record write errors to the original. When writing for recovery, we shouldn't write to the original. This will be addressed in a subsequent patch that generally addresses recovery. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
657e3e4d88461a5ab660dd87f8f773f55e748da4 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: allow removal for failed replacement devices. Enhance raid5_remove_disk to be able to remove ->replacement as well as ->rdev. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
14a75d3e07c784c004b4b44b34af996b8e4ac453 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: preferentially read from replacement device if possible. If a replacement device is present and has been recovered far enough, then use it for reading into the stripe cache. If we get an error we don't try to repair it, we just fail the device. A replacement device that gives errors does not sound sensible. This requires removing the setting of R5_ReadError when we get a read error during a read that bypasses the cache. It was probably a bad idea anyway as we don't know that every block in the read caused an error, and it could cause ReadError to be set for the replacement device, which is bad. Signed-off-by: NeilBrown <neilb@suse.de>
|
995c4275a7e14b8752f301e4570831a108ae4303 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: remove redundant bio initialisations. We current initialise some fields of a bio when preparing a stripe_head, and again just before submitting the request. Remove the duplication by only setting the fields that lower level devices don't touch in raid5_build_block, and only set the changeable fields in ops_run_io. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
671488cc25f7c194c7c7a9f258bab1df17a6ff69 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: allow each slot to have an extra replacement device Just enhance data structures to record a second device per slot to be used as a 'replacement' device, replacing the original. We also have a second bio in each slot in each stripe_head. This will only be used when writing to the array - we need to write to both the original and the replacement at the same time, so will need two bios. For now, only try using the replacement drive for aligned-reads. In this case, we prefer the replacement if it has been recovered far enough, otherwise use the original. This includes a small enhancement. Previously we would only do aligned reads if the target device was fully recovered. Now we also do them if it has recovered far enough. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
b8321b68d1445f308324517e45fb0a5c2b48e271 |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md: change hot_remove_disk to take an rdev rather than a number. Soon an array will be able to have multiple devices with the same raid_disk number (an original and a replacement). So removing a device based on the number won't work. So pass the actual device handle instead. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
908f4fbd265733310c17ecc906299846b5dac44a |
|
23-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: be more thorough in calculating 'degraded' value. When an array is being reshaped to change the number of devices, the two halves can be differently degraded. e.g. one could be missing a device and the other not. So we need to be more careful about calculating the 'degraded' attribute. Instead of just inc/dec at appropriate times, perform a full re-calculation examining both possible cases. This doesn't happen often so it not a big cost, and we already have most of the code to do it. Signed-off-by: NeilBrown <neilb@suse.de>
|
30d7a4836847bdb10b32c78a4879d4aebe0f193b |
|
22-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: ensure correct assessment of drives during degraded reshape. While reshaping a degraded array (as when reshaping a RAID0 by first converting it to a degraded RAID4) we currently get confused about which devices are in_sync. In most cases we get it right, but in the region that is being reshaped we need to treat non-failed devices as in-sync when we have the data but haven't actually written it out yet. Reported-by: Adam Kwolek <adam.kwolek@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
5d8c71f9e5fbdd95650be00294d238e27a363b5c |
|
09-Dec-2011 |
Adam Kwolek <adam.kwolek@intel.com> |
md: raid5 crash during degradation NULL pointer access causes crash in raid5 module. Signed-off-by: Adam Kwolek <adam.kwolek@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
9283d8c5af4cdcb809e655acdf4be368afec8b58 |
|
08-Dec-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: never wait for bad-block acks on failed device. Once a device is failed we really want to completely ignore it. It should go away soon anyway. In particular the presence of bad blocks on it should not cause us to block as we won't be trying to write there anyway. So as soon as we can check if a device is Faulty, do so and pretend that it is already gone if it is Faulty. Signed-off-by: NeilBrown <neilb@suse.de>
|
257a4b42af7586fab4eaec7f04e6896b86551843 |
|
08-Nov-2011 |
Dan Williams <dan.j.williams@intel.com> |
md/raid5: STRIPE_ACTIVE has lock semantics, add barriers All updates that occur under STRIPE_ACTIVE should be globally visible when STRIPE_ACTIVE clears. test_and_set_bit() implies a barrier, but clear_bit() does not. This is suitable for 3.1-stable. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
|
9a3f530f39f4490eaa18b02719fb74ce5f4d2d86 |
|
08-Nov-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: abort any pending parity operations when array fails. When the number of failed devices exceeds the allowed number we must abort any active parity operations (checks or updates) as they are no longer meaningful, and can lead to a BUG_ON in handle_parity_checks6. This bug was introduce by commit 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 in 2.6.29. Reported-by: Manish Katiyar <mkatiyar@gmail.com> Tested-by: Manish Katiyar <mkatiyar@gmail.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
|
056075c76417b112b4924e7b6386fdc6dfc9ac03 |
|
03-Jul-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
md: Add module.h to all files using it implicitly A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
d890fa2b0586b6177b119643ff66932127d58afa |
|
26-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: Fix some bugs in recovery_disabled handling. In 3.0 we changed the way recovery_disabled was handle so that instead of testing against zero, we test an mddev-> value against a conf-> value. Two problems: 1/ one place in raid1 was missed and still sets to '1'. 2/ We didn't explicitly set the conf-> value at array creation time. It defaulted to '0' just like the mddev value does so they could appear equal and thus disable recovery. This did not affect normal 'md' as it calls bind_rdev_to_array which changes the mddev value. However the dmraid interface doesn't call this and so doesn't change ->recovery_disabled; so at array start all recovery is incorrectly disabled. So initialise the 'conf' value to one less that the mddev value, so the will only be the same when explicitly set that way. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
355840e7a7e56bb2834fd3b0da64da5465f8aeaa |
|
26-Oct-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: fix bug that could result in reads from a failed device. This bug was introduced in 415e72d034c50520ddb7ff79e7d1792c1306f0c9 which was in 2.6.36. There is a small window of time between when a device fails and when it is removed from the array. During this time we might still read from it, but we won't write to it - so it is possible that we could read stale data. We didn't need the test of 'Faulty' before because the test on In_sync is sufficient. Since we started allowing reads from the early part of non-In_sync devices we need a test on Faulty too. This is suitable for any kernel from 2.6.36 onwards, though the patch might need a bit of tweaking in 3.0 and earlier. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
|
84fc4b56db85cb9e05326424049973a2036c9940 |
|
11-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: rename "mdk_personality" to "md_personality" "mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>
|
d1688a6d5515f1900af76a963b4bb6d9a6554cfa |
|
11-Oct-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: typedef removal: raid5_conf_t -> struct r5conf Signed-off-by: NeilBrown <neilb@suse.de>
|
e373ab109172abc2d821bd3b5c1b400acddef5a5 |
|
11-Oct-2011 |
NeilBrown <neilb@suse.de> |
md/raid0: typedef removal: raid0_conf_t -> struct r0conf Signed-off-by: NeilBrown <neilb@suse.de>
|
fd01b88c75a718020ff77e7f560d33835e9b58de |
|
11-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: remove typedefs: mddev_t -> struct mddev Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
|
3cb03002000f133f9f97269edefd73611eafc873 |
|
11-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: removing typedefs: mdk_rdev_t -> struct md_rdev The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
|
bdc04e6b15f70a8f96d8cdfe21df159a6466b49a |
|
07-Oct-2011 |
NeilBrown <neilb@suse.de> |
md: remove some old DEBUGging code. This code is not really helpful and is hard to maintain, so just discard it. Signed-off-by: NeilBrown <neilb@suse.de>
|
db298e1946c074c83d97f1c959fbc0def2af2c86 |
|
07-Oct-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: convert to macros into inline functions. More type-safety. Easier to read. Signed-off-by: NeilBrown <neilb@suse.de>
|
e4f869d9de18bc8272df8d0ab764178aa24bdf33 |
|
07-Oct-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: remove pointless NULL test. In the 'abort' branch of run(), 'conf' cannot possibly be NULL, so remove the test. Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
01f96c0a9922cd9919baf9d16febdf7016177a12 |
|
21-Sep-2011 |
NeilBrown <neilb@suse.de> |
md: Avoid waking up a thread after it has been freed. Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org
|
5a7bbad27a410350e64a2d7f5ec18fc73836c14f |
|
12-Sep-2011 |
Christoph Hellwig <hch@infradead.org> |
block: remove support for bio remapping from ->make_request There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
43220aa0f22cd3ce5b30246d50ccd696d119edea |
|
30-Aug-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: fix a hang on device failure. Waiting for a 'blocked' rdev to become unblocked in the raid5d thread cannot work with internal metadata as it is the raid5d thread which will clear the blocked flag. This wasn't a problem in 3.0 and earlier as we only set the blocked flag when external metadata was used then. However we now set it always, so we need to be more careful. Signed-off-by: NeilBrown <neilb@suse.de>
|
b84db560ead5417b5594349512baf8837959df4f |
|
28-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Clear bad blocks on successful write. On a successful write to a known bad block, flag the sh so that raid5d can remove the known bad block from the list. Signed-off-by: NeilBrown <neilb@suse.de>
|
73e92e51b7969ef5477dd28fe2ae4d77675896f4 |
|
28-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5. Don't write to known bad block on doubtful devices. If a device has seen write errors, don't write to any known bad blocks on that device. Signed-off-by: NeilBrown <neilb@suse.de>
|
bc2607f393bd4fb844c1886a02af929ca0372056 |
|
28-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: write errors should be recorded as bad blocks if possible. When a write error is detected, don't mark the device as failed immediately but rather record the fact for handle_stripe to deal with. Handle_stripe then attempts to record a bad block. Only if that fails does the device get marked as faulty. Signed-off-by: NeilBrown <neilb@suse.de>
|
7f0da59bdc2f65795a57009d78f7753d3aea1de3 |
|
28-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: use bad-block log to improve handling of uncorrectable read errors. If we get an uncorrectable read error - record a bad block rather than failing the device. And if these errors (which may be due to known bad blocks) cause recovery to be impossible, record a bad block on the recovering devices, or abort the recovery. As we might abort a recovery without failing a device we need to teach RAID5 about recovery_disabled handling. Signed-off-by: NeilBrown <neilb@suse.de>
|
31c176ecdf3563140e6395249eda51a18130d9f6 |
|
28-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: avoid reading from known bad blocks. There are two times that we might read in raid5: 1/ when a read request fits within a chunk on a single working device. In this case, if there is any bad block in the range of the read, we simply fail the cache-bypass read and perform the read though the stripe cache. 2/ when reading into the stripe cache. In this case we mark as failed any device which has a bad block in that strip (1 page wide). Note that we will both avoid reading and avoid writing. This is correct (as we will never read from the block, there is no point writing), but not optimal (as writing could 'fix' the error) - that will be addressed later. If we have not seen any write errors on the device yet, we treat a bad block like a recent read error. This will encourage an attempt to fix the read error which will either generate a write error, or will ensure good data is stored there. We don't yet forget the bad block in that case. That comes later. Now that we honour bad blocks when reading we can allow devices with bad blocks into the array. Signed-off-by: NeilBrown <neilb@suse.de>
|
de393cdea66cbd63c90725663f400c76faf1b255 |
|
28-Jul-2011 |
NeilBrown <neilb@suse.de> |
md: make it easier to wait for bad blocks to be acknowledged. It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty *or* it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>
|
34b343cff4354ab9864be83be88405fd53d928a0 |
|
28-Jul-2011 |
NeilBrown <neilb@suse.de> |
md: don't allow arrays to contain devices with bad blocks. As no personality understand bad block lists yet, we must reject any device that is known to contain bad blocks. As the personalities get taught, these tests can be removed. This only applies to raid1/raid5/raid10. For linear/raid0/multipath/faulty the whole concept of bad blocks doesn't mean anything so there is no point adding the checks. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
8cfa7b0f67b4d899efc7f39eb7e172fd79237811 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Avoid BUG caused by multiple failures. While preparing to write a stripe we keep the parity block or blocks locked (R5_LOCKED) - towards the end of schedule_reconstruction. If the array is discovered to have failed before this write completes we can leave those blocks LOCKED, and init_stripe will notice that a free stripe still has a locked block and will complain. So clear the R5_LOCKED flag in handle_failed_stripe, and demote the 'BUG' to a 'WARN_ON'. Signed-off-by: NeilBrown <neilb@suse.de>
|
ddd5115fe5594f5aae3c7f0008a5327bb1d19397 |
|
27-Jul-2011 |
Namhyung Kim <namhyung@gmail.com> |
md/raid5: move rdev->corrected_errors counting Read errors are considered to corrected if write-back and re-read cycle is finished without further problems. Thus moving the rdev-> corrected_errors counting after the re-reading looks more reasonable IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
36fad858a7404a9656122a9e560a224ae2a00979 |
|
27-Jul-2011 |
Namhyung Kim <namhyung@gmail.com> |
md: introduce link/unlink_rdev() helpers There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
8bda470e8ebde35f9349e98ecbce4dfb508a60fa |
|
27-Jul-2011 |
Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> |
md/raid: use printk_ratelimited instead of printk_ratelimit As per printk_ratelimit comment, it should not be used. Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de> Signed-off-by: NeilBrown <neilb@suse.de>
|
acfe726bdd0000a9be1b308b29fad1e9ae62178c |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: finalise new merged handle_stripe. handle_stripe5() and handle_stripe6() are now virtually identical. So discard one and rename the other to 'analyse_stripe()'. It always returns 0, so change it to 'void' and remove the 'done' variable in handle_stripe(). Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
474af965fe0005b334cabdb2904a7d712c21489b |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: move some more common code into handle_stripe The RAID6 version of this code is usable for RAID5 providing: - we test "conf->max_degraded" rather than "2" as appropriate - we make sure s->failed_num[1] is meaningful (and not '-1') when s->failed > 1 The 'return 1' must become 'goto finish' in the new location. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
84789554e96c0263ad8aa9be91397ece1f88c768 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: move more common code into handle_stripe Apart from 'prexor' which can only be set for RAID5, and 'qd_idx' which can only be meaningful for RAID6, these two chunks of code are nearly the same. So combine them into one adding a test to call either handle_parity_checks5 or handle_parity_checks6 as appropriate. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
c8ac1803ff0af5aa614587ac0c66d46b7a3bdfcc |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is also allow 'read-modify-write' Apart from this difference, handle_stripe_dirtying[56] are nearly identical. So resolve these differences and create just one function. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
93b3dbce6456a79c545b45e86ccc2244e923cc99 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: unite fetch_block5 and fetch_block6 Provided that ->failed_num[1] is not a valid device number (which is easily achieved) fetch_block6 provides all the functionality of fetch_block5. So remove the latter and rename the former to simply "fetch_block". Then handle_stripe_fill5 and handle_stripe_fill6 become the same and can similarly be united. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
5d35e09cae47bbae2739f432658860680de21866 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: rearrange a test in fetch_block6. Next patch will unite fetch_block5 and fetch_block6. First I want to make the differences a little more clear. For RAID6 if we are writing at all and there is a failed device, then we need to load or compute every block so we can do a reconstruct-write. This case isn't needed for RAID5 - we will do a read-modify-write in that case. So make that test a separate test in fetch_block6 rather than merged with two other tests. Make a similar change in fetch_block5 so the one bit that is not needed for RAID6 is clearly separate. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
c5a3100062cf277d3edd4e6f4a1f1e403524b464 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: move more code into common handle_stripe The difference between the RAID5 and RAID6 code here is easily resolved using conf->max_degraded. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
3687c061886dd0bfec07e131ad12f916ef0abc62 |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Move code for finishing a reconstruction into handle_stripe. Prior to commit ab69ae12ceef7 the code in handle_stripe5 and handle_stripe6 to "Finish reconstruct operations initiated by the expansion process" was identical. That commit added an identical stanza of code to each function, but in different places. That was careless. The raid5 code was correct, so move that out into handle_stripe and remove raid6 version. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
86c374ba9f6726a79a032ede741dc66d219b166e |
|
27-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Remove stripe_head_state arg from handle_stripe_expansion. This arg is only used to differentiate between RAID5 and RAID6 but that is not needed. For RAID5, raid5_compute_sector will set qd_idx to "~0" so j with certainly not equals qd_idx, so there is no need for a guard on that condition. So remove the guard and remove the arg from the declaration and callers of handle_stripe_expansion. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
cc94015a9eac5d511fe9b716624d8fdf9c6e64b2 |
|
26-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: move stripe_head_state and more code into handle_stripe. By defining the 'stripe_head_state' in 'handle_stripe', we can move some common code out of handle_stripe[56]() and into handle_stripe. The means that all accesses for stripe_head_state in handle_stripe[56] need to be 's->' instead of 's.', but the compiler should inline those functions and just use a direct stack reference, and future patches while hoist most of this code up into handle_stripe() so we will revert to "s.". Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
c5709ef6a094c72b56355590bfa55cc107e98376 |
|
26-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: add some more fields to stripe_head_state Adding these three fields will allow more common code to be moved to handle_stripe() struct field rearrangement by Namhyung Kim. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
f2b3b44deee1524ca4f006048e0569f47eefdb74 |
|
26-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: unify stripe_head_state and r6_state 'struct stripe_head_state' stores state about the 'current' stripe that is passed around while handling the stripe. For RAID6 there is an extension structure: r6_state, which is also passed around. There is no value in keeping these separate, so move the fields from the latter into the former. This means that all code now needs to treat s->failed_num as an small array, but this is a small cost. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
82e5a1718b9d0401b826341b9023766d04cb82f2 |
|
26-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: move common code into handle_stripe There is common code at the start of handle_stripe5 and handle_stripe6. Move it into handle_stripe. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
c4c1663be46b2ab94e59d3e0c583a8f6b188ff0c |
|
26-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: replace sh->lock with an 'active' flag. sh->lock is now mainly used to ensure that two threads aren't running in the locked part of handle_stripe[56] at the same time. That can more neatly be achieved with an 'active' flag which we set while running handle_stripe. If we find the flag is set, we simply requeue the stripe for later by setting STRIPE_HANDLE. For safety we take ->device_lock while examining the state of the stripe and creating a summary in 'stripe_head_state / r6_state'. This possibly isn't needed but as shared fields like ->toread, ->towrite are checked it is safer for now at least. We leave the label after the old 'unlock' called "unlock" because it will disappear in a few patches, so renaming seems pointless. This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE later, but that is not a problem. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
cbe47ec559c33a68b5ee002051b848d1531a8adb |
|
26-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Protect some more code with ->device_lock. Other places that change or follow dev->towrite and dev->written take the device_lock as well as the sh->lock. So it should really be held in these places too. Also, doing so will allow sh->lock to be discarded. with merged fixes by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
83206d66b65118d995c38746f21edc2bb8564b49 |
|
26-Jul-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: Remove use of sh->lock in sync_request This is the start of a series of patches to remove sh->lock. sync_request takes sh->lock before setting STRIPE_SYNCING to ensure there is no race with testing it in handle_stripe[56]. Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early in handle_stripe[56] (after getting the same lock) and perform the same set/clear operations if it was set. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
|
ffd96e35c16a99fdb490cc5723b8e32135ae5883 |
|
18-Jul-2011 |
Namhyung Kim <namhyung@gmail.com> |
md/raid5: get rid of duplicated call to bio_data_dir() In raid5::make_request(), once bio_data_dir(@bi) is detected it never (and couldn't) be changed. Use the result always. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
6ce328462c1145a217ba1f27b882743be1407759 |
|
18-Jul-2011 |
Namhyung Kim <namhyung@gmail.com> |
md/raid5: use kmem_cache_zalloc() Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc. I think it's not harmful since @conf->slab_cache already knows actual size of struct stripe_head. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
fcde90759a985d8bfa4391346a821cc12fc16207 |
|
14-Jun-2011 |
Namhyung Kim <namhyung@gmail.com> |
md/raid5: remove unusual use of bio_iovec_idx() In the bio_for_each_segment loop, bvl always points current bio_vec, so the same as bio_iovec_idx(, i). Let's get rid of it. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
b062962edb086011e94ec4d9eb3f6a6d814f2a8f |
|
14-Jun-2011 |
Namhyung Kim <namhyung@gmail.com> |
md/raid5: fix FUA request handling in ops_run_io() Commit e9c7469bb4f5 ("md: implment REQ_FLUSH/FUA support") introduced R5_WantFUA flag and set rw to WRITE_FUA in that case. However remaining code still checks whether rw is exactly same as WRITE or not, so FUAed-write ends up with being treated as READ. Fix it. This bug has been present since 2.6.37 and the fix is suitable for any -stable kernel since then. It is not clear why this has not caused more problems. Cc: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
9b2dc8b665932a8e681a7ab3237f60475e75e161 |
|
13-Jun-2011 |
Namhyung Kim <namhyung@gmail.com> |
md/raid5: fix raid5_set_bi_hw_segments The @bio->bi_phys_segments consists of active stripes count in the lower 16 bits and processed stripes count in the upper 16 bits. So logical-OR operator should be bitwise one. This bug has been present since 2.6.27 and the fix is suitable for any -stable kernel since then. Fortunately the bad code is only used on error paths and is relatively unlikely to be hit. Cc: stable@kernel.org Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
d6b212f4b19da5301e6b6eca562e5c7a2a6e8c8d |
|
09-Jun-2011 |
Jonathan Brassow <jbrassow@redhat.com> |
MD: raid5 do not set fullsync Add check to determine if a device needs full resync or if partial resync will do RAID 5 was assuming that if a device was not In_sync, it must undergo a full resync. We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'. If it is, we can safely skip the full resync and rely on the bitmap for partial recovery instead. This is the legitimate purpose of 'saved_raid_disk', from md.h: int saved_raid_disk; /* role that device used to have in the * array and could again if we did a partial * resync from the bitmap */ Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
b098636cf04c89db4036fedc778da0acc666ad1a |
|
11-May-2011 |
NeilBrown <neilb@suse.de> |
md: allow resync_start to be set while an array is active. The sysfs attribute 'resync_start' (known internally as recovery_cp), records where a resync is up to. A value of 0 means the array is not known to be in-sync at all. A value of MaxSector means the array is believed to be fully in-sync. When the size of member devices of an array (RAID1,RAID4/5/6) is increased, the array can be increased to match. This process sets resync_start to the old end-of-device offset so that the new part of the array gets resynced. However with RAID1 (and RAID6) a resync is not technically necessary and may be undesirable. So it would be good if the implied resync after the array is resized could be avoided. So: change 'resync_start' so the value can be changed while the array is active, and as a precaution only allow it to be changed while resync/recovery is 'frozen'. Changing it once resync has started is not going to be useful anyway. This allows the array to be resized without a resync by: write 'frozen' to 'sync_action' write new size to 'component_size' (this will set resync_start) write 'none' to 'resync_start' write 'idle' to 'sync_action'. Also slightly improve some tests on recovery_cp when resizing raid1/raid5. Now that an arbitrary value could be set we should be more careful in our tests. Signed-off-by: NeilBrown <neilb@suse.de>
|
6f8d0c77cef5849433dd7beb0bd97e573cc4a6a3 |
|
11-May-2011 |
NeilBrown <neilb@suse.de> |
md: make error_handler functions more uniform and correct. - there is no need to test_bit Faulty, as that was already done in md_error which is the only caller of these functions. - MD_CHANGE_DEVS should be set *after* faulty is set to ensure metadata is updated correctly. - spinlock should be held while updating ->degraded. Signed-off-by: NeilBrown <neilb@suse.de>
|
aeb878b0967550bb56606ae21172bdcbd6afe052 |
|
10-Apr-2011 |
Jesper Juhl <jj@chaosbits.net> |
md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). There's a small typo in a comment in drivers/md/raid5.c - 'Of course' is misspelled as 'Ofcourse'. This patch fixes the spelling error. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
d76c8420c3cf8e468901b0bd58306637335c98ea |
|
21-Apr-2011 |
Randy Dunlap <randy.dunlap@oracle.com> |
raid5: fix build error, sector_t usage Change <sectors> from unsigned long long to sector_t. This matches its source field. ERROR: "__udivdi3" [drivers/md/raid456.ko] undefined! Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
3b71bd9337b404baab5c894e066be6b6bf51b1c3 |
|
20-Apr-2011 |
NeilBrown <neilb@suse.de> |
md: Fix dev_sectors on takeover from raid0 to raid4/5 A raid0 array doesn't set 'dev_sectors' as each device might contribute a different number of sectors. So when converting to a RAID4 or RAID5 we need to set dev_sectors as they need the number. We have already verified that in fact all devices do contribute the same number of sectors, so use that number. Signed-off-by: NeilBrown <neilb@suse.de>
|
2b7da309ffe602d222558cee4d7e407b96e34b3a |
|
20-Apr-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: remove setting of ->queue_lock We previously needed to set ->queue_lock to match the raid5 device_lock so we could safely use queue_flag_* operations (e.g. for plugging). which test the ->queue_lock is in fact locked. However that need has completely gone away and is unlikely to come back to remove this now-pointless setting. Signed-off-by: NeilBrown <neilb@suse.de>
|
7c13edc87510f665da3094174e1fd633e06649f4 |
|
18-Apr-2011 |
NeilBrown <neilb@suse.de> |
md: incorporate new plugging into raid5. In raid5 plugging is used for 2 things: 1/ collecting writes that require a bitmap update 2/ collecting writes in the hope that we can create full stripes - or at least more-full. We now release these different sets of stripes when plug_cnt is zero. Also in make_request, we call mddev_check_plug to hopefully increase plug_cnt, and wake up the thread at the end if plugging wasn't achieved for some reason. Signed-off-by: NeilBrown <neilb@suse.de>
|
482c083492ddaa32ef5864bae3d143dc8bcdf7d1 |
|
18-Apr-2011 |
NeilBrown <neilb@suse.de> |
md - remove old plugging code. md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>
|
e1dfa0a29737142c32f00a3bac0f609dc85b4a82 |
|
18-Apr-2011 |
NeilBrown <neilb@suse.de> |
md: use new plugging interface for RAID IO. md/raid submits a lot of IO from the various raid threads. So adding start/finish plug calls to those so that some plugging happens. Signed-off-by: NeilBrown <neilb@suse.de>
|
7eaceaccab5f40bbfda044629a6298616aeaed50 |
|
10-Mar-2011 |
Jens Axboe <jaxboe@fusionio.com> |
block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
da9cf5050a2e3dbc3cf26a8d908482eb4485ed49 |
|
21-Feb-2011 |
NeilBrown <neilb@suse.de> |
md: avoid spinlock problem in blk_throtl_exit blk_throtl_exit assumes that ->queue_lock still exists, so make sure that it does. To do this, we stop redirecting ->queue_lock to conf->device_lock and leave it pointing where it is initialised - __queue_lock. As the blk_plug functions check the ->queue_lock is held, we now take that spin_lock explicitly around the plug functions. We don't need the locking, just the warning removal. This is needed for any kernel with the blk_throtl code, which is which is 2.6.37 and later. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
|
50da08409654e036c4c964a473567a61a654cb83 |
|
31-Jan-2011 |
NeilBrown <neilb@suse.de> |
md: don't abort checking spares as soon as one cannot be added. As spares can be added manually before a reshape starts, we need to find them all to mark some of them as in_sync. Previously we would abort looking for spares when we found an unallocated spare what could not be added to the array (implying there was no room for new spares). However already-added spares could be later in the list, so we need to keep searching. Signed-off-by: NeilBrown <neilb@suse.de>
|
469518a3455c79619e9231aeffeffa2e2989f738 |
|
31-Jan-2011 |
NeilBrown <neilb@suse.de> |
md: fix the test for finding spares in raid5_start_reshape. As spares can be added to the array before the reshape is started, we need to find and count them when checking there are enough. The array could have been degraded, so we need to check all devices, no just those out side of the range of devices in the array before the reshape. So instead of checking the index, check the In_sync flag as that reliably tells if the device is a spare or this purpose. Signed-off-by: NeilBrown <neilb@suse.de>
|
87a8dec91e15954f0cf86be6c21741d991d83621 |
|
31-Jan-2011 |
NeilBrown <neilb@suse.de> |
md: simplify some 'if' conditionals in raid5_start_reshape. There are two consecutive 'if' statements. if (mddev->delta_disks >= 0) .... if (mddev->delta_disks > 0) The code in the second is equally valid if delta_disks == 0, and these two statements are the only place that 'added_devices' is used. So make them a single if statement, make added_devices a local variable, and re-indent it all. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>
|
1a940fcee31ec6c18c2f24dbdad31d54e4c35048 |
|
13-Jan-2011 |
NeilBrown <neilb@suse.de> |
md/raid5: handle manually-added spares in start_reshape. It is possible to manually add spares to specific slots before starting a reshape. raid5_start_reshape should recognised this possibility and include it in the accounting. Signed-off-by: NeilBrown <neilb@suse.de>
|
75d3da43cb74d2e5fb87816dbfecb839cd97c7f4 |
|
13-Jan-2011 |
NeilBrown <neilb@suse.de> |
md: Don't let implementation detail of curr_resync leak out through sysfs. mddev->curr_resync has artificial values of '1' and '2' which are used by the code which ensures only one resync is happening at a time on any given device. These values are internal and should never be exposed to user-space (except when translated appropriately as in the 'pending' status in /proc/mdstat). Unfortunately they are as ->curr_resync is assigned to ->curr_resync_completed and that value is directly visible through sysfs. So change the assignments to ->curr_resync_completed to get the same valued from elsewhere in a form that doesn't have the magic '1' or '2' values. Signed-off-by: NeilBrown <neilb@suse.de>
|
43c73ca43b3e03bb228ff9350b6b44d0e560f262 |
|
13-Jan-2011 |
Jonathan Brassow <jbrassow@redhat.com> |
md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer With the module parameter 'start_dirty_degraded' set, raid5_spare_active() previously called sysfs_notify_dirent() with a NULL argument (rdev->sysfs_state) when a rebuild finished. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
067032bc628598606056412594042564fcf09e22 |
|
13-Jan-2011 |
Joe Perches <joe@perches.com> |
md: Fix single printks with multiple KERN_<level>s Noticed-by: Russell King <linux@arm.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
a167f663243662aa9153c01086580a11cde9ffdc |
|
26-Oct-2010 |
NeilBrown <neilb@suse.de> |
md: use separate bio pool for each md device. bio_clone and bio_alloc allocate from a common bio pool. If an md device is stacked with other devices that use this pool, or under something like swap which uses the pool, then the multiple calls on the pool can cause deadlocks. So allocate a local bio pool for each md array and use that rather than the common pool. This pool is used both for regular IO and metadata updates. Signed-off-by: NeilBrown <neilb@suse.de>
|
57dab0bdf689d42972975ec646d862b0900a4bf3 |
|
19-Oct-2010 |
NeilBrown <neilb@suse.de> |
md: use sector_t in bitmap_get_counter bitmap_get_counter returns the number of sectors covered by the counter in a pass-by-reference variable. In some cases this can be very large, so make it a sector_t for safety. Signed-off-by: NeilBrown <neilb@suse.de>
|
e9c7469bb4f502dafc092166201bea1ad5fc0fbf |
|
03-Sep-2010 |
Tejun Heo <tj@kernel.org> |
md: implment REQ_FLUSH/FUA support This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
6b9656205469269c050963c71fca1998b247a560 |
|
18-Aug-2010 |
NeilBrown <neilb@suse.de> |
md: provide appropriate return value for spare_active functions. md_check_recovery expects ->spare_active to return 'true' if any spares were activated, but none of them do, so the consequent change in 'degraded' is not notified through sysfs. So count the number of spares activated, subtract it from 'degraded' just once, and return it. Reported-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
e6ffbcb6cd0ac471223df24ae77eb486c1ee68cc |
|
18-Aug-2010 |
Adrian Drzewiecki <adriand@vmware.com> |
md: Notify sysfs when RAID1/5/10 disk is In_sync. When RAID1 is done syncing disks, it'll update the state of synced rdevs to In_sync. But it neglected to notify sysfs that the attribute changed. So any programs that are waiting for an rdev's state to change will not be woken. (raid5/raid10 added by neilb) Signed-off-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
7b6d91daee5cac6402186ff224c3af39d79f4a0e |
|
07-Aug-2010 |
Christoph Hellwig <hch@lst.de> |
block: unify flags for struct bio and struct request Remove the current bio flags and reuse the request flags for the bio, too. This allows to more easily trace the type of I/O from the filesystem down to the block driver. There were two flags in the bio that were missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've renamed two request flags that had a superflous RW in them. Note that the flags are in bio.h despite having the REQ_ name - as blkdev.h includes bio.h that is the only way to go for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
9f7c2220017771253d7d10b3cc017cb79eeac0fb |
|
25-Jul-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: export raid5 unplugging interface. Also remove remaining accesses to ->queue and ->gendisk when ->queue is NULL (As it is in a DM target). Signed-off-by: NeilBrown <neilb@suse.de>
|
252ac5221a71be72b7e7c7b7482af91e9c962e8c |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/plug: optionally use plugger to unplug an array during resync/recovery. If an array doesn't have a 'queue' then md_do_sync cannot unplug it. In that case it will have a 'plugger', so make that available to the mddev, and use it to unplug the array if needed. Signed-off-by: NeilBrown <neilb@suse.de>
|
2ac8740151b082f045e58010eb92560c3a23a0e9 |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: add simple plugging infrastructure. md/raid5 uses the plugging infrastructure provided by the block layer and 'struct request_queue'. However when we plug raid5 under dm there is no request queue so we cannot use that. So create a similar infrastructure that is much lighter weight and use it for raid5. Signed-off-by: NeilBrown <neilb@suse.de>
|
11d8a6e3719519fbc0e2c9d61b6fa931b84bf813 |
|
26-Jul-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: export is_congested test the dm module will need this for dm-raid45. Also only access ->queue->backing_dev_info->congested_fn if ->queue actually exists. It won't in a dm target. Signed-off-by: NeilBrown <neilb@suse.de>
|
4a5add49951e698073011855d1a8a7306bc9308d |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
raid5: Don't set read-ahead when there is no queue dm-raid456 does not provide a 'queue' for raid5 to use, so we must make raid5 stop depending on the queue. First: read_ahead dm handles read-ahead adjustment fully in userspace, so simply don't do any readahead adjustments if there is no queue. Also re-arrange code slightly so all the accesses to ->queue are together. Finally, move the blk_queue_merge_bvec function into the 'if' as the ->split_io setting in dm-raid456 has the same effect. Signed-off-by: NeilBrown <neilb@suse.de>
|
f4be6b43f1ac60dff00ef0923ee43b0e08872947 |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk We will shortly allow md devices with no gendisk (they are attached to a dm-target instead). That will cause mdname() to return 'mdX'. There is one place where mdname really needs to be unique: when creating the name for a slab cache. So in that case, if there is no gendisk, you the address of the mddev formatted in HEX to provide a unique name. Signed-off-by: NeilBrown <neilb@suse.de>
|
c41d4ac40df0d01bf9c383ff28f194d1df2d4fd9 |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: factor out code for changing size of stripe cache. Separate the actual 'change' code from the sysfs interface so that it can eventually be called internally. Signed-off-by: NeilBrown <neilb@suse.de>
|
00bcb4ac7ee7e557a491b614219142cea0ef16f4 |
|
01-Jun-2010 |
NeilBrown <neilb@suse.de> |
md: reduce dependence on sysfs. We will want md devices to live as dm targets where sysfs is not visible. So allow md to not connect to sysfs. Signed-off-by: NeilBrown <neilb@suse.de>
|
3424bf6a772cff606fc4bc24a3639c937afb547f |
|
17-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: don't include 'spare' drives when reshaping to fewer devices. There are few situations where it would make any sense to add a spare when reducing the number of devices in an array, but it is conceivable: A 6 drive RAID6 with two missing devices could be reshaped to a 5 drive RAID6, and a spare could become available just in time for the reshape, but not early enough to have been recovered first. 'freezing' recovery can make this easy to do without any races. However doing such a thing is a bad idea. md will not record the partially-recovered state of the 'spare' and when the reshape finished it will think that the spare is still spare. Easiest way to avoid this confusion is to simply disallow it. Signed-off-by: NeilBrown <neilb@suse.de>
|
2f115882499f3e5eca33d1df07b8876cc752a1ff |
|
17-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: add a missing 'continue' in a loop. As the comment says, the tail of this loop only applies to devices that are not fully in sync, so if In_sync was set, we should avoid the rest of the loop. This bug will hardly ever cause an actual problem. The worst it can do is allow an array to be assembled that is dirty and degraded, which is not generally a good idea (without warning the sysadmin first). This will only happen if the array is RAID4 or a RAID5/6 in an intermediate state during a reshape and so has one drive that is all 'parity' - no data - while some other device has failed. This is certainly possible, but not at all common. Signed-off-by: NeilBrown <neilb@suse.de>
|
415e72d034c50520ddb7ff79e7d1792c1306f0c9 |
|
17-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: Allow recovered part of partially recovered devices to be in-sync During a recovery of reshape the early part of some devices might be in-sync while the later parts are not. We we know we are looking at an early part it is good to treat that part as in-sync for stripe calculations. This is particularly important for a reshape which suffers device failure. Treating the data as in-sync can mean the difference between data-safety and data-loss. Signed-off-by: NeilBrown <neilb@suse.de>
|
674806d62fb02a22eea948c9f1b5e58e0947b728 |
|
16-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: More careful check for "has array failed". When we are reshaping an array, the device failure combinations that cause us to decide that the array as failed are more subtle. In particular, any 'spare' will be fully in-sync in the section of the array that has already been reshaped, thus failures that affect only that section are less critical. So encode this subtlety in a new function and call it as appropriate. The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6 conversion where the last two devices failed. This resulted in: good good good good incomplete good good failed failed while converting a 5-drive RAID6 to 8 drive RAID5 The incomplete device causes the whole array to look bad, bad as it was actually good for the section that had been converted to 8-drives, all the data was actually safe. Reported-by: Terry Morris <tbmorris@tbmorris.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
70fffd0bfab1558a8c64c5e903dea1fb84cd9f6b |
|
16-Jun-2010 |
NeilBrown <neilb@suse.de> |
md: Don't update ->recovery_offset when reshaping an array to fewer devices. When an array is reshaped to have fewer devices, the reshape proceeds from the end of the devices to the beginning. If a device happens to be non-In_sync (which is possible but rare) we would normally update the ->recovery_offset as the reshape progresses. However that would be wrong as the recover_offset records that the early part of the device is in_sync, while in fact it would only be the later part that is in_sync, and in any case the offset number would be measured from the wrong end of the device. Relatedly, if after a reshape a spare is discovered to not be recoverred all the way to the end, not allow spare_active to incorporate it in the array. This becomes relevant in the following sample scenario: A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined operation. The RAID5->RAID6 conversion will cause a 5 drive to be included as a spare, then the 5drive -> 6drive reshape will effectively rebuild that spare as it progresses. The 6th drive is treated as in_sync the whole time as there is never any case that we might consider reading from it, but must not because there is no valid data. If we interrupt this reshape part-way through and reverse it to return to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update the recovery_offset - as that would be wrong - and we don't want to include that spare as active in the 5-drive RAID6 when the reversed reshape completed and it will be mostly out-of-sync still. Signed-off-by: NeilBrown <neilb@suse.de>
|
e4e11e385d1e5516ac76c956d6c25e6c2fa1b8d0 |
|
16-Jun-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: avoid oops when number of devices is reduced then increased. The entries in the stripe_cache maintained by raid5 are enlarged when we increased the number of devices in the array, but not shrunk when we reduce the number of devices. So if entries are added after reducing the number of devices, we much ensure to initialise the whole entry, not just the part that is currently relevant. Otherwise if we enlarge the array again, we will reference uninitialised values. As grow_buffers/shrink_buffer now want to use a count that is stored explicity in the raid_conf, they should get it from there rather than being passed it as a parameter. Signed-off-by: NeilBrown <neilb@suse.de>
|
55af6bb509d3ef2696faddd4a734bf024794b337 |
|
26-May-2010 |
Akinobu Mita <akinobu.mita@gmail.com> |
md: convert cpu notifier to return encapsulate errno value By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for raid5. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
7b0bb5368a7195606eca475d9f4e291ab7227052 |
|
28-Apr-2010 |
Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> |
md/raid6: Fix raid-6 read-error correction in degraded state Fix: Raid-6 was not trying to correct a read-error when in singly-degraded state and was instead dropping one more device, going to doubly-degraded state. This patch fixes this behaviour. Tested-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> Reported-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
|
0c55e02259115c151e4835dd417cf41467bb02e2 |
|
03-May-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: improve consistency of error messages. Many 'printk' messages from the raid456 module mention 'raid5' even though it may be a 'raid6' or even 'raid4' array. This can cause confusion. Also the actual array name is not always reported and when it is it is not reported consistently. So change all the messages to start: md/raid:%s: where '%s' becomes e.g. md3 to identify the particular array. Signed-off-by: NeilBrown <neilb@suse.de>
|
f1b29bcae116409db5e543622aadab43041c9ae9 |
|
02-May-2010 |
Dan Williams <dan.j.williams@intel.com> |
md/raid4: permit raid0 takeover For consistency allow raid4 to takeover raid0 in addition to raid5 (with a raid4 layout). Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
21a52c6d05c15f862797736393915bfa8cd40ee9 |
|
01-Apr-2010 |
NeilBrown <neilb@suse.de> |
md: pass mddev to make_request functions rather than request_queue We used to pass the personality make_request function direct to the block layer so the first argument had to be a queue. But now we have the intermediary md_make_request so it makes at lot more sense to pass a struct mddev_s. It makes it possible to have an mddev without its own queue too. Signed-off-by: NeilBrown <neilb@suse.de>
|
b821eaa572fd737faaf6928ba046e571526c36c6 |
|
29-Mar-2010 |
NeilBrown <neilb@suse.de> |
md: remove ->changed and related code. We set ->changed to 1 and call check_disk_change at the end of md_open so that bd_invalidated would be set and thus partition rescan would happen appropriately. Now that we call revalidate_disk directly, which sets bd_invalidates, that indirection is no longer needed and can be removed. Signed-off-by: NeilBrown <neilb@suse.de>
|
490773268cf64f68da2470e07b52c7944da6312d |
|
25-Mar-2010 |
NeilBrown <neilb@suse.de> |
md: move io accounting out of personalities into md_make_request While I generally prefer letting personalities do as much as possible, given that we have a central md_make_request anyway we may as well use it to simplify code. Also this centralises knowledge of ->gendisk which will help later. Signed-off-by: NeilBrown <neilb@suse.de>
|
2b7f22284d71975e37a82db154386348eec0e52c |
|
25-Mar-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: small tidyup in raid5_align_endio Diving through ->queue to find mddev is unnecessarily complex - there is an easier path to finding mddev, so use that. Signed-off-by: NeilBrown <neilb@suse.de>
|
a78d38a1a16c8e009aa512caa71d483757fefc1c |
|
22-Mar-2010 |
NeilBrown <neilb@suse.de> |
md: add support for raid5 to raid4 conversion This is unlikely to be wanted, but we may as well provide it for completeness. Signed-off-by: NeilBrown <neilb@suse.de>
|
54071b3808ee3dc8624d9d6f1b06c4fd5308fa3b |
|
08-Mar-2010 |
Trela Maciej <Maciej.Trela@intel.com> |
md:Add support for Raid0->Raid5 takeover Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
7b92813c3c0b6990f14838e3985fb385d2655d0c |
|
08-Mar-2010 |
H Hartley Sweeten <hartleys@visionengravers.com> |
drivers/md: Remove unnecessary casts of void * void pointers do not need to be cast to other pointer types. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
a64c876fd357906a1f7193723866562ad290654c |
|
14-Apr-2010 |
NeilBrown <neilb@suse.de> |
md: manage redundancy group in sysfs when changing level. Some levels expect the 'redundancy group' to be present, others don't. So when we change level of an array we might need to add or remove this group. This requires fixing up the current practice of overloading ->private to indicate (when ->pers == NULL) that something needs to be removed. So create a new ->to_remove to fill that role. When changing levels, we may need to add or remove attributes. When changing RAID5 -> RAID6, we both add and remove the same thing. It is important to catch this and optimise it out as the removal is delayed until a lock is released, so trying to add immediately would cause problems. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
|
87aa63000c484bfb9909989316f615240dfee018 |
|
28-Apr-2010 |
Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> |
md/raid6: Fix raid-6 read-error correction in degraded state Fix: Raid-6 was not trying to correct a read-error when in singly-degraded state and was instead dropping one more device, going to doubly-degraded state. This patch fixes this behaviour. Tested-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> Reported-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
|
6e3b96ed610e5a1838e62ddae9fa0c3463f235fa |
|
22-Apr-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: fix previous patch. Previous patch changes stripe and chunk_number to sector_t but mistakenly did not update all of the divisions to use sector_dev(). This patch changes all the those divisions (actually the '%' operator) to sector_div. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
|
35f2a591192d0a5d9f7fc696869c76f0b8e49c3d |
|
20-Apr-2010 |
NeilBrown <neilb@suse.de> |
md/raid5: allow for more than 2^31 chunks. With many large drives and small chunk sizes it is possible to create a RAID5 with more than 2^31 chunks. Make sure this works. Reported-by: Brett King <king.br@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
|
5a0e3ad6af8660be21ca98a971cd00f331318c05 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
8a78362c4eefc1deddbefe2c7f38aabbc2429d6b |
|
26-Feb-2010 |
Martin K. Petersen <martin.petersen@oracle.com> |
block: Consolidate phys_segment and hw_segment limits Except for SCSI no device drivers distinguish between physical and hardware segment limits. Consolidate the two into a single segment limit. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
a29d8b8e2d811a24bbe49215a0f0c536b72ebc18 |
|
02-Feb-2010 |
Tejun Heo <tj@kernel.org> |
percpu: add __percpu sparse annotations to what's left Add __percpu sparse annotations to places which didn't make it in one of the previous patches. All converions are trivial. These annotations are to make sparse consider percpu variables to be in a different address space and warn if accessed without going through percpu accessors. This patch doesn't affect normal builds. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Borislav Petkov <borislav.petkov@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Neil Brown <neilb@suse.de>
|
ef286f6fa673cd7fb367e1b145069d8dbfcc6081 |
|
09-Feb-2010 |
NeilBrown <neilb@suse.de> |
md: fix some lockdep issues between md and sysfs. ====== This fix is related to http://bugzilla.kernel.org/show_bug.cgi?id=15142 but does not address that exact issue. ====== sysfs does like attributes being removed while they are being accessed (i.e. read or written) and waits for the access to complete. As accessing some md attributes takes the same lock that is held while removing those attributes a deadlock can occur. This patch addresses 3 issues in md that could lead to this deadlock. Two relate to calling flush_scheduled_work while the lock is held. This is probably a bad idea in general and as we use schedule_work to delete various sysfs objects it is particularly bad. In one case flush_scheduled_work is called from md_alloc (called by md_probe) called from do_md_run which holds the lock. This call is only present to ensure that ->gendisk is set. However we can be sure that gendisk is always set (though possibly we couldn't when that code was originally written. This is because do_md_run is called in three different contexts: 1/ from md_ioctl. This requires that md_open has succeeded, and it fails if ->gendisk is not set. 2/ from writing a sysfs attribute. This can only happen if the mddev has been registered in sysfs which happens in md_alloc after ->gendisk has been set. 3/ from autorun_array which is only called by autorun_devices, which checks for ->gendisk to be set before calling autorun_array. So the call to md_probe in do_md_run can be removed, and the check on ->gendisk can also go. In the other case flush_scheduled_work is being called in do_md_stop, purportedly to wait for all md_delayed_delete calls (which delete the component rdevs) to complete. However there really isn't any need to wait for them - they have already been disconnected in all important ways. The third issue is that raid5->stop() removes some attribute names while the lock is held. There is already some infrastructure in place to delay attribute removal until after the lock is released (using schedule_work). So extend that infrastructure to remove the raid5_attrs_group. This does not address all lockdep issues related to the sysfs "s_active" lock. The rest can be address by splitting that lockdep context between symlinks and non-symlinks which hopefully will happen. Signed-off-by: NeilBrown <neilb@suse.de>
|
9eb07c259207d048e3ee8be2a77b2a4680b1edd4 |
|
08-Feb-2010 |
NeilBrown <neilb@suse.de> |
md: fix 'degraded' calculation when starting a reshape. This code was written long ago when it was not possible to reshape a degraded array. Now it is so the current level of degraded-ness needs to be taken in to account. Also newly addded devices should only reduce degradedness if they are deemed to be in-sync. In particular, if you convert a RAID5 to a RAID6, and increase the number of devices at the same time, then the 5->6 conversion will make the array degraded so the current code will produce a wrong value for 'degraded' - "-1" to be precise. If the reshape runs to completion end_reshape will calculate a correct new value for 'degraded', but if a device fails during the reshape an incorrect decision might be made based on the incorrect value of "degraded". This patch is suitable for 2.6.32-stable and if they are still open, 2.6.31-stable and 2.6.30-stable as well. Cc: stable@kernel.org Reported-by: Michael Evans <mjevans1983@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
0efb9e6191e1d3d34c1db90b829b742bc36d532e |
|
13-Dec-2009 |
NeilBrown <neilb@suse.de> |
md: add MODULE_DESCRIPTION for all md related modules. Suggested by Oren Held <orenhe@il.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
729a18663a30a9c8076e3adc2b3e4c866974f935 |
|
13-Dec-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: don't complete make_request on barrier until writes are scheduled The post-barrier-flush is sent by md as soon as make_request on the barrier write completes. For raid5, the data might not be in the per-device queues yet. So for barrier requests, wait for any pre-reading to be done so that the request will be in the per-device queues. We use the 'preread_active' count to check that nothing is still in the preread phase, and delay the decrement of this count until after write requests have been submitted to the underlying devices. Signed-off-by: NeilBrown <neilb@suse.de>
|
a2826aa92e2e14db372eda01d333267258944033 |
|
13-Dec-2009 |
NeilBrown <neilb@suse.de> |
md: support barrier requests on all personalities. Previously barriers were only supported on RAID1. This is because other levels requires synchronisation across all devices and so needed a different approach. Here is that approach. When a barrier arrives, we send a zero-length barrier to every active device. When that completes - and if the original request was not empty - we submit the barrier request itself (with the barrier flag cleared) and then submit a fresh load of zero length barriers. The barrier request itself is asynchronous, but any subsequent request will block until the barrier completes. The reason for clearing the barrier flag is that a barrier request is allowed to fail. If we pass a non-empty barrier through a striping raid level it is conceivable that part of it could succeed and part could fail. That would be way too hard to deal with. So if the first run of zero length barriers succeed, we assume all is sufficiently well that we send the request and ignore errors in the second run of barriers. RAID5 needs extra care as write requests may not have been submitted to the underlying devices yet. So we flush the stripe cache before proceeding with the barrier. Note that the second set of zero-length barriers are submitted immediately after the original request is submitted. Thus when a personality finds mddev->barrier to be set during make_request, it should not return from make_request until the corresponding per-device request(s) have been queued. That will be done in later patches. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Andre Noll <maan@systemlinux.org>
|
8553fe7ec731e4f997245e14319572cb15118018 |
|
13-Dec-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: remove some sparse warnings. qd_idx is previously declared and given exactly the same value! Signed-off-by: NeilBrown <neilb@suse.de>
|
c148ffdcda00b6599b70f8b65e6a1fadd1dbb127 |
|
13-Nov-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: Allow dirty-degraded arrays to be assembled when only party is degraded. Normally is it not safe to allow a raid5 that is both dirty and degraded to be assembled without explicit request from that admin, as it can cause hidden data corruption. This is because 'dirty' means that the parity cannot be trusted, and 'degraded' means that the parity needs to be used. However, if the device that is missing contains only parity, then there is no issue and assembly can continue. This particularly applies when a RAID5 is being converted to a RAID6 and there is an unclean shutdown while the conversion is happening. So check for whether the degraded space only contains parity, and in that case, allow the assembly. Signed-off-by: NeilBrown <neilb@suse.de>
|
7ef90146a14c2bb1de2e22399f147ebec5b74f0b |
|
13-Nov-2009 |
NeilBrown <neilb@suse.de> |
Don't unconditionally set in_sync on newly added device in raid5_reshape When a reshape finds that it can add spare devices into the array, those devices might already be 'in_sync' if they are beyond the old size of the array, or they might not if they are within the array. The first case happens when we change an N-drive RAID5 to an N+1-drive RAID5. The second happens when we convert an N-drive RAID5 to an N+1-drive RAID6. So set the flag more carefully. Also, ->recovery_offset is only meaningful when the flag is clear, so only set it in that case. This change needs the preceding two to ensure that the non-in_sync device doesn't get evicted from the array when it is stopped, in the case where v0.90 metadata is used. Signed-off-by: NeilBrown <neilb@suse.de>
|
8dee7211467a56b7eb4e4359efb0aa4a72e1b6f3 |
|
06-Nov-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: make sure curr_sync_completes is uptodate when reshape starts This value is visible through sysfs and is used by mdadm when it manages a reshape (backing up data that is about to be rearranged). So it is important that it is always correct. Current it does not get updated properly when a reshape starts which can cause problems when assembling an array that is in the middle of being reshaped. This is suitable for 2.6.31.y stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
|
6629542e79255e0dbef8ec82eaf644e1b2546c3c |
|
20-Oct-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: kill a gcc-4.0.1 'uninitialized variable' warning Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
5dd33c9a4c29015f6d87568d33521c98931a387e |
|
16-Oct-2009 |
NeilBrown <neilb@suse.de> |
md/async: don't pass a memory pointer as a page pointer. md/raid6 passes a list of 'struct page *' to the async_tx routines, which then either DMA map them for offload, or take the page_address for CPU based calculations. For RAID6 we sometime leave 'blanks' in the list of pages. For CPU based calcs, we want to treat theses as a page of zeros. For offloaded calculations, we simply don't pass a page to the hardware. Currently the 'blanks' are encoded as a pointer to raid6_empty_zero_page. This is a 4096 byte memory region, not a 'struct page'. This is mostly handled correctly but is rather ugly. So change the code to pass and expect a NULL pointer for the blanks. When taking page_address of a page, we need to check for a NULL and in that case use raid6_empty_zero_page. Signed-off-by: NeilBrown <neilb@suse.de>
|
5e5e3e78ed9038b8f7112835d07084eefb9daa47 |
|
16-Oct-2009 |
NeilBrown <neilb@suse.de> |
md: Fix handling of raid5 array which is being reshaped to fewer devices. When a raid5 (or raid6) array is being reshaped to have fewer devices, conf->raid_disks is the latter and hence smaller number of devices. However sometimes we want to use a number which is the total number of currently required devices - the larger of the 'old' and 'new' sizes. Before we implemented reducing the number of devices, this was always 'new' i.e. ->raid_disks. Now we need max(raid_disks, previous_raid_disks) in those places. This particularly affects assembling an array that was shutdown while in the middle of a reshape to fewer devices. md.c needs a similar fix when interpreting the md metadata. Signed-off-by: NeilBrown <neilb@suse.de>
|
e4424fee1815f996dccd36be44d68ca160ec3e1b |
|
16-Oct-2009 |
NeilBrown <neilb@suse.de> |
md: fix problems with RAID6 calculations for DDF. Signed-off-by: NeilBrown <neilb@suse.de>
|
417b8d4ac868cf58d6c68f52d72f7648413e0edc |
|
16-Oct-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid456: downlevel multicore operations to raid_run_ops The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). By downleveling the parallelism to raid_run_ops the pathological stripe_head bouncing is eliminated. This version still exhibits an average 11% throughput loss for: mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 echo 1024 > /sys/block/md0/md/stripe_cache_size dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 ...but the results are at least stable and can be used as a base for further multicore experimentation. Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
f5efd45ae597c96ed017afad5662b67d55b402a0 |
|
16-Oct-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid5: initialize conf->device_lock earlier Deallocating a raid5_conf_t structure requires taking 'device_lock'. Ensure it is initialized before it is used, i.e. initialize the lock before attempting any further initializations that might fail. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
1442577bf6da1a31ae6555212202be9a2cec5642 |
|
16-Oct-2009 |
NeilBrown <neilb@suse.de> |
Revert "md: do not progress the resync process if the stripe was blocked" This reverts commit df10cfbc4d7ab93260d997df754219d390d62a9d. This patch was based on a misunderstanding and risks introducing a busy-wait loop. So revert it. Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
3fa841d7e7266f6fcc1b3885b905f5153ba897d8 |
|
23-Sep-2009 |
NeilBrown <neilb@suse.de> |
md: report device as congested when suspended This should writeback from coming when the device is temporarily suspended. Signed-off-by: NeilBrown <neilb@suse.de>
|
0da3c6194ec2f32617b272df4505a1cf022faea5 |
|
23-Sep-2009 |
NeilBrown <neilb@suse.de> |
md: Improve name of threads created by md_register_thread The management thread for raid4,5,6 arrays are all called mdX_raid5, independent of the actual raid level, which is wrong and can be confusion. So change md_register_thread to use the name from the personality unless no alternate name (like 'resync' or 'reshape') is given. This is simpler and more correct. Cc: Jinzc <zhenchengjin@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
a9f326ebf22a0de776815240fb76dabe139397ea |
|
23-Sep-2009 |
NeilBrown <neilb@suse.de> |
md: remove sparse waring "symbol xxx shadows an earlier one" Rename some variable and remove some duplicate definitions to avoid there warnings. None of them are actual errors. Signed-off-by: NeilBrown <neilb@suse.de>
|
6c910a78e495b4c1778a8b136b37fe3c05712730 |
|
16-Sep-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: cleanup ops_run_compute6_2 Neil says: "It is correct as it stands, but the fact that every branch in the 'if' part ends with a 'return' isn't immediately obvious, so it is clearer if we are explicit about the if / then / else structure." Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
2d6e4ecc87d20299bcb249dd62efbd73496744c3 |
|
16-Sep-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: eliminate BUG_ON with side effect As pointed out by Neil it should be possible to build a driver with all BUG_ON statements deleted. It's bad form to have a BUG_ON with a side effect. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
1f98a13f623e0ef666690a18c1250335fc6d7ef1 |
|
11-Sep-2009 |
Jens Axboe <jens.axboe@oracle.com> |
bio: first step in sanitizing the bio->bi_rw flag testing Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
0403e3827788d878163f9ef0541b748b0f88ca5d |
|
09-Sep-2009 |
Dan Williams <dan.j.williams@intel.com> |
dmaengine: add fence support Some engines optimize operation by reading ahead in the descriptor chain such that descriptor2 may start execution before descriptor1 completes. If descriptor2 depends on the result from descriptor1 then a fence is required (on descriptor2) to disable this optimization. The async_tx api could implicitly identify dependencies via the 'depend_tx' parameter, but that would constrain cases where the dependency chain only specifies a completion order rather than a data dependency. So, provide an ASYNC_TX_FENCE to explicitly identify data dependencies. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
07a3b417dc3d00802bd7b4874c3e811f0b015a7d |
|
30-Aug-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid456: distribute raid processing over multiple cores Now that the resources to handle stripe_head operations are allocated percpu it is possible for raid5d to distribute stripe handling over multiple cores. This conversion also adds a call to cond_resched() in the non-multicore case to prevent one core from getting monopolized for raid operations. Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
b774ef491b4edf6876077014ecbb87f10c69c10f |
|
30-Aug-2009 |
Yuri Tikhonov <yur@emcraft.com> |
md/raid6: remove synchronous infrastructure These routines have been replaced by there asynchronous counterparts. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 |
|
30-Aug-2009 |
Yuri Tikhonov <yur@emcraft.com> |
md/raid6: asynchronous handle_stripe6 1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to raid_run_ops 2/ Implement a handler for sh->reconstruct_state similar to the raid5 case (adds handling of Q parity) 3/ Prevent handle_parity_checks6 from running concurrently with 'compute' operations 4/ Hook up raid_run_ops Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
d82dfee0ad8f240fef1b28e2258891c07da57367 |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: asynchronous handle_parity_check6 [ Based on an original patch by Yuri Tikhonov ] Implement the state machine for handling the RAID-6 parities check and repair functionality. Note that the raid6 case does not need to check for new failures, like raid5, as it will always writeback the correct disks. The raid5 case can be updated to check zero_sum_result to avoid getting confused by new failures rather than retrying the entire check operation. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
a9b39a741a7e3b262b9f51fefb68e17b32756999 |
|
30-Aug-2009 |
Yuri Tikhonov <yur@emcraft.com> |
md/raid6: asynchronous handle_stripe_dirtying6 In the synchronous implementation of stripe dirtying we processed a degraded stripe with one call to handle_stripe_dirtying6(). I.e. compute the missing blocks from the other drives, then copy in the new data and reconstruct the parities. In the asynchronous case we do not perform stripe operations directly. Instead, operations are scheduled with flags to be later serviced by raid_run_ops. So, for the degraded case the final reconstruction step can only be carried out after all blocks have been brought up to date by being read, or computed. Like the raid5 case schedule_reconstruction() sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and through operation chaining can handle compute and reconstruct in a single raid_run_ops pass. [dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating] Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
5599becca4bee7badf605e41fd5bcde76d51f2a4 |
|
30-Aug-2009 |
Yuri Tikhonov <yur@emcraft.com> |
md/raid6: asynchronous handle_stripe_fill6 Modify handle_stripe_fill6 to work asynchronously by introducing fetch_block6 as the raid6 analog of fetch_block5 (schedule compute operations for missing/out-of-sync disks). [dan.j.williams@intel.com: compute D+Q in one pass] Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
c0f7bddbe60f43578dccf4ffb8d4bff88f625ea7 |
|
30-Aug-2009 |
Yuri Tikhonov <yur@emcraft.com> |
md/raid5,6: common schedule_reconstruction for raid5/6 Extend schedule_reconstruction5 for reuse by the raid6 path. Add support for generating Q and BUG() if a request is made to perform 'prexor'. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
ac6b53b6e6acab27e4f3e2383f9ac1f0d7c6200b |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: asynchronous raid6 operations [ Based on an original patch by Yuri Tikhonov ] The raid_run_ops routine uses the asynchronous offload api and the stripe_operations member of a stripe_head to carry out xor+pq+copy operations asynchronously, outside the lock. The operations performed by RAID-6 are the same as in the RAID-5 case except for no support of STRIPE_OP_PREXOR operations. All the others are supported: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate missing blocks (1 or 2) in the cache from the other blocks STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_RECONSTRUCT - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct The flow is the same as in the RAID-5 case, and reuses some routines, namely: 1/ ops_complete_postxor (renamed to ops_complete_reconstruct) 2/ ops_complete_compute (updated to set up to 2 targets uptodate) 3/ ops_run_check (renamed to ops_run_check_p for xor parity checks) [neilb@suse.de: fixes to get it to pass mdadm regression suite] Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Ilya Yanok <yanok@emcraft.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
4e7d2c0aefb77f7b24942e5af042a083be4d60bb |
|
30-Aug-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid5: factor out mark_uptodate from ops_complete_compute5 ops_complete_compute5 can be reused in the raid6 path if it is updated to generically handle a second target. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
ad283ea4a3ce82cda2efe33163748a397b31b1eb |
|
30-Aug-2009 |
Dan Williams <dan.j.williams@intel.com> |
async_tx: add sum check flags Replace the flat zero_sum_result with a collection of flags to contain the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead of DMA_ since these flags will be used on non-dma-zero-sum enabled platforms. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
d6f38f31f3ad4b0dd33fe970988f14e7c65ef702 |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid5,6: add percpu scribble region for buffer lists Use percpu memory rather than stack for storing the buffer lists used in parity calculations. Include space for dma address conversions and pass that to async_tx via the async_submit_ctl.scribble pointer. [ Impact: move memory pressure from stack to heap ] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
36d1c6476be51101778882897b315bd928c8c7b5 |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: move the spare page to a percpu allocation In preparation for asynchronous handling of raid6 operations move the spare page to a percpu allocation to allow multiple simultaneous synchronous raid6 recovery operations. Make this allocation cpu hotplug aware to maximize allocation efficiency. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
1a67dde0abba36421a1257d01ba9de2f6d1c160a |
|
13-Aug-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: Properly remove excess drives after shrinking a raid5/6 We were removing the drives, from the array, but not removing symlinks from /sys/.... and not marking the device as having been removed. Signed-off-by: NeilBrown <neilb@suse.de>
|
a639755cf885e437b2fe4168d35157fa90d530ab |
|
13-Aug-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: make sure a reshape restarts at the correct address. This "if" don't allow for the possibility that the number of devices doesn't change, and so sector_nr isn't set correctly in that case. So change '>' to '>='. Signed-off-by: NeilBrown <neilb@suse.de>
|
67ac6011db5d2b0c853d573ff474b25c85dfb644 |
|
13-Aug-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: allow new reshape modes to be restarted in the middle. md/raid5 doesn't allow a reshape to restart if it involves writing over the same part of disk that it would be reading from. This happens at the beginning of a reshape that increases the number of devices, at the end of a reshape that decreases the number of devices, and continuously for a reshape that does not change the number of devices. The current code is correct for the "increase number of devices" case as the critical section at the start is handled by userspace performing a backup. It does not work for reducing the number of devices, or the no-change case. For 'reducing', we need to invert the test. For no-change we cannot really be sure things will be safe, so simply require the array to be read-only, which is how the user-space code which carefully starts such arrays works. Signed-off-by: NeilBrown <neilb@suse.de>
|
449aad3e25358812c43afc60918c5ad3819488e7 |
|
03-Aug-2009 |
NeilBrown <neilb@suse.de> |
md: Use revalidate_disk to effect changes in size of device. As revalidate_disk calls check_disk_size_change, it will cause any capacity change of a gendisk to be propagated to the blockdev inode. So use that instead of mucking about with locks and i_size_write. Also add a call to revalidate_disk in do_md_run and a few other places where the gendisk capacity is changed. Signed-off-by: NeilBrown <neilb@suse.de>
|
64bd660b51b2da92e99a5e97349f6558349f11c5 |
|
03-Aug-2009 |
NeilBrown <neilb@suse.de> |
md: allow raid5_quiesce to work properly when reshape is happening. The ->quiesce method is not supposed to stop resync/recovery/reshape, just normal IO. But in raid5 we don't have a way to know which stripes are being used for normal IO and which for resync etc, so we need to wait for all stripes to be idle to be sure that all writes have completed. However reshape keeps at least some stripe busy for an extended period of time, so a call to raid5_quiesce can block for several seconds needlessly. So arrange for reshape etc to pause briefly while raid5_quiesce is trying to quiesce the array so that the active_stripes count can drop to zero. Signed-off-by: NeilBrown <neilb@suse.de>
|
e516402c0d4fc02be4af9fa8c18954d4f9deb44e |
|
03-Aug-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: set reshape_position correctly when reshape starts. As the internal reshape_progress counter is the main driver for reshape, the fact that reshape_position sometimes starts with the wrong value has minimal effect. It is visible in sysfs and that is all. Signed-off-by: NeilBrown <neilb@suse.de>
|
95fc17aac45300f45968aacd97a536ddd8db8101 |
|
30-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: release spare page at ->stop() Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
a11034b4282515fd7d9f6fdc0a1380781da461c3 |
|
14-Jul-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: release spare page at ->stop() Add missing call to safe_put_page from stop() by unifying open coded raid5_conf_t de-allocation under free_conf(). Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
e62e58a5ffdc98ac28d8dbd070c857620d541f99 |
|
01-Jul-2009 |
NeilBrown <neilb@suse.de> |
md: use interruptible wait when duration is controlled by userspace. User space can set various limits on an md array so that resync waits when it gets to a certain point, or so that I/O is blocked for a short while. When md is waiting against one of these limit, it should use an interruptible wait so as not to add to the load average, and so are not to trigger a warning if the wait goes on for too long. Signed-off-by: NeilBrown <neilb@suse.de>
|
a5c308d4d1659b1f4833b863394e3e24cdbdfc6e |
|
01-Jul-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: suspend shouldn't affect read requests. md allows write to regions on an array to be suspended temporarily. This allows user-space to participate is aspects of reshape. In particular, data can be copied with not risk of a race. We should not be blocking read requests though, so don't. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
|
8f6c2e4b325a8e9f8f47febb2fd0ed4fae7d45a9 |
|
01-Jul-2009 |
Martin K. Petersen <martin.petersen@oracle.com> |
md: Use new topology calls to indicate alignment and I/O sizes Switch MD over to the new disk_stack_limits() function which checks for aligment and adjusts preferred I/O sizes when stacking. Also indicate preferred I/O sizes where applicable. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
48606a9f2fc034f0b308d088c1f7ab6d407c462c |
|
18-Jun-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: correctly update sync_completed when we reach max_resync At the end of reshape_request we update cyrr_resync_completed if we are about to pause due to reaching resync_max. However we update it to the wrong value. We need to add the "reshape_sectors" that have just been reshaped. Signed-off-by: NeilBrown <neilb@suse.de>
|
7a3ab908948b6296ee7e81d42f7c176361c51975 |
|
17-Jun-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid5: add missing call to schedule() after prepare_to_wait() In the unlikely event that reshape progresses past the current request while it is waiting for a stripe we need to schedule() before retrying for 2 reasons: 1/ Prevent list corruption from duplicated list_add() calls without intervening list_del(). 2/ Give the reshape code a chance to make some progress to resolve the conflict. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
8c6ac868b107ed50a46204f6d14e2ad9443ff146 |
|
18-Jun-2009 |
Andre Noll <maan@systemlinux.org> |
md: Push down reconstruction log message to personality code. Currently, the md layer checks in analyze_sbs() if the raid level supports reconstruction (mddev->level >= 1) and if reconstruction is in progress (mddev->recovery_cp != MaxSector). Move that printk into the personality code of those raid levels that care (levels 1, 4, 5, 6, 10). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
50ac168a6e0a061bf5346d53aa9e7beb94c97527 |
|
18-Jun-2009 |
NeilBrown <neilb@suse.de> |
md: merge reconfig and check_reshape methods. The difference between these two methods is artificial. Both check that a pending reshape is valid, and perform any aspect of it that can be done immediately. 'reconfig' handles chunk size and layout. 'check_reshape' handles raid_disks. So make them just one method. Signed-off-by: NeilBrown <neilb@suse.de>
|
597a711b69cfff95c4b8f6069037e7ad3fc71f56 |
|
18-Jun-2009 |
NeilBrown <neilb@suse.de> |
md: remove unnecessary arguments from ->reconfig method. Passing the new layout and chunksize as args is not necessary as the mddev has fields for new_check and new_layout. This is preparation for combining the check_reshape and reconfig methods Signed-off-by: NeilBrown <neilb@suse.de>
|
01ee22b496c41384eaa6dcae983c86d8bc32fbb8 |
|
18-Jun-2009 |
NeilBrown <neilb@suse.de> |
md: raid5: check stripe cache is large enough in start_reshape In reshape cases that do not change the number of devices, start_reshape is called without first calling check_reshape. Currently, the check that the stripe_cache is large enough is only done in check_reshape. It should be in start_reshape too. Signed-off-by: NeilBrown <neilb@suse.de>
|
cdc2ae6d6a30df8fd92c5e300d0e3005e13eb6b0 |
|
18-Jun-2009 |
Andre Noll <maan@systemlinux.org> |
md: fix some comments. 1/ Raid5 has learned to take over also raid4 and raid6 arrays. 2/ new_chunk in mdp_superblock_1 is in sectors, not bytes. Signed-off-by: NeilBrown <neilb@suse.de>
|
0ba459d26260d4d13346c76642f461b2bf607eef |
|
18-Jun-2009 |
Andre Noll <maan@systemlinux.org> |
md/raid5: Use is_power_of_2() in raid5_reconfig()/raid6_reconfig(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
09c9e5fa1b93ad5b81c9dcf8ce3a5b9ae2ac31e4 |
|
18-Jun-2009 |
Andre Noll <maan@systemlinux.org> |
md: convert conf->chunk_size and conf->prev_chunk to sectors. This kills some more shifts. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
664e7c413f1e90eceb0b2596dd73a0832faec058 |
|
18-Jun-2009 |
Andre Noll <maan@systemlinux.org> |
md: Convert mddev->new_chunk to sectors. A straight-forward conversion which gets rid of some multiplications/divisions/shifts. The patch also introduces a couple of new ones, most of which are due to conf->chunk_size still being represented in bytes. This will be cleaned up in subsequent patches. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
9d8f0363623b3da12c43007cf77f5e1a4e8a5964 |
|
18-Jun-2009 |
Andre Noll <maan@systemlinux.org> |
md: Make mddev->chunk_size sector-based. This patch renames the chunk_size field to chunk_sectors with the implied change of semantics. Since is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9) = is_power_of_2(chunk_sectors) these bits don't need an adjustment for the shift. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
740da44918680a0c72411ae4ccdd1861069afcc4 |
|
16-Jun-2009 |
raz ben yehuda <raziebe@gmail.com> |
md: raid5: chunk size check in setup_conf have raid5 check chunk size in run/reshape method instead of in md Signed-off-by: raziebe@gmail.com Signed-off-by: NeilBrown <neilb@suse.de>
|
070ec55d07157a3041f92654135c3c6e2eaaf901 |
|
16-Jun-2009 |
NeilBrown <neilb@suse.de> |
md: remove mddev_to_conf "helper" macro Having a macro just to cast a void* isn't really helpful. I would must rather see that we are simply de-referencing ->private, than have to know what the macro does. So open code the macro everywhere and remove the pointless cast. Signed-off-by: NeilBrown <neilb@suse.de>
|
0e6e0271a210817e202c8a4bfffbde3e3c0616d1 |
|
09-Jun-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: fix bug in reshape code when chunk_size decreases. Now that we support changing the chunksize, we calculate "reshape_sectors" to be the max of number of sectors in old and new chunk size. However there is one please where we still use 'chunksize' rather than 'reshape_sectors'. This causes a reshape that reduces the size of chunks to freeze. Signed-off-by: NeilBrown <neilb@suse.de>
|
a8c906ca3f63d05f0d25490cf82276f73c6fe095 |
|
09-Jun-2009 |
NeilBrown <neilb@suse.de> |
md/raid5 - avoid deadlocks in get_active_stripe during reshape md has functionality to 'quiesce' and array so that all pending IO completed and no new IO starts. This is used to achieve a stable state before making internal changes. Currently this quiescing applies equally to normal IO, resync IO, and reshape IO. However there is a problem with applying it to reshape IO. Reshape can have multiple 'stripe_heads' that must be active together. If the quiesce come between allocating the first and the last of such a collection, then we deadlock, as the last will not be allocated until the quiesce is lifted, the quiesce will not be lifted until the first (which has been allocated) gets used, and that first cannot be used until the last is allocated. It is not necessary to inhibit reshape IO when a quiesce is requested. Those places in the code that require a full quiesce will ensure the reshape thread is not running at all. So allow reshape requests to get access to new stripe_heads without being blocked by a 'quiesce'. This only affects in-place reshapes (i.e. where the array does not grow or shrink) and these are only newly supported. So this patch is not needed in earlier kernels. Signed-off-by: NeilBrown <neilb@suse.de>
|
f001a70cdc61c01452d42e8b32fd7c7842ef62d5 |
|
09-Jun-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: use conf->raid_disks in preference to mddev->raid_disk mddev->raid_disks can be changed and any time by a request from user-space. It is a suggestion as to what number of raid_disks is desired. conf->raid_disks can only be changed by the raid5 module with suitable locks in place. It is a statement as to the current number of raid_disks. There are two places where the latter should be used, but the former is used. This can lead to a crash when reshaping an array. This patch changes to mddev-> to conf-> Signed-off-by: NeilBrown <neilb@suse.de>
|
a08abd8ca890a377521d65d493d174bebcaf694b |
|
03-Jun-2009 |
Dan Williams <dan.j.williams@intel.com> |
async_tx: structify submission arguments, add scribble Prepare the api for the arrival of a new parameter, 'scribble'. This will allow callers to identify scratchpad memory for dma address or page address conversions. As this adds yet another parameter, take this opportunity to convert the common submission parameters (flags, dependency, callback, and callback argument) into an object that is passed by reference. Also, take this opportunity to fix up the kerneldoc and add notes about the relevant ASYNC_TX_* flags for each routine. [ Impact: moves api pass-by-value parameters to a pass-by-reference struct ] Signed-off-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
88ba2aa586c874681c072101287e15d40de7e6e2 |
|
10-Apr-2009 |
Dan Williams <dan.j.williams@intel.com> |
async_tx: kill ASYNC_TX_DEP_ACK flag In support of inter-channel chaining async_tx utilizes an ack flag to gate whether a dependent operation can be chained to another. While the flag is not set the chain can be considered open for appending. Setting the ack flag closes the chain and flags the descriptor for garbage collection. The ASYNC_TX_DEP_ACK flag essentially means "close the chain after adding this dependency". Since each operation can only have one child the api now implicitly sets the ack flag at dependency submission time. This removes an unnecessary management burden from clients of the api. [ Impact: clean up and enforce one dependency per operation ] Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
ed37d83e6aa218192fb28bb6b82498d2a8c74070 |
|
27-May-2009 |
NeilBrown <neilb@suse.de> |
md: raid5: change incorrect usage of 'min' macro to 'min_t' A recent patch to raid5.c use min on an int and a sector_t. This isn't allowed. So change it to min_t(sector_t,x,y). Signed-off-by: NeilBrown <neilb@suse.de>
|
848b3182365fdf5a05bcd5ed949071cac2c894b3 |
|
25-May-2009 |
NeilBrown <neilb@suse.de> |
md: raid5: avoid sector values going negative when testing reshape progress. As sector_t in unsigned, we cannot afford to let 'safepos' etc go negative. So replace a -= b; by a -= min(b,a); Signed-off-by: NeilBrown <neilb@suse.de>
|
ae03bf639a5027d27270123f5f6e3ee6a412781d |
|
22-May-2009 |
Martin K. Petersen <martin.petersen@oracle.com> |
block: Use accessor functions for queue limits Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
c03f6a19699fce4389888a45db8245969311d868 |
|
17-Apr-2009 |
NeilBrown <neilb@suse.de> |
md: update sync_completed and reshape_position even more often. There are circumstances when a user-space process might need to "oversee" a resync/reshape process. For example when doing an in-place reshape of a raid5, it is prudent to take a backup of each section before reshaping it as this is the only way to provide safety against an unplanned shutdown (i.e. crash/power failure). The sync_max sysfs value can be used to stop the resync from advancing beyond a particular point. So user-space can: suspend IO to the first section and back it up set 'sync_max' to the end of the section wait for 'sync_completed' to reach that point resume IO on the first section and move on to the next section. However this process requires the kernel and user-space to run in lock-step which could introduce unnecessary delays. It would be better if a 'double buffered' approach could be used with userspace and kernel space working on different sections with the 'next' section always ready when the 'current' section is finished. One problem with implementing this is that sync_completed is only guaranteed to be updated when the sync process reaches sync_max. (it is updated on a time basis at other times, but it is hard to rely on that). This defeats some of the double buffering. With this patch, sync_completed (and reshape_position) get updated as the current position approaches sync_max, so there is room for userspace to advance sync_max early without losing updates. To be precise, sync_completed is updated when the current sync position reaches half way between the current value of sync_completed and the value of sync_max. This will usually be a good time for user space to update sync_max. If sync_max does not get updated, the updates to sync_completed (together with associated metadata updates) will occur at an exponentially increasing frequency which will get unreasonably fast (one update every page) immediately before the process hits sync_max and stops. So the update rate will be unreasonably fast only for an insignificant period of time. Signed-off-by: NeilBrown <neilb@suse.de>
|
acb180b0e335ad88acfed6c8a33e39c05b95dc49 |
|
14-Apr-2009 |
NeilBrown <neilb@suse.de> |
md: improve usefulness and accuracy of sysfs file md/sync_completed. The sync_completed file reports how much of a resync (or recovery or reshape) has been completed. However due to the possibility of out-of-order completion of writes, it is not certain to be accurate. We have an internal value - mddev->curr_resync_completed - which is an accurate value (though it might not always be quite so uptodate). So: - make curr_resync_completed be uptodate a little more often, particularly when raid5 reshape updates status in the metadata - report curr_resync_completed in the sysfs file - allow poll/select to report all updates to md/sync_completed. This makes sync_completed completed usable by any external metadata handler that wants to record this status information in its metadata. Signed-off-by: NeilBrown <neilb@suse.de>
|
099f53cb50e45ef617a9f1d63ceec799e489418b |
|
08-Apr-2009 |
Dan Williams <dan.j.williams@intel.com> |
async_tx: rename zero_sum to val 'zero_sum' does not properly describe the operation of generating parity and checking that it validates against an existing buffer. Change the name of the operation to 'val' (for 'validate'). This is in anticipation of the p+q case where it is a requirement to identify the target parity buffers separately from the source buffers, because the target parity buffers will not have corresponding pq coefficients. Reviewed-by: Andre Noll <maan@systemlinux.org> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
c8f517c444e4f9f55b5b5ca202b8404691a35805 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5 revise rules for when to update metadata during reshape We currently update the metadata : 1/ every 3Megabytes 2/ When the place we will write new-layout data to is recorded in the metadata as still containing old-layout data. Rule one exists to avoid having to re-do too much reshaping in the face of a crash/restart. So it should really be time based rather than size based. So change it to "every 10 seconds". Rule two turns out to be too harsh when restriping an array 'in-place', as in that case the metadata much be updates for every stripe. For the in-place update, it can only possibly be safe from a crash if some user-space program data a backup of every e.g. few hundred stripes before allowing them to be reshaped. In that case, the constant metadata update is pointless. So only update the metadata if the new metadata will report that the end of the 'old-layout' data is beyond where we are currently writing 'new-layout' data. Signed-off-by: NeilBrown <neilb@suse.de>
|
b0f9ec047b79a92e8b8a9dfbf97537c8fbef234a |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: minor code cleanups in make_request. ... and to be certain the that make_request doesn't wait forever, add a 'wake_up' when ->reshape_progress has been set to MaxSector Signed-off-by: NeilBrown <neilb@suse.de>
|
2cffc4a01dd90a502324e3453d7b245d6d16e1c2 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: remove CONFIG_MD_RAID_RESHAPE config option. This was only needed when the code was experimental. Most of it is well tested now, so the option is no longer useful. Signed-off-by: NeilBrown <neilb@suse.de>
|
ab69ae12ceef7f23c578a3c230144e94a167a821 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: be more careful about write ordering when reshaping. When we are reshaping an array, it is very important that we read the data from a particular sector offset before writing new data at that offset. In most cases when growing or shrinking an array we read long before we even consider writing. But when restriping an array without changing it size, there is a small possibility that we might have some data to available write before the read has happened at the same location. This would require some stripes to be in cache already. To guard against this small possibility, we check, before writing, that the 'old' stripe at the same location is not in the process of being read. And we ensure that we mark all 'source' stripes as such before allowing new 'destination' stripes to proceed. Signed-off-by: NeilBrown <neilb@suse.de>
|
88ce4930e2b80378d45506ce2c3bb5820e156e85 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: allow layout and chunksize to be changed on active array. If an array has 3 or more devices, we allow the chunksize or layout to be changed and when a reshape starts, we use these as the 'new' values. Signed-off-by: NeilBrown <neilb@suse.de>
|
7a6613810785872b7c028fba22fc0bae1c91733d |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: reshape using largest of old and new chunk size This ensures that even when old and new stripes are overlapping, we will try to read all of the old before having to write any of the new. Signed-off-by: NeilBrown <neilb@suse.de>
|
e183eaedd53807e33f02ee80573e2833890e1f21 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: prepare for allowing reshape to change layout Add prev_algo to raid5_conf_t along the same lines as prev_chunk and previous_raid_disks. Signed-off-by: NeilBrown <neilb@suse.de>
|
784052ecc6ade6b6acf4f67e4ada8e5f2e6df446 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: prepare for allowing reshape to change chunksize. Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to remember what the chunk size was before the reshape that is currently underway. This seems like duplication with "chunk_size" and "new_chunk" in mddev_t, and to some extent it is, but there are differences. The values in mddev_t are always defined and often the same. The prev* values are only defined if a reshape is underway. Also (and more significantly) the raid5_conf_t values will be changed at the same time (inside an appropriate lock) that the reshape is started by setting reshape_position. In contrast, the new_chunk value is set when the sysfs file is written which could be well before the reshape starts. Signed-off-by: NeilBrown <neilb@suse.de>
|
86b42c713be3e5f6807aa14b4cbdb005d35c64d5 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape. During a raid5 reshape, we have some stripes in the cache that are 'before' the reshape (and are still to be processed) and some that are 'after'. They are currently differentiated by having different ->disks values as the only reshape current supported involves changing the number of disks. However we will soon support reshapes that do not change the number of disks (chunk parity or chunk size). So make the difference more explicit with a 'generation' number. Signed-off-by: NeilBrown <neilb@suse.de>
|
ec32a2bd35bd6b933a5db6542c48210ce069a376 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: allow number of drives in raid5 to be reduced When reshaping a raid5 to have fewer devices, we work from the end of the array to the beginning. md_do_sync gives addresses to sync_request that go from the beginning to the end. So largely ignore them use the internal state variable "reshape_progress" to keep track of what to do next. Never allow the size to be reduced below the minimum (4 for raid6, 3 otherwise). We require that the size of the array has already been reduced before the array is reshaped to a smaller size. This is because simply reducing the size is an easily reversible operation, while the reshape is immediately destructive and so is not reversible for the blocks at the ends of the devices. Thus to reshape an array to have fewer devices, you must first write an appropriately small size to md/array_size. When reshape finished, we remove any drives that are no longer needed and fix up ->degraded. Signed-off-by: NeilBrown <neilb@suse.de>
|
fef9c61fdfabf97a307c2cf3621a6949f0a4b995 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: change reshape-progress measurement to cope with reshaping backwards. When reducing the number of devices in a raid4/5/6, the reshape process has to start at the end of the array and work down to the beginning. So we need to handle expand_progress and expand_lo differently. This patch renames "expand_progress" and "expand_lo" to avoid the implication that anything is getting bigger (expand->reshape) and every place they are used, we make sure that they are used the right way depending on whether delta_disks is positive or negative. Signed-off-by: NeilBrown <neilb@suse.de>
|
cea9c22800773cecb1d41f4a6139f9eb6a95368b |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: add explicit method to signal the end of a reshape. Currently raid5 (the only module that supports restriping) notices that the reshape has finished be sync_request being given a large value, and handles any cleanup them. This patch changes it so md_check_recovery calls into an explicit finish_reshape method as well. The clean-up from sync_request can do things that need to be done promptly, typically things local to the raid5_conf_t structure. The "finish_reshape" method is called under the mddev_lock so it can do things involving reconfiguring the device. This allows us to get rid of md_set_array_sectors_locked, which would have caused a deadlock if you tried to stop and array while a reshape was happening. Signed-off-by: NeilBrown <neilb@suse.de>
|
7ec0547838976d088dfb9cb0adb073e6e8a15aa3 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: enhance raid5_size to work correctly with negative delta_disks This is the first of four patches which combine to allow md/raid5 to reduce the number of devices in the array by restriping the data over a subset of the devices. If the number of disks in a raid4/5/6 is being reduced, then the default size must be based on the new number, not the old number of devices. In general, it should be based on the smaller of new and old. Signed-off-by: NeilBrown <neilb@suse.de>
|
34e04e87fb8b2c62c9e8868f41c8179d0e15f51a |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: drop qd_idx from r6_state We now have this value in stripe_head so we don't need to duplicate it. Signed-off-by: NeilBrown <neilb@suse.de>
|
f701d589aa34d7531183c9ac6f7713ba14212b02 |
|
31-Mar-2009 |
Dan Williams <dan.j.williams@intel.com> |
md/raid6: move raid6 data processing to raid6_pq.ko Move the raid6 data processing routines into a standalone module (raid6_pq) to prepare them to be called from async_tx wrappers and other non-md drivers/modules. This precludes a circular dependency of raid456 needing the async modules for data processing while those modules in turn depend on raid456 for the base level synchronous raid6 routines. To support this move: 1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h 2/ The raid6_call, recovery calls, and table symbols are exported 3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to compile Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
18b0033491f584a2d79697da714b1ef9d6b27d22 |
|
31-Mar-2009 |
Andre Noll <maan@systemlinux.org> |
md: raid5 run(): Fix max_degraded for raid level 4. raid4 allows only one failed disk. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
b522adcde9c4d3fb7b579cfa9160d8bde7744be8 |
|
31-Mar-2009 |
Dan Williams <dan.j.williams@intel.com> |
md: 'array_size' sysfs attribute Allow userspace to set the size of the array according to the following semantics: 1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0) a) If size is set before the array is running, do_md_run will fail if size is greater than the default size b) A reshape attempt that reduces the default size to less than the set array size should be blocked 2/ once userspace sets the size the kernel will not change it 3/ writing 'default' to this attribute returns control of the size to the kernel and reverts to the size reported by the personality Also, convert locations that need to know the default size from directly reading ->array_sectors to <pers>_size. Resync/reshape operations always follow the default size. Finally, fixup other locations that read a number of 1k-blocks from userspace to use strict_blocks_to_sectors() which checks for unsigned long long to sector_t overflow and blocks to sectors overflow. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
1f403624bde3c678a166984b1e6a727a0ce06f2b |
|
31-Mar-2009 |
Dan Williams <dan.j.williams@intel.com> |
md: centralize ->array_sectors modifications Get personalities out of the business of directly modifying ->array_sectors. Lays groundwork to introduce policy on when ->array_sectors can be modified. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
80c3a6ce4ba4470379b9e6a4d9bcd9d2ee26ae03 |
|
18-Mar-2009 |
Dan Williams <dan.j.williams@intel.com> |
md: add 'size' as a personality method In preparation for giving userspace control over ->array_sectors we need to be able to retrieve the 'default' size, and the 'anticipated' size when a reshape is requested. For personalities that do not reshape emit a warning if anything but the default size is requested. In the raid5 case we need to update ->previous_raid_disks to make the new 'default' size available. Reviewed-by: Andre Noll <maan@systemlinux.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
fc9739c6d626ee79a148ec367d143b0601299a9d |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: add takeover support for converting raid6 back into raid5 If a raid6 is still in the layout that comes from converting raid5 into a raid6. this will allow us to convert it back again. Signed-off-by: NeilBrown <neilb@suse.de>
|
e9d4758f6e93488dc719a1445ce54659a570938f |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: add takeover support for raid4 -> raid5 conversion. Signed-off-by: NeilBrown <neilb@suse.de>
|
b3546035277847028df650b147469fc943cf5c71 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. 2-drive raid5's aren't very interesting. But if you are converting a raid1 into a raid5, you will at least temporarily have one. And that it a good time to set the layout/chunksize for the new RAID5 if you aren't happy with the defaults. layout and chunksize don't actually affect the placement of data on a 2-drive raid5, so we just do some internal book-keeping. Signed-off-by: NeilBrown <neilb@suse.de>
|
d562b0c4313e3ddea402a400371afa47ddf679f9 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: add ->takeover method for raid5 to be able to take over raid1 The RAID1 must have two drives and be a suitable size to be a multiple of a chunksize that isn't too small. Signed-off-by: NeilBrown <neilb@suse.de>
|
245f46c2c221ef09c7db892f0e3fc2149be42052 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: add ->takeover method to support changing the personality managing an array Implement this for RAID6 to be able to 'takeover' a RAID5 array. The new RAID6 will use a layout which places Q on the last device, and that device will be missing. If there are any available spares, one will immediately have Q recovered onto it. Signed-off-by: NeilBrown <neilb@suse.de>
|
e0cf8f045b2023b0b3f919ee93eb94345f648434 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: md_unregister_thread should cope with being passed NULL Mostly md_unregister_thread is only called when we know that the thread is NULL, but sometimes we need to check first. It is safer to put the check inside md_unregister_thread itself. Signed-off-by: NeilBrown <neilb@suse.de>
|
91adb56473febeeb3ef657bb5147ddd355465700 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: refactor raid5 "run" .. so that the code to create the private data structures is separate. This will help with future code to change the level of an active array. Signed-off-by: NeilBrown <neilb@suse.de>
|
67cc2b8165857ba019920d1f00d64bcc4140075d |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: finish support for DDF/raid6 DDF requires RAID6 calculations over different devices in a different order. For md/raid6, we calculate over just the data devices, starting immediately after the 'Q' block. For ddf/raid6 we calculate over all devices, using zeros in place of the P and Q blocks. This requires unfortunately complex loops... Signed-off-by: NeilBrown <neilb@suse.de>
|
99c0fb5f92828ae96909d390f2df137b89093b37 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: Add support for new layouts for raid5 and raid6. DDF uses different layouts for P and Q blocks than current md/raid6 so add those that are missing. Also add support for RAID6 layouts that are identical to various raid5 layouts with the simple addition of one device to hold all of the 'Q' blocks. Finally add 'raid5' layouts to match raid4. These last to will allow online level conversion. Note that this does not provide correct support for DDF/raid6 yet as the order in which data blocks are summed to produce the Q block is significant and different between current md code and DDF requirements. Signed-off-by: NeilBrown <neilb@suse.de>
|
911d4ee8536d89ea8a6cd3e96b1c95a3ebc5ea66 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: simplify raid5_compute_sector interface Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass a 'struct stripe_head *' and fill in the relevant fields. This is more extensible. Signed-off-by: NeilBrown <neilb@suse.de>
|
d0dabf7e577411c2bf6b616c751544dc241213d4 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid6: remove expectation that Q device is immediately after P device. Code currently assumes that the devices in a raid6 stripe are 0 1 ... N-1 P Q in some rotated order. We will shortly add new layouts in which this strict pattern is broken. So remove this expectation. We still assume that the data disks are roughly in-order. However P and Q can be inserted anywhere within that order. Signed-off-by: NeilBrown <neilb@suse.de>
|
112bf8970dbdfc00bd4667da5996e57c2ce58066 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument This similar to the recent change to get_active_stripe. There is no functional change, just come rearrangement to make future patches cleaner. Signed-off-by: NeilBrown <neilb@suse.de>
|
b5663ba405fe3e51176ddb6c91a5e186590c26b5 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md/raid5: simplify interface for init_stripe and get_active_stripe Rather than passing 'pd_idx' and 'disks' to these functions, just pass 'previous' which tells whether to use the 'previous' or 'current' geometry during a reshape, and let init_stripe calculate disks and pd_idx and anything else it might need. This is not a substantial simplification and even adds a division. However we will shortly be adding more complexity to init_stripe to handle more interesting 'reshape' activities, and without this change, the interface to these functions would get very complex. Signed-off-by: NeilBrown <neilb@suse.de>
|
58c0fed400603a802968b23ddf78f029c5a84e41 |
|
31-Mar-2009 |
Andre Noll <maan@systemlinux.org> |
md: Make mddev->size sector-based. This patch renames the "size" field of struct mddev_s to "dev_sectors" and stores the number of 512-byte sectors instead of the number of 1K-blocks in it. All users of that field, including raid levels 1,4-6,10, are adjusted accordingly. This simplifies the code a bit because it allows to get rid of a couple of divisions/multiplications by two. In order to make checkpatch happy, some minor coding style issues have also been addressed. In particular, size_store() now uses strict_strtoull() instead of simple_strtoull(). Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
43b2e5d86d8bdd77386226db0bc961529492c043 |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: move md_k.h from include/linux/raid/ to drivers/md/ It really is nicer to keep related code together.. Signed-off-by: NeilBrown <neilb@suse.de>
|
bff61975b3d6c18ee31457cc5b4d73042f44915f |
|
31-Mar-2009 |
NeilBrown <neilb@suse.de> |
md: move lots of #include lines out of .h files and into .c This makes the includes more explicit, and is preparation for moving md_k.h to drivers/md/md.h Remove include/raid/md.h as its only remaining use was to #include other files. Signed-off-by: NeilBrown <neilb@suse.de>
|
ef740c372dfd80e706dbf955d4e4aedda6c0c148 |
|
31-Mar-2009 |
Christoph Hellwig <hch@lst.de> |
md: move headers out of include/linux/raid/ Move the headers with the local structures for the disciplines and bitmap.h into drivers/md/ so that they are more easily grepable for hacking and not far away. md.h is left where it is for now as there are some uses from the outside. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>
|
159ec1fc060ab22b157a62364045f5e98749c4d3 |
|
08-Jan-2009 |
Cheng Renquan <crquan@gmail.com> |
md: use list_for_each_entry macro directly The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to list_for_each_entry_safe, from <linux/list.h>, it should be defined to use list_for_each_entry_safe, instead of reinventing the wheel. But some calls to each_entry_safe don't really need a safe version, just a direct list_for_each_entry is enough, this could save a temp variable (tmp) in every function that used rdev_for_each. In this patch, most rdev_for_each loops are replaced by list_for_each_entry, totally save many tmp vars; and only in the other situations that will call list_del to delete an entry, the safe version is used. Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
|
4bbf3771ca40d0aaec8316d0e7476b16010288e5 |
|
13-Oct-2008 |
NeilBrown <neilb@suse.de> |
md: Relax minimum size restrictions on chunk_size. Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE. This makes moving an array to a machine with a larger PAGE_SIZE, or changing the kernel to use a larger PAGE_SIZE, can stop an array from working. For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync process works on whole pages at a time, and assumes them to be wholly within a stripe. For other raid personalities, this restriction is not needed at all and can be dropped. So remove the test on chunk_size from common can, and add it in just the places where it is needed: raid10 and raid4/5/6. Signed-off-by: NeilBrown <neilb@suse.de>
|
d710e13812600037a723a673dc5c96a071de98d3 |
|
13-Oct-2008 |
NeilBrown <neilb@suse.de> |
md: remove space after function name in declaration and call. Having function (args) instead of function(args) make is harder to search for calls of particular functions. So remove all those spaces. Signed-off-by: NeilBrown <neilb@suse.de>
|
fb4d8c76e56a887b9eee99fbc55fe82b18625d30 |
|
13-Oct-2008 |
NeilBrown <neilb@suse.de> |
md: Remove unnecessary #includes, #defines, and function declarations. A lot of cruft has gathered over the years. Time to remove it. Signed-off-by: NeilBrown <neilb@suse.de>
|
074a7aca7afa6f230104e8e65eba3420263714a5 |
|
25-Aug-2008 |
Tejun Heo <tj@kernel.org> |
block: move stats from disk to part0 Move stats related fields - stamp, in_flight, dkstats - from disk to part0 and unify stat handling such that... * part_stat_*() now updates part0 together if the specified partition is not part0. ie. part_stat_*() are now essentially all_stat_*(). * {disk|all}_stat_*() are gone. * part_round_stats() is updated similary. It handles part0 stats automatically and disk_round_stats() is killed. * part_{inc|dec}_in_fligh() is implemented which automatically updates part0 stats for parts other than part0. * disk_map_sector_rcu() is updated to return part0 if no part matches. Combined with the above changes, this makes NULL special case handling in callers unnecessary. * Separate stats show code paths for disk are collapsed into part stats show code paths. * Rename disk_stat_lock/unlock() to part_stat_lock/unlock() While at it, reposition stat handling macros a bit and add missing parentheses around macro parameters. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
c9959059161ddd7bf4670cf47367033d6b2f79c4 |
|
25-Aug-2008 |
Tejun Heo <tj@kernel.org> |
block: fix diskstats access There are two variants of stat functions - ones prefixed with double underbars which don't care about preemption and ones without which disable preemption before manipulating per-cpu counters. It's unclear whether the underbarred ones assume that preemtion is disabled on entry as some callers don't do that. This patch unifies diskstats access by implementing disk_stat_lock() and disk_stat_unlock() which take care of both RCU (for partition access) and preemption (for per-cpu counter access). diskstats access should always be enclosed between the two functions. As such, there's no need for the versions which disables preemption. They're removed and double underbars ones are renamed to drop the underbars. As an extra argument is added, there's no danger of using the old version unconverted. disk_stat_lock() uses get_cpu() and returns the cpu index and all diskstat functions which access per-cpu counters now has @cpu argument to help RT. This change adds RCU or preemption operations at some places but also collapses several preemption ops into one at others. Overall, the performance difference should be negligible as all involved ops are very lightweight per-cpu ones. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
5b99c2ffa980528a197f26c7d876cceeccce8dd5 |
|
15-Aug-2008 |
Jens Axboe <jens.axboe@oracle.com> |
block: make bi_phys_segments an unsigned int instead of short raid5 can overflow with more than 255 stripes, and we can increase it to an int for free on both 32 and 64-bit archs due to the padding. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
960e739d9e9f1c2346d8bdc65299ee2e1ed42218 |
|
15-Aug-2008 |
Jens Axboe <jens.axboe@oracle.com> |
block: raid fixups for removal of bi_hw_segments Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
ac4090d24c6a26211bc4523d920376e054d4f3f8 |
|
05-Aug-2008 |
NeilBrown <neilb@suse.de> |
Don't let a blocked_rdev interfere with read request in raid5/6 When we have externally managed metadata, we need to mark a failed device as 'Blocked' and not allow any writes until that device have been marked as faulty in the metadata and the Blocked flag has been removed. However it is perfectly OK to allow read requests when there is a Blocked device, and with a readonly array, there may not be any metadata-handler watching for blocked devices. So in raid5/raid6 only allow a Blocked device to interfere with Write request or resync. Read requests go through untouched. raid1 and raid10 already differentiate between read and write properly. Signed-off-by: NeilBrown <neilb@suse.de>
|
dba034eef2456d2a9f9a76806846c97acf6c3ad1 |
|
05-Aug-2008 |
NeilBrown <neilb@suse.de> |
Fail safely when trying to grow an array with a write-intent bitmap. We cannot currently change the size of a write-intent bitmap. So if we change the size of an array which has such a bitmap, it tries to set bits beyond the end of the bitmap. For now, simply reject any request to change the size of an array which has a bitmap. mdadm can remove the bitmap and add a new one after the array has changed size. Signed-off-by: NeilBrown <neilb@suse.de>
|
df10cfbc4d7ab93260d997df754219d390d62a9d |
|
29-Jul-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: do not progress the resync process if the stripe was blocked handle_stripe will take no action on a stripe when waiting for userspace to unblock the array, so do not report completed sectors. Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
2339788376e2d69a9154130e4dacd5b21ce63094 |
|
24-Jul-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: fix merge error The original STRIPE_OP_IO removal patch had the following hunk: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } However it appears the hunk became broken after merging: - for (i = conf->raid_disks; i--; ) { + for (i = conf->raid_disks; i--; ) set_bit(R5_Wantwrite, &sh->dev[i].flags); set_bit(R5_LOCKED, &dev->flags); s.locked++; - if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - } Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
c9f21aaff1d1fb5629325130af469532d19beb93 |
|
23-Jul-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: move async_tx_issue_pending_all outside spin_lock_irq Some dma drivers need to call spin_lock_bh in their device_issue_pending routines. This change avoids: WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85() Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
f233ea5c9e0d8b95e4283bf6a3436b88f6fd3586 |
|
21-Jul-2008 |
Andre Noll <maan@systemlinux.org> |
md: Make mddev->array_size sector-based. This patch renames the array_size field of struct mddev_s to array_sectors and converts all instances to use units of 512 byte sectors instead of 1k blocks. Signed-off-by: Andre Noll <maan@systemlinux.org> Signed-off-by: NeilBrown <neilb@suse.de>
|
7a1fc53c5adb910751a9b212af90302eb4ffb527 |
|
10-Jul-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: ensure all blocks are uptodate or locked when syncing Remove the dubious attempt to prefer 'compute' over 'read'. Not only is it wrong given commit c337869d (md: do not compute parity unless it is on a failed drive), but it can trigger a BUG_ON in handle_parity_checks5(). Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
cc371e66e340f35eed8dc4651c7c18e754c7fb26 |
|
03-Jul-2008 |
Alasdair G Kergon <agk@redhat.com> |
Add bvec_merge_data to handle stacked devices and ->merge_bvec() When devices are stacked, one device's merge_bvec_fn may need to perform the mapping and then call one or more functions for its underlying devices. The following bio fields are used: bio->bi_sector bio->bi_bdev bio->bi_size bio->bi_rw using bio_data_dir() This patch creates a new struct bvec_merge_data holding a copy of those fields to avoid having to change them directly in the struct bio when going down the stack only to have to change them back again on the way back up. (And then when the bio gets mapped for real, the whole exercise gets repeated, but that's a problem for another day...) Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Neil Brown <neilb@suse.de> Cc: Milan Broz <mbroz@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
b5470dc5fc18a8ff6517c3bb538d1479e58ecb02 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: resolve external metadata handling deadlock in md_allow_write md_allow_write() marks the metadata dirty while holding mddev->lock and then waits for the write to complete. For externally managed metadata this causes a deadlock as userspace needs to take the lock to communicate that the metadata update has completed. Change md_allow_write() in the 'external' case to start the 'mark active' operation and then return -EAGAIN. The expected side effects while waiting for userspace to write 'active' to 'array_state' are holding off reshape (code currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and cause updates to 'raid_disks' to fail. Except for 'stripe_cache_size' changes these failures can be mitigated by coordinating with mdmon. md_write_start() still prevents writes from occurring until the metadata handler has had a chance to take action as it unconditionally waits for MD_CHANGE_CLEAN to be cleared. [neilb@suse.de: return -EAGAIN, try GFP_NOIO] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
1fe797e67fb07d605b82300934d0de67068a0aca |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: rationalize raid5 function names From: Dan Williams <dan.j.williams@intel.com> Commit a4456856 refactored some of the deep code paths in raid5.c into separate functions. The names chosen at the time do not consistently indicate what is going to happen to the stripe. So, update the names, and since a stripe is a cache element use cache semantics like fill, dirty, and clean. (also, fix up the indentation in fetch_block5) Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
7b3a871ed995270268a481404454ceafe1a87478 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: handle operation chaining in raid5_run_ops From: Dan Williams <dan.j.williams@intel.com> Neil said: > At the end of ops_run_compute5 you have: > /* ack now if postxor is not set to be run */ > if (tx && !test_bit(STRIPE_OP_POSTXOR, &s->ops_run)) > async_tx_ack(tx); > > It looks odd having that test there. Would it fit in raid5_run_ops > better? The intended global interpretation is that raid5_run_ops can build a chain of xor and memcpy operations. When MD registers the compute-xor it tells async_tx to keep the operation handle around so that another item in the dependency chain can be submitted. If we are just computing a block to satisfy a read then we can terminate the chain immediately. raid5_run_ops gives a better context for this test since it cares about the entire chain. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
d8ee0728b5b30d7a6f62c399a95e953616d31f23 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: replace R5_WantPrexor with R5_WantDrain, add 'prexor' reconstruct_states From: Dan Williams <dan.j.williams@intel.com> Currently ops_run_biodrain and other locations have extra logic to determine which blocks are processed in the prexor and non-prexor cases. This can be eliminated if handle_write_operations5 flags the blocks to be processed in all cases via R5_Wantdrain. The presence of the prexor operation is tracked in sh->reconstruct_state. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
600aa10993012ff2dd5617720dac081e4f992017 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: replace STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} with 'reconstruct_states' From: Dan Williams <dan.j.williams@intel.com> Track the state of reconstruct operations (recalculating the parity block usually due to incoming writes, or as part of array expansion) Reduces the scope of the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags to only tracking whether a reconstruct operation has been requested via the ops_request field of struct stripe_head_state. This is the final step in the removal of ops.{pending,ack,complete,count}, i.e. the STRIPE_OP_{BIODRAIN,PREXOR,POSTXOR} flags only request an operation and do not track the state of the operation. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
976ea8d475675da6e86bd434328814ccbf5ae641 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: replace STRIPE_OP_COMPUTE_BLK with STRIPE_COMPUTE_RUN From: Dan Williams <dan.j.williams@intel.com> Track the state of compute operations (recalculating a block from all the other blocks in a stripe) with a state flag. Reduces the scope of the STRIPE_OP_COMPUTE_BLK flag to only tracking whether a compute operation has been requested via the ops_request field of struct stripe_head_state. Note, the compute operation that is performed in the course of doing a 'repair' operation (check the parity block, recalculate it and write it back if the check result is not zero) is tracked separately with the 'check_state' variable. Compute operations are held off while a 'check' is in progress, and moving this check out to handle_issuing_new_read_requests5 the helper routine __handle_issuing_new_read_requests5 can be simplified. This is another step towards the removal of ops.{pending,ack,complete,count}, i.e. STRIPE_OP_COMPUTE_BLK only requests an operation and does not track the state of the operation. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
83de75cc92be599850e5ef3928e07cd840833499 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: replace STRIPE_OP_BIOFILL with STRIPE_BIOFILL_RUN From: Dan Williams <dan.j.williams@intel.com> Track the state of read operations (copying data from the stripe cache to bio buffers outside the lock) with a state flag. Reduce the scope of the STRIPE_OP_BIOFILL flag to only tracking whether a biofill operation has been requested via the ops_request field of struct stripe_head_state. This is another step towards the removal of ops.{pending,ack,complete,count}, i.e. STRIPE_OP_BIOFILL only requests an operation and does not track the state of the operation. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
ecc65c9b3f9b9d740a5deade3d85b39be56401b6 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: replace STRIPE_OP_CHECK with 'check_states' From: Dan Williams <dan.j.williams@intel.com> The STRIPE_OP_* flags record the state of stripe operations which are performed outside the stripe lock. Their use in indicating which operations need to be run is straightforward; however, interpolating what the next state of the stripe should be based on a given combination of these flags is not straightforward, and has led to bugs. An easier to read implementation with minimal degrees of freedom is needed. Towards this goal, this patch introduces explicit states to replace what was previously interpolated from the STRIPE_OP_* flags. For now this only converts the handle_parity_checks5 path, removing a user of the ops.{pending,ack,complete,count} fields of struct stripe_operations. This conversion also found a remaining issue with the current code. There is a small window for a drive to fail between when we schedule a repair and when the parity calculation for that repair completes. When this happens we will writeback to 'failed_num' when we really want to write back to 'pd_idx'. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
f0e43bcdebf709d747a3effb210aff1941e819ab |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: unify raid5/6 i/o submission From: Dan Williams <dan.j.williams@intel.com> Let the raid6 path call ops_run_io to get pending i/o submitted. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
c4e5ac0a22e664eecf29249553cf16c2433f5f25 |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: use stripe_head_state in ops_run_io() From: Dan Williams <dan.j.williams@intel.com> In handle_stripe after taking sh->lock we sample some bits into 's' (struct stripe_head_state): s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); Use these values from 's' in ops_run_io() rather than re-sampling the bits. This ensures a consistent snapshot (as seen under sh->lock) is used. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
2b7497f0e0a0b9cf21d822e427d5399b2056501a |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: kill STRIPE_OP_IO flag From: Dan Williams <dan.j.williams@intel.com> The R5_Want{Read,Write} flags already gate i/o. So, this flag is superfluous and we can unconditionally call ops_run_io(). Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
b203886edbcaac3ca427cf4dbcb50b18bdb346fd |
|
28-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: kill STRIPE_OP_MOD_DMA in raid5 offload From: Dan Williams <dan.j.williams@intel.com> This micro-optimization allowed the raid code to skip a re-read of the parity block after checking parity. It took advantage of the fact that xor-offload-engines have their own internal result buffer and can check parity without writing to memory. Remove it for the following reasons: 1/ It is a layering violation for MD to need to manage the DMA and non-DMA paths within async_xor_zero_sum 2/ Bad precedent to toggle the 'ops' flags outside the lock 3/ Hard to realize a performance gain as reads will not need an updated parity block and writes will dirty it anyways. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
199050ea1ff2270174ee525b73bc4c3323098897 |
|
28-Jun-2008 |
Neil Brown <neilb@notabene.brown> |
rationalise return value for ->hot_add_disk method. For all array types but linear, ->hot_add_disk returns 1 on success, 0 on failure. For linear, it returns 0 on success and -errno on failure. This doesn't cause a functional problem because the ->hot_add_disk function of linear is used quite differently to the others. However it is confusing. So convert all to return 0 for success or -errno on failure and fix call sites to match. Signed-off-by: Neil Brown <neilb@suse.de>
|
6c2fce2ef6b4821c21b5c42c7207cb9cf8c87eda |
|
28-Jun-2008 |
Neil Brown <neilb@notabene.brown> |
Support adding a spare to a live md array with external metadata. i.e. extend the 'md/dev-XXX/slot' attribute so that you can tell a device to fill an vacant slot in an and md array. Signed-off-by: Neil Brown <neilb@suse.de>
|
0e13fe23a00ad88c737d91d94a050707c6139ce4 |
|
28-Jun-2008 |
Neil Brown <neilb@notabene.brown> |
use bio_endio instead of a call to bi_end_io Turn calls to bi->bi_end_io() into bio_endio(). Apparently bio_endio does exactly the same error processing as is hardcoded at these places. bio_endio() avoids recursion (or will soon), so it should be used. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Neil Brown <neilb@suse.de>
|
efe311431869b40d67911820a309f9a1a41306f3 |
|
28-Jun-2008 |
Neil Brown <neilb@notabene.brown> |
Don't acknowlege that stripe-expand is complete until it really is. We shouldn't acknowledge that a stripe has been expanded (When reshaping a raid5 by adding a device) until the moved data has actually been written out. However we are currently acknowledging (by calling md_done_sync) when the POST_XOR is complete and before the write. So track in s.locked whether there are pending writes, and don't call md_done_sync yet if there are. Note: we all set R5_LOCKED on devices which are are about to read from. This probably isn't technically necessary, but is usually done when writing a block, and justifies the use of s.locked here. This bug can lead to a crash if an array is stopped while an reshape is in progress. Cc: <stable@kernel.org> Signed-off-by: Neil Brown <neilb@suse.de>
|
8c2e870a625bd336b2e7a65a97c1836acef07322 |
|
28-Jun-2008 |
Neil Brown <neilb@notabene.brown> |
Ensure interrupted recovery completed properly (v1 metadata plus bitmap) If, while assembling an array, we find a device which is not fully in-sync with the array, it is important to set the "fullsync" flags. This is an exact analog to the setting of this flag in hot_add_disk methods. Currently, only v1.x metadata supports having devices in an array which are not fully in-sync (it keep track of how in sync they are). The 'fullsync' flag only makes a difference when a write-intent bitmap is being used. In this case it tells recovery to ignore the bitmap and recovery all blocks. This fix is already in place for raid1, but not raid5/6 or raid10. So without this fix, a raid1 ir raid4/5/6 array with version 1.x metadata and a write intent bitmaps, that is stopped in the middle of a recovery, will appear to complete the recovery instantly after it is reassembled, but the recovery will not be correct. If you might have an array like that, issueing echo repair > /sys/block/mdXX/md/sync_action will make sure recovery completes properly. Cc: <stable@kernel.org> Signed-off-by: Neil Brown <neilb@suse.de>
|
c337869d95011495fa181536786e74aa2d7ff031 |
|
06-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: do not compute parity unless it is on a failed drive If a block is computed (rather than read) then a check/repair operation may be lead to believe that the data on disk is correct, when infact it isn't. So only compute blocks for failed devices. This issue has been around since at least 2.6.12, but has become harder to hit in recent kernels since most reads bypass the cache. echo repair > /sys/block/mdN/md/sync_action will set the parity blocks to the correct state. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
e0a115e5aa554b93150a8dc1c3fe15467708abb2 |
|
06-Jun-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: fix prexor vs sync_request race During the initial array synchronization process there is a window between when a prexor operation is scheduled to a specific stripe and when it completes for a sync_request to be scheduled to the same stripe. When this happens the prexor completes and the stripe is unconditionally marked "insync", effectively canceling the sync_request for the stripe. Prior to 2.6.23 this was not a problem because the prexor operation was done under sh->lock. The effect in older kernels being that the prexor would still erroneously mark the stripe "insync", but sync_request would be held off and re-mark the stripe as "!in_sync". Change the write completion logic to not mark the stripe "in_sync" if a prexor was performed. The effect of the change is to sometimes not set STRIPE_INSYNC. The worst this can do is cause the resync to stall waiting for STRIPE_INSYNC to be set. If this were happening, then STRIPE_SYNCING would be set and handle_issuing_new_read_requests would cause all available blocks to eventually be read, at which point prexor would never be used on that stripe any more and STRIPE_INSYNC would eventually be set. echo repair > /sys/block/mdN/md/sync_action will correct arrays that may have lost this race. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
dfc7064500061677720fa26352963c772d3ebe6b |
|
23-May-2008 |
NeilBrown <neilb@suse.de> |
md: restart recovery cleanly after device failure. When we get any IO error during a recovery (rebuilding a spare), we abort the recovery and restart it. For RAID6 (and multi-drive RAID1) it may not be best to restart at the beginning: when multiple failures can be tolerated, the recovery may be able to continue and re-doing all that has already been done doesn't make sense. We already have the infrastructure to record where a recovery is up to and restart from there, but it is not being used properly. This is because: - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR, which causes the recovery not be be checkpointed. - We remove spares and then re-added them which loses important state information. The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't needed. If there is an error, the relevant drive will be marked as Faulty, and that is enough to ensure correct handling of the error. So we first remove MD_RECOVERY_ERR, changing some of the uses of it to MD_RECOVERY_INTR. Then we cause the attempt to remove a non-faulty device from an array to fail (unless recovery is impossible as the array is too degraded). Then when remove_and_add_spares attempts to remove the devices on which recovery can continue, it will fail, they will remain in place, and recovery will continue on them as desired. Issue: If we are halfway through rebuilding a spare and another drive fails, and a new spare is immediately available, do we want to: 1/ complete the current rebuild, then go back and rebuild the new spare or 2/ restart the rebuild from the start and rebuild both devices in parallel. Both options can be argued for. The code currently takes option 2 as a/ this requires least code change b/ this results in a minimally-degraded array in minimal time. Cc: "Eivind Sarto" <ivan@kasenna.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
6be9d4940134b36f9ed020aead36f831f19b49f1 |
|
23-May-2008 |
Bernd Schubert <bernd-schubert@gmx.de> |
md: md: raid5 rate limit error printk Last night we had scsi problems and a hardware raid unit was offlined during heavy i/o. While this happened we got for about 3 minutes a huge number messages like these Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2). I guess the high error rate is responsible for not scheduling other events - during this time the system was not pingable and in the end also other devices run into scsi command timeouts causing problems on these unrelated devices as well. Signed-off-by: Bernd Schubert <bernd-schubert@gmx.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
e7e72bf641b1fc7b9df6f40bd2c36dfccd8d647c |
|
15-May-2008 |
Neil Brown <neilb@suse.de> |
Remove blkdev warning triggered by using md As setting and clearing queue flags now requires that we hold a spinlock on the queue, and as blk_queue_stack_limits is called without that lock, get the lock inside blk_queue_stack_limits. For blk_queue_stack_limits to be able to find the right lock, each md personality needs to set q->queue_lock to point to the appropriate lock. Those personalities which didn't previously use a spin_lock, us q->__queue_lock. So always initialise that lock when allocated. With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no longer cause warnings as it will be clear that the proper lock is held. Thanks to Dan Williams for review and fixing the silly bugs. Signed-off-by: NeilBrown <neilb@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alistair John Strachan <alistair@devzero.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Jacek Luczak <difrost.kernel@gmail.com> Cc: Prakash Punnoor <prakash@punnoor.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
c8894419acf5e56851de9741c5047bebd78acd1f |
|
12-May-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: fix raid5 'repair' operations commit bd2ab67030e9116f1e4aae1289220255412b37fd "md: close a livelock window in handle_parity_checks5" introduced a bug in handling 'repair' operations. After a repair operation completes we clear the state bits tracking this operation. However, they are cleared too early and this results in the code deciding to re-run the parity check operation. Since we have done the repair in memory the second check does not find a mismatch and thus does not do a writeback. Test results: $ echo repair > /sys/block/md0/md/sync_action $ cat /sys/block/md0/md/mismatch_cnt 51072 $ echo repair > /sys/block/md0/md/sync_action $ cat /sys/block/md0/md/mismatch_cnt 0 (also fix incorrect indentation) Cc: <stable@kernel.org> Tested-by: George Spelvin <linux@horizon.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
6bfe0b499082fd3950429017cd8ebf2a6c458aa5 |
|
30-Apr-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: support blocking writes to an array on device failure Allows a userspace metadata handler to take action upon detecting a device failure. Based on an original patch by Neil Brown. Changes: -added blocked_wait waitqueue to rdev -don't qualify Blocked with Faulty always let userspace block writes -added md_wait_for_blocked_rdev to wait for the block device to be clear, if userspace misses the notification another one is sent every 5 seconds -set MD_RECOVERY_NEEDED after clearing "blocked" -kill DoBlock flag, just test mddev->external Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
d7a420c9472a95c46600a0345434b7b166e0b9c7 |
|
28-Apr-2008 |
Nick Andrew <nick@nick-andrew.net> |
raid: remove leading TAB on printk messages MD drivers use one printk() call to print 2 log messages and the second line may be prefixed by a TAB character. It may also output a trailing space before newline. klogd (I think) turns the TAB character into the 2 characters '^I' when logging to a file. This looks ugly. Instead of a leading TAB to indicate continuation, prefix both output lines with 'raid:' or similar. Also remove any trailing space in the vicinity of the affected code and consistently end the sentences with a period. Signed-off-by: Nick Andrew <nick@nick-andrew.net> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
4ef197d87ad7d4bb326de3e9b8ecbb26f9e86253 |
|
28-Apr-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: raid5.c convert simple_strtoul to strict_strtoul strict_strtoul handles the open-coded sanity checks in raid5_store_stripe_cache_size and raid5_store_preread_threshold Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
8b3e6cdc53b7f29f7026955d6cb6902a49322a15 |
|
28-Apr-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: introduce get_priority_stripe() to improve raid456 write performance Improve write performance by preventing the delayed_list from dumping all its stripes onto the handle_list in one shot. Delayed stripes are now further delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when: * a STRIPE_IO_STARTED stripe is found at the head of 'handle_list' * 'handle_list' is empty and i/o is being done to satisfy full stripe-width write requests * 'bypass_count' is less than 'bypass_threshold'. By default the threshold is 1, i.e. every other stripe handled is a preread stripe provided the top two conditions are false. Benchmark data: System: 2x Xeon 5150, 4x SATA, mem=1GB Baseline: 2.6.24-rc7 Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 * patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512) Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS) * patched: +13% * patched + preread_bypass_threshold = 0: +37% Changes since v1: * reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and make it configurable. This defaults to fairness and modest performance gains out of the box. Changes since v2: * [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not necessary, the important change was clearing STRIPE_DELAYED in add_stripe_bio and this has been moved out to make_request for the hang fix. * [neilb@suse.de]: simplify get_priority_stripe * [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is sampled empty (+11%) * [dan.j.williams@intel.com]: decrement the bypass_count at the detection of stripes being naturally promoted off of hold_list +2%. Note, resetting bypass_count instead of decrementing on these events yields +4% but that is probably too aggressive. Changes since v3: * cosmetic fixups Tested-by: James W. Laferriere <babydr@baby-dragons.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
e46b272b6608783ed7aa7b0594871550ce20b849 |
|
28-Apr-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
md: replace remaining __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
bd2ab67030e9116f1e4aae1289220255412b37fd |
|
11-Apr-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: close a livelock window in handle_parity_checks5 If a failure is detected after a parity check operation has been initiated, but before it completes handle_parity_checks5 will never quiesce operations on the stripe. Explicitly handle this case by "canceling" the parity check, i.e. clear the STRIPE_OP_CHECK flags and queue the stripe on the handle list again to refresh any non-uptodate blocks. Kernel versions >= 2.6.23 are susceptible. Cc: <stable@kernel.org> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
9ea85ebae1e05100cdb4807db4f265b0ede7aad8 |
|
20-Mar-2008 |
Andrew Morton <akpm@linux-foundation.org> |
drivers/md/raid5.c: fix printk warnings gcc-3.4.5 on sparc64: drivers/md/raid5.c: In function `raid5_end_read_request': drivers/md/raid5.c:1147: warning: long long unsigned int format, long unsigned int arg (arg 4) drivers/md/raid5.c:1164: warning: long long unsigned int format, long unsigned int arg (arg 3) drivers/md/raid5.c:1170: warning: long long unsigned int format, long unsigned int arg (arg 3) sector_t is u64, and we don't know what type the architecture uses to implement u64 (on some it is unsigned long). Cc: Neil Brown <neilb@suse.de> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
6ed3003c19a96fe18edf8179c4be6fe14abbebbc |
|
06-Feb-2008 |
NeilBrown <neilb@suse.de> |
md: fix an occasional deadlock in raid5 raid5's 'make_request' function calls generic_make_request on underlying devices and if we run out of stripe heads, it could end up waiting for one of those requests to complete. This is bad as recursive calls to generic_make_request go on a queue and are not even attempted until make_request completes. So: don't make any generic_make_request calls in raid5 make_request until all waiting has been done. We do this by simply setting STRIPE_HANDLE instead of calling handle_stripe(). If we need more stripe_heads, raid5d will get called to process the pending stripe_heads which will call generic_make_request from a This change by itself causes a performance hit. So add a change so that raid5_activate_delayed is only called at unplug time, never in raid5. This seems to bring back the performance numbers. Calling it in raid5d was sometimes too soon... Neil said: How about we queue it for 2.6.25-rc1 and then about when -rc2 comes out, we queue it for 2.6.24.y? Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Tested-by: dean gaudet <dean@arctic.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
d089c6af10c2be5988f03667d6d22fe6085fbe5e |
|
06-Feb-2008 |
NeilBrown <neilb@suse.de> |
md: change ITERATE_RDEV to rdev_for_each As this is more in line with common practice in the kernel. Also swap the args around to be more like list_for_each. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
c620727779f7cc8ea96efb71f0651a26349e59c1 |
|
06-Feb-2008 |
NeilBrown <neilb@suse.de> |
md: allow a maximum extent to be set for resyncing This allows userspace to control resync/reshape progress and synchronise it with other activities, such as shared access in a SAN, or backing up critical sections during a tricky reshape. Writing a number of sectors (which must be a multiple of the chunk size if such is meaningful) causes a resync to pause when it gets to that point. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
b47490c9bc73d0b34e4c194db40de183e592e446 |
|
06-Feb-2008 |
NeilBrown <neilb@suse.de> |
md: Update md bitmap during resync. Currently an md array with a write-intent bitmap does not updated that bitmap to reflect successful partial resync. Rather the entire bitmap is updated when the resync completes. This is because there is no guarentee that resync requests will complete in order, and tracking each request individually is unnecessarily burdensome. However there is value in regularly updating the bitmap, so add code to periodically pause while all pending sync requests complete, then update the bitmap. Doing this only every few seconds (the same as the bitmap update time) does not notciably affect resync performance. [snitzer@gmail.com: export bitmap_cond_end_sync] Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Mike Snitzer" <snitzer@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
0f94e87cdeaaac9f0f9a28a5dd2a5070b87cd3e8 |
|
09-Jan-2008 |
Dan Williams <dan.j.williams@intel.com> |
md: fix data corruption when a degraded raid5 array is reshaped We currently do not wait for the block from the missing device to be computed from parity before copying data to the new stripe layout. The change in the raid6 code is not techincally needed as we don't delay data block recovery in the same way for raid6 yet. But making the change now is safer long-term. This bug exists in 2.6.23 and 2.6.24-rc Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
6c55be8b962f1bdc592d579e81fc27b11ea53dfc |
|
15-Nov-2007 |
Dan Williams <dan.j.williams@intel.com> |
raid5: fix unending write sequence <debug output from Joel's system> handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 check 5: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ffcffcc0 written 0000000000000000 check 4: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fdd4e360 written 0000000000000000 check 3: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 check 2: state 0x1 toread 0000000000000000 read 0000000000000000 write 0000000000000000 written 0000000000000000 check 1: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800ff517e40 written 0000000000000000 check 0: state 0x6 toread 0000000000000000 read 0000000000000000 write fffff800fd4cae60 written 0000000000000000 locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 for sector 7629696, rmw=0 rcw=0 </debug> These blocks were prepared to be written out, but were never handled in ops_run_biodrain(), so they remain locked forever. The operations flags are all clear which means handle_stripe() thinks nothing else needs to be done. This state suggests that the STRIPE_OP_PREXOR bit was sampled 'set' when it should not have been. This patch cleans up cases where the code looks at sh->ops.pending when it should be looking at the consistent stack-based snapshot of the operations flags. Report from Joel: Resync done. Patch fix this bug. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Joel Bertrand <joel.bertrand@systella.fr> Cc: <stable@kernel.org> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
2ad8b1ef11c98c5603580878aebf9f1bc74129e4 |
|
07-Nov-2007 |
Alan D. Brunelle <Alan.Brunelle@hp.com> |
Add UNPLUG traces to all appropriate places Added blk_unplug interface, allowing all invocations of unplugs to result in a generated blktrace UNPLUG. Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
def6ae26a9e69c3e6d0f0054524c76fd32420ecd |
|
05-Nov-2007 |
Neil Brown <neilb@suse.de> |
md: fix misapplied patch in raid5.c commit 4ae3f847e49e3787eca91bced31f8fd328d50496 ("md: raid5: fix clearing of biofill operations") did not get applied correctly, presumably due to substantial similarities between handle_stripe5 and handle_stripe6. This patch moves the chunk of new code from handle_stripe6 (where it isn't needed (yet)) to handle_stripe5. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Dan Williams" <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
4ae3f847e49e3787eca91bced31f8fd328d50496 |
|
23-Oct-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: raid5: fix clearing of biofill operations ops_complete_biofill() runs outside of spin_lock(&sh->lock) and clears the 'pending' and 'ack' bits. Since the test_and_ack_op() macro only checks against 'complete' it can get an inconsistent snapshot of pending work. Move the clearing of these bits to handle_stripe5(), under the lock. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Joel Bertrand <joel.bertrand@systella.fr> Signed-off-by: Neil Brown <neilb@suse.de> Cc: Stable <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
fd5d806266935179deda1502101624832eacd01f |
|
16-Oct-2007 |
Jens Axboe <jens.axboe@oracle.com> |
block: convert blkdev_issue_flush() to use empty barriers Then we can get rid of ->issue_flush_fn() and all the driver private implementations of that. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
6712ecf8f648118c3363c142196418f89a510b90 |
|
27-Sep-2007 |
NeilBrown <neilb@suse.de> |
Drop 'size' argument from bio_endio and bi_end_io As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
e4d84909dd48b5e5806a5d18b881e1ca1610ba9b |
|
24-Sep-2007 |
Dan Williams <dan.j.williams@intel.com> |
raid5: fix 2 bugs in ops_complete_biofill 1/ ops_complete_biofill tried to avoid calling handle_stripe since all the state necessary to return read completions is available. However the process of determining whether more read requests are pending requires locking the stripe (to block add_stripe_bio from updating dev->toead). ops_complete_biofill can run in tasklet context, so rather than upgrading all the stripe locks from spin_lock to spin_lock_bh this patch just unconditionally reschedules handle_stripe after completing the read request. 2/ ops_complete_biofill needlessly qualified processing R5_Wantfill with dev->toread. The result being that the 'biofill' pending bit is cleared before handling the pending read-completions on dev->read. R5_Wantfill can be unconditionally handled because the 'biofill' pending bit prevents new R5_Wantfill requests from being seen by ops_run_biofill and ops_complete_biofill. Found-by: Yuri Tikhonov <yur@emcraft.com> [neilb@suse.de: simpler fix for bug 1 than moving code] Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
a2e0855182e2be26b252745b2bb7558705cb0dd2 |
|
12-Sep-2007 |
NeilBrown <neilb@suse.de> |
md: fix some bugs with growing raid5/raid6 arrays. The recent changed to raid5 to allow offload of parity calculation etc introduced some bugs in the code for growing (i.e. adding a disk to) raid5 and raid6. This fixes them Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
165125e1e480f9510a5ffcfbfee4e3ee38c05f23 |
|
24-Jul-2007 |
Jens Axboe <jens.axboe@oracle.com> |
[BLOCK] Get rid of request_queue_t typedef Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
eb0645a8b1f14da300f40bb9f424640cd1181fbf |
|
20-Jul-2007 |
Dan Williams <dan.j.williams@intel.com> |
async_tx: fix kmap_atomic usage in async_memcpy Andrew Morton: [async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and ASYNC_TX_KMAP_SRC can ever be set. We'll end up using the same kmap slot for both src add dest and we get either corrupted data or a BUG. Evgeniy Polyakov: Btw, shouldn't it always be kmap_atomic() even if flag is not set. That pages are usual one returned by alloc_page(). So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and ASYNC_TX_KMAP_SRC flags. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
20c2df83d25c6a95affe6157a4c9cac4cf5ffaac |
|
20-Jul-2007 |
Paul Mundt <lethal@linux-sh.org> |
mm: Remove slab destructors from kmem_cache_create(). Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
|
f6dff381af01006ffae3c23cd2e07e30584de0ec |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: remove raid5 compute_block and compute_parity5 replaced by raid5_run_ops Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
830ea01673a397798d1281d2022615559f5001bb |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: handle_stripe5 - request io processing in raid5_run_ops I/O submission requests were already handled outside of the stripe lock in handle_stripe. Now that handle_stripe is only tasked with finding work, this logic belongs in raid5_run_ops. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
f0a50d3754c7f1b7f05f45b1c0b35d20445316b5 |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: handle_stripe5 - add request/completion logic for async expand ops When a stripe is being expanded bulk copying takes place to move the data from the old stripe to the new. Since raid5_run_ops only operates on one stripe at a time these bulk copies are handled in-line under the stripe lock. In the dma offload case we poll for the completion of the operation. After the data has been copied into the new stripe the parity needs to be recalculated across the new disks. We reuse the existing postxor functionality to carry out this calculation. By setting STRIPE_OP_POSTXOR without setting STRIPE_OP_BIODRAIN the completion path in handle stripe can differentiate expand operations from normal write operations. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
b5e98d65d34a1c11a2135ea8a9b2619dbc7216c8 |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: handle_stripe5 - add request/completion logic for async read ops When a read bio is attached to the stripe and the corresponding block is marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy the data from the stripe cache to the bio buffer. handle_stripe flags the blocks to be operated on with the R5_Wantfill flag. If new read requests arrive while raid5_run_ops is running they will not be handled until handle_stripe is scheduled to run again. Changelog: * cleanup to_read and to_fill accounting * do not fail reads that have reached the cache Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
e89f89629b5de76e504d1be75c82c4a6b2419583 |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: handle_stripe5 - add request/completion logic for async check ops Check operations are scheduled when the array is being resynced or an explicit 'check/repair' command was sent to the array. Previously check operations would destroy the parity block in the cache such that even if parity turned out to be correct the parity block would be marked !R5_UPTODATE at the completion of the check. When the operation can be carried out by a dma engine the assumption is that it can check parity as a read-only operation. If raid5_run_ops notices that the check was handled by hardware it will preserve the R5_UPTODATE status of the parity disk. When a check operation determines that the parity needs to be repaired we reuse the existing compute block infrastructure to carry out the operation. Repair operations imply an immediate write back of the data, so to differentiate a repair from a normal compute operation the STRIPE_OP_MOD_REPAIR_PD flag is added. Changelog: * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
f38e12199a94ca458e4f03c5a2c984fb80adadc5 |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: handle_stripe5 - add request/completion logic for async compute ops handle_stripe will compute a block when a backing disk has failed, or when it determines it can save a disk read by computing the block from all the other up-to-date blocks. Previously a block would be computed under the lock and subsequent logic in handle_stripe could use the newly up-to-date block. With the raid5_run_ops implementation the compute operation is carried out a later time outside the lock. To preserve the old functionality we take advantage of the dependency chain feature of async_tx to flag the block as R5_Wantcompute and then let other parts of handle_stripe operate on the block as if it were up-to-date. raid5_run_ops guarantees that the block will be ready before it is used in another operation. However, this only works in cases where the compute and the dependent operation are scheduled at the same time. If a previous call to handle_stripe sets the R5_Wantcompute flag there is no facility to pass the async_tx dependency chain across successive calls to raid5_run_ops. The req_compute variable protects against this case. Changelog: * remove the req_compute BUG_ON Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
e33129d84130459dbb764a1a52a4bfceab3da978 |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: handle_stripe5 - add request/completion logic for async write ops After handle_stripe5 decides whether it wants to perform a read-modify-write, or a reconstruct write it calls handle_write_operations5. A read-modify-write operation will perform an xor subtraction of the blocks marked with the R5_Wantprexor flag, copy the new data into the stripe (biodrain) and perform a postxor operation across all up-to-date blocks to generate the new parity. A reconstruct write is run when all blocks are already up-to-date in the cache so all that is needed is a biodrain and postxor. On the completion path STRIPE_OP_PREXOR will be set if the operation was a read-modify-write. The STRIPE_OP_BIODRAIN flag is used in the completion path to differentiate write-initiated postxor operations versus expansion-initiated postxor operations. Completion of a write triggers i/o to the drives. Changelog: * make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil Brown * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
d84e0f10d38393f617227f0c831a99c69294651f |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: common infrastructure for running operations with raid5_run_ops All the handle_stripe operations that are to be transitioned to use raid5_run_ops need a method to coherently gather work under the stripe-lock and hand that work off to raid5_run_ops. The 'get_stripe_work' routine runs under the lock to read all the bits in sh->ops.pending that do not have the corresponding bit set in sh->ops.ack. This modified 'pending' bitmap is then passed to raid5_run_ops for processing. The transition from 'ack' to 'completion' does not need similar protection as the existing release_stripe infrastructure will guarantee that handle_stripe will run again after a completion bit is set, and handle_stripe can tolerate a sh->ops.completed bit being set while the lock is held. A call to async_tx_issue_pending_all() is added to raid5d to kick the offload engines once all pending stripe operations work has been submitted. This enables batching of the submission and completion of operations. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
91c00924846a0034020451c280c76baa4299f9dc |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
md: raid5_run_ops - run stripe operations outside sh->lock When the raid acceleration work was proposed, Neil laid out the following attack plan: 1/ move the xor and copy operations outside spin_lock(&sh->lock) 2/ find/implement an asynchronous offload api The raid5_run_ops routine uses the asynchronous offload api (async_tx) and the stripe_operations member of a stripe_head to carry out xor+copy operations asynchronously, outside the lock. To perform operations outside the lock a new set of state flags is needed to track new requests, in-flight requests, and completed requests. In this new model handle_stripe is tasked with scanning the stripe_head for work, updating the stripe_operations structure, and finally dropping the lock and calling raid5_run_ops for processing. The following flags outline the requests that handle_stripe can make of raid5_run_ops: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate a missing block in the cache from the other blocks STRIPE_OP_PREXOR - subtract existing data as part of the read-modify-write process STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_POSTXOR - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct STRIPE_OP_IO - submit i/o to the member disks (note this was already performed outside the stripe lock, but it made sense to add it as an operation type The flow is: 1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending 2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the operation to the async_tx api 3/ async_tx triggers the completion callback routine to set sh->ops.complete and release the stripe 4/ handle_stripe runs again to finish the operation and optionally submit new operations that were previously blocked Note this patch just defines raid5_run_ops, subsequent commits (one per major operation type) modify handle_stripe to take advantage of this routine. Changelog: * removed ops_complete_biodrain in favor of ops_complete_postxor and ops_complete_write. * removed the raid5_run_ops workqueue * call bi_end_io for reads in ops_complete_biofill, saves a call to handle_stripe * explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown * fix race between async engines and bi_end_io call for reads, Neil Brown * remove unnecessary spin_lock from ops_complete_biofill * remove test_and_set/test_and_clear BUG_ONs, Neil Brown * remove explicit interrupt handling for channel switching, this feature was absorbed (i.e. it is now implicit) by the async_tx api * use return_io in ops_complete_biofill Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
45b4233caac05da0118b608a9fc2a40a9fc580cd |
|
09-Jul-2007 |
Dan Williams <dan.j.williams@intel.com> |
raid5: replace custom debug PRINTKs with standard pr_debug Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in favor of the global DEBUG definition. To get local debug messages just add '#define DEBUG' to the top of the file. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
a445685647e825c713175d180ffc8dd54d90589b |
|
09-Jul-2007 |
Dan Williams <dan.j.williams@intel.com> |
raid5: refactor handle_stripe5 and handle_stripe6 (v3) handle_stripe5 and handle_stripe6 have very deep logic paths handling the various states of a stripe_head. By introducing the 'stripe_head_state' and 'r6_state' objects, large portions of the logic can be moved to sub-routines. 'struct stripe_head_state' consumes all of the automatic variables that previously stood alone in handle_stripe5,6. 'struct r6_state' contains the handle_stripe6 specific variables like p_failed and q_failed. One of the nice side effects of the 'stripe_head_state' change is that it allows for further reductions in code duplication between raid5 and raid6. The following new routines are shared between raid5 and raid6: handle_completed_write_requests handle_requests_to_failed_array handle_stripe_expansion Changes: * v2: fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path * v3: removed the unused 'dirty' field from struct stripe_head_state * v3: coalesced open coded bi_end_io routines into return_io() Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
9bc89cd82d6f88fb0ca39b30445c329a430fd66b |
|
02-Jan-2007 |
Dan Williams <dan.j.williams@intel.com> |
async_tx: add the async_tx api The async_tx api provides methods for describing a chain of asynchronous bulk memory transfers/transforms with support for inter-transactional dependencies. It is implemented as a dmaengine client that smooths over the details of different hardware offload engine implementations. Code that is written to the api can optimize for asynchronous operation and the api will fit the chain of operations to the available offload resources. I imagine that any piece of ADMA hardware would register with the 'async_*' subsystem, and a call to async_X would be routed as appropriate, or be run in-line. - Neil Brown async_tx exploits the capabilities of struct dma_async_tx_descriptor to provide an api of the following general format: struct dma_async_tx_descriptor * async_<operation>(..., struct dma_async_tx_descriptor *depend_tx, dma_async_tx_callback cb_fn, void *cb_param) { struct dma_chan *chan = async_tx_find_channel(depend_tx, <operation>); struct dma_device *device = chan ? chan->device : NULL; int int_en = cb_fn ? 1 : 0; struct dma_async_tx_descriptor *tx = device ? device->device_prep_dma_<operation>(chan, len, int_en) : NULL; if (tx) { /* run <operation> asynchronously */ ... tx->tx_set_dest(addr, tx, index); ... tx->tx_set_src(addr, tx, index); ... async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param); } else { /* run <operation> synchronously */ ... <operation> ... async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param); } return tx; } async_tx_find_channel() returns a capable channel from its pool. The channel pool is organized as a per-cpu array of channel pointers. The async_tx_rebalance() routine is tasked with managing these arrays. In the uniprocessor case async_tx_rebalance() tries to spread responsibility evenly over channels of similar capabilities. For example if there are two copy+xor channels, one will handle copy operations and the other will handle xor. In the SMP case async_tx_rebalance() attempts to spread the operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor channel0 while cpu1 gets copy channel 1 and xor channel 1. When a dependency is specified async_tx_find_channel defaults to keeping the operation on the same channel. A xor->copy->xor chain will stay on one channel if it supports both operation types, otherwise the transaction will transition between a copy and a xor resource. Currently the raid5 implementation in the MD raid456 driver has been converted to the async_tx api. A driver for the offload engines on the Intel Xscale series of I/O processors, iop-adma, is provided in a later commit. With the iop-adma driver and async_tx, raid456 is able to offload copy, xor, and xor-zero-sum operations to hardware engines. On iop342 tiobench showed higher throughput for sequential writes (20 - 30% improvement) and sequential reads to a degraded array (40 - 55% improvement). For the other cases performance was roughly equal, +/- a few percentage points. On a x86-smp platform the performance of the async_tx implementation (in synchronous mode) was also +/- a few percentage points of the original implementation. According to 'top' on iop342 CPU utilization drops from ~50% to ~15% during a 'resync' while the speed according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s. The tiobench command line used for testing was: tiobench --size 2048 --block 4096 --block 131072 --dir /mnt/raid --numruns 5 * iop342 had 1GB of memory available Details: * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making async_tx_find_channel a static inline routine that always returns NULL * when a callback is specified for a given transaction an interrupt will fire at operation completion time and the callback will occur in a tasklet. if the the channel does not support interrupts then a live polling wait will be performed * the api is written as a dmaengine client that requests all available channels * In support of dependencies the api implicitly schedules channel-switch interrupts. The interrupt triggers the cleanup tasklet which causes pending operations to be scheduled on the next channel * Xor engines treat an xor destination address differently than a software xor routine. To the software routine the destination address is an implied source, whereas engines treat it as a write-only destination. This patch modifies the xor_blocks routine to take a an explicit destination address to mirror the hardware. Changelog: * fixed a leftover debug print * don't allow callbacks in async_interrupt_cond * fixed xor_block changes * fixed usage of ASYNC_TX_XOR_DROP_DEST * drop dma mapping methods, suggested by Chris Leech * printk warning fixups from Andrew Morton * don't use inline in C files, Adrian Bunk * select the API when MD is enabled * BUG_ON xor source counts <= 1 * implicitly handle hardware concerns like channel switching and interrupts, Neil Brown * remove the per operation type list, and distribute operation capabilities evenly amongst the available channels * simplify async_tx_find_channel to optimize the fast path * introduce the channel_table_initialized flag to prevent early calls to the api * reorganize the code to mimic crypto * include mm.h as not all archs include it in dma-mapping.h * make the Kconfig options non-user visible, Adrian Bunk * move async_tx under crypto since it is meant as 'core' functionality, and the two may share algorithms in the future * move large inline functions into c files * checkpatch.pl fixes * gpl v2 only correction Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-By: NeilBrown <neilb@suse.de>
|
685784aaf3cd0e3ff5e36c7ecf6f441cdbf57f73 |
|
09-Jul-2007 |
Dan Williams <dan.j.williams@intel.com> |
xor: make 'xor_blocks' a library routine for use with async_tx The async_tx api tries to use a dma engine for an operation, but will fall back to an optimized software routine otherwise. Xor support is implemented using the raid5 xor routines. For organizational purposes this routine is moved to a common area. The following fixes are also made: * rename xor_block => xor_blocks, suggested by Adrian Bunk * ensure that xor.o initializes before md.o in the built-in case * checkpatch.pl fixes * mark calibrate_xor_blocks __init, Adrian Bunk Cc: Adrian Bunk <bunk@stusta.de> Cc: NeilBrown <neilb@suse.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
44ce6294d07555c3d313757105fd44b78208407f |
|
10-May-2007 |
Linus Torvalds <torvalds@woody.linux-foundation.org> |
Revert "md: improve partition detection in md array" This reverts commit 5b479c91da90eef605f851508744bfe8269591a0. Quoth Neil Brown: "It causes an oops when auto-detecting raid arrays, and it doesn't seem easy to fix. The array may not be 'open' when do_md_run is called, so bdev->bd_disk might be NULL, so bd_set_size can oops. This whole approach of opening an md device before it has been assembled just seems to get more and more painful. I think I'm going to have to come up with something clever to provide both backward comparability with usage expectation, and sane integration into the rest of the kernel." Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
5b479c91da90eef605f851508744bfe8269591a0 |
|
09-May-2007 |
NeilBrown <neilb@suse.de> |
md: improve partition detection in md array md currently uses ->media_changed to make sure rescan_partitions is call on md array after they are assembled. However that doesn't happen until the array is opened, which is later than some people would like. So use blkdev_ioctl to do the rescan immediately that the array has been assembled. This means we can remove all the ->change infrastructure as it was only used to trigger a partition rescan. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
42b9bebe3fea3d3ce381bc6735a3fb50e6613f06 |
|
09-May-2007 |
NeilBrown <neilb@suse.de> |
md: remove the slash from the name of a kmem_cache used by raid5 SLUB doesn't like slashes as it wants to use the cache name as the name of a directory (or symlink) in sysfs. Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
5e55e2f5fc95b355d8aa649f346cff69904c8ade |
|
27-Mar-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: convert compile time warnings into runtime warnings ... still not sure why we need this .... Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
041ae52e265fc432ea5525b1c66720385c2d11f0 |
|
27-Mar-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: clear the congested_fn when stopping a raid5 If this mddev and queue got reused for another array that doesn't register a congested_fn, this function would get called incorretly. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
3d37890baa4ca962f8a6b77525b8f3d0698eee09 |
|
27-Mar-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: allow raid4 arrays to be reshaped All that is missing the the function pointers in raid4_pers. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
6d3baf2eb8bd680b2d4f509bc3dbf4dcd6e27a40 |
|
05-Mar-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix for raid6 reshape Recent patch for raid6 reshape had a change missing that showed up in subsequent review. Many places in the raid5 code used "conf->raid_disks-1" to mean "number of data disks". With raid6 that had to be changed to "conf->raid_disk - conf->max_degraded" or similar. One place was missed. This bug means that if a raid6 reshape were aborted in the middle the recorded position would be wrong. On restart it would either fail (as the position wasn't on an appropriate boundary) or would leave a section of the array unreshaped, causing data corruption. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
f416885ef4950501dd3858d1afa1137a0c2905c5 |
|
01-Mar-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: add support for reshape of a raid6 i.e. one or more drives can be added and the array will re-stripe while on-line. Most of the interesting work was already done for raid5. This just extends it to raid6. mdadm newer than 2.6 is needed for complete safety, however any version of mdadm which support raid5 reshape will do a good enough job in almost all cases (an 'echo repair > /sys/block/mdX/md/sync_action' is recommended after a reshape that was aborted and had to be restarted with an such a version of mdadm). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
b4c4c7b8095298ff4ce20b40bf180ada070812d0 |
|
01-Mar-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: restart a (raid5) reshape that has been aborted due to a read/write error An error always aborts any resync/recovery/reshape on the understanding that it will immediately be restarted if that still makes sense. However a reshape currently doesn't get restarted. With this patch it does. To avoid restarting when it is not possible to do work, we call into the personality to check that a reshape is ok, and strengthen raid5_check_reshape to fail if there are too many failed devices. We also break some code out into a separate function: remove_and_add_spares as the indent level for that code was getting crazy. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
387bb17374c5fa057462d00d4ba941d49f45de4d |
|
08-Feb-2007 |
Neil Brown <neilb@suse.de> |
[PATCH] md: fix various bugs with aligned reads in RAID5 It is possible for raid5 to be sent a bio that is too big for an underlying device. So if it is a READ that we pass stright down to a device, it will fail and confuse RAID5. So in 'chunk_aligned_read' we check that the bio fits within the parameters for the target device and if it doesn't fit, fall back on reading through the stripe cache and making lots of one-page requests. Note that this is the earliest time we can check against the device because earlier we don't have a lock on the device, so it could change underneath us. Also, the code for handling a retry through the cache when a read fails has not been tested and was badly broken. This patch fixes that code. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Kai" <epimetreus@fastmail.fm> Cc: <stable@suse.de> Cc: <org@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
c20086de9319ac406f1e96ad459763c9f9965b18 |
|
26-Jan-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: remove unnecessary printk when raid5 gets an unaligned read. raid5_mergeable_bvec tries to ensure that raid5 never sees a read request that does not fit within just one chunk. However as we must always accept a single-page read, that is not always possible. So when "in_chunk_boundary" fails, it might be unusual, but it is not a problem and printing a message every time is a bad idea. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
2a2275d630b982e5f90206f9bc497f6695a3ec5d |
|
26-Jan-2007 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix potential memalloc deadlock in md If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held, it is possible for a deadlock to eventuate. This happens if the array was marked 'clean', and the memalloc triggers a write-out to the md device. For the writeout to succeed, the array must be marked 'dirty', and that requires getting the mddev_lock. So, before attempting a GFP_KERNEL allocation while holding the lock, make sure the array is marked 'dirty' (unless it is currently read-only). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
802ba064c49f655d20fed563f2a4924c8256ea10 |
|
13-Dec-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Don't assume that READ==0 and WRITE==1 - use the names explicitly Thanks Jens for alerting me to this. Cc: Jens Axboe <jens.axboe@oracle.com> Cc: <raziebe@gmail.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
c2b00852fbae4f8c45c2651530ded3bd01bde814 |
|
10-Dec-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5 Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error to higher levels. While this should be sufficient, it is safer to explicitly set the error code as well - less room for confusion. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b8c6b645563d641df91fdcfd84a9c73c91d75b61 |
|
10-Dec-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: remove some old ifdefed-out code from raid5.c There are some vestiges of old code that was used for bypassing the stripe cache on reads in raid5.c. This was never updated after the change from buffer_heads to bios, but was left as a reminder. That functionality has nowe been implemented in a completely different way, so the old code can go. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b875e531fc82db592d6093594593d5cafde0a1cd |
|
10-Dec-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix innocuous bug in raid6 stripe_to_pdidx stripe_to_pdidx finds the index of the parity disk for a given stripe. It assumes raid5 in that it uses "disks-1" to determine the number of data disks. This is incorrect for raid6 but fortunately the two usages cancel each other out. The only way that 'data_disks' affects the calculation of pd_idx in raid5_compute_sector is when it is divided into the sector number. But as that sector number is calculated by multiplying in the wrong value of 'data_disks' the division produces the right value. So it is innocuous but needs to be fixed. Also change the calculation of raid_disks in compute_blocknr to make it more obviously correct (it seems at first to always use disks-1 too). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
5248861511d6aae4997a5aa7152824d87587b0b6 |
|
10-Dec-2006 |
Raz Ben-Jehuda(caro) <raziebe@gmail.com> |
[PATCH] md: enable bypassing cache for reads Call the chunk_aligned_read where appropriate. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
46031f9a38a9773021f1872abc713d62467ac22e |
|
10-Dec-2006 |
Raz Ben-Jehuda(caro) <raziebe@gmail.com> |
[PATCH] md: allow reads that have bypassed the cache to be retried on failure If a bypass-the-cache read fails, we simply try again through the cache. If it fails again it will trigger normal recovery precedures. update 1: From: NeilBrown <neilb@suse.de> 1/ chunk_aligned_read and retry_aligned_read assume that data_disks == raid_disks - 1 which is not true for raid6. So when an aligned read request bypasses the cache, we can get the wrong data. 2/ The cloned bio is being used-after-free in raid5_align_endio (to test BIO_UPTODATE). 3/ We forgot to add rdev->data_offset when submitting a bio for aligned-read 4/ clone_bio calls blk_recount_segments and then we change bi_bdev, so we need to invalidate the segment counts. 5/ We don't de-reference the rdev when the read completes. This means we need to record the rdev to so it is still available in the end_io routine. Fortunately bi_next in the original bio is unused at this point so we can stuff it in there. 6/ We leak a cloned bio if the target rdev is not usable. From: NeilBrown <neilb@suse.de> update 2: 1/ When aligned requests fail (read error) they need to be retried via the normal method (stripe cache). As we cannot be sure that we can process a single read in one go (we may not be able to allocate all the stripes needed) we store a bio-being-retried and a list of bioes-that-still-need-to-be-retried. When find a bio that needs to be retried, we should add it to the list, not to single-bio... 2/ We were never incrementing 'scnt' when resubmitting failed aligned requests. [akpm@osdl.org: build fix] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
f679623f50545bc0577caf2d0f8675b61162f059 |
|
10-Dec-2006 |
Raz Ben-Jehuda(caro) <raziebe@gmail.com> |
[PATCH] md: handle bypassing the read cache (assuming nothing fails) Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
23032a0eb97c8eaae8ac9d17373b53b19d0f5413 |
|
10-Dec-2006 |
Raz Ben-Jehuda(caro) <raziebe@gmail.com> |
[PATCH] md: define raid5_mergeable_bvec This will encourage read request to be on only one device, so we will often be able to bypass the cache for read requests. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
e18b890bb0881bbab6f4f1a6cd20d9c60d66b003 |
|
07-Dec-2006 |
Christoph Lameter <clameter@sgi.com> |
[PATCH] slab: remove kmem_cache_t Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
0692c6b1cf5537b190f90fb5903f1af89a41b0a8 |
|
09-Nov-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix sizing problem with raid5-reshape and CONFIG_LBD=n I forgot to has the size-in-blocks to (loff_t) before shifting up to a size-in-bytes. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
52e5f9d1cf0b10b24317037dcd1c9be38ca7011c |
|
03-Oct-2006 |
Eric Sesterhenn <snakebyte@gmx.de> |
BUG_ON cleanup for drivers/md/ This changes two if() BUG(); usages to BUG_ON(); so people can disable it safely. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
f022b2fddd97795bc9333ffb6862eacfa95c6a95 |
|
03-Oct-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: add a ->congested_fn function for raid5/6 This is very different from other raid levels and all requests go through a 'stripe cache', and it has congestion management already. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
c04be0aa82ff535e3676ab3e573957bdeef41879 |
|
03-Oct-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Improve locking around error handling The error handling routines don't use proper locking, and so two concurrent errors could trigger a problem. So: - use test-and-set and test-and-clear to synchonise the In_sync bits with the ->degraded count - use the spinlock to protect updates to the degraded count (could use an atomic_t but that would be a bigger change in code, and isn't really justified) - remove un-necessary locking in raid5 Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
2d2063ceae73660d5142f4754d50a75b655fd1f9 |
|
03-Oct-2006 |
Coywolf Qi Hunt <qiyong@freeforge.net> |
[PATCH] md: remove unnecessary variable x in stripe_to_pdidx() Signed-off-by: Coywolf Qi Hunt <qiyong@freeforge.net> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
02c2de8cc835885bdff51a8bfd6c0b659b969f50 |
|
03-Oct-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: remove the working_disks and failed_disks from raid5 state data. They are not needed. conf->failed_disks is the same as mddev->degraded and conf->working_disks is conf->raid_disks - mddev->degraded. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
850b2b420cd5b363ed4cf48a8816d656c8b5251b |
|
03-Oct-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: replace magic numbers in sb_dirty with well defined bit flags Instead of magic numbers (0,1,2,3) in sb_dirty, we have some flags instead: MD_CHANGE_DEVS Some device state has changed requiring superblock update on all devices. MD_CHANGE_CLEAN The array has transitions from 'clean' to 'dirty' or back, requiring a superblock update on active devices, but possibly not on spares MD_CHANGE_PENDING A superblock update is underway. We wait for an update to complete by waiting for all flags to be clear. A flag can be set at any time, even during an update, without risk that the change will be lost. Stop exporting md_update_sb - isn't needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
d69504325978c461b51b03cca49626026970307b |
|
10-Jul-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: include sector number in messages about corrected read errors This is generally useful, but particularly helps see if it is the same sector that always needs correcting, or different ones. [akpm@osdl.org: fix printk warnings] Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
ae3c20ccf84c88d45616f12122f781a900118f09 |
|
10-Jul-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix some small races in bitmap plugging in raid5 The comment gives more details, but I didn't quite have the sequencing write, so there was room for races to leave bits unset in the on-disk bitmap for short periods of time. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
7c785b7a18dc30572a49c6b75efd384269735d14 |
|
10-Jul-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix a plug/unplug race in raid5 When a device is unplugged, requests are moved from one or two (depending on whether a bitmap is in use) queues to the main request queue. So whenever requests are put on either of those queues, we should make sure the raid5 array is 'plugged'. However we don't. We currently plug the raid5 queue just before putting requests on queues, so there is room for a race. If something unplugs the queue at just the wrong time, requests will be left on the queue and nothing will want to unplug them. Normally something else will plug and unplug the queue fairly soon, but there is a risk that nothing will. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
ff4e8d9a9f46e3a7f89d14ade52fe5d53a82c022 |
|
10-Jul-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix resync speed calculation for restarted resyncs We introduced 'io_sectors' recently so we could count the sectors that causes io during resync separate from sectors which didn't cause IO - there can be a difference if a bitmap is being used to accelerate resync. However when a speed is reported, we find the number of sectors processed recently by subtracting an oldish io_sectors count from a current 'curr_resync' count. This is wrong because curr_resync counts all sectors, not just io sectors. So, add a field to mddev to store the curren io_sectors separately from curr_resync, and use that in the calculations. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
0b8c9de05c2a860fe6b02fedcb48763bcee648b3 |
|
10-Jul-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: delay starting md threads until array is completely setup When an array is started we start one or two threads (two if there is a reshape or recovery that needs to be completed). We currently start these *before* the array is completely set up and in particular before queue->queuedata is set. If the thread actually starts very quickly on another CPU, we can end up dereferencing queue->queuedata and oops. This patch also makes sure we don't try to start a recovery if a reshape is being restarted. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
f4370781d83cd2e52eb515e4663155e8091e4d4e |
|
10-Jul-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: possible fix for unplug problem I have reports of a problem with raid5 which turns out to be because the raid5 device gets stuck in a 'plugged' state. This shouldn't be able to happen as 3msec after it gets plugged it should get unplugged. However it happens none-the-less. This patch fixes the problem and is a reasonable thing to do, though it might hurt performance slightly in some cases. Until I can find the real problem, we should probably have this workaround in place. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
6ab3d5624e172c553004ecc862bfeac16d9d68b7 |
|
30-Jun-2006 |
Jörn Engel <joern@wohnheim.fh-wedel.de> |
Remove obsolete #include <linux/config.h> Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
cfb9e32f2ff32ef5265c1c80fe68dd1a7f03a604 |
|
29-Jun-2006 |
Adrian Bunk <bunk@stusta.de> |
[PATCH] drivers/md/raid5.c: remove an unused variable Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
3285edf152cefff482f95ceb90b1bd46ac169df1 |
|
26-Jun-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Fix bug that stops raid5 resync from happening As data_disks is *less* than raid_disks, the current test here is obviously wrong. And as the difference is already available in conf->max_degraded, it makes much more sense to use that. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
52c03291a832d86c093996d0491a326de4a6b79b |
|
26-Jun-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: split reshape portion of raid5 sync_request into a separate function ... as raid5 sync_request is WAY too big. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
5fd6c1dce06ec24ef3de20fe0c7ecf2ba9fe5ef9 |
|
26-Jun-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: allow checkpoint of recovery with version-1 superblock For a while we have had checkpointing of resync. The version-1 superblock allows recovery to be checkpointed as well, and this patch implements that. Due to early carelessness we need to add a feature flag to signal that the recovery_offset field is in use, otherwise older kernels would assume that a partially recovered array is in fact fully recovered. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
16a53ecc35f2a80dc285be2e769768847d89ca37 |
|
26-Jun-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: merge raid5 and raid6 code There is a lot of commonality between raid5.c and raid6main.c. This patches merges both into one module called raid456. This saves a lot of code, and paves the way for online raid5->raid6 migrations. There is still duplication, e.g. between handle_stripe5 and handle_stripe6. This will probably be cleaned up later. Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
8932c2e0dcae52e73430878fd8a7a7800176eada |
|
26-Jun-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: remove arbitrary limit on chunk size The largest chunk size the code can support without substantial surgery is 2^30 bytes, so make that the limit instead of an arbitrary 4Meg. Some day, the 'chunksize' should change to a sector-shift instead of a byte-count. Then no limit would be needed. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
78bafebd461049a419df2f4b1f25efbee73c2439 |
|
02-Apr-2006 |
Eric Sesterhenn <snakebyte@gmx.de> |
BUG_ON() Conversion in md/raid5.c this changes if() BUG(); constructs to BUG_ON() which is cleaner and can better optimized away Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
|
f6344757a92e5466bf8c23a74ee8f972ff35cb05 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Remove bi_end_io call out from under a spinlock raid5 overloads bi_phys_segments to count the number of blocks that the request was broken in to so that it knows when the bio is completely handled. Accessing this must always be done under a spinlock. In one case we also call bi_end_io under that spinlock, which probably isn't ideal as bi_end_io could be expensive (even though it isn't allowed to sleep). So we reducde the range of the spinlock to just accessing bi_phys_segments. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b3b46be38aef5c46a4e144f1bcb8d0cc6bf3ff97 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Remove some stray semi-colons after functions called in macro.. wait_event_lock_irq puts a ';' after its usage of the 4th arg, so we don't need to. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
df8e7f7639bab3b2cc536a1d30d5593d65251778 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Improve comments about locking situation in raid5 make_request Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
e464eafdb4400c6d6576ba3840d8bd40340f8a96 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Support suspending of IO to regions of an md array This allows user-space to access data safely. This is needed for raid5 reshape as user-space needs to take a backup of the first few stripes before allowing reshape to commence. It will also be useful in cluster-aware raid1 configurations so that all cluster members can leave a section of the array untouched while a resync/recovery happens. A 'start' and 'end' of the suspended range are written to 2 sysfs attributes. Note that only one range can be suspended at a time. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
63c70c4f3a30e77e6f445bd16eff7934a031ebd3 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Split reshape handler in check_reshape and start_reshape check_reshape checks validity and does things that can be done instantly - like adding devices to raid1. start_reshape initiates a restriping process to convert the whole array. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b578d55fdd80140f657130abd85aebeb345755fb |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Only checkpoint expansion progress occasionally Instead of checkpointing at each stripe, only checkpoint when a new write would overwrite uncheckpointed data. Block any write to the uncheckpointed area. Arbitrarily checkpoint at least every 3Meg. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
f67055780caac6a99f43834795c43acf99eba6a6 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Checkpoint and allow restart of raid5 reshape We allow the superblock to record an 'old' and a 'new' geometry, and a position where any conversion is up to. The geometry allows for changing chunksize, layout and level as well as number of devices. When using verion-0.90 superblock, we convert the version to 0.91 while the conversion is happening so that an old kernel will refuse the assemble the array. For version-1, we use a feature bit for the same effect. When starting an array we check for an incomplete reshape and restart the reshape process if needed. If the reshape stopped at an awkward time (like when updating the first stripe) we refuse to assemble the array, and let user-space worry about it. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
292695531ae4019bb15deedc121b218d1908b648 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Final stages of raid5 expand code This patch adds raid5_reshape and end_reshape which will start and finish the reshape processes. raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set, to discourage accidental use. Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry. and Make sure that you have backups, just in case. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
ccfcc3c10b2a5cb8fd3c918199a4ff904fc6fb3e |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Core of raid5 resize process This patch provides the core of the resize/expand process. sync_request notices if a 'reshape' is happening and acts accordingly. It allocated new stripe_heads for the next chunk-wide-stripe in the target geometry, marking them STRIPE_EXPANDING. Then it finds which stripe heads in the old geometry can provide data needed by these and marks them STRIPE_EXPAND_SOURCE. This causes stripe_handle to read all blocks on those stripes. Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that are needed are copied into the corresponding STRIPE_EXPANDING stripe_head. Once a STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY and then is written out and released. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
7ecaa1e6a1ad69862e9980b6c777e11f26c4782d |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Infrastructure to allow normal IO to continue while array is expanding We need to allow that different stripes are of different effective sizes, and use the appropriate size. Also, when a stripe is being expanded, we must block any IO attempts until the stripe is stable again. Key elements in this change are: - each stripe_head gets a 'disk' field which is part of the key, thus there can sometimes be two stripe heads of the same area of the array, but covering different numbers of devices. One of these will be marked STRIPE_EXPANDING and so won't accept new requests. - conf->expand_progress tracks how the expansion is progressing and is used to determine whether the target part of the array has been expanded yet or not. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
ad01c9e3752f4ba4f3d99c89b7370fa4983a25b5 |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Allow stripes to be expanded in preparation for expanding an array Before a RAID-5 can be expanded, we need to be able to expand the stripe-cache data structure. This requires allocating new stripes in a new kmem_cache. If this succeeds, we copy cache pages over and release the old stripes and kmem_cache. We then allocate new pages. If that fails, we leave the stripe cache at it's new size. It isn't worth the effort to shrink it back again. Unfortuanately this means we need two kmem_cache names as we, for a short period of time, we have two kmem_caches. So they are raid5/%s and raid5/%s-alt Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b55e6bfcd23cb2f7249095050c649f7aea813f9f |
|
27-Mar-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Split disks array out of raid5 conf structure so it is easier to grow The remainder of this batch implements raid5 reshaping. Currently the only shape change that is supported is added a device, but it is envisioned that changing the chunksize and layout will also be supported, as well as changing the level (e.g. 1->5, 5->6). The reshape process naturally has to move all of the data in the array, and so should be used with caution. It is believed to work, and some testing does support this, but wider testing would be great for increasing my confidence. You will need a version of mdadm newer than 2.3.1 to make use of raid5 growth. This is because mdadm need to take a copy of a 'critical section' at the start of the array incase there is a crash at an awkward moment. On restart, mdadm will restore the critical section and allow reshape to continue. I hope to release a 2.4-pre by early next week - it still needs a little more polishing. This patch: Previously the array of disk information was included in the raid5 'conf' structure which was allocated to an appropriate size. This makes it awkward to change the size of that array. So we split it off into a separate kmalloced array which will require a little extra indexing, but is much easier to grow. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
29fc7e3e70a05e9eea28afb6707a39c1a53e2f66 |
|
03-Feb-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: Assorted little md fixes - version-1 superblock + The default_bitmap_offset is in sectors, not bytes. + the 'size' field in the superblock is in sectors, not KB - raid0_run should return a negative number on error, not '1' - raid10_read_balance should not return a valid 'disk' number if ->rdev turned out to be NULL - kmem_cache_destroy doesn't like being passed a NULL. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
858119e159384308a5dde67776691a2ebf70df0f |
|
14-Jan-2006 |
Arjan van de Ven <arjan@infradead.org> |
[PATCH] Unlinline a bunch of other functions Remove the "inline" keyword from a bunch of big functions in the kernel with the goal of shrinking it by 30kb to 40kb Signed-off-by: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jeff Garzik <jgarzik@pobox.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
4dbcdc751cb25ffca3a8374cbc5ab6de961cc545 |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: count corrected read errors per drive Store this total in superblock (As appropriate), and make it available to userspace via sysfs. Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
d9d166c2a9d5d01af34396793950aa695883eed4 |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: allow array level to be set textually via sysfs Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
2604b703b6b3db80e3c75ce472a54dfd0b7bf9f4 |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: remove personality numbering from md md supports multiple different RAID level, each being implemented by a 'personality' (which is often in a separate module). These personalities have fairly artificial 'numbers'. The numbers are use to: 1- provide an index into an array where the various personalities are recorded 2- identify the module (via an alias) which implements are particular personality. Neither of these uses really justify the existence of personality numbers. The array can be replaced by a linked list which is searched (array lookup only happens very rarely). Module identification can be done using an alias based on level rather than 'personality' number. The current 'raid5' modules support two level (4 and 5) but only one personality. This slight awkwardness (which was handled in the mapping from level to personality) can be better handled by allowing raid5 to register 2 personalities. With this change in place, the core md module does not need to have an exhaustive list of all possible personalities, so other personalities can be added independently. This patch also moves the check for chunksize being non-zero into the ->run routines for the personalities that need it, rather than having it in core-md. This has a side effect of allowing 'faulty' and 'linear' not to have a chunk-size set. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
fccddba060f2b4916a30aa27acc3d03b01bb981e |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: tidy up raid5/6 hash table code - replace open-coded hash chain with hlist macros - Fix hash-table size at one page - it is already quite generous, so there will never be a need to use multiple pages, so no need for __get_free_pages No functional change. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
9ffae0cf3ea02f75d163922accfd3e592d87adde |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: convert md to use kzalloc throughout Replace multiple kmalloc/memset pairs with kzalloc calls. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
2d1f3b5d1b2cd11a162eb29645df749ec0036413 |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: clean up 'page' related names in md Substitute: page_cache_get -> get_page page_cache_release -> put_page PAGE_CACHE_SHIFT -> PAGE_SHIFT PAGE_CACHE_SIZE -> PAGE_SIZE PAGE_CACHE_MASK -> PAGE_MASK __free_page -> put_page because we aren't using the page cache, we are just using pages. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
9910f16af35419a5382fa7850eecc220103036fa |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix up some rdev rcu locking in raid5/6 There is this "FIXME" comment with a typo in it!! that been annoying me for days, so I just had to remove it. conf->disks[i].rdev should only be accessed if - we know we hold a reference or - the mddev->reconfig_sem is down or - we have a rcu_readlock handle_stripe was referencing rdev in three places without any of these. For the first two, get an rcu_readlock. For the last, the same access (md_sync_acct call) is made a little later after the rdev has been claimed under and rcu_readlock, if R5_Syncio is set. So just use that access... However R5_Syncio isn't really needed as the 'syncing' variable contains the same information. So use that instead. Issues, comment, and fix are identical in raid5 and raid6. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b15c2e57f0f5bf596a19e9c5571e5b07cdfc7363 |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: move bitmap_create to after md array has been initialised This is important because bitmap_create uses mddev->resync_max_sectors and that doesn't have a valid value until after the array has been initialised (with pers->run()). [It doesn't make a difference for current personalities that support bitmaps, but will make a difference for raid10] This has the added advantage of meaning with can move the thread->timeout manipulation inside the bitmap.c code instead of sprinkling identical code throughout all personalities. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
6ff8d8ec06690f4011a6c3ad9e0759b9094f0601 |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: allow dirty raid[456] arrays to be started at boot See patch to md.txt for more details Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
14f8d26b8ea3413b28f2cac208c9a93600fe3a80 |
|
06-Jan-2006 |
NeilBrown <neilb@suse.de> |
[PATCH] md: small cleanups for raid5 Resync code: A test that isn't needed, a 'compute_block' that makes more sense elsewhere (And then doesn't need a test), a couple of BUG_ONs to confirm the change makes sense. Printks: A few were missing KERN_* Also fix a typo in a comment.. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
5036805be7b815eb18dcce489d974f3aee4f3841 |
|
12-Dec-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: use correct size of raid5 stripe cache when measuring how full it is The raid5 stripe cache was recently changed from fixed size (NR_STRIPES) to variable size (conf->max_nr_stripes). However there are two places that still use the constant and as a result, reducing the size of the stripe cache can result in a deadlock. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
700e432d8364ce59c521abbe03a522051610ebc2 |
|
28-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix locking problem in r5/r6 bitmap_unplug actually writes data (bits) to storage, so we shouldn't be holding a spinlock... Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
96de1e663cda65ba9275afb1bc007f34e5068ba1 |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix some locking and module refcounting issues with md's use of sysfs 1/ I really should be using the __ATTR macros for defining attributes, so that the .owner field get set properly, otherwise modules can be removed while sysfs files are open. This also involves some name changes of _show routines. 2/ Always lock the mddev (against reconfiguration) for all sysfs attribute access. This easily avoid certain races and is completely consistant with other interfaces (ioctl and /proc/mdstat both always lock against reconfiguration). 3/ raid5 attributes must check that the 'conf' structure actually exists (the array could have been stopped while an attribute file was open). 4/ A missing 'kfree' from when the raid5_conf_t was converted to have a kobject embedded, and then converted back again. Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
3855ad9f398de88a3290f7dd799633c4b73903ea |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: make sure a user-request sync of raid5 ignores intent bitmap A sync of raid5 usually ignore blocks which the bitmap says are in-sync. But a user-request check or repair should not ignore these. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b2d444d7ad975d555bb919601bcdc0e58975a40e |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: convert 'faulty' and 'in_sync' fields to bits in 'flags' field This has the advantage of removing the confusion caused by 'rdev_t' and 'mddev_t' both having 'in_sync' fields. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
ba22dcbf106338a5c46d6979f9b19564faae3d49 |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: improvements to raid5 handling of read errors Two refinements to the 'attempt-overwrite-on-read-error' mechanism. 1/ If the array is read-only, don't attempt an over-write. 2/ If there are more than max_nr_stripes read errors on a device with no success, fail the drive. This will make sure a dead drive will be eventually kicked even when we aren't trying to rewrite (which would normally kick a dead drive more quickly. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
007583c9253fed363a0bd71b039e9b40a0f6855e |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: change raid5 sysfs attribute to not create a new directory There isn't really a need for raid5 attributes to be an a subdirectory, so this patch moves them from /sys/block/mdX/md/raid5/attribute to /sys/block/mdX/md/attribute This suggests that all md personalities should co-operate about namespace usage, but that shouldn't be a problem. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
9c79197761b4c181a143dc6a6044f4e47d44bdcc |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: fix ref-counting problems with kobjects in md Thanks Greg. Cc: Greg KH <greg@kroah.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
d6065f7bf8bec170c9c56524a250093ce73ca5d9 |
|
09-Nov-2005 |
Suzanne Wood <suzannew@cs.pdx.edu> |
[PATCH] md: provide proper rcu_dereference / rcu_assign_pointer annotations in md Acked-by: <paulmck@us.ibm.com> Signed-off-by: Suzanne Wood <suzannew@cs.pdx.edu> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
9d88883e68f404d5581bd391713ceef470ea53a9 |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: teach raid5 the difference between 'check' and 'repair'. With this, raid5 can be asked to check parity without repairing it. It also keeps a count of the number of incorrect parity blocks found (mismatches) and reports them through sysfs. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
3f294f4fb6f2ba887b717674da26c21f3d57f3fc |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: add kobject/sysfs support to raid5 /sys/block/mdX/md/raid5/ contains raid5-related attributes. Currently stripe_cache_size is number of entries in stripe cache, and is settable. stripe_cache_active is number of active entries, and in only readable. Signed-off-by: Neil Brown <neilb@suse.de> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
4e5314b56a7ea11c7a5f2b8418992b2f49648a25 |
|
09-Nov-2005 |
NeilBrown <neilb@suse.de> |
[PATCH] md: better handling of readerrors with raid5. This patch changes the behaviour of raid5 when it gets a read error. Instead of just failing the device, it tried to find out what should have been there, and writes it over the bad block. For some media-errors, this has a reasonable chance of fixing the error. If the write succeeds, and a subsequent read succeeds as well, raid5 decided the address is OK and conitnues. Instead of failing a drive on read-error, we attempt to re-write the block, and then re-read. If that all works, we allow the device to remain in the array. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
66c006a55137cc51f8761549e2024a7593180d27 |
|
07-Nov-2005 |
Nishanth Aravamudan <nacc@us.ibm.com> |
[PATCH] drivers/md: fix-up schedule_timeout() usage Use schedule_timeout_interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
a362357b6cd62643d4dda3b152639303d78473da |
|
01-Nov-2005 |
Jens Axboe <axboe@suse.de> |
[BLOCK] Unify the seperate read/write io stat fields into arrays Instead of having ->read_sectors and ->write_sectors, combine the two into ->sectors[2] and similar for the other fields. This saves a branch several places in the io path, since we don't have to care for what the actual io direction is. On my x86-64 box, that's 200 bytes less text in just the core (not counting the various drivers). Signed-off-by: Jens Axboe <axboe@suse.de>
|
72626685dc66d455742a7f215a0535c551628b9e |
|
10-Sep-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: add write-intent-bitmap support to raid5 Most awkward part of this is delaying write requests until bitmap updates have been flushed. To achieve this, we have a sequence number (seq_flush) which is incremented each time the raid5 is unplugged. If the raid thread notices that this has changed, it flushes bitmap changes, and assigned the value of seq_flush to seq_write. When a write request arrives, it is given the number from seq_write, and that write request may not complete until seq_flush is larger than the saved seq number. We have a new queue for storing stripes which are waiting for a bitmap flush and an extra flag for stripes to record if the write was 'degraded' and so should not clear the a bit in the bitmap. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
e5dcdd80a60627371f40797426273048630dc8ca |
|
10-Sep-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: fail IO request to md that require a barrier. md does not yet support BIO_RW_BARRIER, so be honest about it and fail (-EOPNOTSUPP) any such requests. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
b1581566183f310abbd2d384a9079d4039faca05 |
|
01-Aug-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: make sure raid5/raid6 resync uses correct 'max_sectors' The default resync_max_sector is set to "mddev->size << 1". If the raid-personality-module updates mddev->size, it must update resync_max_sectors too. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
4b5c7ae83704320e2afb0912f4c42eadabc7535b |
|
27-Jul-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: when resizing an array, we need to update resync_max_sectors as well as size Without this, and attempt to 'grow' an array will claim to have synced the extra part without actually having done anything. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
3d310eb7b3df1252e8595d059d982b0a9825a137 |
|
22-Jun-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: fix deadlock due to md thread processing delayed requests. Before completing a 'write' the md superblock might need to be updated. This is best done by the md_thread. The current code schedules this up and queues the write request for later handling by the md_thread. However some personalities (Raid5/raid6) will deadlock if the md_thread tries to submit requests to its own array. So this patch changes things so the processes submitting the request waits for the superblock to be written and then submits the request itself. This fixes a recently-created deadlock in raid5/raid6 Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
57afd89f98a990747445f01c458ecae64263b2f8 |
|
22-Jun-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: improve the interface to sync_request 1/ change the return value (which is number-of-sectors synced) from 'int' to 'sector_t'. The number of sectors is usually easily small enough to fit in an int, but if resync needs to abort, it may want to return the total number of remaining sectors, which could be large. Also errors cannot be returned as negative numbers now, so use 0 instead 2/ Add a 'skipped' return parameter to allow the array to report that it skipped the sectors. This allows md to take this into account in the speed calculations. Currently there is no important skipping, but the bitmap-based-resync that is coming will use this. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
06d91a5fe0b50c9060e70bdf7786f8a3c66249db |
|
22-Jun-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: improve locking on 'safemode' and move superblock writes When md marks the superblock dirty before a write, it calls generic_make_request (to write the superblock) from within generic_make_request (to write the first dirty block), which could cause problems later. With this patch, the superblock write is always done by the helper thread, and write request are delayed until that write completes. Also, the locking around marking the array dirty and writing the superblock is improved to avoid possible races. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
fca4d848f0e6fafdc2b25f8a0cf1e76935f13ac2 |
|
22-Jun-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: merge md_enter_safemode into md_check_recovery md_enter_safemode checks if it is time to mark the md superblock as 'clean'. i.e. if all writes have completed and a suitable delay has passed. This is currently called from md_handle_safemode which in-turn is called (almost) every time md_check_recovery is called, and from the end of md_do_sync which causes the mddev->thread to run, which will always call md_check_recovery as well. So it doesn't need to be a separate function and fits quite well into md_check_recovery. The "almost" is because multipathd calls md_check_recovery but not md_handle_safemode. This is OK because the code from md_enter_safemode is a no-op if mddev->safemode == 0, which it always is for a multipathd (providing we don't allow it to be set to 2 on a signal...) Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
7a5febe9ffeecd1e78c5b505260ccc1ef18021b4 |
|
17-May-2005 |
NeilBrown <neilb@cse.unsw.edu.au> |
[PATCH] md: set the unplug_fn and issue_flush_fn for md devices *after* committed to creation We we set the too early, they may still be in place and possibly get called even though the array didn't get set up properly. Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
fbd568a3e61a7decb8a754ad952aaa5b5c82e9e5 |
|
01-May-2005 |
Paul E. McKenney <paulmck@us.ibm.com> |
[PATCH] Change synchronize_kernel to _rcu and _sched This patch changes calls to synchronize_kernel(), deprecated in the earlier "Deprecate synchronize_kernel, GPL replacement" patch to instead call the new synchronize_rcu() and synchronize_sched() APIs. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
|
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 |
|
17-Apr-2005 |
Linus Torvalds <torvalds@ppc970.osdl.org> |
Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
|