History log of /fs/btrfs/scrub.c
Revision Date Author Comments
8b9456da037ab53428d6347fa2fa088933da1424 30-Jul-2014 David Sterba <dsterba@suse.cz> btrfs: remove unused members from struct scrub_warning

Signed-off-by: David Sterba <dsterba@suse.cz>
1203b6813ee84add8b4baa6d75e50ba85517e99c 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com> Btrfs: modify clean_io_failure and make it suit direct io

We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
ffdd2018dd0bbfc0d9855ed811dba67201766a2d 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com> Btrfs: modify repair_io_failure and make it suit direct io

The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
935e5cc935bcbf9b3d0dd59fed7dbc0f2ebca6bc 03-Sep-2014 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix wrong disk size when writing super blocks

total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
5f546063cee93047af90cf2756e023da9f9fca51 24-Jul-2014 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix wrong generation check of super block on a seed device

The super block generation of the seed devices is not the same as the
filesystem which sprouted from them because we don't update the super
block on the seed devices when we change that new filesystem. So we
should not use the generation of that new filesystem to check the super
block generation on the seed devices, Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
17a9be2f28595945ec9bfac0dd15b86891c1f1de 24-Jul-2014 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix wrong fsid check of scrub

All the metadata in the seed devices has the same fsid as the fsid
of the seed filesystem which is on the seed device, so we should check
them by the current filesystem. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
707e8a071528385a87b63a72a37c2322e463c7b8 04-Jun-2014 David Sterba <dsterba@suse.cz> btrfs: use nodesize everywhere, kill leafsize

The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.

Shaves a few bytes from .text:

text data bss dec hex filename
852418 24560 23112 900090 dbbfa btrfs.ko.before
851074 24584 23112 898770 db6d2 btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
962a298f35110edd8f326814ae41a3dd306ecb64 04-Jun-2014 David Sterba <dsterba@suse.cz> btrfs: kill the key type accessor helpers

btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
9e0af23764344f7f1b68e4eefbe7dc865018b63d 15-Aug-2014 Liu Bo <bo.li.liu@oracle.com> Btrfs: fix task hang under heavy compressed write

This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
work = container_of(arg, struct btrfs_work, normal_work);

work->func() <---- (we name it work X)
for ordered_work in wq->ordered_list
ordered_work->ordered_func()
ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
btrfs_readpages()
for page that is not in page cache
__do_readpage()
submit_extent_page()
btrfs_submit_bio_hook()
btrfs_bio_wq_end_io()
submit_bio()
end_workqueue_bio() <--(ret by the 1st endio)
queue a work(named work Y) for the 2nd
also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg -- struct work_struct

kthread:
worker_thread()
pick up a work_struct from @worklist
process_one_work(arg)
worker->current_work = arg; <-- arg is A->normal_work
worker->current_func(arg)
normal_work_helper(arg)
A = container_of(arg, struct btrfs_work, normal_work);

A->func()
A->ordered_func()
A->ordered_free() <-- A gets freed

B->ordered_func()
submit_compressed_extents()
find_free_extent()
load_free_space_inode()
... <-- (the above readhead stack)
end_workqueue_bio()
btrfs_queue_work(work C)
B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
5d68da3b8ee6eb2257aa4b8d885581782278ae93 24-Jul-2014 Miao Xie <miaox@cn.fujitsu.com> Btrfs: don't write any data into a readonly device when scrub

We should not write data into a readonly device especially seed device when
doing scrub, skip those devices.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
ced96edc48ba455b0982c3aa64d3cc3bf2d0816a 19-Jun-2014 Qu Wenruo <quwenruo@cn.fujitsu.com> btrfs: Skip scrubbing removed chunks to avoid -ENOENT.

When run scrub with balance, sometimes -ENOENT will be returned, since
in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
chunk is removed but not committed, -ENOENT will be returned.

However, there is no need to stop scrubbing since other chunks may be
scrubbed without problem.

So this patch changes the behavior to skip removed chunks and continue
to scrub the rest.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
6eda71d0c030af0fc2f68aaa676e6d445600855b 09-Jun-2014 Liu Bo <bo.li.liu@oracle.com> Btrfs: fix scrub_print_warning to handle skinny metadata extents

The skinny extents are intepreted incorrectly in scrub_print_warning(),
and end up hitting the BUG() in btrfs_extent_inline_ref_size.

Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
7fb18a06644a33f0a2eba810d57d4b3b2c982f5b 25-Apr-2014 Tobias Klauser <tklauser@distanz.ch> btrfs: Remove unnecessary check for NULL

iput() already checks for the inode being NULL, thus it's unnecessary to
check before calling.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Chris Mason <clm@fb.com>
e4fbaee29272533a242f117d18712e2974520d2c 11-Apr-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: fix compile warnings on on avr32 platform

fs/btrfs/scrub.c: In function 'get_raid56_logic_offset':
fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast
fs/btrfs/scrub.c:2269: warning: right shift count >= width of type
fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type

Since @rot is an int type, we should not use do_div(), fix it.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
3b080b2564287be91605bfd1d5ee985696e61d3c 01-Apr-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: scrub raid56 stripes in the right way

Steps to reproduce:
# mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
# mount /dev/sda8 /mnt
# btrfs scrub start -BR /mnt
# echo $? <--unverified errors make return value be 3

This is because we don't setup right mapping between physical
and logical address for raid56, which makes checksum mismatch.
But we will find everthing is fine later when rechecking using
btrfs_map_block().

This patch fixed the problem by settuping right mappings and
we only verify data stripes' checksums.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
d458b0540ebd728b4d6ef47cc5ef0dbfd4dd361a 28-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com> btrfs: Cleanup the "_struct" suffix in btrfs_workequeue

Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
0339ef2f42bcfbb2d4021ad6f38fe20580082c85 28-Feb-2014 Qu Wenruo <quwenruo@cn.fujitsu.com> btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue.

Replace the fs_info->scrub_* with the newly created
btrfs_workqueue.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
32a447896c7b4c5cf5502a3ca7f5ef57d498fb03 19-Feb-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: wake up @scrub_pause_wait as much as we can

check if @scrubs_running=@scrubs_paused condition inside wait_event()
is not an atomic operation which means we may inc/dec @scrub_running/
paused at any time. Let's wake up @scrub_pause_wait as much as we can
to let commit transaction blocked less.

An example below:

Thread1 Thread2
|->scrub_blocked_if_needed() |->scrub_pending_trans_workers_inc
|->increase @scrub_paused
|->increase @scrub_running
|->wake up scrub_pause_wait list
|->scrub blocked
|->increase @scrub_paused

Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
So after Thread2 increase @scrub_paused, we meet the condition
@scrub_paused=@scrub_running, but transaction will be still blocked until
another calling to wake up @scrub_pause_wait.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
12cf93728dfba237b46001a95479829c7179cdc9 19-Feb-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: device_replace: fix deadlock for nocow case

commit cb7ab02156e4 cause a following deadlock found by
xfstests,btrfs/011:

Thread1 is commiting transaction which is blocked at
btrfs_scrub_pause().

Thread2 is calling btrfs_file_aio_write() which has held
inode's @i_mutex and commit transaction(blocked because
Thread1 is committing transaction).

Thread3 is copy_nocow_page worker which will also try to
hold inode @i_mutex, so thread3 will wait Thread1 finished.

Thread4 is waiting pending workers finished which will wait
Thread3 finished. So the problem is like this:

Thread1--->Thread4--->Thread3--->Thread2---->Thread1

Deadlock happens! we fix it by letting Thread1 go firstly,
which means we won't block transaction commit while we are
waiting pending workers finished.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
ade2e0b3eeca941a5cd486bac21599ff87f288c8 12-Jan-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: fix to search previous metadata extent item since skinny metadata

There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.

To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
7c76edb77c23db673a83793686b4a53e2eec4de4 12-Jan-2014 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: fix missing skinny metadata check in scrub_stripe()

Check if we support skinny metadata firstly and fix to use
right type to search.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
efe120a067c8674a8ae21b194f0e68f098b61ee2 20-Dec-2013 Frank Holton <fholton@gmail.com> Btrfs: convert printk to btrfs_ and fix BTRFS prefix

Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
cb7ab02156e4ba999df90e9fa8e96107683586fd 04-Dec-2013 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: wrap repeated code into scrub_blocked_if_needed()

Just wrap same code into one function scrub_blocked_if_needed().

This make a change that we will move waiting (@workers_pending = 0)
before we can wake up commiting transaction(atomic_inc(@scrub_paused)),
we must take carefully to not deadlock here.

Thread 1 Thread 2
|->btrfs_commit_transaction()
|->set trans type(COMMIT_DOING)
|->btrfs_scrub_paused()(blocked)
|->join_transaction(blocked)

Move btrfs_scrub_paused() before setting trans type which means we can
still join a transaction when commiting_transaction is blocked.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
3cb0929ad24c95c5fd8f08eb41a702a65954b4c6 04-Dec-2013 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: fix wrong super generation mismatch when scrubbing supers

We came a race condition when scrubbing superblocks, the story is:

In commiting transaction, we will update @last_trans_commited after
writting superblocks, if scrubber start after writting superblocks
and before updating @last_trans_commited, generation mismatch happens!

We fix this by checking @scrub_pause_req, and we won't start a srubber
until commiting transaction is finished.(after btrfs_scrub_continue()
finished.)

Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
ce3e7f1073ee4958a30e5677e868f79292bc53a6 04-Nov-2013 Valentina Giusti <valentina.giusti@microon.de> btrfs: remove unused variable from scrub_fixup_nodatasum

Signed-off-by: Valentina Giusti <valentina.giusti@microon.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <clm@fb.com>
c170bbb45febc03ac4d34ba2b8bb55e06104b7e7 25-Nov-2013 Kent Overstreet <kmo@daterainc.com> block: submit_bio_wait() conversions

It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4f024f3797c43cb4b73cd2c50cec728842d0e49e 12-Oct-2013 Kent Overstreet <kmo@daterainc.com> block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
33879d4512c021ae65be9706608dacb36b4687b1 24-Nov-2013 Kent Overstreet <kmo@daterainc.com> block: submit_bio_wait() conversions

It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
33ef30add16905f99bbe799ee5ccea15ba497803 03-Nov-2013 Ilya Dryomov <idryomov@gmail.com> Btrfs: do not inc uncorrectable_errors counter on ro scrubs

Currently if we discover an error when scrubbing in ro mode we a)
blindly increment the uncorrectable_errors counter, and b) spam the
dmesg with the 'unable to fixup (regular) error at ...' message, even
though a) we haven't tried to determine if the error is correctable or
not, and b) we haven't tried to fixup anything. Fix this.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
3b7a016f44d51ba8425c244f4c607f93fa213fd2 11-Oct-2013 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: avoid unnecessary scrub workers allocation

We only allocate scrub workers if we pass all the necessary
checks, for example, there are no operation in progress.

Besides, move mutex lock protection outside of scrub_workers_get()
/scrub_workers_put().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
9b011adfe14977fcda977234609d43ca52463a3d 25-Oct-2013 Wang Shilong <wangsl.fnst@cn.fujitsu.com> Btrfs: remove scrub_super_lock holding in btrfs_sync_log()

Originally, we introduced scrub_super_lock to synchronize
tree log code with scrubbing super.

However we can replace scrub_super_lock with device_list_mutex,
because writing super will hold this mutex, this will reduce an extra
lock holding when writing supers in sync log code.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
539f358a30d5113bad81c41a2e7ba8770d6c9f6e 07-Oct-2013 Ilya Dryomov <idryomov@gmail.com> Btrfs: fix the dev-replace suspend sequence

Replace progresses strictly from lower to higher offsets, and the
progress is tracked in chunks, by storing the physical offset of the
dev_extent which is being copied in the cursor_left field of
btrfs_dev_replace_item. When we are done copying the chunk,
left_cursor is updated to point one byte past the dev_extent, so that
on resume we can skip the dev_extents that have already been copied.

There is a major bug (which goes all the way back to the inception of
dev-replace in 3.8) in the way left_cursor is bumped: the bump is done
unconditionally, without any regard to the scrub_chunk return value.
On suspend (and also on any kind of error) scrub_chunk returns early,
i.e. without completing the copy. This leads to us skipping the chunk
that hasn't been fully copied yet when resuming.

Fix this by doing the cursor_left update only if scrub_chunk ret is 0.
(On suspend scrub_chunk returns with -ECANCELED, so this fix covers
both suspend and error cases.)

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
652f25a2921eb32a3e6f88aed3c59b494c1287c4 12-Sep-2013 Josef Bacik <jbacik@fusionio.com> Btrfs: improve replacing nocow extents

Various people have hit a deadlock when running btrfs/011. This is because when
replacing nocow extents we will take the i_mutex to make sure nobody messes with
the file while we are replacing the extent. The problem is we are already
holding a transaction open, which is a locking inversion, so instead we need to
save these inodes we find and then process them outside of the transaction.

Further we can't just lock the inode and assume we are good to go. We need to
lock the extent range and then read back the extent cache for the inode to make
sure the extent really still points at the physical block we want. If it
doesn't we don't have to copy it. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
23fa76b0ba78b7d84708d9ee683587d8a5bbceef 23-Aug-2013 Stefan Behrens <sbehrens@giantdisaster.de> Btrf: cleanup: don't check for root_refs == 0 twice

btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in three more places.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
118a0a251425b5ed5d5951097f9ef95f30611b03 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org> Btrfs: Format mirror_num as int

mirror_num is always "int", hence don't cast it to "unsigned long long" and
format it as a 64-bit number.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
27f9f02357f2bff96fc5e8a000c78ec5f96d42af 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org> Btrfs: Format PAGE_SIZE as unsigned long

PAGE_SIZE is "unsigned long" everywhere, so there's no need to cast it to
"unsigned long long" and format it as a 64-bit number.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
c1c9ff7c94e83fae89a742df74db51156869bad5 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org> Btrfs: Remove superfluous casts from u64 to unsigned long long

u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
3cae210fa529d69cb25c2a3c491f29dab687b245 16-Jul-2013 Qu Wenruo <quwenruo@cn.fujitsu.com> btrfs: Cleanup for using BTRFS_SETGET_STACK instead of raw convert

Some codes still use the cpu_to_lexx instead of the
BTRFS_SETGET_STACK_FUNCS declared in ctree.h.

Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec
and other structures.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
115930cb2d444a684975cf2325759cb48ebf80cc 04-Jul-2013 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: fix wrong write offset when replacing a device

Miao Xie reported the following issue:

The filesystem was corrupted after we did a device replace.

Steps to reproduce:
# mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
# mount <device0> <mnt>
# btrfs replace start -rfB 1 <device4> <mnt>
# umount <mnt>
# btrfsck <device4>

The reason for the issue is that we changed the write offset by mistake,
introduced by commit 625f1c8dc.

We read the data from the source device at first, and then write the
data into the corresponding place of the new device. In order to
implement the "-r" option, the source location is remapped using
btrfs_map_block(). The read takes place on the mapped location, and
the write needs to take place on the unmapped location. Currently
the write is using the mapped location, and this commit changes it
back by undoing the change to the write address that the aforementioned
commit added by mistake.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
edd1400be9f983f521c7397740d810fa210ee52f 27-Jun-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix several potential problems in copy_nocow_pages_for_inode

- It makes no sense that we deal with a inode in the dead tree.
- fix the race between dio and page copy by waiting the dio completion
- avoid the page copy vs truncate/punch hole
- check if the page is in the page cache or not

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
826aa0a82c5b9d2c8016c02b552e8f82f5b1e660 27-Jun-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: cleanup the code of copy_nocow_pages_for_inode()

- It make no sense that we continue to do something after the error
happened, just go back with this patch.
- remove some check of copy_nocow_pages_for_inode(), such as page check
after write, inode check in the end of the function, because we are
sure they exist.
- remove the unnecessary goto in the return value check of the write

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
26b258919006fc2d76a50b8247d7dea73207b583 27-Jun-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix oops when recovering the file data by scrub function

We get oops while running btrfs replace start test,
------------[ cut here ]------------
kernel BUG at mm/filemap.c:608!
[SNIP]
Call Trace:
[<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
[<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
[<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
[<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
[<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
[<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
[<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
[<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
[<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
[<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
[<ffffffff8106f2f0>] kthread+0xc0/0xd0
[<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
[<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
[<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
[SNIP]
RIP [<ffffffff8111f4c5>] unlock_page+0x35/0x40
RSP <ffff88010316bb98>
---[ end trace 421e79ad0dd72c7d ]---

it is because we forgot to lock the page again after we read data to
the page. Fix it.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
f51a4a1826ff810eb9c00cadff8978b028c40756 19-Jun-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: remove btrfs_sector_sum structure

Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.

By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).

test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
d88d46c6e06cb47cd3b951287ccaf414e96560d0 10-Jun-2013 Josef Bacik <jbacik@fusionio.com> Btrfs: free csums when we're done scrubbing an extent

A user reported scrub taking up an unreasonable amount of ram as it ran. This
is because we lookup the csums for the extent we're scrubbing but don't free it
up until after we're done with the scrub, which means we can take up a whole lot
of ram. This patch fixes this by dropping the csums once we're done with the
extent we've scrubbed. The user reported this to fix their problem. Thanks,

Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
9be3395bcd4ad4af76476ac38152b4cafa6b6159 18-May-2013 Chris Mason <chris.mason@fusionio.com> Btrfs: use a btrfs bioset instead of abusing bio internals

Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs. They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index. The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
625f1c8dc66d77878e1a563d6dd5722404968fbf 27-Apr-2013 Liu Bo <bo.li.liu@oracle.com> Btrfs: improve the loop of scrub_stripe

1) Right now scrub_stripe() is looping in some unnecessary cases:
* when the found extent item's objectid has been out of the dev extent's range
but we haven't finish scanning all the range within the dev extent
* when all the items has been processed but we haven't finish scanning all the
range within the dev extent

In both cases, we can just finish the loop to save costs.

2) Besides, when the found extent item's length is larger than the stripe
len(64k), we don't have to release the path and search again as it'll get at the
same key used in the last loop, we can instead increase the logical cursor in
place till all space of the extent is scanned.

3) And we use 0 as the key's offset to search btree, then get to previous item
to find a smaller item, and again have to move to the next one to get the right
item. Setting offset=-1 and previous_item() is the correct way.

4) As we won't find any checksum at offset unless this 'offset' is in a data
extent, we can just find checksum when we're really going to scrub an extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
48a3b6366f6913683563d934eb16fea67dead9c1 25-Apr-2013 Eric Sandeen <sandeen@redhat.com> btrfs: make static code static & remove dead code

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
3173a18f70554fe7880bb2d85c7da566e364eb3c 07-Mar-2013 Josef Bacik <jbacik@fusionio.com> Btrfs: add a incompatible format change for smaller metadata extent refs

We currently store the first key of the tree block inside the reference for the
tree block in the extent tree. This takes up quite a bit of space. Make a new
key type for metadata which holds the level as the offset and completely removes
storing the btrfs_tree_block_info inside the extent ref. This reduces the size
from 51 bytes to 33 bytes per extent reference for each tree block. In practice
this results in a 30-35% decrease in the size of our extent tree, which means we
COW less and can keep more of the extent tree in memory which makes our heavy
metadata operations go much faster. This is not an automatic format change, you
must enable it at mkfs time or with btrfstune. This patch deals with having
metadata stored as either the old format or the new format so it is easy to
convert. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
b0496686ba0da69cfd2433ef55fb2d1dc7465084 14-Mar-2013 Liu Bo <bo.li.liu@oracle.com> Btrfs: cleanup unused arguments of btrfs_csum_data

Argument 'root' is no more used in btrfs_csum_data().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
d8fe29e9dea8d7d61fd140d8779326856478fc62 29-Mar-2013 Josef Bacik <jbacik@fusionio.com> Btrfs: don't drop path when printing out tree errors in scrub

A user reported a panic where we were panicing somewhere in
tree_backref_for_extent from scrub_print_warning. He only captured the trace
but looking at scrub_print_warning we drop the path right before we mess with
the extent buffer to print out a bunch of stuff, which isn't right. So fix this
by dropping the path after we use the eb if we need to. Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
87533c475187c1420794a2e164bc67a7974f1327 29-Jan-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: use bit operation for ->fs_state

There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.

Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.

Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
6f1c36055f96e80031c7fdda3fd5be826b8d7782 29-Jan-2013 Liu Bo <bo.li.liu@oracle.com> Btrfs: fix race between snapshot deletion and getting inode

While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().

And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.

Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.

(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0). So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.

Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.

So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
53b381b3abeb86f12787a6c40fee9b2f71edc23b 30-Jan-2013 David Woodhouse <David.Woodhouse@intel.com> Btrfs: RAID5 and RAID6

This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
4ded4f639533ed5f02a0f0ab20d43bb9659c91f8 14-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: fix BUG() in scrub when first superblock reading gives EIO

This fixes a very special case that can be reproduced by just
disconnecting a disk at runtime, and without unmounting the
filesystem first, start scrub on the filesystem with the
disconnected disk. All read and write EIOs are handled
correctly, only the first superblock is an exception and gives
a BUG() in a subfunction. The BUG() is correct, it would crash
later otherwise. The subfunction must not be called for
superblocks and this is what the fix changes.

Reported-by: Joeri Vanthienen <mail@joerivanthienen.be>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
af1be4f851db4f4975f0139211a6561776ef37c0 27-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: fix a scrub regression in case of write errors

This regression was introduced by the device-replace patches.
Scrub immediately stops checking those disks that have write errors.
This is nothing that happens in the real world, but it is wrong
since scrub is the tool to detect and repair defects. Fix it.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
29a8d9a0bce6a5abac1f313400c2e189e8d10e67 06-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block()

Before this commit, btrfs_map_block() was called with REQ_WRITE
in order to retrieve the list of mirrors for a disk block.
This needs to be changed for the device replace procedure since
it makes a difference whether you are asking for read mirrors
or for locations to write to.
GET_READ_MIRRORS is introduced as a new interface to call
btrfs_map_block().
In the current commit, the functionality is not yet changed,
only the interface for GET_READ_MIRRORS is introduced and all
the places that should use this new interface are adapted.

The reason that REQ_WRITE cannot be abused anymore to retrieve
a list of read mirrors is that during a running dev replace
operation all write requests to the live filesystem are
duplicated to also write to the target drive.
Keep in mind that the target disk is only partially a valid
copy of the source disk while the operation is ongoing. All
writes go to the target disk, but not all reads would return
valid data on the target disk. Therefore it is not possible
anymore to abuse a REQ_WRITE interface to find valid mirrors
for a REQ_READ.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
8dabb7420f014ab0f9f04afae8ae046c0f48b270 06-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: change core code of btrfs to support the device replace operations

This commit contains all the essential changes to the core code
of Btrfs for support of the device replace procedure.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
ff023aac31198e88507d626825379b28ea481d4d 06-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: add code to scrub to copy read data to another disk

The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit adds code to scrub to allow the scrub code to copy read
data to another disk.
One goal is to be able to perform as fast as possible. Therefore the
write requests are collected until huge bios are built, and the
write process is decoupled from the read process with some kind of
flow control, of course, in order to limit the allocated memory.
The best performance on spinning disks could by reached when the
head movements are avoided as much as possible. Therefore a single
worker is used to interface the read process with the write process.
The regular scrub operation works as fast as before, it is not
negatively influenced and actually it is more or less unchanged.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
63a212abc2315972b245f93cb11ae3acf3c0b513 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: disallow some operations on the device replace target device

This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
aa1b8cd409f05e1489ec77ff219eff6ed4b801b8 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: pass fs_info instead of root

A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
3ec706c831d4c96905c287013c8228b21619a1d9 05-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree

This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
b6bfebc13218f1fc1502041a810919d3a81b8b4e 02-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: cleanup scrub bio and worker wait code

Just move some code into functions to make everything more readable.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
34f5c8e90b3f002672cd6b4e6e7c5b959fd981ae 02-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: in scrub repair code, simplify alloc error handling

In the scrub repair code, the code is changed to handle memory
allocation errors a little bit smarter. The change is to handle
it just like a read error. This simplifies the code and removes
a couple of lines of code, since the code to handle read errors
is there anyway.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
cb2ced73d8c7a38b5f699e267deadf2a2cfe911c 02-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: in scrub repair code, optimize the reading of mirrors

In case that disk blocks need to be repaired (rewritten), the
current code at first (for simplicity reasons) reads all alternate
mirrors in the first step, afterwards selects the best one in a
second step. This is now changed to read one alternate mirror
after the other and to leave the loop early when a perfect mirror
is found.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
7a9e9987681198c56ac7f165725ca322d7a196e1 02-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: make the scrub page array dynamically allocated

With the modified design (in order to support the devive replace
procedure) it is necessary to alloc the page array dynamically.
The reason is that pages are reused. At first a page is used for
the bio to read the data from the filesystem, then the same page
is reused for the bio that writes the data to the target disk.
Since the read process and the write process are completely
decoupled, this requires a new concept of refcounts and get/put
functions for pages, and it requires to use newly created pages
for each read bio which are freed after the write operation
is finished.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
a36cf8b8933e4a7a7f2f2cbc3c70b097e97f7fd1 02-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: remove the block device pointer from the scrub context struct

The block device is removed from the scrub context state structure.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in the scrub
context struct and moved into the lower level scope of scrub_bio,
fixup and page structures where the block device context is known.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
d9d181c1ba7aa09a6d2698e8c7e75b515524d504 02-Nov-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: rename the scrub context structure

The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.

Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
be3940c0a90265654d778394cafe2e2cec674df8 11-Sep-2012 Kent Overstreet <koverstreet@google.com> btrfs: Kill some bi_idx references

For immutable bio vecs, I've been auditing and removing bi_idx
references. These were harmless, but removing them will make auditing
easier.

scrub_bio_end_io_worker() was open coding a bio_reset() - but this
doesn't appear to have been needed for anything as right after it does a
bio_put(), and perusing the code it doesn't appear anything else was
holding a reference to the bio.

The other use end_bio_extent_readpage() was just for a pr_debug() -
changed it to something that might be a bit more useful.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Stefan Behrens <sbehrens@giantdisaster.de>
69917e431210f8712fe050f47b7561e7dae89521 08-Sep-2012 Liu Bo <liub.liubo@gmail.com> Btrfs: fix a bug in parsing return value in logical resolve

In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag.

It is possible to lose our errors because
(-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true.

I'm not sure if it is on purpose, it just looks too hacky if it is.
I'd rather use a real flag and a 'ret' to catch errors.

Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
cf93dccea67ad8f5e0d9163c6a0a584550bbd7cd 02-Sep-2012 Wei Yongjun <yongjun_wei@trendmicro.com.cn> Btrfs: fix possible memory leak in scrub_setup_recheck_block()

bbio has been malloced in btrfs_map_block() and should be
freed before leaving from the error handling cases.

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
606686eeac4550d2212bf3d621a810407ef5e9bf 04-Jun-2012 Josef Bacik <josef@redhat.com> Btrfs: use rcu to protect device->name

Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory. Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock(). This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch. Thanks,

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
442a4f6308e694e0fa6025708bd5e4e424bbf51c 25-May-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: add device counters for detected IO and checksum errors

The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
ea9947b4395fa34666086b2fa6f686e94903e047 04-May-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: fix crash in scrub repair code when device is missing

Fix that when scrub tries to repair an I/O or checksum error and one of
the devices containing the mirror is missing, it crashes in bio_add_page
because the bdev is a NULL pointer for missing devices.

Reported-by: Marco L. Crociani <marco.crociani@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
5c84fc3c3914e9adfa6155a167c6c0c2709e6a62 30-Mar-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: don't count CRC or header errors twice while scrubbing

Each CRC or header error was counted twice, this is now fixed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
e627ee7bcd42b4e3a03ca01a8e46dcb4033c5ae0 12-Apr-2012 Tsutomu Itoh <t-itoh@jp.fujitsu.com> Btrfs: check return value of bio_alloc() properly

bio_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
b5d67f64f9bc656970dacba245410f0faedad18e 27-Mar-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: change scrub to support big blocks

Scrub used to be coded for nodesize == leafsize == sectorsize == PAGE_SIZE.
This is now changed to support sizes for nodesize and leafsize which are
N * PAGE_SIZE.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
1623edebee317855c6a854366c01d1630cc537c9 27-Mar-2012 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: minor cleanup in scrub

Just a minor cleanup commit in preparation for the big block changes.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
7a3ae2f8c8c8432e65467b7fc84d5deab04061a0 23-Mar-2012 Jan Schmidt <list.btrfs@jan-o-sch.net> Btrfs: fix regression in scrub path resolving

In commit 4692cf58 we introduced new backref walking code for btrfs. This
assumes we're searching live roots, which requires a transaction context.
While scrubbing, however, we must not join a transaction because this could
deadlock with the commit path. Additionally, what scrub really wants to do
is resolving a logical address in the commit root it's currently checking.

This patch adds support for logical to path resolving on commit roots and
makes scrub use that.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
79787eaab46121d4713ed03c8fc63b9ec3eaec76 12-Mar-2012 Jeff Mahoney <jeffm@suse.com> btrfs: replace many BUG_ONs with proper error handling

btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.

This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
49b25e0540904be0bf558b84475c69d72e4de66e 01-Mar-2012 Jeff Mahoney <jeffm@suse.com> btrfs: enhance transaction abort infrastructure

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
143bede527b054a271053f41bfaca2b57baa9408 01-Mar-2012 Jeff Mahoney <jeffm@suse.com> btrfs: return void in functions without error conditions

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
7ac687d9e047b3fa335f04e18c7188db6a170334 25-Nov-2011 Cong Wang <amwang@redhat.com> btrfs: remove the second argument of k[un]map_atomic()

Signed-off-by: Cong Wang <amwang@redhat.com>
859acaf1a29bbacf6256f1159210c8d6df992b33 09-Feb-2012 Arne Jansen <sensille@gmx.net> btrfs: don't check DUP chunks twice

Because scrub enumerates the dev extent tree to find the chunks to scrub,
it currently finds each DUP chunk twice and also scrubs it twice. This
patch makes sure that scrub_chunk only checks that part of the chunk the
dev extent has been found for. This only changes the behaviour for DUP
chunks.

Reported-and-tested-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Arne Jansen <sensille@gmx.net>
4692cf58aa7b81f721c1653d48db99ea41421d58 02-Dec-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> Btrfs: new backref walking code

The old backref iteration code could only safely be used on commit roots.
Besides this limitation, it had bugs in finding the roots for these
references. This commit replaces large parts of it by btrfs_find_all_roots()
which a) really finds all roots and the correct roots, b) works correctly
under heavy file system load, c) considers delayed refs.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
21adbd5cbb5344a3fca6bb7ddb2ab6cb03c44546 09-Nov-2011 Stefan Behrens <sbehrens@giantdisaster.de> Btrfs: integrate integrity check module into btrfs

This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.

Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
0dc3b84a73267f47a75468f924f5d58a840e3152 18-Nov-2011 Josef Bacik <josef@redhat.com> Btrfs: fix num_workers_starting bug and other bugs in async thread

Al pointed out we have some random problems with the way we account for
num_workers_starting in the async thread stuff. First of all we need to make
sure to decrement num_workers_starting if we fail to start the worker, so make
__btrfs_start_workers do this. Also fix __btrfs_start_workers so that it
doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
failed to create a worker. Also check_pending_worker_creates needs to call
__btrfs_start_work in it's work function since it already increments
num_workers_starting.

People only start one worker at a time, so get rid of the num_workers argument
everywhere, and make btrfs_queue_worker a void since it will always succeed.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
26bdef541d26fd6a5ddffdf8949ace22f94f809f 16-Nov-2011 Dan Carpenter <dan.carpenter@oracle.com> btrfs scrub: handle -ENOMEM from init_ipath()

init_ipath() can return an ERR_PTR(-ENOMEM).

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
745c4d8e160afaf6c75e887c27ea4b75c8142b26 20-Nov-2011 Jeff Mahoney <jeffm@suse.com> btrfs: Fix up 32/64-bit compatibility for new ioctls

This patch casts to unsigned long before casting to a pointer and fixes
the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
69f4cb526bd02ae5af35846f9a710c099eec3347 11-Nov-2011 Arne Jansen <sensille@gmx.net> Btrfs: handle bio_add_page failure gracefully in scrub

Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
dm based targets accept only one page per bio, thus making scrub always
fails. This patch just submits the current bio when an error is encountered
and starts a new one.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
56d2a48f81a1bde827c625b90221fade72fe9c61 04-Nov-2011 Ilya Dryomov <idryomov@gmail.com> Btrfs: fix a potential btrfs_bio leak on scrub fixups

In case we were able to map less than we wanted (length < PAGE_SIZE
clause is true) btrfs_bio is still allocated and we have to free it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
740c3d226cbba6cd6a32adfb66809c94938f3e57 02-Nov-2011 Chris Mason <chris.mason@oracle.com> Btrfs: fix the new inspection ioctls for 32 bit compat

The new ioctls to follow backrefs are not clean for 32/64 bit
compat. This reworks them for u64s everywhere. They are brand new, so
there are no problems with changing the interface now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
6c41761fc6efe1503103a1afe03a6635c0b5d4ec 13-Apr-2011 David Sterba <dsterba@suse.cz> btrfs: separate superblock items out of fs_info

fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
7a26285eea8eb92e0088db011571d887d4551b0f 10-Jun-2011 Arne Jansen <sensille@gmx.net> btrfs: use readahead API for scrub

Scrub uses a simple tree-enumeration to bring the relevant portions
of the extent- and csum-tree into the page cache before starting the
scrub-I/O. This is now replaced by using the new readahead-API.
During readahead the scrub is being accounted as paused, so it won't
hold off transaction commits.

This change raises the average disk bandwith utilisation on my test
volume from 70% to 90%. On another volume, the time for a test run
went down from 89s to 43s.

Changes v5:
- reada1/2 are now of type struct reada_control *

Signed-off-by: Arne Jansen <sensille@gmx.net>
5da6fcbc4eb50c0f55d520750332f5a6ab13508c 04-Aug-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> btrfs: integrating raid-repair and scrub-fixup-nodatasum

This ties nodatasum fixup in scrub together with raid repair patches. While
both series are working fine alone, scrub will report uncorrectable errors
if they occur in a nodatasum extent *and* the page is in the page cache.

Previously, we would have triggered readpage to find good data and do the
repair. However, readpage wouldn't read anything in the case where the page
is up to date in the cache. So, we simply take that good data we have and
call repair_io_failure directly (unless the page in the cache is dirty).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
a1d3c4786a4b9c71c0767aa656a759968f7554b6 04-Aug-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> btrfs: btrfs_multi_bio replaced with btrfs_bio

btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
0ef8e45158f97dde4801b535e25f70f7caf01a27 13-Jun-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> btrfs scrub: add fixup code for errors on nodatasum files

This removes a FIXME comment and introduces the first part of nodatasum
fixup: It gets the corresponding inode for a logical address and triggers a
regular readpage for the corrupted sector.

Once we have on-the-fly error correction our error will be automatically
corrected. The correction code is expected to clear the newly introduced
EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead
of "uncorrectable" eventually.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
e12fa9cd390f8e93a9144bd99bd6f6ed316fbc1e 17-Jun-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> btrfs scrub: use int for mirror_num, not u64

the rest of the code uses int mirror_num, and so should scrub

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
193ea74b2729e6ddc08fb6bde6e15a3bd4d94071 13-Jun-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> btrfs scrub: bugfix: mirror_num off by one

Fix the mirror_num determination in scrub_stripe. The rest of the scrub code
did not use mirror_num for anything important and that error went unnoticed.
The nodatasum fixup patch of this set depends on a correct mirror_num.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
558540c17771eaf89b1a3be39aa2c8bc837da1a6 13-Jun-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> btrfs scrub: print paths of corrupted files

While scrubbing, we may encounter various errors. Previously, a logical
address was printed to the log only. Now, all paths belonging to that
address are resolved and printed separately. That should work for hardlinks
as well as reflinks.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
13db62b7a1e8c64763a93c155091620f85ff8920 13-Jun-2011 Jan Schmidt <list.btrfs@jan-o-sch.net> btrfs scrub: added unverified_errors

In normal operation, scrub is reading data sequentially in large portions.
In case of an i/o error, we try to find the corrupted area(s) by issuing
page sized read requests. With this commit we increment the
unverified_errors counter if all of the small size requests succeed.

Userland patches carrying such conspicous events to the administrator should
already be around.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
6eef3125886df260ca0e8758d141308152226f6a 10-Jun-2011 Arne Jansen <sensille@gmx.net> btrfs: remove unneeded includes from scrub.c

Signed-off-by: Arne Jansen <sensille@gmx.net>
632dd772fcbde2ba37c0e8983bd38ef4a1eac906 10-Jun-2011 Arne Jansen <sensille@gmx.net> btrfs: reinitialize scrub workers

Scrub starts the workers each time a scrub starts and stops them after it
finished. This patch adds an initialization for the workers before each
start, otherwise the workers behave strangely.

Signed-off-by: Arne Jansen <sensille@gmx.net>
8c51032f978bac5bec5dae0c5de4f85db97c1cc9 03-Jun-2011 Arne Jansen <sensille@gmx.net> btrfs: scrub: errors in tree enumeration

due to the semantics of btrfs_search_slot the path can point to an
invalid slot when ret > 0. This condition went unnoticed, which in
turn could have led to an incomplete scrubbing.

Signed-off-by: Arne Jansen <sensille@gmx.net>
7841cb2898f66a73062c64d0ef5733dde7279e46 31-May-2011 David Sterba <dsterba@suse.cz> btrfs: add helper for fs_info->closing

wrap checking of filesystem 'closing' flag and fix a few missing memory
barriers.

Signed-off-by: David Sterba <dsterba@suse.cz>
e7786c3ae517b2c433edc91714e86be770e9f1ce 28-May-2011 Arne Jansen <sensille@gmx.net> btrfs: scrub: add explicit plugging

With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
1bc8779349d6278e2713a1ff94418c2a6746a791 28-May-2011 Arne Jansen <sensille@gmx.net> btrfs: scrub: don't reuse bios and pages

The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
00d01bc17cc2807292303961519d9c005794eb1d 25-May-2011 Arne Jansen <sensille@gmx.net> btrfs scrub: don't coalesce pages that are logically discontiguous

scrub_page collects several pages into one bio as long as they are physically
contiguous. As we only save one logical address for the whole bio, don't
collect pages that are physically contiguous but logically discontiguous.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
8628764e1a5e1998a42b9713e9edea7753653d01 23-Mar-2011 Arne Jansen <sensille@gmx.net> btrfs: add readonly flag

setting the readonly flag prevents writes in case an error is detected

Signed-off-by: Arne Jansen <sensille@gmx.net>
96e369208e65a7d017a52361fd572df41fde8472 09-Apr-2011 Ilya Dryomov <idryomov@gmail.com> btrfs scrub: make fixups sync

btrfs scrub - make fixups sync, don't reuse fixup bios

Fixups are already sync for csum failures, this patch makes them sync
for EIO case as well.

Fixups are now sharing pages with the parent sbio - instead of
allocating a separate page to do a fixup we grab the page from the sbio
buffer.

Fixup bios are no longer reused.

struct fixup is no longer needed, instead pass [sbio pointer, index].

Originally this was added to look at the possibility of sharing the code
between drive swap and scrub, but it actually fixes a serious bug in
scrub code where errors that could be corrected were ignored and
reported as uncorrectable.

btrfs scrub - restore bios properly after media errors

The current code reallocates a bio after a media error. This is a
temporary measure introduced in v3 after a serious problem related to
bio reuse was found in v2 of scrub patchset.

Basically we did not reset bv_offset and bv_len fields of the bio_vec
structure. They are changed in case I/O error happens, for example, at
offset 512 or 1024 into the page. Also bi_flags field wasn't properly
setup before reusing the bio.

Signed-off-by: Arne Jansen <sensille@gmx.net>
a2de733c78fa7af51ba9670482fa7d392aa67c57 08-Mar-2011 Arne Jansen <sensille@gmx.net> btrfs: scrub

This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>