History log of /fs/btrfs/file-item.c
Revision Date Author Comments
6e5aafb27419f32575b27ef9d6a31e5d54661aca 04-Nov-2014 Chris Mason <clm@fb.com> Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup

If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
the csums we allocate and free them. But the code was using list_entry
incorrectly, and ended up trying to free the on-stack list_head instead.

This bug came from commit 0678b6185

btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Erik Berg <btrfs@slipsprogrammoer.no>
cc: stable@vger.kernel.org # 3.3 or newer
23ea8e5a07673127d05cb5cf6f9914d7a53e0847 12-Sep-2014 Miao Xie <miaox@cn.fujitsu.com> Btrfs: load checksum data once when submitting a direct read io

The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
d1b00a4711d5b953b13ccc859bc30c447c96860e 25-Jul-2014 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> btrfs: use IS_ALIGNED() for assertion in btrfs_lookup_csums_range() for simplicity

btrfs_lookup_csums_range() uses ALIGN() to check if "start"
and "end + 1" are aligned to "root->sectorsize". It's better to
replace these with IS_ALIGNED() for simplicity.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
962a298f35110edd8f326814ae41a3dd306ecb64 04-Jun-2014 David Sterba <dsterba@suse.cz> btrfs: kill the key type accessor helpers

btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
27b9a8122ff71a8cadfbffb9c4f0694300464f3b 09-Aug-2014 Filipe Manana <fdmanana@suse.com> Btrfs: fix csum tree corruption, duplicate and outdated checksums

Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

Leaf N:

slot = 0 slot = btrfs_header_nritems() - 1
|-------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
|-------------------------------------------------------------------|

Leaf N + 1:

slot = 0 slot = btrfs_header_nritems() - 1
|--------------------------------------------------------------------|
| [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
|--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1
|----------------------------------------------------------------------------------------------------|
| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] |
|----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

tmp = min((sums->len - total_bytes) >> blocksize_bits,
(next_offset - file_key.offset) >> blocksize_bits) =
min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
7ffbb598a059b73487909619d73150f99b50337a 09-Jun-2014 Filipe Manana <fdmanana@gmail.com> Btrfs: make fsync work after cloning into a file

When cloning into a file, we were correctly replacing the extent
items in the target range and removing the extent maps. However
we weren't replacing the extent maps with new ones that point to
the new extents - as a consequence, an incremental fsync (when the
inode doesn't have the full sync flag) was a NOOP, since it relies
on the existence of extent maps in the modified list of the inode's
extent map tree, which was empty. Therefore add new extent maps to
reflect the target clone range.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
35045bf2fd7c030f2583dbd80a2015f427778bf1 09-Apr-2014 Filipe Manana <fdmanana@gmail.com> Btrfs: don't access non-existent key when csum tree is empty

When the csum tree is empty, our leaf (path->nodes[0]) has a number
of items equal to 0 and since btrfs_header_nritems() returns an
unsigned integer (and so is our local nritems variable) the following
comparison always evaluates to false:

if (path->slots[0] >= nritems - 1) {

As the casting rules lead to:

if ((u32)0 >= (u32)4294967295) {

This makes us access key at slot paths->slots[0] + 1 (1) of the empty leaf
some lines below:

btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
found_key.type != BTRFS_EXTENT_CSUM_KEY) {
found_next = 1;
goto insert;
}

So just don't access such non-existent slot and don't set found_next to 1
when the tree is empty. It's very unlikely we'll get a random key with the
objectid and type values above, which is where we could go into trouble.

If nritems is 0, just set found_next to 1 anyway as it will make us insert
a csum item covering our whole extent (or the whole leaf) when the tree is
empty.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
d2cbf2a260ab18c833f07fda66e30c4d4344162e 29-Apr-2014 Liu Bo <bo.li.liu@oracle.com> Btrfs: do not increment on bio_index one by one

'bio_index' is just a index, it's really not necessary to do increment
one by one.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
efe120a067c8674a8ae21b194f0e68f098b61ee2 20-Dec-2013 Frank Holton <fholton@gmail.com> Btrfs: convert printk to btrfs_ and fix BTRFS prefix

Convert all applicable cases of printk and pr_* to the btrfs_* macros.

Fix all uses of the BTRFS prefix.

Signed-off-by: Frank Holton <fholton@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
4f024f3797c43cb4b73cd2c50cec728842d0e49e 12-Oct-2013 Kent Overstreet <kmo@daterainc.com> block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
fae7f21cece9a4c181a8d8131870c7247e153f65 31-Oct-2013 Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)

Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
code that outputs a more descriptive warnings. Also fix the styling
warning of redundant braces that came up as a result of this fix.

Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
4277a9c3b3665f2830c55ece015163867b9414cc 15-Oct-2013 Josef Bacik <jbacik@fusionio.com> Btrfs: add an assert to btrfs_lookup_csums_range for alignment

I was hitting weird issues when trying to remove hole extents and it turned out
it was because I was sending non-aligned offsets down to
btrfs_lookup_csums_range. So add an assert for this in case somebody trips over
this in the future. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
c1c9ff7c94e83fae89a742df74db51156869bad5 20-Aug-2013 Geert Uytterhoeven <geert@linux-m68k.org> Btrfs: Remove superfluous casts from u64 to unsigned long long

u64 is "unsigned long long" on all architectures now, so there's no need to
cast it when formatting it using the "ll" length modifier.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
facc8a2247340a9735fe8cc123c5da2102f5ef1b 25-Jul-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: don't cache the csum value into the extent state tree

Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.

Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
f51a4a1826ff810eb9c00cadff8978b028c40756 19-Jun-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: remove btrfs_sector_sum structure

Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.

By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).

test command:
# dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
48a3b6366f6913683563d934eb16fea67dead9c1 25-Apr-2013 Eric Sandeen <sandeen@redhat.com> btrfs: make static code static & remove dead code

Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
4b90c68015a7c0863292d6306501552d4ffa33ff 16-Apr-2013 Tsutomu Itoh <t-itoh@jp.fujitsu.com> Btrfs: remove unused argument of btrfs_extend_item()

Argument 'trans' is not used in btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
afe5fea72bd50b1df2e6a721ef50559427d42f2b 16-Apr-2013 Tsutomu Itoh <t-itoh@jp.fujitsu.com> Btrfs: cleanup of function where fixup_low_keys() is called

If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
e4100d987b2437596ebcf11809022b79507f3db1 05-Apr-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: improve the performance of the csums lookup

It is very likely that there are several blocks in bio, it is very
inefficient if we get their csums one by one. This patch improves
this problem by getting the csums in batch.

According to the result of the following test, the execute time of
__btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).

# dd if=<mnt>/file of=/dev/null bs=1M count=1024

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
628c8282bed73c18ff056d655d4a6a63ef55284a 18-Mar-2013 Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> btrfs: Cleanup some redundant codes in btrfs_lookup_csums_range()

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
b0496686ba0da69cfd2433ef55fb2d1dc7465084 14-Mar-2013 Liu Bo <bo.li.liu@oracle.com> Btrfs: cleanup unused arguments of btrfs_csum_data

Argument 'root' is no more used in btrfs_csum_data().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
82d130ff390be67d980d8b6f39e921c0b1d8d8e0 28-Mar-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix wrong return value of btrfs_lookup_csum()

If we don't find the expected csum item, but find a csum item which is
adjacent to the specified extent, we should return -EFBIG, or we should
return -ENOENT. But btrfs_lookup_csum() return -EFBIG even the csum item
is not adjacent to the specified extent. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
39847c4d3d91f487f9ab3d083ee5d0f8419f105c 28-Mar-2013 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix wrong reservation of csums

We reserve the space for csums only when we write data into a file, in
the other cases, such as tree log, log replay, we don't do reservation,
so we can use the reservation of the transaction handle just for the former.
And for the latter, we should use the tree's own reservation. But the
function - btrfs_csum_file_blocks() didn't differentiate between these
two types of the cases, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2f697dc6a648d3a16f512fe7a53281d55cce1570 04-Feb-2013 Liu Bo <bo.li.liu@oracle.com> Btrfs: extend the checksum item as much as possible

For write, we also reserve some space for COW blocks during updating
the checksum tree, and we calculate the number of blocks by checking
if the number of bytes outstanding that are going to need csums needs
one more block for csum.

When we add these checksum into the checksum tree, we use ordered sums
list.
Every ordered sum contains csums for each sector, and we'll first try
to look up an existing csum item,
a) if we don't yet have a proper csum item, then we need to insert one,
b) or if we find one but the csum item is not big enough, then we need
to extend it.

The point is we'll unlock the whole path and then insert or extend.
So others can hack in and update the tree.

Each insert or extend needs update the tree with COW on, and we may need
to insert/extend for many times.

That means what we've reserved for updating checksum tree is NOT enough
indeed.

The case is even more serious with having several write threads at the
same time, it can end up eating our reserved space quickly and starting
eating globle reserve pool instead.

I don't yet come up with a way to calculate the worse case for updating
csum, but extending the checksum item as much as possible can be helpful
in my test.

The idea behind is that it can reduce the times we insert/extend so that
it saves us precious reserved space.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
e58dd74bccb4317e39e4b675bf9c6cd133608fac 22-Jan-2013 Josef Bacik <jbacik@fusionio.com> Btrfs: put csums on the right ordered extent

I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent. This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent. Without this we could end up with csums for
bytenrs that don't have extents to cover them yet. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
315a9850da2b89c83971b26fe54a60f22bdd91ad 01-Nov-2012 Miao Xie <miaox@cn.fujitsu.com> Btrfs: fix wrong file extent length

There are two types of the file extent - inline extent and regular extent,
When we log file extents, we didn't take inline extent into account, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
221b831835421f9451182611fa25fa60f440662f 20-Sep-2012 Zach Brown <zab@zabbo.net> btrfs: fix min csum item size warnings in 32bit

commit 7ca4be45a0255ac8f08c05491c6add2dd87dd4f8 limited csum items to
PAGE_CACHE_SIZE. It used min() with incompatible types in 32bit which
generates warnings:

fs/btrfs/file-item.c: In function ‘btrfs_csum_file_blocks’:
fs/btrfs/file-item.c:717: warning: comparison of distinct pointer types lacks a cast

This uses min_t(u32,) to fix the warnings. u32 seemed reasonable
because btrfs_root->leafsize is u32 and PAGE_CACHE_SIZE is unsigned
long.

Signed-off-by: Zach Brown <zab@zabbo.net>
995e01b7af745b8aaa5e882cfb7bfd5baab3f335 13-Aug-2012 Jan Schmidt <list.btrfs@jan-o-sch.net> Btrfs: fix gcc warnings for 32bit compiles

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
c329861da40623cd838b8c9ee31a850242fd88cf 03-Aug-2012 Josef Bacik <jbacik@fusionio.com> Btrfs: don't allocate a seperate csums array for direct reads

We've been allocating a big array for csums instead of storing them in the
io_tree like we do for buffered reads because previously we were locking the
entire range, so we didn't have an extent state for each sector of the
range. But now that we do the range locking as we map the buffers we can
limit the mapping lenght to sectorsize and use the private part of the
io_tree for our csums. This allows us to avoid an extra memory allocation
for direct reads which could incur latency. Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
83eea1f1bacd5dc7b44dcf84f5fdca54fdea5453 10-Jul-2012 Liu Bo <liubo2009@cn.fujitsu.com> Btrfs: kill root from btrfs_is_free_space_inode

Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
0e721106923be82f651dd0ee504742a8a3eb089f 26-Jun-2012 Josef Bacik <jbacik@fusionio.com> Btrfs: change how we indicate we're adding csums

There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv. Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work. The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit. By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often. This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.

This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
7ca4be45a0255ac8f08c05491c6add2dd87dd4f8 01-Feb-2012 Chris Mason <chris.mason@oracle.com> Btrfs: don't use crc items bigger than 4KB

With the big metadata blocks, we can have crc items
that are much bigger than a page. There are a few
places that we try to kmalloc memory to hold the
items during a split.

Items bigger than 4KB don't really have a huge benefit
in efficiency, but they do trigger larger order allocations.
This commits changes the csums to make sure they stay under
4KB. This is not a format change, just a #define to limit
huge items.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
79787eaab46121d4713ed03c8fc63b9ec3eaec76 12-Mar-2012 Jeff Mahoney <jeffm@suse.com> btrfs: replace many BUG_ONs with proper error handling

btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.

This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
0678b61851b510ba68341dff59cd9b47e1712e91 06-Aug-2011 Mark Fasheh <mfasheh@suse.com> btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()

Unfortunately it isn't enough to just exit here - the kzalloc() happens in a
loop and the allocated items are added to a linked list whose head is passed
in from the caller.

To fix the BUG_ON() and also provide the semantic that the list passed in is
only modified on success, I create function-local temporary list that we add
items too. If no error is met, that list is spliced to the callers at the
end of the function. Otherwise the list will be walked and all items freed
before the error value is returned.

I did a simple test on this patch by forcing an error at the kzalloc() point
and verifying that when this hits (git clone seemed to exercise this), the
function throws the proper error. Unfortunately but predictably, we later
hit a BUG_ON(ret) type line that still hasn't been fixed up ;)

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
143bede527b054a271053f41bfaca2b57baa9408 01-Mar-2012 Jeff Mahoney <jeffm@suse.com> btrfs: return void in functions without error conditions

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
7ac687d9e047b3fa335f04e18c7188db6a170334 25-Nov-2011 Cong Wang <amwang@redhat.com> btrfs: remove the second argument of k[un]map_atomic()

Signed-off-by: Cong Wang <amwang@redhat.com>
6c41761fc6efe1503103a1afe03a6635c0b5d4ec 13-Apr-2011 David Sterba <dsterba@suse.cz> btrfs: separate superblock items out of fs_info

fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
ddf23b3fc6850bd4654d51ec9457fe7c77cde51e 11-Sep-2011 Josef Bacik <josef@redhat.com> Btrfs: skip locking if searching the commit root in csum lookup

It's not enough to just search the commit root, since we could be cow'ing the
very block we need to search through, which would mean that its locked and we'll
still deadlock. So use path->skip_locking as well. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2cf8572dac62cc2ff7e995173e95b6c694401b3f 26-Jul-2011 Chris Mason <chris.mason@oracle.com> Btrfs: use the commit_root for reading free_space_inode crcs

Now that we are using regular file crcs for the free space cache,
we can deadlock if we try to read the free_space_inode while we are
updating the crc tree.

This commit fixes things by using the commit_root to read the crcs. This is
safe because we the free space cache file would already be loaded if
that block group had been changed in the current transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
a65917156e345946dbde3d7effd28124c6d6a8c2 19-Jul-2011 Chris Mason <chris.mason@oracle.com> Btrfs: stop using highmem for extent_buffers

The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.

The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.

This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
d8926bb3badd36670fecf2de4a062c78bc37430b 13-Jul-2011 Mark Fasheh <mfasheh@suse.com> btrfs: don't BUG_ON btrfs_alloc_path() errors

This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:

- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)

The following functions were updated:

btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
1cd307990d6e2b4965620e339a92e0d7ae853e13 19-May-2011 Tsutomu Itoh <t-itoh@jp.fujitsu.com> Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item

Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
65a246c5ffe3b487a001de025816326939e63362 19-May-2011 Tsutomu Itoh <t-itoh@jp.fujitsu.com> Btrfs: return error code to caller when btrfs_del_item fails

The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
a2de733c78fa7af51ba9670482fa7d392aa67c57 08-Mar-2011 Arne Jansen <sensille@gmx.net> btrfs: scrub

This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>
b3b4aa74b58bded927f579fff787fb6fa1c0393c 21-Apr-2011 David Sterba <dsterba@suse.cz> btrfs: drop unused parameter from btrfs_release_path

parameter tree root it's not used since commit
5f39d397dfbe140a14edecd4e73c34ce23c4f9ee ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
33345d01522f8152f99dc84a3e7a1a45707f387f 20-Apr-2011 Li Zefan <lizf@cn.fujitsu.com> Btrfs: Always use 64bit inode number

There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
c2db1073fdf9757e6fd8b4a59d15b6ecc7a2af8a 01-Mar-2011 Tsutomu Itoh <t-itoh@jp.fujitsu.com> Btrfs: check return value of btrfs_alloc_path()

Adding the check on the return value of btrfs_alloc_path() to several places.
And, some of callers are modified by this change.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
db5b493ac78e46c7b6bad22cd25d8041564cd8ea 23-Mar-2011 Tsutomu Itoh <t-itoh@jp.fujitsu.com> Btrfs: cleanup some BUG_ON()

This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
ad0397a7a97f55fd7f70998ec208c5d8b90310ff 28-Jan-2011 Josef Bacik <josef@redhat.com> Btrfs: do error checking in btrfs_del_csums

Got a report of a box panicing because we got a NULL eb in read_extent_buffer.
His fs was borked and btrfs_search_path returned EIO, but we don't check for
errors so the box paniced. Yes I know this will just make something higher up
the stack panic, but that's a problem for future Josef. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2a29edc6b60a5248ccab588e7ba7dad38cef0235 26-Jan-2011 liubo <liubo2009@cn.fujitsu.com> btrfs: fix several uncheck memory allocations

To make btrfs more stable, add several missing necessary memory allocation
checks, and when no memory, return proper errno.

We've checked that some of those -ENOMEM errors will be returned to
userspace, and some will be catched by BUG_ON() in the upper callers,
and none will be ignored silently.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
4b46fce23349bfca781a32e2707a18328ca5ae22 23-May-2010 Josef Bacik <josef@redhat.com> Btrfs: add basic DIO read/write support

This provides basic DIO support for reading and writing. It does not do the
work to recover from mismatching checksums, that will come later. A few design
changes have been made from Jim's code (sorry Jim!)

1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.

2) Fallback on buffered IO for compressed or inline extents. Jim's code did
it's own buffering to make dio with compressed extents work. Now we just
fallback onto normal buffered IO.

3) Use ordered extents for the writes so that all of the

lock_extent()
lookup_ordered()

type checks continue to work.

4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
DIO writes.

I've tested this with fsx and everything works great. This patch depends on my
dio and filemap.c patches to work. Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
4a500fd178c89b96fa166a2d9e7855df33429841 16-May-2010 Yan, Zheng <zheng.yan@oracle.com> Btrfs: Metadata ENOSPC handling for tree log

Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
b9473439d3e84d9fc1a0a83faca69cc1b7566341 13-Mar-2009 Chris Mason <chris.mason@oracle.com> Btrfs: leave btree locks spinning more often

btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
07d400a6df4767a90d49a153fdb7f4cfa1e3f23e 06-Jan-2009 Yan Zheng <zheng.yan@oracle.com> Btrfs: tree logging checksum fixes

This patch contains following things.

1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This
struct is kmalloced so we want to keep it reasonable.

2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was
duplicated code in tree-log.c

3) Remove replay_one_csum. csum items are replayed at the same time as
replaying file extents. This guarantees we only replay useful csums.

4) nbytes accounting fix.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
d397712bcc6a759a560fd247e6053ecae091f958 06-Jan-2009 Chris Mason <chris.mason@oracle.com> Btrfs: Fix checkpatch.pl warnings

There were many, most are fixed now. struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
87b29b208c6c38f3446d2de6ece946e2459052cf 17-Dec-2008 Yan Zheng <zheng.yan@oracle.com> Btrfs: properly check free space for tree balancing

btrfs_insert_empty_items takes the space needed by the btrfs_item
structure into account when calculating the required free space.

So the tree balancing code shouldn't add sizeof(struct btrfs_item)
to the size when checking the free space. This patch removes these
superfluous additions.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
dcbdd4dcb9793b00b46ab023e9330922c8c7c54c 16-Dec-2008 Chris Mason <chris.mason@oracle.com> Btrfs: delete checksum items before marking blocks free

Btrfs maintains a cache of blocks available for allocation in ram. The
code that frees extents was marking the extents free and then deleting
the checksum items.

This meant it was possible the extent would be reallocated before the
checksum item was actually deleted, leading to races and other
problems as the checksums were updated for the newly allocated extent.

The fix is to delete the checksum before marking the extent free.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
17d217fe970d34720f4f1633dca73a6aa2f3d9d1 12-Dec-2008 Yan Zheng <zheng.yan@oracle.com> Btrfs: fix nodatasum handling in balancing code

Checksums on data can be disabled by mount option, so it's
possible some data extents don't have checksums or have
invalid checksums. This causes trouble for data relocation.
This patch contains following things to make data relocation
work.

1) make nodatasum/nodatacow mount option only affects new
files. Checksums and COW on data are only controlled by the
inode flags.

2) check the existence of checksum in the nodatacow checker.
If checksums exist, force COW the data extent. This ensure that
checksum for a given block is either valid or does not exist.

3) update data relocation code to properly handle the case
of checksum missing.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
459931eca5f4b8c9ad259d07cc1ca49afed54804 10-Dec-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Delete csum items when freeing extents

This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.

The trick is doing it without racing because a single csum item may
hold csums for more than one extent. Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.

A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock. This is used to
remove csum bytes from the middle of an item.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
d20f7043fa65659136c1a7c3c456eeeb5c6f431f 08-Dec-2008 Chris Mason <chris.mason@oracle.com> Btrfs: move data checksumming into a dedicated tree

Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.

But, this has a few problems:

* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.

* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.

* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.

* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.

* There is potentitally one copy of the checksum in each subvolume
referencing an extent.

The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:

* The checksum is against the data stored on disk, after any compression
or encryption is done.

* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.

This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.

The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
607d432da0542e84ddcd358adfddac6f68500e3d 02-Dec-2008 Josef Bacik <jbacik@redhat.com> Btrfs: add support for multiple csum algorithms

This patch gives us the space we will need in order to have different csum
algorithims at some point in the future. We save the csum algorithim type
in the superblock, and use those instead of define's.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
39be25cd89450940b0e5f8a6aad71d1ec99b17bf 10-Nov-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Use invalidatepage when writepage finds a page outside of i_size

With all the recent fixes to the delalloc locking, it is now safe
again to use invalidatepage inside the writepage code for
pages outside of i_size. This used to deadlock against some of the
code to write locked ranges of pages, but all of that has been fixed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
c8b978188c9a0fd3d535c13debd19d522b726f1f 29-Oct-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Add zlib compression support

This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
639cb58675ce9b507eed9c3d6b3335488079b21a 28-Aug-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Fix variable init during csum creation

Signed-off-by: Chris Mason <chris.mason@oracle.com>
4d1b5fb4d7075f862848dbff8873e22382abd482 20-Aug-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Lookup readpage checksums on bio submission again

This optimization had been removed because I thought it was triggering
csum errors. The real cause of the errors was elsewhere, and so
this optimization is back.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
53863232ef961778aa414b700ed88a48e8e871e6 15-Aug-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Lower contention on the csum mutex

This takes the csum mutex deeper in the call chain and releases it
more often.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
3de9d6b649b4cc60687be92e71cef36d7d4e8f2f 05-Aug-2008 Chris Mason <chris.mason@oracle.com> btrfs_lookup_bio_sums seems broken, go back to the readpage_io_hook for now

Signed-off-by: Chris Mason <chris.mason@oracle.com>
6dab81574346c831ded96ae3ab0e8f9ca72c37ae 04-Aug-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Hold csum mutex while reading in sums during readpages

Signed-off-by: Chris Mason <chris.mason@oracle.com>
61b4944018449003ac5f9757f4d125dce519cf51 31-Jul-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Fix streaming read performance with checksumming on

Large streaming reads make for large bios, which means each entry on the
list async work queues represents a large amount of data. IO
congestion throttling on the device was kicking in before the async
worker threads decided a single thread was busy and needed some help.

The end result was that a streaming read would result in a single CPU
running at 100% instead of balancing the work off to other CPUs.

This patch also changes the pre-IO checksum lookup done by reads to
work on a per-bio basis instead of a per-page. This results in many
extra btree lookups on large streaming reads. Doing the checksum lookup
right before bio submit allows us to reuse searches while processing
adjacent offsets.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
bcc63abbf3e9bf948a1b0129b3e6120ec7d7f698 30-Jul-2008 Yan <zheng.yan@oracle.com> Btrfs: implement memory reclaim for leaf reference cache

The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
ed98b56a6393c5e150fd5095b9eb7fd7d3cfb041 23-Jul-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Take the csum mutex while reading checksums

Signed-off-by: Chris Mason <chris.mason@oracle.com>
e5a2217ef6ff088d08a27208929a6f9c635d672c 19-Jul-2008 Chris Mason <chris.mason@oracle.com> Fix btrfs_wait_ordered_extent_range to properly wait

Signed-off-by: Chris Mason <chris.mason@oracle.com>
7f3c74fb831fa19bafe087e817c0a5ff3883f1ea 18-Jul-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Keep extent mappings in ram until pending ordered extents are done

It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
3edf7d33f4edb1e4a9bb0a4c0a84d95fb4d22a09 18-Jul-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Handle data checksumming on bios that span multiple ordered extents

Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
e6dcd2dc9c489108648e2ed543315dd134d50a9a 17-Jul-2008 Chris Mason <chris.mason@oracle.com> Btrfs: New data=ordered implementation

The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk. This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
the extent bit EXTENT_ORDERED is set on the entire range of the extent.
A struct btrfs_ordered_extent is allocated an inserted into a per-inode
rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
are written, The previously reserved extent is inserted into the FS
btree and into the extent allocation trees. The checksums for the file
data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
f2eb0a241f0e5c135d93243b0236cb1f14c305e0 02-May-2008 Sage Weil <sage@newdream.net> Btrfs: Clone file data ioctl

Add a new ioctl to clone file data

Signed-off-by: Chris Mason <chris.mason@oracle.com>
e015640f9c4fa2417dcc3bbbb3b2b61ad4059ab0 16-Apr-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Write bio checksumming outside the FS mutex

This significantly improves streaming write performance by allowing
concurrency in the data checksumming.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
eb20978f318ab5e360ef9c1b24b5dea14d0fee6a 21-Feb-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Use KM_USERN instead of KM_IRQ during data summing

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2e1a992e3103624af48f1501aaad4e4d8317f88c 20-Feb-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Make sure bio pages are adjacent during bulk csumming

Signed-off-by: Chris Mason <chris.mason@oracle.com>
6e92f5e651a34f24ab31ebdf3f113c7d23a36000 20-Feb-2008 Chris Mason <chris.mason@oracle.com> Btrfs: While doing checksums on bios, cache the extent_buffer mapping

Signed-off-by: Chris Mason <chris.mason@oracle.com>
065631f6dccea07bfad48d8981369f6d9cfd6e2b 20-Feb-2008 Chris Mason <chris.mason@oracle.com> Btrfs: checksum file data at bio submission time instead of during writepage

When we checkum file data during writepage, the checksumming is done one
page at a time, making it difficult to do bulk metadata modifications
to insert checksums for large ranges of the file at once.

This patch changes btrfs to checksum on a per-bio basis instead. The
bios are checksummed before they are handed off to the block layer, so
each bio is contiguous and only has pages from the same inode.

Checksumming on a bio basis allows us to insert and modify the file
checksum items in large groups. It also allows the checksumming to
be done more easily by async worker threads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
aadfeb6e39ad6bde080cb3ab23f4da57ccb25f4a 29-Jan-2008 Chris Mason <chris.mason@oracle.com> Btrfs: Add some extra debugging around file data checksum failures

Signed-off-by: Chris Mason <chris.mason@oracle.com>
179e29e488cc74f1e9bd67bc45f70b832740e9ec 01-Nov-2007 Chris Mason <chris.mason@oracle.com> Btrfs: Fix a number of inline extent problems that Yan Zheng reported.

The fixes do a number of things:

1) Most btrfs_drop_extent callers will try to leave the inline extents in
place. It can truncate bytes off the beginning of the inline extent if
required.

2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.

3) btrfs_truncate_in_transaction truncates inline extents

4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together

Signed-off-by: Chris Mason <chris.mason@oracle.com>
b56baf5bedccd3258643b09289f17ceab3ddea52 29-Oct-2007 Yan <yanzheng@21cn.com> Minor fix for btrfs_csum_file_block.

Execution should goto label 'insert' when 'btrfs_next_leaf' return a
non-zero value, otherwise the parameter 'slot' for
'btrfs_item_key_to_cpu' may be out of bounds. The original codes jump
to label 'insert' only when 'btrfs_next_leaf' return a negative
value.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
f578d4bd7e141dd03ca7e8695c1cc118c326e69e 25-Oct-2007 Chris Mason <chris.mason@oracle.com> Btrfs: Optimize csum insertion to create larger items when possible

This reduces the number of calls to btrfs_extend_item and greatly lowers
the cpu usage while writing large files.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
ff79f8190b6e955ff7a71faf804a3017d526e657 15-Oct-2007 Chris Mason <chris.mason@oracle.com> Btrfs: Add back file data checksumming

Signed-off-by: Chris Mason <chris.mason@oracle.com>
db94535db75e67fab12ccbb7f5ee548e33fed891 15-Oct-2007 Chris Mason <chris.mason@oracle.com> Btrfs: Allow tree blocks larger than the page size

Signed-off-by: Chris Mason <chris.mason@oracle.com>
5f39d397dfbe140a14edecd4e73c34ce23c4f9ee 15-Oct-2007 Chris Mason <chris.mason@oracle.com> Btrfs: Create extent_buffer interface for large blocksizes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
ec6b910fb330f29e169c9f502c3ac209515af8d1 11-Jul-2007 Zach Brown <zach.brown@oracle.com> Btrfs: trivial include fixups

Almost none of the files including module.h need to do so,
remove them.

Include sched.h in extent-tree.c to silence a warning about cond_resched()
being undeclared.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
54aa1f4dfdacd60a19c4471220b24e581be6f774 22-Jun-2007 Chris Mason <chris.mason@oracle.com> Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack

Signed-off-by: Chris Mason <chris.mason@oracle.com>
8c2383c3dd2cb5bb39598ce4fa97154bc591020a 18-Jun-2007 Chris Mason <chris.mason@oracle.com> Subject: Rework btrfs_file_write to only allocate while page locks are held

Signed-off-by: Chris Mason <chris.mason@oracle.com>
9ebefb180bad4914a31c4e1748ba187a30e1e990 15-Jun-2007 Chris Mason <chris.mason@oracle.com> Btrfs: patch queue: page_mkwrite

Signed-off-by: Chris Mason <chris.mason@oracle.com>
f1ace244c8c1e16eaa5c8b3b5339849651e31ede 13-Jun-2007 Aneesh <aneesh.kumar@linux.vnet.ibm.com> btrfs: Code cleanup
Attaching below is some of the code cleanups that i came across while
reading the code.

a) alloc_path already calls init_path.
b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
the in memory copy ext4_inode as the disk copy

Signed-off-by: Chris Mason <chris.mason@oracle.com>
6cbd55707802b98843f953d1ae6d8f5bcd9a76c0 12-Jun-2007 Chris Mason <chris.mason@oracle.com> Btrfs: add GPLv2

Signed-off-by: Chris Mason <chris.mason@oracle.com>
5af3981c1878b0657b9babd2ef7ec98c2008cf2c 12-Jun-2007 Chris Mason <chris.mason@oracle.com> Btrfs: printk fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
84f54cfa78c81991e087309a9b379f25f1ffdb10 12-Jun-2007 Chris Mason <chris.mason@oracle.com> Btrfs: 64 bit div fixes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
1de037a43edf67f3a9f66dd197195b3c08febb16 29-May-2007 Chris Mason <chris.mason@oracle.com> Btrfs: fixup various fsx failures

Signed-off-by: Chris Mason <chris.mason@oracle.com>
3a686375629da5d2e2ad019265b66ef113c87455 24-May-2007 Chris Mason <chris.mason@oracle.com> Btrfs: sparse files!

Signed-off-by: Chris Mason <chris.mason@oracle.com>
509659cde578d891445afd67d87121dd13e71596 10-May-2007 Chris Mason <chris.mason@oracle.com> Btrfs: switch to crc32c instead of sha256

Signed-off-by: Chris Mason <chris.mason@oracle.com>
236454dfffb64a95ee01c50a215153f5de61c475 19-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: many file_write fixes, inline data

Signed-off-by: Chris Mason <chris.mason@oracle.com>
a429e51371eee3c989160c003ee40bc3947c6a76 18-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: working file_write, reorganized key flags

Signed-off-by: Chris Mason <chris.mason@oracle.com>
70b2befd0c8a4064715d8b340270650cc9d15af8 17-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: rework csums and extent item ordering

Signed-off-by: Chris Mason <chris.mason@oracle.com>
b18c6685810af8e6763760711aece31ccc7a8ea8 17-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: progress on file_write

Signed-off-by: Chris Mason <chris.mason@oracle.com>
6567e837df07e43bffc08ac40858af8133a007bf 16-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: early work to file_write in big extents

Signed-off-by: Chris Mason <chris.mason@oracle.com>
d0dbc6245cefa36e19dff49c557ccf05e3063e9c 10-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: drop owner and parentid

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2da566edd8ce32ae2952b863ee634bcc2e7d63c1 02-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: csum_verify_file_block locking fix

Signed-off-by: Chris Mason <chris.mason@oracle.com>
5caf2a002901f0fde475371c4bf1c553b51884af 02-Apr-2007 Chris Mason <chris.mason@oracle.com> Btrfs: dynamic allocation of path struct

Signed-off-by: Chris Mason <chris.mason@oracle.com>
d6025579531b7ea170ba283b171ff7a6bf7d0e12 30-Mar-2007 Chris Mason <chris.mason@oracle.com> Btrfs: corruption hunt continues

Signed-off-by: Chris Mason <chris.mason@oracle.com>
f254e52c1ce550fdaa0d31f5e068f0d67c2485d4 29-Mar-2007 Chris Mason <chris.mason@oracle.com> Btrfs: verify csums on read

Signed-off-by: Chris Mason <chris.mason@oracle.com>
9773a788681db1f5c2701b7433737fdca61a14ba 27-Mar-2007 Chris Mason <chris.mason@oracle.com> Btrfs: byte offsets for file keys

Signed-off-by: Chris Mason <chris.mason@oracle.com>
71951f35a6e413f2bfbd41829af8cf10a890aeb6 27-Mar-2007 Chris Mason <chris.mason@oracle.com> Btrfs: add generation field to file extent

Signed-off-by: Chris Mason <chris.mason@oracle.com>
dee26a9f7aab7ffe1193cd1415b23a69426acc9f 26-Mar-2007 Chris Mason <chris.mason@oracle.com> btrfs_get_block, file read/write

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2e635a278354a1a7951e16cfea4c247d6d0e7c99 21-Mar-2007 Chris Mason <chris.mason@oracle.com> Btrfs: initial move to kernel module land

Signed-off-by: Chris Mason <chris.mason@oracle.com>
9f5fae2fe6dc35b46bf56183f11398451851cb3f 20-Mar-2007 Chris Mason <chris.mason@oracle.com> Btrfs: Add inode map, and the start of file extent items

Signed-off-by: Chris Mason <chris.mason@oracle.com>
1e1d27017c5986c1ea81181506042cf9cba3f6ea 16-Mar-2007 Chris Mason <chris.mason@oracle.com> Btrfs: add inode item

Signed-off-by: Chris Mason <chris.mason@oracle.com>