History log of /fs/ceph/inode.c
Revision Date Author Comments
a4483e8a424d76bc1dfacdd94e739fba29d7f83f 17-Sep-2014 Chao Yu <chao2.yu@samsung.com> ceph: remove redundant code for max file size verification

Both ceph_update_writeable_page and ceph_setattr will verify file size
with max size ceph supported.
There are two caller for ceph_update_writeable_page, ceph_write_begin and
ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in
generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no
chance to change file size when mmap. Likewise we have already verified the size
in inode_change_ok when we call ceph_setattr.
So let's remove the redundant code for max file size verification.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
508b32d8661b12da4c9ca41a9b2054e1dc92fa7e 16-Sep-2014 Yan, Zheng <zyan@redhat.com> ceph: request xattrs if xattr_version is zero

Following sequence of events can happen.
- Client releases an inode, queues cap release message.
- A 'lookup' reply brings the same inode back, but the reply
doesn't contain xattrs because MDS didn't receive the cap release
message and thought client already has up-to-data xattrs.

The fix is force sending a getattr request to MDS if xattrs_version
is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client
does not have xattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
4e217b5dc87042b0fa52b11f491c4ded1863823a 07-Jun-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: use truncate_pagecache() instead of truncate_inode_pages()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
f3ae1b97be14ff10da8f02309ba04bed2ba035bc 06-Jun-2014 Fabian Frederick <fabf@skynet.be> fs/ceph: replace pr_warning by pr_warn

Update the last pr_warning callsites in fs branch

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8d08503c130e96e3794f66fe47053051460b1584 18-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: remember subtree root dirfrag's auth MDS

remember dirfrag's auth MDS when it's different from its parent inode's
auth MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
3e7fbe9cebfdaac380419507908e10c499ddd25b 18-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: introduce ceph_fill_fragtree()

Move the code that update the i_fragtree into a separate function.
Also add simple probabilistic test to decide whether the i_fragtree
should be updated

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
d9df2783507943316b305e177e5b1c157200c76f 18-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: pre-allocate ceph_cap struct for ceph_add_cap()

So that ceph_add_cap() can be used while i_ceph_lock is locked.
This simplifies the code that handle cap import/export.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
f98a128a55ff85d0087de89f304f10bd75e792aa 17-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: update inode fields according to issued caps

Cap message and request reply from non-auth MDS may carry stale
information (corresponding locks are in LOCK states) even they
have the newest inode version. So client should update inode fields
according to issued caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
0a8a70f96fe1bd3e07c15bb86fd247e76102398a 14-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: clear directory's completeness when creating file

When creating a file, ceph_set_dentry_offset() puts the new dentry
at the end of directory's d_subdirs, then set the dentry's offset
based on directory's max offset. The offset does not reflect the
real postion of the dentry in directory. Later readdir reply from
MDS may change the dentry's position/offset. This inconsistency
can cause missing/duplicate entries in readdir result if readdir
is partly satisfied by dcache_readdir().

The fix is clear directory's completeness after creating/renaming
file. It prevents later readdir from using dcache_readdir().

Fixes: http://tracker.ceph.com/issues/8025
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
48193012873e341f08de48304e32d0499b96c60b 01-Apr-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: don't grabs open file reference for aborted request

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
5f75ce57818e4a48bdeac0b76daeb434eea26059 21-Mar-2014 Fabian Frederick <fabf@skynet.be> ceph: Remove get/set acl on symlinks

Remove unsupported symlink operations.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
8c93cd610c6c5a4c0dddfc6fe906814331b3af87 08-Mar-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: update i_max_size even if inode version does not change

handle following sequence of events:
- client releases a inode with i_max_size > 0. The release message
is queued. (is not sent to the auth MDS)
- a 'lookup' request reply from non-auth MDS returns the same inode.
- client opens the inode in write mode. The version of inode trace
in 'open' request reply is equal to the cached inode's version.
- client requests new max size. The MDS ignores the request because
it does not affect client's write range

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
19913b4eac4a230dccb548931358398f45dabe4c 06-Mar-2014 Yan, Zheng <zheng.z.yan@intel.com> ceph: add get_name() NFS export callback

Use the newly introduced LOOKUPNAME MDS request to connect child
inode to its parent directory.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
752c8bdcfe88f27a17c5c9264df928fd145a4b30 05-Feb-2013 Sage Weil <sage@inktank.com> ceph: do not chain inode updates to parent fsync

The fsync(dirfd) only covers namespace operations, not inode updates.
We do not need to cover setattr variants or O_TRUNC.

Reported-by: Al Viro <viro@xeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
72466d0b92e04a7e0e5abf74c86eb352225346e4 29-Jan-2014 Sage Weil <sage@inktank.com> ceph: fix posix ACL hooks

The merge of commit 7221fe4c2ed7 ("ceph: add acl for cephfs") raced with
upstream changes in the generic POSIX ACL code (eg commit 2aeccbe957d0
"fs: add generic xattr_acl handlers" and others).

Some of the fallout was fixed in commit 4db658ea0ca ("ceph: Fix up after
semantic merge conflict"), but it was incomplete: the set_acl
inode_operation wasn't getting set, and the prototype needed to be
adjusted a bit (it doesn't take a dentry anymore).

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4db658ea0ca2312b5d168230476ec7729385aefe 29-Jan-2014 Linus Torvalds <torvalds@linux-foundation.org> ceph: Fix up after semantic merge conflict

The previous ceph-client merge resulted in ceph not even building,
because there was a merge conflict that wasn't visible as an actual data
conflict: commit 7221fe4c2ed7 ("ceph: add acl for cephfs") added support
for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change
a lot of the POSIX ACL helper functions to be much more helpful to
filesystems (see for example commits 2aeccbe957d0 "fs: add generic
xattr_acl handlers", 5bf3258fd2ac "fs: make posix_acl_chmod more useful"
and 37bc15392a23 "fs: make posix_acl_create more useful")

The reason this conflict wasn't obvious was many-fold: because it was a
semantic conflict rather than a data conflict, it wasn't visible in the
git merge as a conflict. And because the VFS tree hadn't been in
linux-next, people hadn't become aware of it that way. And because I
was at jury duty this morning, I was using my laptop and as a result not
doing constant "allmodconfig" builds.

Anyway, this fixes the build and generally removes a fair chunk of the
Ceph POSIX ACL support code, since the improved helpers seem to match
really well for Ceph too. But I don't actually have any way to *test*
the end result, and I was really hoping for some ACK's for this. Oh,
well.

Not compiling certainly doesn't make things easier to test, so I'm
committing this without the acks after having waited for four hours...
Plus it's what I would have done for the merge had I noticed the
semantic conflict..

Reported-by: Dave Jones <davej@redhat.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Guangliang Zhao <lucienchao@gmail.com>
Cc: Li Wang <li.wang@ubuntykylin.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11df2dfb610d68e8050c2183c344b1002351a99d 24-Nov-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: add imported caps when handling cap export message

Version 3 cap export message includes information about the imported
caps. It allows us to add the imported caps if the corresponding cap
import message still hasn't been received.

This allow us to handle situation that the importer MDS crashes and
the cap import message is missing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
9563f88c1fa01341d125e396edc654a8dbcab2d2 22-Nov-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: fix cache revoke race

handle following sequence of events:

- non-auth MDS revokes Fc cap. queue invalidate work
- auth MDS issues Fc cap through request reply. i_rdcache_gen gets
increased.
- invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen,
so it does nothing.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
7221fe4c2ed72804b28633c8e0217d65abb0023f 11-Nov-2013 Guangliang Zhao <lucienchao@gmail.com> ceph: add acl for cephfs

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: Li Wang <li.wang@ubuntykylin.com>
Reviewed-by: Zheng Yan <zheng.z.yan@intel.com>
9f12bd119e408388233e7aeb1152f372a8b5dcad 20-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: drop unconnected inodes

Positve dentry and corresponding inode are always accompanied in MDS reply.
So no need to keep inode in the cache after dropping all its aliases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
86b58d13134ef14f09f8c8f37797ccc37cf823a3 04-Dec-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: initialize inode before instantiating dentry

commit b18825a7c8 (Put a small type field into struct dentry::d_flags)
put a type field into struct dentry::d_flags. __d_instantiate() set the
field by checking inode->i_mode. So we should initialize inode before
instantiating dentry when handling mds reply.

Fixes: http://tracker.ceph.com/issues/6930
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
81c6aea5275eae453719d7f3924da07e668265c5 18-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: handle frag mismatch between readdir request and reply

If client has outdated directory fragments information, it may request
readdir an non-existent directory fragment. In this case, the MDS finds
an approximate directory fragment and sends its contents back to the
client. When receiving a reply with fragment that is different than the
requested one, the client need to reset the 'readdir offset'.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
53e879a485f9def0e55c404dbc7187470a01602d 18-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: remove outdated frag information

If directory fragments change, fill_inode() inserts new frags into
the fragtree, but it does not remove outdated frags from the fragtree.
This patch fixes it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
ed284c49f61165c3ba1b4e6969d1cc30a769c31b 02-Sep-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: remove ceph_lookup_inode()

commit 6f60f889 (ceph: fix freeing inode vs removing session caps race)
introduced ceph_lookup_inode(). But there is already a ceph_find_inode()
which provides similar function. So remove ceph_lookup_inode(), use
ceph_find_inode() instead.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <alex.elder@linary.org>
Reviewed-by: Sage Weil <sage@inktank.com>
99ccbd229cf7453206bc858e795ec1f0345ff258 21-Aug-2013 Milosz Tanski <milosz@adfin.com> ceph: use fscache as a local presisent cache

Adding support for fscache to the Ceph filesystem. This would bring it to on
par with some of the other network filesystems in Linux (like NFS, AFS, etc...)

In order to mount the filesystem with fscache the 'fsc' mount option must be
passed.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: Sage Weil <sage@inktank.com>
b0d7c2231015b331b942746610a05b6ea72977ab 13-Aug-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: introduce i_truncate_mutex

I encountered below deadlock when running fsstress

wmtruncate work truncate MDS
--------------- ------------------ --------------------------
lock i_mutex
<- truncate file
lock i_mutex (blocked)
<- revoking Fcb (filelock to MIX)
send request ->
handle request (xlock filelock)

At the initial time, there are some dirty pages in the page cache.
When the kclient receives the truncate message, it reduces inode size
and creates some 'out of i_size' dirty pages. wmtruncate work can't
truncate these dirty pages because it's blocked by the i_mutex. Later
when the kclient receives the cap message that revokes Fcb caps, It
can't flush all dirty pages because writepages() only flushes dirty
pages within the inode size.

When the MDS handles the 'truncate' request from kclient, it waits
for the filelock to become stable. But the filelock is stuck in
unstable state because it can't finish revoking kclient's Fcb caps.

The truncate pagecache locking has already caused lots of trouble
for use. I think it's time simplify it by introducing a new mutex.
We use the new mutex to prevent concurrent truncate_inode_pages().
There is no need to worry about race between buffered write and
truncate_inode_pages(), because our "get caps" mechanism prevents
them from concurrent execution.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
6f60f889470aecf747610279545c054a99aadca3 23-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: fix freeing inode vs removing session caps race

remove_session_caps() uses iterate_session_caps() to remove caps,
but iterate_session_caps() skips inodes that are being deleted.
So session->s_nr_caps can be non-zero after iterate_session_caps()
return.

We can fix the issue by waiting until deletions are complete.
__wait_on_freeing_inode() is designed for the job, but it is not
exported, so we use lookup inode function to access it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
85ce127a9adf5ab9e9d57ddf64c858927d5e546d 21-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: wake up writer if vmtruncate work get blocked

To write data, the writer first acquires the i_mutex, then try getting
caps. The writer may sleep while holding the i_mutex. If the MDS revokes
Fb cap in this case, vmtruncate work can't do its job because i_mutex
is locked. We should wake up the writer and let it truncate the pages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
84d08fa888e7c2d53b5bbc764db2ef02968b499c 05-Jul-2013 Al Viro <viro@zeniv.linux.org.uk> helper for reading ->d_count

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
b415bf4f9fe25f39934f5c464125e4a2dffb6d08 01-Jul-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: fix pending vmtruncate race

The locking order for pending vmtruncate is wrong, it can lead to
following race:

write wmtruncate work
------------------------ ----------------------
lock i_mutex
check i_truncate_pending check i_truncate_pending
truncate_inode_pages() lock i_mutex (blocked)
copy data to page cache
unlock i_mutex
truncate_inode_pages()

The fix is take i_mutex before calling __ceph_do_pending_vmtruncate()

Fixes: http://tracker.ceph.com/issues/5453
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
0b93267252ef5fe6c6d77e3013ed6a0d766352ad 07-Apr-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: fix symlink inode operations

add getattr/setattr and xattrs related methods.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
2f276c511137d97e56b19e29865e1e6569315ccb 13-Mar-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: use i_release_count to indicate dir's completeness

Current ceph code tracks directory's completeness in two places.
ceph_readdir() checks i_release_count to decide if it can set the
I_COMPLETE flag in i_ceph_flags. All other places check the I_COMPLETE
flag. This indirection introduces locking complexity.

This patch adds a new variable i_complete_count to ceph_inode_info.
Set i_release_count's value to it when marking a directory complete.
By comparing the two variables, we know if a directory is complete

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
3f99969f42300e52779ae0656678c2534097f2ea 01-Mar-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: acquire i_mutex in __ceph_do_pending_vmtruncate

make __ceph_do_pending_vmtruncate() acquire the i_mutex if the caller
does not hold the i_mutex, so ceph_aio_read() can call safely.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
a8673d61ad77ddf2118599507bd40cc345e95368 18-Feb-2013 Yan, Zheng <zheng.z.yan@intel.com> ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag

commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().

This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
79f9f99ad1e3e19d4ac300573b51289e3ee8ba86 29-Jan-2013 Sage Weil <sage@inktank.com> ceph: prepopulate inodes only when request is aborted

If r_aborted is true, we do not hold the dir i_mutex, and cannot touch
the dcache. However, we still need to update the inodes with the state
returned by the MDS.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
bd2bae6a66df9261a39e47291b0a6b00cd0831e0 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com> ceph: Convert kuids and kgids before printing them.

Before printing kuid and kgids values convert them into
the initial user namespace.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
ab871b903e9095772c219b512d9eae96c4663a5d 31-Jan-2013 Eric W. Biederman <ebiederm@xmission.com> ceph: Translate inode uid and gid attributes to/from kuids and kgids.

- In fill_inode() transate uids and gids in the initial user namespace
into kuids and kgids stored in inode->i_uid and inode->i_gid.

- In ceph_setattr() if they have changed convert inode->i_uid and
inode->i_gid into initial user namespace uids and gids for
transmission.

Cc: Sage Weil <sage@inktank.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
a85f50b6ef93fbbb2ae932ce9b2376509d172796 19-Nov-2012 Yan, Zheng <zheng.z.yan@intel.com> ceph: Fix __ceph_do_pending_vmtruncate

we should set i_truncate_pending to 0 after page cache is truncated
to i_truncate_size

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2744c171dbaa0b1ec7639e7d0ff817fba9461a38 27-Sep-2012 Al Viro <viro@zeniv.linux.org.uk> ceph: don't abuse d_delete() on failure exits

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
6c5e50fa614fea5325a2973be06f7ec6f1055316 22-Aug-2012 Sage Weil <sage@inktank.com> ceph: tolerate (and warn on) extraneous dentry from mds

If the MDS gives us a dentry and we weren't prepared to handle it,
WARN_ON_ONCE instead of crashing.

Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
810339ec2fae5cbd0164b8acde7fb65652755864 03-Feb-2012 Xi Wang <xi.wang@gmail.com> ceph: avoid panic with mismatched symlink sizes in fill_inode()

Return -EINVAL rather than panic if iinfo->symlink_len and inode->i_size
do not match.

Also use kstrndup rather than kmalloc/memcpy.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
b8cd952b51034ad9f20ca147507ee68dc641c98c 13-Dec-2011 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: dereference pointer after checking for NULL

moved dereference after BUG_ON

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
6b520e0565422966cdf1c3759bd73df77b0f248c 12-Dec-2011 Al Viro <viro@zeniv.linux.org.uk> vfs: fix the stupidity with i_dentry in inode destructors

Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
be655596b3de5873f994ddbe205751a5ffb4de39 30-Nov-2011 Sage Weil <sage@newdream.net> ceph: use i_ceph_lock instead of i_lock

We have been using i_lock to protect all kinds of data structures in the
ceph_inode_info struct, including lists of inodes that we need to iterate
over while avoiding races with inode destruction. That requires grabbing
a reference to the inode with the list lock protected, but igrab() now
takes i_lock to check the inode flags.

Changing the list lock ordering would be a painful process.

However, using a ceph-specific i_ceph_lock in the ceph inode instead of
i_lock is a simple mechanical change and avoids the ordering constraints
imposed by igrab().

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
15a2015fbc692e1c97d7ce12d96e077f5ae7ea6d 06-Nov-2011 Sage Weil <sage@newdream.net> ceph: fix iput race when queueing inode work

If we queue a work item that calls iput(), make sure we ihold() before
attempting to queue work. Otherwise our queued work might miraculously run
before we notice the queue_work() succeeded and call ihold(), allowing the
inode to be destroyed.

That is, instead of

if (queue_work(...))
ihold();

we need to do

ihold();
if (!queue_work(...))
iput();

Reported-by: Amon Ott <a.ott@m-privacy.de>
Signed-off-by: Sage Weil <sage@newdream.net>
c6ffe10015f4e6fba8a915318b319c43aed1836f 03-Nov-2011 Sage Weil <sage@newdream.net> ceph: use new D_COMPLETE dentry flag

We used to use a flag on the directory inode to track whether the dcache
contents for a directory were a complete cached copy. Switch to a dentry
flag CEPH_D_COMPLETE that is safely updated by ->d_prune().

Signed-off-by: Sage Weil <sage@newdream.net>
bfe8684869601dacfcb2cd69ef8cfd9045f62170 28-Oct-2011 Miklos Szeredi <mszeredi@suse.cz> filesystems: add set_nlink()

Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
83eaea22bdfc9e1cec88f81be5b64f30f6c37e8b 24-Aug-2011 Sage Weil <sage@newdream.net> Revert "ceph: don't truncate dirty pages in invalidate work thread"

This reverts commit c9af9fb68e01eb2c2165e1bc45cfeeed510c64e6.

We need to block and truncate all pages in order to reliably invalidate
them. Otherwise, we could:

- have some uptodate pages in the cache
- queue an invalidate
- write(2) locks some pages
- invalidate_work skips them
- write(2) only overwrites part of the page
- page now dirty and uptodate
-> partial leakage of invalidated data

It's not entirely clear why we started skipping locked pages in the first
place. I just ran this through fsx and didn't see any problems.

Signed-off-by: Sage Weil <sage@newdream.net>
4f1772645296a230e04f5c53e79cfb6f841ce634 26-Jul-2011 Sage Weil <sage@newdream.net> ceph: document locking for ceph_set_dentry_offset

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
5f21c96dd5c615341963036ae8f5e4f5227a818d 26-Jul-2011 Sage Weil <sage@newdream.net> ceph: protect access to d_parent

d_parent is protected by d_lock: use it when looking up a dentry's parent
directory inode. Also take a reference and drop it in the caller to avoid
a use-after-free.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
dfabbed6fdd509dc2beb89c954bc36014a1bc7cb 26-Jul-2011 Sage Weil <sage@newdream.net> ceph: set dir complete frag after adding capability

Curretly ceph_add_cap clears the complete bit if we are newly issued the
FILE_SHARED cap, which is normally the case for a newly issue cap on a new
directory. That means we clear the just-set bit. Move the check that sets
the flag to after the cap is added/updated.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2f90b852e3ae73889d7f6de6ecf429b9b6a6b103 26-Jul-2011 Sage Weil <sage@newdream.net> ceph: ignore lease mask

The lease mask is no longer used (and it changed a while back). Instead,
use a non-zero duration to indicate that there is a lease being issued.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
10556cb21a0d0b24d95f00ea6df16f599a3345b2 21-Jun-2011 Al Viro <viro@zeniv.linux.org.uk> ->permission() sanitizing: don't pass flags to ->permission()

not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2830ba7f34ebb27c4e5b8b6ef408cd6d74860890 21-Jun-2011 Al Viro <viro@zeniv.linux.org.uk> ->permission() sanitizing: don't pass flags to generic_permission()

redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
178ea73521d64ba41d7aa5488fb9f549c6d4507d 20-Jun-2011 Al Viro <viro@zeniv.linux.org.uk> kill check_acl callback of generic_permission()

its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
70b666c3b4cb2b96098d80e6f515e4bc6d37db5a 27-May-2011 Sage Weil <sage@newdream.net> ceph: use ihold when we already have an inode ref

We should use ihold whenever we already have a stable inode ref, even
when we aren't holding i_lock. This avoids adding new and unnecessary
locking dependencies.

Signed-off-by: Sage Weil <sage@newdream.net>
d3d0720d4a7a46e93e055e5b0f1a8bd612743ed6 11-May-2011 Henry C Chang <henry.cy.chang@gmail.com> ceph: do not use i_wrbuffer_ref as refcount for Fb cap

We increments i_wrbuffer_ref when taking the Fb cap. This breaks
the dirty page accounting and causes looping in
__ceph_do_pending_vmtruncate, and ceph client hangs.

This bug can be reproduced occasionally by running blogbench.

Add a new field i_wb_ref to inode and dedicate it to Fb reference
counting.

Signed-off-by: Henry C Chang <henry.cy.chang@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
fca65b4ad72d28cbb43a029114d04b89f06faadb 04-May-2011 Sage Weil <sage@newdream.net> ceph: do not call __mark_dirty_inode under i_lock

The __mark_dirty_inode helper now takes i_lock as of 250df6ed. Fix the
one ceph callers that held i_lock (__ceph_mark_dirty_caps) to return the
flags value so that the callers can do it outside of i_lock.

Signed-off-by: Sage Weil <sage@newdream.net>
ad1fee96cbaf873520064252c5dc3212c9844861 22-Jan-2011 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: add ino32 mount option

The ino32 mount option forces the ceph fs to report 32 bit
ino values. This is useful for 64 bit kernels with 32 bit userspace.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
09adc80c611bb8902daa8ccfe34dbbc009d6befe 05-Feb-2011 Sage Weil <sage@newdream.net> ceph: preserve I_COMPLETE across rename

d_move puts the renamed dentry at the end of d_subdirs, screwing with our
cached dentry directory offsets. We were just clearing I_COMPLETE to avoid
any possibility of trouble. However, assigning the renamed dentry an
offset at the end of the directory (to match it's new d_subdirs position)
is sufficient to maintain correct behavior and hold onto I_COMPLETE.

This is especially important for workloads like rsync, which renames files
into place. Before, we would lose I_COMPLETE and do MDS lookups for each
file. With this patch we only talk to the MDS on create and rename.

Signed-off-by: Sage Weil <sage@newdream.net>
b545cc1505eb49247071ce9f4092665de788ca00 28-Feb-2011 Sage Weil <sage@newdream.net> ceph: do not set I_COMPLETE

Do not set the I_COMPLETE flag on directories until we resolve races with
dcache pruning.

Signed-off-by: Sage Weil <sage@newdream.net>
1c1266bb916e6a6b362d3be95f2cc7f3c41277a6 13-Jan-2011 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: fix getattr on directory when using norbytes

The norbytes mount option was broken, and when doing getattr
on a directory it return the rbytes instead of the number of
entities. This commit fixes it.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
14303d20f3ae3e6ab626c77a4aac202b3bafd377 15-Dec-2010 Sage Weil <sage@newdream.net> ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS

This implements the DIRLAYOUTHASH protocol feature, which passes the dir
layout over the wire from the MDS. This gives the client knowledge
of the correct hash function to use for mapping dentries among dir
fragments.

Note that if this feature is _not_ present on the client but is on the
MDS, the client may misdirect requests. This will result in a forward
and degrade performance. It may also result in inaccurate NFS filehandle
generation, which will prevent fh resolution when the inode is not present
in the client cache and the parent directories have been fragmented.

Signed-off-by: Sage Weil <sage@newdream.net>
6c0f3af72cb1622a66962a1180c36ef8c41be8e2 16-Nov-2010 Sage Weil <sage@newdream.net> ceph: add dir_layout to inode

Add a ceph_dir_layout to the inode, and calculate dentry hash values based
on the parent directory's specified dir_hash function. This is needed
because the old default Linux dcache hash function is extremely week and
leads to a poor distribution of files among dir fragments.

Signed-off-by: Sage Weil <sage@newdream.net>
b74c79e99389cd79b31fcc08f82c24e492e63c7e 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: provide rcu-walk aware permission i_ops

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
fa0d7e3de6d6fc5004ad9dea0dd6b286af8f03e9 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: icache RCU free inodes

RCU free the struct inode. This will allow:

- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.

The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.

In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.

The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
b5c84bf6f6fa3a7dfdcb556023a62953574b60ee 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache remove dcache_lock

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2fd6b7f50797f2e993eea59e0a0b8c6399c811dc 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache scale subdirs

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
b7ab39f631f505edc2bbdb86620d5493f995c9da 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache scale dentry refcount

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
451a3c24b0135bce54542009b5fde43846c7cf67 17-Nov-2010 Arnd Bergmann <arnd@arndb.de> BKL: remove extraneous #include <smp_lock.h>

The big kernel lock has been removed from all these files at some point,
leaving only the #include.

Remove this too as a cleanup.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b7495fc2ff941db6a118a93ab8d61149e3f4cef8 09-Nov-2010 Sage Weil <sage@newdream.net> ceph: make page alignment explicit in osd interface

We used to infer alignment of IOs within a page based on the file offset,
which assumed they matched. This broke with direct IO that was not aligned
to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
specified in the OSD reply, which could have been adjusted by the server.

Explicitly specify the page alignment when setting up OSD IO requests.

Signed-off-by: Sage Weil <sage@newdream.net>
d8672d64b88cdb7aa8139fb6d218f40b8cbf60af 08-Nov-2010 Sage Weil <sage@newdream.net> ceph: fix update of ctime from MDS

The client can have a newer ctime than the MDS due to AUTH_EXCL and
XATTR_EXCL caps as well; update the check in ceph_fill_file_time
appropriately.

This fixes cases where ctime/mtime goes backward under the right sequence
of local updates (e.g. chmod) and mds replies (e.g. subsequent stat that
goes to the MDS).

Signed-off-by: Sage Weil <sage@newdream.net>
8bd59e0188c04f6540f00e13f633f22e4804ce06 08-Nov-2010 Sage Weil <sage@newdream.net> ceph: fix version check on racing inode updates

We may get updates on the same inode from multiple MDSs; generally we only
pay attention if the update is newer than what we already have. The
exception is when an MDS sense unstable information, in which case we
always update.

The old > check got this wrong when our version was odd (e.g. 3) and the
reply version was even (e.g. 2): the older stale (v2) info would be
applied. Fixed and clarified the comment.

Signed-off-by: Sage Weil <sage@newdream.net>
cd045cb42a266882ac24bc21a3a8d03683c72954 04-Nov-2010 Sage Weil <sage@newdream.net> ceph: fix rdcache_gen usage and invalidate

We used to use rdcache_gen to indicate whether we "might" have cached
pages. Now we just look at the mapping to determine that. However, some
old behavior remains from that transition.

First, rdcache_gen == 0 no longer means we have no pages. That can happen
at any time (presumably when we carry FILE_CACHE). We should not reset it
to zero, and we should not check that it is zero.

That means that the only purpose for rdcache_revoking is to resolve races
between new issues of FILE_CACHE and an async invalidate. If they are
equal, we should invalidate. On success, we decrement rdcache_revoking,
so that it is no longer equal to rdcache_gen. Similarly, if we success
in doing a sync invalidate, set revoking = gen - 1. (This is a small
optimization to avoid doing unnecessary invalidate work and does not
affect correctness.)

Signed-off-by: Sage Weil <sage@newdream.net>
912a9b0319a8eb9e0834b19a25e01013ab2d6a9f 07-Nov-2010 Sage Weil <sage@newdream.net> ceph: only let auth caps update max_size

Only the auth MDS has a meaningful max_size value for us, so only update it
in fill_inode if we're being issued an auth cap. Otherwise, a random
stat result from a non-auth MDS can clobber a meaningful max_size, get
the client<->mds cap state out of sync, and make writes hang.

Specifically, even if the client re-requests a larger max_size (which it
will), the MDS won't respond because as far as it knows we already have a
sufficiently large value.

Signed-off-by: Sage Weil <sage@newdream.net>
d8b16b3d1c9d8d9124d647d05797383d35e2d645 06-Nov-2010 Sage Weil <sage@newdream.net> ceph: fix bad pointer dereference in ceph_fill_trace

We dereference *in a few lines down, but only set it on rename. It is
apparently pretty rare for this to trigger, but I have been hitting it
with a clustered MDSs.

Signed-off-by: Sage Weil <sage@newdream.net>
3d14c5d2b6e15c21d8e5467dc62d33127c23a644 07-Apr-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: factor out libceph from Ceph file system

This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:

- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).

No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.

Signed-off-by: Sage Weil <sage@newdream.net>
467c525109d5d542d7d416b0c11bdd54610fe2f4 13-Sep-2010 Sage Weil <sage@newdream.net> ceph: fix dn offset during readdir_prepopulate

When adding the readdir results to the cache, ceph_set_dentry_offset was
clobbered our just-set offset. This can cause the readdir result offsets
to get out of sync with the server. Add an argument to the helper so
that it does not.

This bug was introduced by 1cd3935bedccf592d44343890251452a6dd74fc4.

Signed-off-by: Sage Weil <sage@newdream.net>
ac1f12ef569d49b013c3db86e11be7e15d66b1c3 25-Aug-2010 Dan Carpenter <error27@gmail.com> ceph: ceph_get_inode() returns an ERR_PTR

ceph_get_inode() returns an ERR_PTR and it doesn't return a NULL.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
124514918b030d74f1f3e15483b7bf3b85268082 23-Aug-2010 Sage Weil <sage@newdream.net> ceph: don't improperly set dir complete when holding EXCL cap

If we hold the EXCL cap, we cannot trust the dir stats from the MDS (num
files, subdirs) and must not incorrectly conclude that the directory is
empty. If we do, we get can bad results from lookup (bad ENOENT) and
bad readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>
2962507ca204f886967e1a089d9bec206d427c22 27-May-2010 Sage Weil <sage@newdream.net> ceph: perform lazy reads when file mode and caps permit

If the file mode is marked as "lazy," perform cached/buffered reads when
the caps permit it. Adjust the rdcache_gen and invalidation logic
accordingly so that we manage our cache based on the FILE_CACHE -or-
FILE_LAZYIO cap bits.

Signed-off-by: Sage Weil <sage@newdream.net>
03066f23452ff088ad8e2c8acdf4443043f35b51 27-Jul-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: use complete_all and wake_up_all

This fixes an issue triggered by running concurrent syncs. One of the syncs
would go through while the other would just hang indefinitely. In any case, we
never actually want to wake a single waiter, so the *_all functions should
be used.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
8c696737aa61316a252c4514d09dd163f1464d33 22-Jul-2010 Sage Weil <sage@newdream.net> ceph: fix leak of dentry in ceph_init_dentry() error path

If we fail to allocate a ceph_dentry_info, don't leak the dn reference.

Signed-off-by: Sage Weil <sage@newdream.net>
d69ed05a80f23b25f06e73af9b7e701ce4900edc 21-Jun-2010 Sage Weil <sage@newdream.net> ceph: handle splice_dentry/d_materialize_unique error in readdir_prepopulate

Handle a splice_dentry failure (due to a d_materialize_unique error)
without crashing. (Also, report the error code.)

Signed-off-by: Sage Weil <sage@newdream.net>
13a4214cd9ec14d7b77e98bd3ee51f60f868a6e5 01-Jun-2010 Henry C Chang <henry_c_chang@tcloudcomputing.com> ceph: fix d_subdirs ordering problem

We misused list_move_tail() to order the dentry in d_subdirs.
This will screw up the d_subdirs order.

This bug can be reliably reproduced by:
1. mount ceph fs.
2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git
3. Run autogen.sh in ceph directory.
(Note: Errors only occur at the first time you run autogen.sh.)

Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
7e34bc524ecae3a04d8cc427ee76ddad826a937b 22-May-2010 Julia Lawall <julia@diku.dk> fs/ceph: Use ERR_CAST

Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of
the returned value is the same as the type of the enclosing function.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
167c9e352deb7e25568c926c49c3eafad69cbe76 14-May-2010 Sage Weil <sage@newdream.net> ceph: use common helper for aborted dir request invalidation

We invalidate I_COMPLETE and dentry leases in two places: on aborted mds
request and on request replay. Use common helper to avoid duplicate code.

Signed-off-by: Sage Weil <sage@newdream.net>
1cd3935bedccf592d44343890251452a6dd74fc4 04-May-2010 Sage Weil <sage@newdream.net> ceph: set dn offset when spliced

We want to assign an offset when the dentry goes from null to linked, which
is always done by splice_dentry(). Notably, we should NOT assign an
offset when a dentry is first created and is still null.

BUG if we try to splice a non-null dentry (we shouldn't).

Signed-off-by: Sage Weil <sage@newdream.net>
1b7facc41b42c2ab904b2f88b64b1f8ca0ca6cb7 16-Apr-2010 Sage Weil <sage@newdream.net> ceph: don't clobber i_max_offset on already complete dir

This can screw up offsets assigned to new dentries and break dcache
readdir results.

Signed-off-by: Sage Weil <sage@newdream.net>
e8a7498715181ece36130335536e13733a5c3187 15-Apr-2010 Sage Weil <sage@newdream.net> ceph: skip set_dentry_offset work if directory not I_COMPLETE

Signed-off-by: Sage Weil <sage@newdream.net>
a6424e48c8d54a5795430b07c4487f1ed280df4e 29-Apr-2010 Sage Weil <sage@newdream.net> ceph: fix xattr dangling pointer / double free

If we use the xattr_blob, clear the pointer so we don't release the memory
at the bottom of the fuction.

Reported-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
Signed-off-by: Sage Weil <sage@newdream.net>
640ef79d27c81b7a3265a344ec1d25644dd463ad 26-Mar-2010 Cheng Renquan <crquan@gmail.com> ceph: use ceph_sb_to_client instead of ceph_client

ceph_sb_to_client and ceph_client are really identical, we need to dump
one; while function ceph_client is confusing with "struct ceph_client",
ceph_sb_to_client's definition is more clear; so we'd better switch all
call to ceph_sb_to_client.

-static inline struct ceph_client *ceph_client(struct super_block *sb)
-{
- return sb->s_fs_info;
-}

Signed-off-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
81a6cf2d30eac5d790f53cdff110892f7b18c7fe 14-May-2010 Sage Weil <sage@newdream.net> ceph: invalidate affected dentry leases on aborted requests

If we abort a request, we return to caller, but the request may still
complete. And if we hold the dir FILE_EXCL bit, we may not release a
lease when sending a request. A simple un-tar, control-c, un-tar again
will reproduce the bug (manifested as a 'Cannot open: File exists').

Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so
we don't have valid (but incorrect) leases. Do the same, consistently, at
other sites where I_COMPLETE is similarly cleared.

Signed-off-by: Sage Weil <sage@newdream.net>
04d000eb358919043da538f197d63f2a5924a525 07-May-2010 Sage Weil <sage@newdream.net> ceph: fix open file counting on snapped inodes when mds returns no caps

It's possible the MDS will not issue caps on a snapped inode, in which case
an open request may not __ceph_get_fmode(), botching the open file
counting. (This is actually a server bug, but the client shouldn't BUG out
in this case.)

Signed-off-by: Sage Weil <sage@newdream.net>
c10f5e12bafde7f7a2f9b75d76f7a68d62154e91 16-Apr-2010 Sage Weil <sage@newdream.net> ceph: clear dir complete on d_move

d_move() reorders the d_subdirs list, breaking the readdir result caching.
Unless/until d_move preserves that ordering, clear CEPH_I_COMPLETE on
rename.

Signed-off-by: Sage Weil <sage@newdream.net>
9358c6d4c0264b1572554c49c4b92673ea9a5c72 30-Mar-2010 Sage Weil <sage@newdream.net> ceph: fix dentry rehashing on virtual .snap dir

If a lookup fails on the magic .snap directory, we bind it to a magic
snap directory inode in ceph_lookup_finish(). That code assumes the dentry
is unhashed, but a recent server-side change started returning NULL leases
on lookup failure, causing the .snap dentry to be hashed and NULL by
ceph_fill_trace().

This causes dentry hash chain corruption, or a dies when d_rehash()
includes
BUG_ON(!d_unhashed(entry));

So, avoid processing the NULL dentry lease if it the dentry matches the
snapdir name in ceph_fill_trace(). That allows the lookup completion to
properly bind it to the snapdir inode. BUG there if dentry is hashed to
be sure.

Signed-off-by: Sage Weil <sage@newdream.net>
8b218b8a4a65bf4e304ae8690cadb9100ef029c0 09-Mar-2010 Sage Weil <sage@newdream.net> ceph: fix inode removal from snap realm when racing with migration

When an inode was dropped while being migrated between two MDSs,
i_cap_exporting_issued was non-zero such that issue caps were non-zero and
__ceph_is_any_caps(ci) was true. This prevented the inode from being
removed from the snap realm, even as it was dropped from the cache.

Fix this by dropping any residual i_snap_realm ref in destroy_inode.

Signed-off-by: Sage Weil <sage@newdream.net>
c9af9fb68e01eb2c2165e1bc45cfeeed510c64e6 19-Feb-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: don't truncate dirty pages in invalidate work thread

Instead of truncating the whole range of pages, we skip those
pages that are dirty or in the middle of writeback. Those pages
will be cleared later when the writeback completes.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2c27c9a57c93a0757b9b4b0e7dc1abeaf1db1ce2 18-Feb-2010 Sage Weil <sage@newdream.net> ceph: fix typo in ceph_queue_writeback debug output

Signed-off-by: Sage Weil <sage@newdream.net>
3c6f6b79a64db7f1c7abf09d693db3b0066784fb 10-Feb-2010 Sage Weil <sage@newdream.net> ceph: cleanup async writeback, truncation, invalidate helpers

Grab inode ref in helper. Make work functions static, with consistent
naming.

Signed-off-by: Sage Weil <sage@newdream.net>
3d497d858ae6e5f23a28783030aecc69074e102d 09-Feb-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: fix truncation when not holding caps

A truncation should occur when either we have the
specified caps for the file, or (in cases where we are
not the only ones referencing the file) when it is mapped
or when it is opened. The latter two cases were not
handled.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
0f26c4b21b684825a6dd41f2bc04d48ff62d72f8 29-Jan-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: remove unreachable code

We never truncate to a smaller size without contacting the MDS.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
5b1daecd59f95eb24dc629407ed80369c9929520 25-Jan-2010 Sage Weil <sage@newdream.net> ceph: properly handle aborted mds requests

Previously, if the MDS request was interrupted, we would unregister the
request and ignore any reply. This could cause the caps or other cache
state to become out of sync. (For instance, aborting dbench and doing
rm -r on clients would complain about a non-empty directory because the
client didn't realize it's aborted file create request completed.)

Even we don't unregister, we still can't process the reply normally because
we are no longer holding the caller's locks (like the dir i_mutex).

So, mark aborted operations with r_aborted, and in the reply handler, be
sure to process all the caps. Do not process the namespace changes,
though, since we no longer will hold the dir i_mutex. The dentry lease
state can also be ignored as it's more forgiving.

Signed-off-by: Sage Weil <sage@newdream.net>
4baa75ef0ed29adae03fcbbaa9aca1511a5a8cc9 08-Jan-2010 Yehuda Sadeh <yehuda@hq.newdream.net> ceph: change dentry offset and position after splice_dentry

This fixes a bug, where we had the parent list have dentries with
offsets that are not monotonically increasing, which caused the ceph
dcache_readdir to skip entries.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
c4a29f26d50bea65809ca670992108a33aa2efa6 21-Dec-2009 Sage Weil <sage@newdream.net> ceph: ensure rename target dentry fails revalidation

This works around a bug in vfs_rename_dir() that rehashes the target
dentry. Ensure such dentries always fail revalidation by timing out the
dentry lease and kicking it out of the current directory lease gen.

This can be reverted when the vfs bug is fixed.

Signed-off-by: Sage Weil <sage@newdream.net>
b6c1d5b81ea0841ae9d3ce2cda319ab986b081cf 07-Dec-2009 Sage Weil <sage@newdream.net> ceph: simplify ceph_buffer interface

We never allocate the ceph_buffer and buffer separtely, so use a single
constructor.

Disallow put on NULL buffer; make the caller check.

Signed-off-by: Sage Weil <sage@newdream.net>
b377ff13b31778c19203f3089d14080beb40a692 12-Nov-2009 Sage Weil <sage@newdream.net> ceph: initialize i_size/i_rbytes on snapdir

Signed-off-by: Sage Weil <sage@newdream.net>
232d4b01319767b3ffa5d08962a81c805962be49 21-Oct-2009 Sage Weil <sage@newdream.net> ceph: move directory size logic to ceph_getattr

We can't fill i_size with rbytes at the fill_file_size stage without
adding additional checks for directories. Notably, we want st_blocks
to remain 0 on directories so that 'du' still works.

Fill in i_blocks, i_size specially in ceph_getattr instead.

Signed-off-by: Sage Weil <sage@newdream.net>
355da1eb7a1f91c276b991764e951bbcd8047599 06-Oct-2009 Sage Weil <sage@newdream.net> ceph: inode operations

Inode cache and inode operations. We also include routines to
incorporate metadata structures returned by the MDS into the client
cache, and some helpers to deal with file capabilities and metadata
leases. The bulk of that work is done by fill_inode() and
fill_trace().

Signed-off-by: Sage Weil <sage@newdream.net>