History log of /fs/file_table.c
Revision Date Author Comments
a457606a6f81cfddfc9da1ef2a8bf2c65a8eb35e 12-Oct-2014 Eric Biggers <ebiggers3@gmail.com> fs/file_table.c: Update alloc_file() comment

This comment is 5 years outdated; init_file() no longer exists.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
908c7f1949cb7cc6e92ba8f18f2998e87e265b8e 08-Sep-2014 Tejun Heo <tj@kernel.org> percpu_counter: add @gfp to percpu_counter_init()

Percpu allocator now supports allocation mask. Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion. This is the one with
the most users. Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
1f7e0616cd4f5df594595153c3a01bbb16072380 06-Jun-2014 Joe Perches <joe@perches.com> fs: convert use of typedef ctl_table to struct ctl_table

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
293bc9822fa9b3c9d4b7893bcb241e085580771a 12-Feb-2014 Al Viro <viro@zeniv.linux.org.uk> new methods: ->read_iter() and ->write_iter()

Beginning to introduce those. Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated. File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
7f7f25e82d54870df24d415a7007fbd327da027b 11-Feb-2014 Al Viro <viro@zeniv.linux.org.uk> replace checking for ->read/->aio_read presence with check in ->f_mode

Since we are about to introduce new methods (read_iter/write_iter), the
tests in a bunch of places would have to grow inconveniently. Check
once (at open() time) and store results in ->f_mode as FMODE_CAN_READ
and FMODE_CAN_WRITE resp. It might end up being a temporary measure -
once everything switches from ->aio_{read,write} to ->{read,write}_iter
it might make sense to return to open-coded checks. We'll see...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
7f4b36f9bb930b3b2105a9a2cb0121fa7028c432 14-Mar-2014 Al Viro <viro@zeniv.linux.org.uk> get rid of files_defer_init()

the only thing it's doing these days is calculation of
upper limit for fs.nr_open sysctl and that can be done
statically

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
83f936c75e3689a63253d89c47a4d239c56d7410 14-Mar-2014 Al Viro <viro@zeniv.linux.org.uk> mark struct file that had write access grabbed by open()

new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
when it has grabbed write access, checked by __fput() to decide whether
it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
in alloc_file(), along with fewer special_file() checks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4597e695b8baa3e2620da89c7593be70cf20566b 14-Mar-2014 Al Viro <viro@zeniv.linux.org.uk> get rid of DEBUG_WRITECOUNT

it only makes control flow in __fput() and friends more convoluted.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
dd20908a8a06b22c171f6c3fcdbdbd65bed07505 14-Mar-2014 Al Viro <viro@zeniv.linux.org.uk> don't bother with {get,put}_write_access() on non-regular files

it's pointless and actually leads to wrong behaviour in at least one
moderately convoluted case (pipe(), close one end, try to get to
another via /proc/*/fd and run into ETXTBUSY).

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
78ed8a13382b1354e95d0f2233577eba15cb8171 03-Feb-2014 Jeff Layton <jlayton@redhat.com> locks: rename locks_remove_flock to locks_remove_file

This function currently removes leases in addition to flock locks and in
a later patch we'll have it deal with file-private locks too. Rename it
to locks_remove_file to indicate that it removes locks that are
associated with a particular struct file, and not just flock locks.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
9c225f2655e36a470c4f58dbbc99244c5fc7f2d4 03-Mar-2014 Linus Torvalds <torvalds@linux-foundation.org> vfs: atomic f_pos accesses as per POSIX

Our write() system call has always been atomic in the sense that you get
the expected thread-safe contiguous write, but we haven't actually
guaranteed that concurrent writes are serialized wrt f_pos accesses, so
threads (or processes) that share a file descriptor and use "write()"
concurrently would quite likely overwrite each others data.

This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says:

"2.9.7 Thread Interactions with Regular File Operations

All of the following functions shall be atomic with respect to each
other in the effects specified in POSIX.1-2008 when they operate on
regular files or symbolic links: [...]"

and one of the effects is the file position update.

This unprotected file position behavior is not new behavior, and nobody
has ever cared. Until now. Yongzhi Pan reported unexpected behavior to
Michael Kerrisk that was due to this.

This resolves the issue with a f_pos-specific lock that is taken by
read/write/lseek on file descriptors that may be shared across threads
or processes.

Reported-by: Yongzhi Pan <panyongzhi@gmail.com>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
eee5cc2702929fd41cce28058dc6d6717f723f87 04-Oct-2013 Al Viro <viro@zeniv.linux.org.uk> get rid of s_files and files_lock

The only thing we need it for is alt-sysrq-r (emergency remount r/o)
and these days we can do just as well without going through the
list of files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
72c2d53192004845cbc19cd8a30b3212a9288140 22-Sep-2013 Al Viro <viro@zeniv.linux.org.uk> file->f_op is never NULL...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
c7314d74fcb089b127ef5753b5263ac8473f33bc 20-Oct-2013 Al Viro <viro@zeniv.linux.org.uk> nfsd regression since delayed fput()

Background: nfsd v[23] had throughput regression since delayed fput
went in; every read or write ends up doing fput() and we get a pair
of extra context switches out of that (plus quite a bit of work
in queue_work itselfi, apparently). Use of schedule_delayed_work()
gives it a chance to accumulate a bit before we do __fput() on all
of them. I'm not too happy about that solution, but... on at least
one real-world setup it reverts about 10% throughput loss we got from
switch to delayed fput.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
be49b30a98fe7e20f898fcfe7b6c082700fb96e8 11-Sep-2013 Andrew Morton <akpm@linux-foundation.org> fs/file_table.c:fput(): make comment more truthful

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
184cacabe274a0af65b3876e9cd95c9fdde069ea 30-Aug-2013 Al Viro <viro@zeniv.linux.org.uk> only regular files with FMODE_WRITE need to be on s_files

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4f5e65a1cc90bbb15b9f6cdc362922af1bcc155a 08-Jul-2013 Oleg Nesterov <oleg@redhat.com> fput: turn "list_head delayed_fput_list" into llist_head

fput() and delayed_fput() can use llist and avoid the locking.

This is unlikely path, it is not that this change can improve
the performance, but this way the code looks simpler.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
64372501e2af9b11e2ffd1ff79345dc4b1abe539 08-Jul-2013 Andrew Morton <akpm@linux-foundation.org> fs/file_table.c:fput(): add comment

A missed update to "fput: task_work_add() can fail if the caller has
passed exit_task_work()".

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
c77cecee52e9b599da1f8ffd9170d4374c99a345 14-Jun-2013 David Howells <dhowells@redhat.com> Replace a bunch of file->dentry->d_inode refs with file_inode()

Replace a bunch of file->dentry->d_inode refs with file_inode().

In __fput(), use file->f_inode instead so as not to be affected by any tricks
that file_inode() might grow.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
e7b2c4069252732d52f1de6d1f7c82d99a156659 14-Jun-2013 Oleg Nesterov <oleg@redhat.com> fput: task_work_add() can fail if the caller has passed exit_task_work()

fput() assumes that it can't be called after exit_task_work() but
this is not true, for example free_ipc_ns()->shm_destroy() can do
this. In this case fput() silently leaks the file.

Change it to fallback to delayed_fput_work if task_work_add() fails.
The patch looks complicated but it is not, it changes the code from

if (PF_KTHREAD) {
schedule_work(...);
return;
}
task_work_add(...)

to
if (!PF_KTHREAD) {
if (!task_work_add(...))
return;
/* fallback */
}
schedule_work(...);

As for shm_destroy() in particular, we could make another fix but I
think this change makes sense anyway. There could be another similar
user, it is not safe to assume that task_work_add() can't fail.

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
dd37978c50bc8b354e5c4633f69387f16572fdac 02-Mar-2013 Al Viro <viro@zeniv.linux.org.uk> cache the value of file_inode() in struct file

Note that this thing does *not* contribute to inode refcount;
it's pinned down by dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
39b652527457452f09b35044fb4f8b3b0eabafdf 13-Sep-2012 Anatol Pomozov <anatol.pomozov@gmail.com> fs: Preserve error code in get_empty_filp(), part 2

Allocating a file structure in function get_empty_filp() might fail because
of several reasons:
- not enough memory for file structures
- operation is not allowed
- user is over its limit

Currently the function returns NULL in all cases and we loose the exact
reason of the error. All callers of get_empty_filp() assume that the function
can fail with ENFILE only.

Return error through pointer. Change all callers to preserve this error code.

[AV: cleaned up a bit, carved the get_empty_filp() part out into a separate commit
(things remaining here deal with alloc_file()), removed pipe(2) behaviour change]

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1afc99beaf0fca3767d9b67789a7ae91c4f7a9c9 15-Feb-2013 Al Viro <viro@zeniv.linux.org.uk> propagate error from get_empty_filp() to its callers

Based on parts from Anatol's patch (the rest is the next commit).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
496ad9aa8ef448058e36ca7a787c61f2e63f0f54 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk> new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
72651cac884b1e285fa8e8314b10e9f1b8458802 05-Dec-2012 Jan Kara <jack@suse.cz> fs: Fix imbalance in freeze protection in mark_files_ro()

File descriptors (even those for writing) do not hold freeze protection.
Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
against remount read-only. Calling mnt_drop_write_file() as we do now
results in:

[ BUG: bad unlock balance detected! ]
3.7.0-rc6-00028-g88e75b6 #101 Not tainted
-------------------------------------
kworker/1:2/79 is trying to release lock (sb_writers) at:
[<ffffffff811b33b4>] mnt_drop_write+0x24/0x30
but there are no more locks to release!

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4b2c551f77f5a0c496e2125b1d883f4b26aabf2c 09-Oct-2012 Lai Jiangshan <laijs@cn.fujitsu.com> lglock: add DEFINE_STATIC_LGLOCK()

When the lglock doesn't need to be exported we can use
DEFINE_STATIC_LGLOCK().

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
0ee8cdfe6af052deb56dccd54838a1eb32fb4ca2 16-Aug-2012 Al Viro <viro@zeniv.linux.org.uk> take fget() and friends to fs/file.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4199d35cbc90c15db447d115bd96ffa5f1d60d3a 17-Mar-2011 Mimi Zohar <zohar@linux.vnet.ibm.com> vfs: move ima_file_free before releasing the file

ima_file_free(), called on __fput(), currently flags files that have
changed, so that the file is re-measured. For appraising a files's
integrity, the file's hash must be re-calculated and stored in the
'security.ima' xattr to reflect any changes.

This patch moves the ima_file_free() call to before releasing the file
in preparation of ima-appraisal measuring the file and updating the
'security.ima' xattr.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Dmitry Kasatkin <dmitry.kasatkin@intel.com>
eb04c28288bb0098d0e75d81ba2a575239de71d8 12-Jun-2012 Jan Kara <jack@suse.cz> fs: Add freezing handling to mnt_want_write() / mnt_drop_write()

Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5c33b183a36500a5b0a3c53c11c431f0fec6efc8 20-Jul-2012 Al Viro <viro@zeniv.linux.org.uk> uninline file_free_rcu()

What inline? Its only use is passing its address to call_rcu(), for fuck sake!

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4a9d4b024a3102fc083c925c242d98ac27b1c5f6 24-Jun-2012 Al Viro <viro@zeniv.linux.org.uk> switch fput to task_work_add

... and schedule_work() for interrupt/kernel_thread callers
(and yes, now it *is* OK to call from interrupt).

We are guaranteed that __fput() will be done before we return
to userland (or exit). Note that for fput() from a kernel
thread we get an async behaviour; it's almost always OK, but
sometimes you might need to have __fput() completed before
you do anything else. There are two mechanisms for that -
a general barrier (flush_delayed_fput()) and explicit
__fput_sync(). Both should be used with care (as was the
case for fput() from kernel threads all along). See comments
in fs/file_table.c for details.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
85d7d618c17a09cfd824c1ad4483c19e6f9637ff 23-Jun-2012 Al Viro <viro@zeniv.linux.org.uk> mark_files_ro(): don't bother with mntget/mntput

mnt_drop_write_file() is safe under any lock

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
962830df366b66e71849040770ae6ba55a8b4aec 08-May-2012 Andi Kleen <ak@linux.intel.com> brlocks/lglocks: API cleanups

lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.

In preparation, this patch changes the API to look more like normal
function calls with pointers, not magic macros.

The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
eea62f831b8030b0eeea8314eed73b6132d1de26 08-May-2012 Andi Kleen <ak@linux.intel.com> brlocks/lglocks: turn into functions

lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason to not just use common utility
functions and put all the data into a common data structure.

Since there are at least two users it makes sense to share this code in a
library. This is also easier maintainable than a macro forest.

This will also make it later possible to dynamically allocate lglocks and
also use them in modules (this would both still need some additional, but
now straightforward, code)

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
b57ce9694ec43dcb6ef6f189d6540e4b3d2c5e7a 12-Feb-2012 Al Viro <viro@zeniv.linux.org.uk> vfs: drop_file_write_access() made static

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8e8b87964bc8dc5c14b6543fc933b7725f07d3ac 21-Nov-2011 Miklos Szeredi <mszeredi@suse.cz> vfs: prevent remount read-only if pending removes

If there are any inodes on the super block that have been unlinked
(i_nlink == 0) but have not yet been deleted then prevent the
remounting the super block read-only.

Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
60063497a95e716c9a689af3be2687d261f115b4 27-Jul-2011 Arun Sharma <asharma@fb.com> atomic: use <linux/atomic.h>

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
60ed8cf78f886753e454b671841c0a3a0e55e915 16-Mar-2011 Miklos Szeredi <miklos@szeredi.hu> fix cdev leak on O_PATH final fput()

__fput doesn't need a cdev_put() for O_PATH handles.

Signed-off-by: mszeredi@suse.cz
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
326be7b484843988afe57566b627fb7a70beac56 13-Mar-2011 Al Viro <viro@zeniv.linux.org.uk> Allow passing O_PATH descriptors via SCM_RIGHTS datagrams

Just need to make sure that AF_UNIX garbage collector won't
confuse O_PATHed socket on filesystem for real AF_UNIX opened
socket.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1abf0c718f15a56a0a435588d1b104c7a37dc9bd 13-Mar-2011 Al Viro <viro@zeniv.linux.org.uk> New kind of open files - "location only".

New flag for open(2) - O_PATH. Semantics:
* pathname is resolved, but the file itself is _NOT_ opened
as far as filesystem is concerned.
* almost all operations on the resulting descriptors shall
fail with -EBADF. Exceptions are:
1) operations on descriptors themselves (i.e.
close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
fcntl(fd, F_SETFD, ...))
2) fcntl(fd, F_GETFL), for a common non-destructive way to
check if descriptor is open
3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
points of pathname resolution
* closing such descriptor does *NOT* affect dnotify or
posix locks.
* permissions are checked as usual along the way to file;
no permission checks are applied to the file itself. Of course,
giving such thing to syscall will result in permission checks (at
the moment it means checking that starting point of ....at() is
a directory and caller has exec permissions on it).

fget() and fget_light() return NULL on such descriptors; use of
fget_raw() and fget_raw_light() is needed to get them. That protects
existing code from dealing with those things.

There are two things still missing (they come in the next commits):
one is handling of symlinks (right now we refuse to open them that
way; see the next commit for semantics related to those) and another
is descriptor passing via SCM_RIGHTS datagrams.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
890275b5eb79e9933d12290473eab9ac38da0051 02-Nov-2010 Mimi Zohar <zohar@linux.vnet.ibm.com> IMA: maintain i_readcount in the VFS layer

ima_counts_get() updated the readcount and invalidated the PCR,
as necessary. Only update the i_readcount in the VFS layer.
Move the PCR invalidation checks to ima_file_check(), where it
belongs.

Maintaining the i_readcount in the VFS layer, will allow other
subsystems to use i_readcount.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Eric Paris <eparis@redhat.com>
78d2978874e4e10e97dfd4fd79db45bdc0748550 04-Feb-2011 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> CRED: Fix kernel panic upon security_file_alloc() failure.

In get_empty_filp() since 2.6.29, file_free(f) is called with f->f_cred == NULL
when security_file_alloc() returned an error. As a result, kernel will panic()
due to put_cred(NULL) call within RCU callback.

Fix this bug by assigning f->f_cred before calling security_file_alloc().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3bc0ba4305fa99b32caac8c60df84a2f14fce228 14-Dec-2010 Steven Rostedt <srostedt@redhat.com> fs: Remove unlikely() from fget_light()

There's an unlikely() in fget_light() that assumes the file ref count
will be 1. Running the annotate branch profiler on a desktop that is
performing daily tasks (running firefox, evolution, xchat and is also part
of a distcc farm), it shows that the ref count is not 1 that often.

correct incorrect % Function File Line
------- --------- - -------- ---- ----
1035099358 6209599193 85 fget_light file_table.c 315

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
518de9b39e854542de59bfb8b9f61c8f7ecf808b 26-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:

atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of atomic_t.

get_max_files() is changed to return an unsigned long. get_nr_files() is
changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7e360c38abe2c70eae3ba5a8a17f17671d8b77c5 05-Oct-2010 Eric Dumazet <eric.dumazet@gmail.com> fs: allow for more than 2^31 files

Andrew,

Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.

Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())

Thanks !

[PATCH V4] fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot. The kernel was incapable
of creating either a named pipe or unix domain socket. This comes down
to a common kernel function called unix_create1() which does:

atomic_inc(&unix_nr_socks);
if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

n = (mempages * (PAGE_SIZE / 1024)) / 10;
files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000). That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.

get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704 0 2147483648

Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
6416ccb7899960868f5016751fb81bf25213d24f 17-Aug-2010 Nick Piggin <npiggin@kernel.dk> fs: scale files_lock

fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.

Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.

Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

throughput
2.6.34-rc2 24.5
+patch 24.9

us sys idle IO wait (in %)
2.6.34-rc2 51.25 28.25 17.25 3.25
+patch 53.75 18.5 19 8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.

Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ee2ffa0dfdd2db19705f2ba1c6a4c0bfe8122dd8 17-Aug-2010 Nick Piggin <npiggin@kernel.dk> fs: cleanup files_lock locking

fs: cleanup files_lock locking

Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
manipulate the per-sb files list; unexport the files_lock spinlock.

Cc: linux-kernel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2069601b3f0ea38170d4b509b89f3ca0a373bdc1 12-Aug-2010 Linus Torvalds <torvalds@linux-foundation.org> Revert "fsnotify: store struct file not struct path"

This reverts commit 3bcf3860a4ff9bbc522820b4b765e65e4deceb3e (and the
accompanying commit c1e5c954020e "vfs/fsnotify: fsnotify_close can delay
the final work in fput" that was a horribly ugly hack to make it work at
all).

The 'struct file' approach not only causes that disgusting hack, it
somehow breaks pulseaudio, probably due to some other subtlety with
f_count handling.

Fix up various conflicts due to later fsnotify work.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
58939473bacf08e7e7346673c6d70bc367bd091a 11-Aug-2010 Tony Battersby <tonyb@cybernetics.com> vfs: improve comment describing fget_light()

Improve the description of fget_light(), which is currently incorrect
about needing a prior refcnt (judging by the way it is actually used).

Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c1e5c954020e123d30b4abf4038ce501861bcf9f 28-Jul-2010 Eric Paris <eparis@redhat.com> vfs/fsnotify: fsnotify_close can delay the final work in fput

fanotify almost works like so:

user context calls fsnotify_* function with a struct file.
fsnotify takes a reference on the struct path
user context goes about it's buissiness

at some later point in time the fsnotify listener gets the struct path
fanotify listener calls dentry_open() to create a file which userspace can deal with
listener drops the reference on the struct path
at some later point the listener calls close() on it's new file

With the switch from struct path to struct file this presents a problem for
fput() and fsnotify_close(). fsnotify_close() is called when the filp has
already reached 0 and __fput() wants to do it's cleanup.

The solution presented here is a bit odd. If an event is created from a
struct file we take a reference on the file. We check however if the f_count
was already 0 and if so we take an EXTRA reference EVEN THOUGH IT WAS ZERO.
In __fput() (where we know the f_count hit 0 once) we check if the f_count is
non-zero and if so we drop that 'extra' ref and return without destroying the
file.

Signed-off-by: Eric Paris <eparis@redhat.com>
d7065da038227a4d09a244e6014e0186a6bd21d0 26-May-2010 Al Viro <viro@zeniv.linux.org.uk> get rid of the magic around f_count in aio

__aio_put_req() plays sick games with file refcount. What
it wants is fput() from atomic context; it's almost always
done with f_count > 1, so they only have to deal with delayed
work in rare cases when their reference happens to be the
last one. Current code decrements f_count and if it hasn't
hit 0, everything is fine. Otherwise it keeps a pointer
to struct file (with zero f_count!) around and has delayed
work do __fput() on it.

Better way to do it: use atomic_long_add_unless( , -1, 1)
instead of !atomic_long_dec_and_test(). IOW, decrement it
only if it's not the last reference, leave refcount alone
if it was. And use normal fput() in delayed work.

I've made that atomic_long_add_unless call a new helper -
fput_atomic(). Drops a reference to file if it's safe to
do in atomic (i.e. if that's not the last one), tells if
it had been able to do that. aio.c converted to it, __fput()
use is gone. req->ki_file *always* contributes to refcount
now. And __fput() became static.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
42e49608683ab25fbbbf9c40edb944601e543882 05-Mar-2010 Wu Fengguang <fengguang.wu@intel.com> vfs: take f_lock on modifying f_mode after open time

We'll introduce FMODE_RANDOM which will be runtime modified. So protect
all runtime modification to f_mode with f_lock to avoid races.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> [2.6.33.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
89068c576bf324ef6fbd50dfc745148f7def202c 07-Feb-2010 Al Viro <viro@zeniv.linux.org.uk> Take ima_file_free() to proper place.

Hooks: Just Say No.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
385e3ed4f0b81dd0b0b214050a30d352a71b95f7 16-Dec-2009 Roland Dreier <rdreier@cisco.com> alloc_file(): simplify handling of mnt_clone_write() errors

When alloc_file() and init_file() were combined, the error handling of
mnt_clone_write() was taken into alloc_file() in a somewhat obfuscated
way. Since we don't use the error code for anything except warning,
we might as well warn directly without an extra variable.

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
73efc4681cb5e3c8807daf106f001e7f0798d8a0 16-Dec-2009 Roland Dreier <rdreier@cisco.com> re-export alloc_file()

Commit 3d1e4631 ("get rid of init_file()") removed the export of
alloc_file() -- possibly inadvertently, since that commit mainly
consisted of deleting the lines between the end of alloc_file() and
the start of the code in init_file().

There is in fact one modular use of alloc_file() in the tree, in
drivers/infiniband/core/uverbs_main.c, so re-add the export to fix:

ERROR: "alloc_file" [drivers/infiniband/core/ib_uverbs.ko] undefined!

when CONFIG_INFINIBAND_USER_ACCESS=m.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0552f879d45cecc35d8e372a591fc5ed863bca58 16-Dec-2009 Al Viro <viro@zeniv.linux.org.uk> Untangling ima mess, part 1: alloc_file()

There are 2 groups of alloc_file() callers:
* ones that are followed by ima_counts_get
* ones giving non-regular files
So let's pull that ima_counts_get() into alloc_file();
it's a no-op in case of non-regular files.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
e81e3f4dca6c54116a24aec217d2c15c6f58ada5 04-Dec-2009 Eric Paris <eparis@redhat.com> fs: move get_empty_filp() deffinition to internal.h

All users outside of fs/ of get_empty_filp() have been removed. This patch
moves the definition from the include/ directory to internal.h so no new
users crop up and removes the EXPORT_SYMBOL. I'd love to see open intents
stop using it too, but that's a problem for another day and a smarter
developer!

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2c48b9c45579a9b5e3e74694eebf3d2451f3dbd3 08-Aug-2009 Al Viro <viro@zeniv.linux.org.uk> switch alloc_file() to passing struct path

... and have the caller grab both mnt and dentry; kill
leak in infiniband, while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3d1e463158febf6e047897597722f768b15350cd 08-Aug-2009 Al Viro <viro@zeniv.linux.org.uk> get rid of init_file()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
732741274d0269718ba20c520cf72530bb038641 05-Aug-2009 Al Viro <viro@zeniv.linux.org.uk> unexport get_empty_filp()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
6c21a7fb492bf7e2c4985937082ce58ddeca84bd 22-Oct-2009 Mimi Zohar <zohar@linux.vnet.ibm.com> LSM: imbed ima calls in the security hooks

Based on discussions on LKML and LSM, where there are consecutive
security_ and ima_ calls in the vfs layer, move the ima_ calls to
the existing security_ hooks.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
8d65af789f3e2cf4cfbdbf71a0f7a61ebcd41d38 24-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com> sysctl: remove "struct file *" argument of ->proc_handler

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
864d7c4c068f23642efe91b33be3a84afe5f71e0 26-Apr-2009 npiggin@suse.de <npiggin@suse.de> fs: move mark_files_ro into file_table.c

This function walks the s_files lock, and operates primarily on the
files in a superblock, so it better belongs here (eg. see also
fs_may_remount_ro).

[AV: ... and it shouldn't be static after that move]

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
96029c4e09ccbd73a6d0ed2b29e80bf2586ad7ef 26-Apr-2009 npiggin@suse.de <npiggin@suse.de> fs: introduce mnt_clone_write

This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.

Before:
avg = 462.286
std = 5.46106

After:
avg = 453.12
std = 9.58257

(50 runs of each, stddev gives a reasonable confidence)

It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.

After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).

[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a4e49cb69e7dc87359bbdf1613d1ed872b9c9ebe 08-Mar-2009 Tero Roponen <tero.roponen@gmail.com> trivial: remove unused variable 'path' in alloc_file()

'struct path' is not used in alloc_file().

Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
684999149002dd046269666a390458e0acb38280 06-Feb-2009 Jonathan Corbet <corbet@lwn.net> Rename struct file->f_ep_lock

This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now,
epoll remains the only user, but a future patch will use it to protect
f_flags as well.

Cc: Davide Libenzi <davidel@xmailserver.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
6146f0d5e47ca4047ffded0fb79b6c25359b386c 04-Feb-2009 Mimi Zohar <zohar@linux.vnet.ibm.com> integrity: IMA hooks

This patch replaces the generic integrity hooks, for which IMA registered
itself, with IMA integrity hooks in the appropriate places directly
in the fs directory.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
b6b3fdead251d432f32f2cfce2a893ab8a658110 10-Dec-2008 Eric Dumazet <dada1@cosmosbay.com> filp_cachep can be static in fs/file_table.c

Instead of creating the "filp" kmem_cache in vfs_caches_init(),
we can do it a litle be later in files_init(), so that filp_cachep
is static to fs/file_table.c

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d76b0d9b2d87cfc95686e148767cbf7d0e22bdc0 14-Nov-2008 David Howells <dhowells@redhat.com> CRED: Use creds in file structs

Attach creds to file structs and discard f_uid/f_gid.

file_operations::open() methods (such as hppfs_open()) should use file->f_cred
rather than current_cred(). At the moment file->f_cred will be current_cred()
at this point.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
86a264abe542cfececb4df129bc45a0338d8cdb9 14-Nov-2008 David Howells <dhowells@redhat.com> CRED: Wrap current->cred and a few other accessors

Wrap current->cred and a few other accessors to hide their actual
implementation.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
b6dff3ec5e116e3af6f537d4caedcad6b9e5082a 14-Nov-2008 David Howells <dhowells@redhat.com> CRED: Separate task security context from task_struct

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
233e70f4228e78eb2f80dc6650f65d3ae3dbf17c 01-Nov-2008 Al Viro <viro@ZenIV.linux.org.uk> saner FASYNC handling on file close

As it is, all instances of ->release() for files that have ->fasync()
need to remember to evict file from fasync lists; forgetting that
creates a hole and we actually have a bunch that *does* forget.

So let's keep our lives simple - let __fput() check FASYNC in
file->f_flags and call ->fasync() there if it's been set. And lose that
crap in ->release() instances - leaving it there is still valid, but we
don't have to bother anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aeb5d727062a0238a2f96c9c380fbd2be4640c6f 02-Sep-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] introduce fmode_t, do annotations

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
516e0cc5646f377ab80fcc2ee639892eccb99853 26-Jul-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] f_count may wrap around

make it atomic_long_t; while we are at it, get rid of useless checks in affs,
hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9f3acc3140444a900ab280de942291959f0f615d 24-Apr-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] split linux/file.h

Initial splitoff of the low-level stuff; taken to fdtable.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ad775f5a8faa5845377f093ca11caf577404add9 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com> [PATCH] r/o bind mounts: debugging for missed calls

There have been a few oopses caused by 'struct file's with NULL f_vfsmnts.
There was also a set of potentially missed mnt_want_write()s from
dentry_open() calls.

This patch provides a very simple debugging framework to catch these kinds of
bugs. It will WARN_ON() them, but should stop us from having any oopses or
mnt_writer count imbalances.

I'm quite convinced that this is a good thing because it found bugs in the
stuff I was working on as soon as I wrote it.

[hch: made it conditional on a debug option.
But it's still a little bit too ugly]

[hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4a3fd211ccfc08a88edc824300e25a87785c6a5f 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com> [PATCH] r/o bind mounts: elevate write count for open()s

This is the first really tricky patch in the series. It elevates the writer
count on a mount each time a non-special file is opened for write.

We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps. This will cover even those cases, while a
call in may_open() would not have.

There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.

Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
aceaf78da92a53f5e1b105649a1b8c0afdb2135c 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com> [PATCH] r/o bind mounts: create helper to drop file write access

If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().

NFS does just that, and will use this in the next
patch.

AV: drop write access in __fput() only after we evict from file list.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
430e285e0817e3e18aadd814bc078d50d8af0cbf 15-Feb-2008 Dave Hansen <haveblue@us.ibm.com> [PATCH] fix up new filp allocators

Some new uses of get_empty_filp() have crept in; switched
to alloc_file() to make sure that pieces of initialization
won't be missing.

We really need to kill get_empty_filp().

[AV] fixed dentry leak on failure exit in anon_inode_getfd()

Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fc9b52cd8f5f459b88adcf67c47668425ae31a78 08-Feb-2008 Harvey Harrison <harvey.harrison@gmail.com> fs: remove fastcall, it is always empty

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cfdaf9e5f95993264b5aee7cbb9dd16977bc11ed 19-Oct-2007 Matthias Kaehlcke <matthias.kaehlcke@gmail.com> fs/file_table.c: use list_for_each_entry() instead of list_for_each()

fs/file_table.c: use list_for_each_entry() instead of list_for_each()
in fs_may_remount_ro()

Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ce8d2cdf3d2b73e346c82e6f0a46da331df6364c 17-Oct-2007 Dave Hansen <haveblue@us.ibm.com> r/o bind mounts: filesystem helpers for custom 'struct file's

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4975e45ff66845c9acc6c8619e80ef15eadf785e 17-Oct-2007 Denis Cheng <crquan@gmail.com> fs: use kmem_cache_zalloc instead

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
52d9f3b4090922f34497ace82bd062d80a465a29 17-Oct-2007 Peter Zijlstra <a.p.zijlstra@chello.nl> lib: percpu_counter_sum_positive

s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e63340ae6b6205fef26b40a75673d1c9c0c8bb90 08-May-2007 Randy Dunlap <randy.dunlap@oracle.com> header cleaning: don't include smp_lock.h when not used

Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.

Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0f7fc9e4d03987fe29f6dd4aa67e4c56eb7ecb05 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> [PATCH] VFS: change struct file to use struct path

This patch changes struct file to use struct path instead of having
independent pointers to struct dentry and struct vfsmount, and converts all
users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}.

Additionally, it adds two #define's to make the transition easier for users of
the f_dentry and f_vfsmnt.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
609d7fa9565c754428d2520cac2accc9052e1245 02-Oct-2006 Eric W. Biederman <ebiederm@xmission.com> [PATCH] file: modify struct fown_struct to use a struct pid

File handles can be requested to send sigio and sigurg to processes. By
tracking the destination processes using struct pid instead of pid_t we make
the interface safe from all potential pid wrap around problems.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
577c4eb09d1034d0739e3135fd2cff50588024be 27-Sep-2006 Theodore Ts'o <tytso@mit.edu> [PATCH] inode-diet: Move i_cdev into a union

Move the i_cdev pointer in struct inode into a union.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
6ab3d5624e172c553004ecc862bfeac16d9d68b7 30-Jun-2006 Jörn Engel <joern@wohnheim.fh-wedel.de> Remove obsolete #include <linux/config.h>

Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
0216bfcffe424a5473daa4da47440881b36c1f41 23-Jun-2006 Mingming Cao <cmm@us.ibm.com> [PATCH] percpu counter data type changes to suppport more than 2**31 ext3 free blocks counter

The percpu counter data type are changed in this set of patches to support
more users like ext3 who need more than 32 bit to store the free blocks
total in the filesystem.

- Generic perpcu counters data type changes. The size of the global counter
and local counter were explictly specified using s64 and s32. The global
counter is changed from long to s64, while the local counter is changed from
long to s32, so we could avoid doing 64 bit update in most cases.

- Users of the percpu counters are updated to make use of the new
percpu_counter_init() routine now taking an additional parameter to allow
users to pass the initial value of the global counter.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
5a6b7951bfcca7f45f44269ea87417c74558daf8 23-Mar-2006 Benjamin LaHaise <bcrl@linux.intel.com> [PATCH] get_empty_filp tweaks, inline epoll_init_file()

Eliminate a handful of cache references by keeping current in a register
instead of reloading (helps x86) and avoiding the overhead of a function
call. Inlining eventpoll_init_file() saves 24 bytes. Also reorder file
initialization to make writes occur more sequentially.

Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
529bf6be5c04f2e869d07bfdb122e9fd98ade714 08-Mar-2006 Dipankar Sarma <dipankar@in.ibm.com> [PATCH] fix file counting

I have benchmarked this on an x86_64 NUMA system and see no significant
performance difference on kernbench. Tested on both x86_64 and powerpc.

The way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was
constructor/destructor based. This meant that nr_files was decremented
only when the object was removed from the slab cache. This is susceptible
to slab fragmentation. With RCU based file structure, consequent batched
freeing and a test program like Serge's, we just speed this up and end up
with a very fragmented slab -

llm22:~ # cat /proc/sys/fs/file-nr
587730 0 758844

At the same time, I see only a 2000+ objects in filp cache. The following
patch I fixes this problem.

This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate percpu counter, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl handler for
nr_files, we populate files_stat.nr_files before returning to user.

Counting files as an when they are created and destroyed (as opposed to
inside slab) allows us to correctly count open files with RCU.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
16f7e0fe2ecc30f30652e8185e1772cdebe39109 11-Jan-2006 Randy Dunlap <rdunlap@xenotime.net> [PATCH] capable/capability.h (fs/)

fs: Use <linux/capability.h> where capable() is used.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Tim Schmielau <tim@physik3.uni-rostock.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
095975da26dba21698582e91e96be10f7417333f 08-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au> [PATCH] rcu file: use atomic primitives

Use atomic_inc_not_zero for rcu files instead of special case rcuref.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2109a2d1b175dfcffbfdac693bdbe4c4ab62f11f 07-Nov-2005 Pekka J Enberg <penberg@cs.Helsinki.FI> [PATCH] mm: rename kmem_cache_s to kmem_cache

This patch renames struct kmem_cache_s to kmem_cache so we can start using
it instead of kmem_cache_t typedef.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2f51201662b28dbf8c15fb7eb972bc51c6cc3fa5 31-Oct-2005 Eric Dumazet <dada1@cosmosbay.com> [PATCH] reduce sizeof(struct file)

Now that RCU applied on 'struct file' seems stable, we can place f_rcuhead
in a memory location that is not anymore used at call_rcu(&f->f_rcuhead,
file_free_rcu) time, to reduce the size of this critical kernel object.

The trick I used is to move f_rcuhead and f_list in an union called f_u

The callers are changed so that f_rcuhead becomes f_u.fu_rcuhead and f_list
becomes f_u.f_list

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ab2af1f5005069321c5d130f09cce577b03f43ef 09-Sep-2005 Dipankar Sarma <dipankar@in.ibm.com> [PATCH] files: files struct with RCU

Patch to eliminate struct files_struct.file_lock spinlock on the reader side
and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
updates to the fdtable are done by allocating a new fdtable structure and
setting files->fdt to point to the new structure. The fdtable structure is
protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
global list of fdtable to be freed is not scalable, so we use a per-cpu list.
If keventd is already handling the current cpu's work, we use a timer to defer
queueing of that work.

Since the last publication, this patch has been re-written to avoid using
explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
premitives instead. This required that the fd information is kept in a
separate structure (fdtable) and updated atomically.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2832e9366a1fcd6f76957a42157be041240f994e 07-Sep-2005 Eric Dumazet <dada1@cosmosbay.com> [PATCH] remove file.f_maxcount

struct file cleanup: f_maxcount has an unique value (INT_MAX). Just use
the hard-wired value.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
0eeca28300df110bd6ed54b31193c83b87921443 12-Jul-2005 Robert Love <rml@novell.com> [PATCH] inotify

inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:

* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?

inotify provides a more usable, simple, powerful solution to file change
notification:

* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.

Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.

See Documentation/filesystems/inotify.txt.

Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
af4d2ecbf007b7df3db7a41eedccdc05b8006d0b 23-Jun-2005 Kirill Korotaev <dev@sw.ru> [PATCH] Fix of bogus file max limit messages

This patch fixes incorrect and bogus kernel messages that file-max limit
reached when the allocation fails

Signed-Off-By: Kirill Korotaev <dev@sw.ru>
Signed-Off-By: Denis Lunev <den@sw.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!