History log of /fs/aio.c
Revision Date Author Comments
835f252c6debd204fcd607c79975089b1ecd3472 06-Nov-2014 Gu Zheng <guz.fnst@cn.fujitsu.com> aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
zone_stat_item item)
{
atomic_long_dec(&zone->vm_stat[item]);
+ WARN_ON_ONCE(item == NR_FILE_DIRTY &&
atomic_long_read(&zone->vm_stat[item]) < 0);
atomic_long_dec(&vm_stat[item]);
}

[ 21.341632] ------------[ cut here ]------------
[ 21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[ 21.355296] Modules linked in: wutbox_cp sata_mv
[ 21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[ 21.366793] Workqueue: events free_ioctx
[ 21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
(show_stack+0x20/0x24)
[ 21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
(dump_stack+0x24/0x28)
[ 21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
(warn_slowpath_common+0x84/0x9c)
[ 21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
(warn_slowpath_null+0x2c/0x34)
[ 21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
(cancel_dirty_page+0x164/0x224)
[ 21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
(truncate_inode_page+0x8c/0x158)
[ 21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
(truncate_inode_pages_range+0x11c/0x53c)
[ 21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
[<c00c0f6c>] (truncate_pagecache+0x88/0xac)
[ 21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
(truncate_setsize+0x5c/0x74)
[ 21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
(put_aio_ring_file.isra.14+0x34/0x90)
[ 21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
[<c013b424>] (aio_free_ring+0x20/0xcc)
[ 21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
(free_ioctx+0x24/0x44)
[ 21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
(process_one_work+0x134/0x47c)
[ 21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
(worker_thread+0x130/0x414)
[ 21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
(kthread+0xd4/0xec)
[ 21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
(ret_from_fork+0x14/0x20)
[ 21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus <m.koenigshaus@wut.de>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable <stable@vger.kernel.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2aad2a86f6685c10360ec8a5a55eb9ab7059cb72 24-Sep-2014 Tejun Heo <tj@kernel.org> percpu_ref: add PERCPU_REF_INIT_* flags

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
a34375ef9e65340a138fc0be287de5c940d260fc 08-Sep-2014 Tejun Heo <tj@kernel.org> percpu-refcount: add @gfp to percpu_ref_init()

Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing. Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>
6098b45b32e6baeacc04790773ced9340601d511 03-Sep-2014 Gu Zheng <guz.fnst@cn.fujitsu.com> aio: block exit_aio() until all context requests are completed

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
2ff396be602f10b5eab8e73b24f20348fa2de159 02-Sep-2014 Jeff Moyer <jmoyer@redhat.com> aio: add missing smp_rmb() in read_events_ring

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events. After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves. This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
d856f32a86b2b015ab180ab7a55e455ed8d3ccc5 24-Aug-2014 Benjamin LaHaise <bcrl@kvack.org> aio: fix reqs_available handling

As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer. Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression. Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring. So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available(). The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .

Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Dan Aloni <dan@kernelim.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: stable@vger.kernel.org # v3.16 and anything that f8567a3845ac was backported to
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
00fefb9cf2b5493a86912de55ba912bdfae4a207 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com> aio: use iovec array rather than the single one

Previously, we only offer a single iovec to handle all the read/write cases, so
the PREADV/PWRITEV request always need to alloc more iovec buffer when copying
user vectors.
If we use a tmp iovec array rather than the single one, some small PREADV/PWRITEV
workloads(vector size small than the tmp buffer) will not need to alloc more
iovec buffer when copying user vectors.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2be4e7deec2d4398a0eb2165cc04086ebfc831d2 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com> aio: fix some comments

The function comments of aio_run_iocb and aio_read_events are out of date, so
fix them here.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
8dc4379e17cddad7b2088a3f300ded50d2a6d493 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com> aio: use the macro rather than the inline magic number

Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC),
just clean up.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
b53f1f82fbde8dcf34ab7d731c2a9ae6f0d8d2e2 23-Jul-2014 Gu Zheng <guz.fnst@cn.fujitsu.com> aio: remove the needless registration of ring file's private_data

Remove the registration of ring file's private_data, we do not use
it.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
be6fb451a24582732c66e28cb0beb3f19c4289fd 22-Jul-2014 Benjamin LaHaise <bcrl@kvack.org> aio: remove no longer needed preempt_disable()

Based on feedback from Jens Axboe on 263782c1c95bbddbb022dc092fd89a36bb8d5577,
clean up get/put_reqs_available() to remove the no longer needed preempt_disable()
and preempt_enable() pair.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>
263782c1c95bbddbb022dc092fd89a36bb8d5577 14-Jul-2014 Benjamin LaHaise <bcrl@kvack.org> aio: protect reqs_available updates from changes in interrupt handlers

As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context. While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU. This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot <Elliott@hp.com>
Tested-by: Robert Elliot <Elliott@hp.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org
9a1049da9bd2cd83fe11d46433e603c193aa9c71 28-Jun-2014 Tejun Heo <tj@kernel.org> percpu-refcount: require percpu_ref to be exited explicitly

Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.

* It's impossible to re-init a released reference counter without
going through re-allocation.

* In the similar vein, it's impossible to initialize a percpu_ref
count with static percpu variables.

* We need and have an explicit destructor anyway for failure paths -
percpu_ref_cancel_init().

This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit(). percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().

All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Li Zefan <lizefan@huawei.com>
55c6c814ae0aa896781d3c51e4608de542624f64 28-Jun-2014 Tejun Heo <tj@kernel.org> percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()

ioctx_alloc() reaches inside percpu_ref and directly frees
->pcpu_count in its failure path, which is quite gross. percpu_ref
has been providing a proper interface to do this,
percpu_ref_cancel_init(), for quite some time now. Let's use that
instead.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
855ef0dec7271ff7be7381feaaf3f4aed80bd503 30-Apr-2014 Oleg Nesterov <oleg@redhat.com> aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()

ioctx_add_table() is the writer, it does not need rcu_read_lock() to
protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
add the confusion.

And it doesn't need rcu_dereference() by the same reason, it must see
any updates previously done under the same ->ioctx_lock. We could use
rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
the function is simple enough.

The same for kill_ioctx(), although it does not update the pointer.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
4b70ac5fd9b58bfaa5f25b4ea48f528aefbf3308 30-Apr-2014 Oleg Nesterov <oleg@redhat.com> aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

On 04/30, Benjamin LaHaise wrote:
>
> > - ctx->mmap_size = 0;
> > -
> > - kill_ioctx(mm, ctx, NULL);
> > + if (ctx) {
> > + ctx->mmap_size = 0;
> > + kill_ioctx(mm, ctx, NULL);
> > + }
>
> Rather than indenting and moving the two lines changing mmap_size and the
> kill_ioctx() call, why not just do "if (!ctx) ... continue;"? That reduces
> the number of lines changed and avoid excessive indentation.

OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
of course, I won't argue.

The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
we reset mmap_size only for kill_ioctx(). But feel free to remove this change.

-------------------------------------------------------------------------------
Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
or even rcu_dereference().

This mm has no users, nobody else can play with ->ioctx_table. Otherwise
the code is buggy anyway, if we need rcu_read_lock() in a loop because
->ioctx_table can be updated then kfree(table) is obviously wrong.

2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
munmap(), but another reason is that we simply can't do vm_munmap() unless
current->mm == mm and this is not true in general, the caller is mmput().

3. We do not really need to nullify mm->ioctx_table before return, probably
the current code does this to catch the potential problems. But in this
case RCU_INIT_POINTER(NULL) looks better.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
edfbbf388f293d70bf4b7c0bc38774d05e6f711a 24-Jun-2014 Benjamin LaHaise <bcrl@kvack.org> aio: fix kernel memory disclosure in io_getevents() introduced in v3.10

A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+. A separate backport is needed for 3.10/3.11.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
f8567a3845ac05bb28f3c1b478ef752762bd39ef 24-Jun-2014 Benjamin LaHaise <bcrl@kvack.org> aio: fix aio request leak when events are reaped by userspace

The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping. Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+. A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
293bc9822fa9b3c9d4b7893bcb241e085580771a 12-Feb-2014 Al Viro <viro@zeniv.linux.org.uk> new methods: ->read_iter() and ->write_iter()

Beginning to introduce those. Just the callers for now, and it's
clumsier than it'll eventually become; once we finish converting
aio_read and aio_write instances, the things will get nicer.

For now, these guys are in parallel to ->aio_read() and ->aio_write();
they take iocb and iov_iter, with everything in iov_iter already
validated. File offset is passed in iocb->ki_pos, iov/nr_segs -
in iov_iter.

Main concerns in that series are stack footprint and ability to
split the damn thing cleanly.

[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
754320d6e166d3a12cb4810a452bde00afbd4e9a 01-May-2014 Leon Yu <chianglungyu@gmail.com> aio: fix potential leak in aio_run_iocb().

iovec should be reclaimed whenever caller of rw_copy_check_uvector() returns,
but it doesn't hold when failure happens right after aio_setup_vectored_rw().

Fix that in a such way to avoid hairy goto.

Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
fa88b6f8803c87c4ced5aac11261ced7cedaa05e 29-Apr-2014 Benjamin LaHaise <bcrl@kvack.org> aio: cleanup: flatten kill_ioctx()

There is no need to have most of the code in kill_ioctx() indented. Flatten
it.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
fb2d44838320b78e6e3b5eb2e35b70f62f262e4c 29-Apr-2014 Benjamin LaHaise <bcrl@kvack.org> aio: report error from io_destroy() when threads race in io_destroy()

As reported by Anatol Pomozov, io_destroy() fails to report an error when
it loses the race to destroy a given ioctx. Since there is a difference in
behaviour between the thread that wins the race (which blocks on outstanding
io requests) versus lthe thread that loses (which returns immediately), wire
up a return code from kill_ioctx() to the io_destroy() syscall.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Anatol Pomozov <anatol.pomozov@gmail.com>
d52a8f9ead60338306c4f03e9ce575c5f23a4b65 22-Apr-2014 Fabian Frederick <fabf@skynet.be> fs/aio.c: Remove ctx parameter in kiocb_cancel

ctx is no longer used in kiocb_cancel since

57282d8fd74407 ("aio: Kill ki_users")

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
e02ba72aabfade4c9cd6e3263e9b57bf890ad25c 15-Apr-2014 Anatol Pomozov <anatol.pomozov@gmail.com> aio: block io_destroy() until all context requests are completed

deletes aio context and all resources related to. It makes sense that
no IO operations connected to the context should be running after the context
is destroyed. As we removed io_context we have no chance to
get requests status or call io_getevents().

man page for io_destroy says that this function may block until
all context's requests are completed. Before kernel 3.11 io_destroy()
blocked indeed, but since aio refactoring in 3.11 it is not true anymore.

Here is a pseudo-code that shows a testcase for a race condition discovered
in 3.11:

initialize io_context
io_submit(read to buffer)
io_destroy()

// context is destroyed so we can free the resources
free(buffers);

// if the buffer is allocated by some other user he'll be surprised
// to learn that the buffer still filled by an outstanding operation
// from the destroyed io_context

The fix is straight-forward - add a completion struct and wait on it
in io_destroy, complete() should be called when number of in-fligh requests
reaches zero.

If two or more io_destroy() called for the same context simultaneously then
only the first one waits for IO completion, other calls behaviour is undefined.

Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and
do not see the race condition anymore.

Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
fa8a53c39f3fdde98c9eace6a9b412143f0f6ed6 28-Mar-2014 Benjamin LaHaise <bcrl@kvack.org> aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration

As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
exist in the aio ring page migration support.

As a result, for example, we have the following problem:

thread 1 | thread 2
|
aio_migratepage() |
|-> take ctx->completion_lock |
|-> migrate_page_copy(new, old) |
| *NOW*, ctx->ring_pages[idx] == old |
|
| *NOW*, ctx->ring_pages[idx] == old
| aio_read_events_ring()
| |-> ring = kmap_atomic(ctx->ring_pages[0])
| |-> ring->head = head; *HERE, write to the old ring page*
| |-> kunmap_atomic(ring);
|
|-> ctx->ring_pages[idx] = new |
| *BUT NOW*, the content of |
| ring_pages[idx] is old. |
|-> release ctx->completion_lock |

As above, the new ring page will not be updated.

Fix this issue, as well as prevent races in aio_ring_setup() by holding
the ring_lock mutex during kioctx setup and page migration. This avoids
the overhead of taking another spinlock in aio_read_events_ring() as Tang's
and Gu's original fix did, pushing the overhead into the migration code.

Note that to handle the nesting of ring_lock inside of mmap_sem, the
migratepage operation uses mutex_trylock(). Page migration is not a 100%
critical operation in this case, so the ocassional failure can be
tolerated. This issue was reported by Sasha Levin.

Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
Instead, make page migration fully serialised by mapping->private_lock, and
have aio_free_ring() simply disconnect the kioctx from the mapping by calling
put_aio_ring_file() before touching ctx->ring_pages[]. This simplifies the
error handling logic in aio_migratepage(), and should improve robustness.

v4: always do mutex_unlock() in cases when kioctx setup fails.

Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable@vger.kernel.org
3dc9acb67600393249a795934ccdfc291a200e6b 19-Dec-2013 Linus Torvalds <torvalds@linux-foundation.org> aio: clean up and fix aio_setup_ring page mapping

Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages
migration") the aio ring setup code has used a special per-ring backing
inode for the page allocations, rather than just using random anonymous
pages.

However, rather than remembering the pages as it allocated them, it
would allocate the pages, insert them into the file mapping (dirty, so
that they couldn't be free'd), and then forget about them. And then to
look them up again, it would mmap the mapping, and then use
"get_user_pages()" to get back an array of the pages we just created.

Now, not only is that incredibly inefficient, it also leaked all the
pages if the mmap failed (which could happen due to excessive number of
mappings, for example).

So clean it all up, making it much more straightforward. Also remove
some left-overs of the previous (broken) mm_populate() usage that was
removed in commit d6c355c7dabc ("aio: fix race in ring buffer page
lookup introduced by page migration support") but left the pointless and
now misleading MAP_POPULATE flag around.

Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8e321fefb0e60bae4e2a28d20fc4fa30758d27c6 21-Dec-2013 Benjamin LaHaise <bcrl@kvack.org> aio/migratepages: make aio migrate pages sane

The arbitrary restriction on page counts offered by the core
migrate_page_move_mapping() code results in rather suspicious looking
fiddling with page reference counts in the aio_migratepage() operation.
To fix this, make migrate_page_move_mapping() take an extra_count parameter
that allows aio to tell the code about its own reference count on the page
being migrated.

While cleaning up aio_migratepage(), make it validate that the old page
being passed in is actually what aio_migratepage() expects to prevent
misbehaviour in the case of races.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
1881686f842065d2f92ec9c6424830ffc17d23b0 21-Dec-2013 Benjamin LaHaise <bcrl@kvack.org> aio: fix kioctx leak introduced by "aio: Fix a trinity splat"

e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference
counting to correct a bug trinity found. Unfortunately, the change lead
to kioctxes being leaked because there was no final reference count to
put. Add that reference count back in to fix things.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
d1b9432712a25eeb06114fb4b587133525a47de5 04-Dec-2013 Gu Zheng <guz.fnst@cn.fujitsu.com> aio: clean up aio ring in the fail path

Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
ddb8c45ba15149ebd41d7586261c05f7ca37f9a1 19-Nov-2013 Sasha Levin <sasha.levin@oracle.com> aio: nullify aio->ring_pages after freeing it

After freeing ring_pages we leave it as is causing a dangling pointer. This
has already caused an issue so to help catching any issues in the future
NULL it out.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
d558023207e008a4476a3b7bb8706b2a2bf5d84f 19-Nov-2013 Sasha Levin <sasha.levin@oracle.com> aio: prevent double free in ioctx_alloc

ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring()
fails to do so it would call aio_free_ring() before returning, but
ioctx_alloc() would call aio_free_ring() again causing a double free of
the ring.

This is easily reproducible from userspace.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
7f62656be8a8ef14c168db2d98021fb9c8cc1076 13-Nov-2013 Dan Carpenter <dan.carpenter@oracle.com> aio: checking for NULL instead of IS_ERR

alloc_anon_inode() returns an ERR_PTR(), it doesn't return NULL.

Fixes: 71ad7490c1f3 ('rework aio migrate pages to use aio fs')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
71ad7490c1f32bd7829df76360f9fa17829868f3 17-Sep-2013 Benjamin LaHaise <bcrl@kvack.org> rework aio migrate pages to use aio fs

Don't abuse anon_inodes.c to host private files needed by aio;
we can bloody well declare a mini-fs of our own instead of
patching up what anon_inodes can create for us.

Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
e34ecee2ae791df674dfb466ce40692ca6218e43 11-Oct-2013 Kent Overstreet <kmo@daterainc.com> aio: Fix a trinity splat

aio kiocb refcounting was broken - it was relying on keeping track of
the number of available ring buffer entries, which it needs to do
anyways; then at shutdown time it'd wait for completions to be delivered
until the # of available ring buffer entries equalled what it was
initialized to.

Problem with that is that the ring buffer is mapped writable into
userspace, so userspace could futz with the head and tail pointers to
cause the kernel to see extra completions, and cause free_ioctx() to
return while there were still outstanding kiocbs. Which would be bad.

Fix is just to directly refcount the kiocbs - which is more
straightforward, and with the new percpu refcounting code doesn't cost
us any cacheline bouncing which was the whole point of the original
scheme.

Also clean up ioctx_alloc()'s error path and fix a bug where it wasn't
subtracting from aio_nr if ioctx_add_table() failed.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
5e9ae2e5da0beb93f8557fc92a8f4fbc05ea448f 27-Sep-2013 Benjamin LaHaise <bcrl@kvack.org> aio: fix use-after-free in aio_migratepage

Dmitry Vyukov managed to trigger a case where aio_migratepage can cause a
use-after-free during teardown of the aio ring buffer's mapping. This turns
out to be caused by access to the ioctx's ring_pages via the migratepage
operation which was not being protected by any locks during ioctx freeing.
Use the address_space's private_lock to protect use and updates of the mapping's
private_data, and make ioctx teardown unlink the ioctx from the address space.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
d9b2c8714aef102dea95544a8cd9372b21af463f 09-Sep-2013 Artem Savkov <artem.savkov@gmail.com> aio: rcu_read_lock protection for new rcu_dereference calls

Patch "aio: fix rcu sparse warnings introduced by ioctx table lookup patch"
(77d30b14d24e557f89c41980011d72428514d729 in linux-next.git) introduced a
couple of new rcu_dereference calls which are not protected by rcu_read_lock
and result in following warnings during syscall fuzzing(trinity):

[ 471.646379] ===============================
[ 471.649727] [ INFO: suspicious RCU usage. ]
[ 471.653919] 3.11.0-next-20130906+ #496 Not tainted
[ 471.657792] -------------------------------
[ 471.661235] fs/aio.c:503 suspicious rcu_dereference_check() usage!
[ 471.665968]
[ 471.665968] other info that might help us debug this:
[ 471.665968]
[ 471.672141]
[ 471.672141] rcu_scheduler_active = 1, debug_locks = 1
[ 471.677549] 1 lock held by trinity-child0/3774:
[ 471.681675] #0: (&(&mm->ioctx_lock)->rlock){+.+...}, at: [<c119ba1a>] SyS_io_setup+0x63a/0xc70
[ 471.688721]
[ 471.688721] stack backtrace:
[ 471.692488] CPU: 1 PID: 3774 Comm: trinity-child0 Not tainted 3.11.0-next-20130906+ #496
[ 471.698437] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 471.703151] 00000000 00000000 c58bbf30 c18a814b de2234c0 c58bbf58 c10a4ec6 c1b0d824
[ 471.709544] c1b0f60e 00000001 00000001 c1af61b0 00000000 cb670ac0 c3aca000 c58bbfac
[ 471.716251] c119bc7c 00000002 00000001 00000000 c119b8dd 00000000 c10cf684 c58bbfb4
[ 471.722902] Call Trace:
[ 471.724859] [<c18a814b>] dump_stack+0x4b/0x66
[ 471.728772] [<c10a4ec6>] lockdep_rcu_suspicious+0xc6/0x100
[ 471.733716] [<c119bc7c>] SyS_io_setup+0x89c/0xc70
[ 471.737806] [<c119b8dd>] ? SyS_io_setup+0x4fd/0xc70
[ 471.741689] [<c10cf684>] ? __audit_syscall_entry+0x94/0xe0
[ 471.746080] [<c18b1fcc>] syscall_call+0x7/0xb
[ 471.749723] [<c1080000>] ? task_fork_fair+0x240/0x260

Signed-off-by: Artem Savkov <artem.savkov@gmail.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
d6c355c7dabcd753a75bc77d150d36328a355267 09-Sep-2013 Benjamin LaHaise <bcrl@kvack.org> aio: fix race in ring buffer page lookup introduced by page migration support

Prior to the introduction of page migration support in "fs/aio: Add support
to aio ring pages migration" / 36bc08cc01709b4a9bb563b35aa530241ddc63e3,
mapping of the ring buffer pages was done via get_user_pages() while
retaining mmap_sem held for write. This avoided possible races with userland
racing an munmap() or mremap(). The page migration patch, however, switched
to using mm_populate() to prime the page mapping. mm_populate() cannot be
called with mmap_sem held.

Instead of dropping the mmap_sem, revert to the old behaviour and simply
drop the use of mm_populate() since get_user_pages() will cause the pages to
get mapped anyways. Thanks to Al Viro for spotting this issue.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
77d30b14d24e557f89c41980011d72428514d729 30-Aug-2013 Benjamin LaHaise <bcrl@kvack.org> aio: fix rcu sparse warnings introduced by ioctx table lookup patch

Sseveral sparse warnings were caused by missing rcu_dereference() annotations
for dereferencing mm->ioctx_table. Thankfully, none of those were actual bugs
as the deref was protected by a spin lock in all instances.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
79bd1bcf1ab22ea723da7d5854a9e72a350ecbf8 30-Aug-2013 Benjamin LaHaise <bcrl@kvack.org> aio: remove unnecessary debugging from aio_free_ring()

The commit 36bc08cc0170 ("fs/aio: Add support to aio ring pages migration")
added some debugging code that is not required and resulted in a build error
when 98474236f72e ("vfs: make the dentry cache use the lockref infrastructure")
was added to the tree. The code is not required, so just delete it.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
f30d704fe1244c44a984d3d1f47bc648bcc6c9f7 08-Aug-2013 Benjamin LaHaise <bcrl@kvack.org> aio: table lookup: verify ctx pointer

Another shortcoming of the table lookup patch was revealed where the pointer
was not being tested before being dereferenced. Verify this to avoid the
NULL pointer dereference.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
da90382c2ec367aac88ff6aa76afb659ee0e4235 05-Aug-2013 Benjamin LaHaise <bcrl@kvack.org> aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"

In the patch "aio: convert the ioctx list to table lookup v3", incorrect
handling in the ioctx_alloc() error path was introduced that lead to an
ioctx being added via ioctx_add_table() while freed when the ioctx_alloc()
call returned -EAGAIN due to hitting the aio_max_nr limit. Fix this by
only calling ioctx_add_table() as the last step in ioctx_alloc().

Also, several unnecessary rcu_dereference() calls were added that lead to
RCU warnings where the system was already protected by a spin lock for
accessing mm->ioctx_table.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
6878ea72a5d1aa6caae86449975a50b7fe9abed5 31-Jul-2013 Benjamin LaHaise <bcrl@kvack.org> aio: be defensive to ensure request batching is non-zero instead of BUG_ON()

In the event that an overflow/underflow occurs while calculating req_batch,
clamp the minimum at 1 request instead of doing a BUG_ON().

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
db446a08c23d5475e6b08c87acca79ebb20f283c 30-Jul-2013 Benjamin LaHaise <bcrl@kvack.org> aio: convert the ioctx list to table lookup v3

On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> > 32.51% [guest.kernel] [g] lookup_ioctx
> > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
> > 4.40% [guest.kernel] [g] lock_release
> > 4.19% [guest.kernel] [g] sched_clock_local
> > 3.86% [guest.kernel] [g] local_clock
> > 3.68% [guest.kernel] [g] native_sched_clock
> > 3.08% [guest.kernel] [g] sched_clock_cpu
> > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
> > 2.60% [guest.kernel] [g] memcpy
> > 2.33% [guest.kernel] [g] lock_acquired
> > 2.25% [guest.kernel] [g] lock_acquire
> > 1.84% [guest.kernel] [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores 1 2 4 8 16 32
> > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
> > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
> > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
> > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
> > % of radix
> > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list 58892 ms
> > %rsd 0.91%
> > radix 59404 ms
> > %rsd 0.81%
> > % of radix
> > relative 100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
... <snip>

I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx. It's not finished yet, and
it needs to be tidied up, but is most of the way there.

-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo> -Kent

bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
4cd81c3dfc4a34e4a0b6fa577860077c8e5b13af 30-Jul-2013 Benjamin LaHaise <bcrl@kvack.org> aio: double aio_max_nr in calculations

With the changes to use percpu counters for aio event ring size calculation,
existing increases to aio_max_nr are now insufficient to allow for the
allocation of enough events. Double the value used for aio_max_nr to account
for the doubling introduced by the percpu slack.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
d29c445b635b3a03cf683cafcbae58a4ec1e1125 18-Mar-2013 Kent Overstreet <koverstreet@google.com> aio: Kill ki_dtor

sock_aio_dtor() is dead code - and stuff that does need to do cleanup
can simply do it before calling aio_complete().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
57282d8fd744072d6d6f18fa6ebe3cc1149015bf 13-May-2013 Kent Overstreet <koverstreet@google.com> aio: Kill ki_users

The kiocb refcount is only needed for cancellation - to ensure a kiocb
isn't freed while a ki_cancel callback is running. But if we restrict
ki_cancel callbacks to not block (which they currently don't), we can
simply drop the refcount.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
8bc92afcf7f5c598001dd04e62d88f57f6e89e51 26-Feb-2013 Kent Overstreet <koverstreet@google.com> aio: Kill unneeded kiocb members

The old aio retry infrastucture needed to save the various arguments to
to aio operations. But with the retry infrastructure gone, we can trim
struct kiocb quite a bit.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
73a7075e3f6ec63dc359064eea6fd84f406cf2a5 10-May-2013 Kent Overstreet <koverstreet@google.com> aio: Kill aio_rw_vect_retry()

This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
5ffac122dbda89fbb29885f35a5d47b0edb8936d 10-May-2013 Kent Overstreet <koverstreet@google.com> aio: Don't use ctx->tail unnecessarily

aio_complete() (arguably) needs to keep its own trusted copy of the tail
pointer, but io_getevents() doesn't have to use it - it's already using
the head pointer from the ring buffer.

So convert it to use the tail from the ring buffer so it touches fewer
cachelines and doesn't contend with the cacheline aio_complete() needs.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
bec68faaf3ba74ed0dcd5dc3a881b30aec542973 13-May-2013 Kent Overstreet <koverstreet@google.com> aio: io_cancel() no longer returns the io_event

Originally, io_event() was documented to return the io_event if
cancellation succeeded - the io_event wouldn't be delivered via the ring
buffer like it normally would.

But this isn't what the implementation was actually doing; the only
driver implementing cancellation, the usb gadget code, never returned an
io_event in its cancel function. And aio_complete() was recently changed
to no longer suppress event delivery if the kiocb had been cancelled.

This gets rid of the unused io_event argument to kiocb_cancel() and
kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
kiocb->ki_cancel() returned success.

Also tweak the refcounting in kiocb_cancel() to make more sense.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
723be6e39d14254bb5bb9f422b434566d359fa6e 29-May-2013 Kent Overstreet <koverstreet@google.com> aio: percpu ioctx refcount

This just converts the ioctx refcount to the new generic dynamic percpu
refcount code.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
e1bdd5f27a5b14e24a658d5511bebceb67679d83 26-Apr-2013 Kent Overstreet <koverstreet@google.com> aio: percpu reqs_available

See the previous patch ("aio: reqs_active -> reqs_available") for why we
want to do this - this basically implements a per cpu allocator for
reqs_available that doesn't actually allocate anything.

Note that we need to increase the size of the ringbuffer we allocate,
since a single thread won't necessarily be able to use all the
reqs_available slots - some (up to about half) might be on other per cpu
lists, unavailable for the current thread.

We size the ringbuffer based on the nr_events userspace passed to
io_setup(), so this is a slight behaviour change - but nr_events wasn't
being used as a hard limit before, it was being rounded up to the next
page before so this doesn't change the actual semantics.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
34e83fc618085e00dc9803286c581f51966673bd 26-Apr-2013 Kent Overstreet <koverstreet@google.com> aio: reqs_active -> reqs_available

The number of outstanding kiocbs is one of the few shared things left that
has to be touched for every kiocb - it'd be nice to make it percpu.

We can make it per cpu by treating it like an allocation problem: we have
a maximum number of kiocbs that can be outstanding (i.e. slots) - then we
just allocate and free slots, and we know how to write per cpu allocators.

So as prep work for that, we convert reqs_active to reqs_available.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
0c45355fc7c48c82db151bf0e7ca305d513e639e 17-Jul-2013 Benjamin LaHaise <bcrl@kvack.org> aio: fix build when migration is disabled

When "fs/aio: Add support to aio ring pages migration" was applied, it
broke the build when CONFIG_MIGRATION was disabled. Wrap the migration
code with a test for CONFIG_MIGRATION to fix this and save a few bytes
when migration is disabled.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
36bc08cc01709b4a9bb563b35aa530241ddc63e3 16-Jul-2013 Gu Zheng <guz.fnst@cn.fujitsu.com> fs/aio: Add support to aio ring pages migration

As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node safely.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
4b30f07e74d2df673925019ad58bc59a719df3bb 04-Jul-2013 Tang Chen <tangchen@cn.fujitsu.com> aio: fix wrong comment in aio_complete()

ctx->ctx_lock should be ctx->completion_lock.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
68d70d03f8f5bd10a0e7337210b13f536fd4aeb9 19-Jun-2013 Al Viro <viro@zeniv.linux.org.uk> constify rw_verify_area()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4fcc712f5c48b1e32cdbf9b9cfba42a27b2e3160 12-Jun-2013 Kent Overstreet <koverstreet@google.com> aio: fix io_destroy() regression by using call_rcu()

There was a regression introduced by 36f5588905c1 ("aio: refcounting
cleanup"), reported by Jens Axboe - the refcounting cleanup switched to
using RCU in the shutdown path, but the synchronize_rcu() was done in
the context of the io_destroy() syscall greatly increasing the time it
could block.

This patch switches it to call_rcu() and makes shutdown asynchronous
(more asynchronous than it was originally; before the refcount changes
io_destroy() would still wait on pending kiocbs).

Note that there's a global quota on the max outstanding kiocbs, and that
quota must be manipulated synchronously; otherwise io_setup() could
return -EAGAIN when there isn't quota available, and userspace won't
have any way of waiting until shutdown of the old kioctxs has finished
(besides busy looping).

So we release our quota before kioctx shutdown has finished, which
should be fine since the quota never corresponded to anything real
anyways.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Tested-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
03e04f048d2774aabd126fbad84729d4ba9dc40a 25-May-2013 Benjamin LaHaise <bcrl@kvack.org> aio: fix kioctx not being freed after cancellation at exit time

The recent changes overhauling fs/aio.c introduced a bug that results in
the kioctx not being freed when outstanding kiocbs are cancelled at
exit_aio() time. Specifically, a kiocb that is cancelled has its
completion events discarded by batch_complete_aio(), which then fails to
wake up the process stuck in free_ioctx(). Fix this by modifying the
wait_event() condition in free_ioctx() appropriately.

This patch was tested with the cancel operation in the thread based code
posted yesterday.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6900807c6b95dcb004902302b8ac5dbfbf6feb89 25-May-2013 Jeff Moyer <jmoyer@redhat.com> aio: fix io_getevents documentation

In reviewing man pages, I noticed that io_getevents is documented to
update the timeout that gets passed into the library call. This doesn't
happen in kernel space or in the library (even though it's documented to
do so in both places). Unless there is objection, I'd like to fix the
comments/docs to match the code (I will also update the man page upon
consensus).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Cyril Hrubis <chrubis@suse.cz>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
41ef4eb8eef8d06bc1399e7b00c940d771554711 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: kill ki_retry

Thanks to Zach Brown's work to rip out the retry infrastructure, we don't
need this anymore - ki_retry was only called right after the kiocb was
initialized.

This also refactors and trims some duplicated code, as well as cleaning up
the refcounting/error handling a bit.

[akpm@linux-foundation.org: use fmode_t in aio_run_iocb()]
[akpm@linux-foundation.org: fix file_start_write/file_end_write tests]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8a6608907cf165b3ae658c9de2efe6af4be68bff 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: kill ki_key

ki_key wasn't actually used for anything previously - it was always 0.
Drop it to trim struct kiocb a bit.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4e23bcaeb9e8df234e47840ac2c757ab79a0b572 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: give shared kioctx fields their own cachelines

[akpm@linux-foundation.org: make reqs_active __cacheline_aligned_in_smp]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
58c85dc20a2c5ba0e53740c5dd1945e658da0e1c 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: kill struct aio_ring_info

struct aio_ring_info was kind of odd, the only place it's used is where
it's embedded in struct kioctx - there's no real need for it.

The next patch rearranges struct kioctx and puts various things on their
own cachelines - getting rid of struct aio_ring_info now makes that
reordering a bit clearer.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a1c8eae75ea3b0168fca93788db1b5aef2424921 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: kill batch allocation

Previously, allocating a kiocb required touching quite a few global
(well, per kioctx) cachelines... so batching up allocation to amortize
those was worthwhile. But we've gotten rid of some of those, and in
another couple of patches kiocb allocation won't require writing to any
shared cachelines, so that means we can just rip this code out.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3e845ce01a391d7c5d59ff2f28db5381bf02fa27 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: change reqs_active to include unreaped completions

The aio code tries really hard to avoid having to deal with the
completion ringbuffer overflowing. To do that, it has to keep track of
the number of outstanding kiocbs, and the number of completions
currently in the ringbuffer - and it's got to check that every time we
allocate a kiocb. Ouch.

But - we can improve this quite a bit if we just change reqs_active to
mean "number of outstanding requests and unreaped completions" - that
means kiocb allocation doesn't have to look at the ringbuffer, which is
a fairly significant win.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0460fef2a9215680f7f85415b57731b7e0fdf673 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: use cancellation list lazily

Cancelling kiocbs requires adding them to a per kioctx linked list,
which is one of the few things we need to take the kioctx lock for in
the fast path. But most kiocbs can't be cancelled - so if we just do
this lazily, we can avoid quite a bit of locking overhead.

While we're at it, instead of using a flag bit switch to using ki_cancel
itself to indicate that a kiocb has been cancelled/completed. This lets
us get rid of ki_flags entirely.

[akpm@linux-foundation.org: remove buggy BUG()]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
21b40200cfe961b1428a529c63c33b1f1e1b4738 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: use flush_dcache_page()

This wasn't causing problems before because it's not needed on x86, but
it is needed on other architectures.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a31ad380bed817aa25f8830ad23e1a0480fef797 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: make aio_read_evt() more efficient, convert to hrtimers

Previously, aio_read_event() pulled a single completion off the
ringbuffer at a time, locking and unlocking each time. Change it to
pull off as many events as it can at a time, and copy them directly to
userspace.

This also fixes a bug where if copying the event to userspace failed,
we'd lose the event.

Also convert it to wait_event_interruptible_hrtimeout(), which
simplifies it quite a bit.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
36f5588905c10a8c4568a210d601fe8c3c27e0f0 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: refcounting cleanup

The usage of ctx->dead was fubar - it makes no sense to explicitly check
it all over the place, especially when we're already using RCU.

Now, ctx->dead only indicates whether we've dropped the initial
refcount. The new teardown sequence is:

set ctx->dead
hlist_del_rcu();
synchronize_rcu();

Now we know no system calls can take a new ref, and it's safe to drop
the initial ref:

put_ioctx();

We also need to ensure there are no more outstanding kiocbs. This was
done incorrectly - it was being done in kill_ctx(), and before dropping
the initial refcount. At this point, other syscalls may still be
submitting kiocbs!

Now, we cancel and wait for outstanding kiocbs in free_ioctx(), after
kioctx->users has dropped to 0 and we know no more iocbs could be
submitted.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
11599ebac4a249ab3c8b9a535c21db7a51458c0a 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: make aio_put_req() lockless

Freeing a kiocb needed to touch the kioctx for three things:

* Pull it off the reqs_active list
* Decrementing reqs_active
* Issuing a wakeup, if the kioctx was in the process of being freed.

This patch moves these to aio_complete(), for a couple reasons:

* aio_complete() already has to issue the wakeup, so if we drop the
kioctx refcount before aio_complete does its wakeup we don't have to
do it twice.
* aio_complete currently has to take the kioctx lock, so it makes sense
for it to pull the kiocb off the reqs_active list too.
* A later patch is going to change reqs_active to include unreaped
completions - this will mean allocating a kiocb doesn't have to look
at the ringbuffer. So taking the decrement of reqs_active out of
kiocb_free() is useful prep work for that patch.

This doesn't really affect cancellation, since existing (usb) code that
implements a cancel function still calls aio_complete() - we just have
to make sure that aio_complete does the necessary teardown for cancelled
kiocbs.

It does affect code paths where we free kiocbs that were never
submitted; they need to decrement reqs_active and pull the kiocb off the
reqs_active list. This occurs in two places: kiocb_batch_free(), which
is going away in a later patch, and the error path in io_submit_one.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1d98ebfccc15aeea87a7c48d50d7343e1ce8daae 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: do fget() after aio_get_req()

aio_get_req() will fail if we have the maximum number of requests
outstanding, which depending on the application may not be uncommon. So
avoid doing an unnecessary fget().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
caf4167aa73bef20f5ce994b776ecb2e44226e5e 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: dprintk() -> pr_debug()

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4e179bca6718693148c7445c236bc3e0e0013ffd 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: move private stuff out of aio.h

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
906b973cf0ae8f10a6e8182845729b2ecc0da95a 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: add kiocb_cancel()

Minor refactoring, to get rid of some duplicated code

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2d68449e86168744513ca4f13477f081ce167130 08-May-2013 Kent Overstreet <koverstreet@google.com> aio: kill return value of aio_complete()

Nothing used the return value, and it probably wasn't possible to use it
safely for the locked versions (aio_complete(), aio_put_req()). Just
kill it.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
41003a7bcfed1255032e1e7c7b487e505b22e298 08-May-2013 Zach Brown <zab@redhat.com> aio: remove retry-based AIO

This removes the retry-based AIO infrastructure now that nothing in tree
is using it.

We want to remove retry-based AIO because it is fundemantally unsafe.
It retries IO submission from a kernel thread that has only assumed the
mm of the submitting task. All other task_struct references in the IO
submission path will see the kernel thread, not the submitting task.
This design flaw means that nothing of any meaningful complexity can use
retry-based AIO.

This removes all the code and data associated with the retry machinery.
The most significant benefit of this is the removal of the locking
around the unused run list in the submission path.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
91d80a84bbc8f28375cca7e65ec666577b4209ad 26-Apr-2013 Zhao Hongjiang <zhaohongjiang@huawei.com> aio: fix possible invalid memory access when DEBUG is enabled

dprintk() shouldn't access @ring after it's unmapped.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8d71db4f0890605d44815a2b2da4ca003f1bb142 20-Mar-2013 Al Viro <viro@zeniv.linux.org.uk> lift sb_start_write/sb_end_write out of ->aio_write()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2cf0966683430b6468f36ca20515a33ca7f2403c 21-Jan-2013 Al Viro <viro@zeniv.linux.org.uk> make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect

... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded
uses of asmlinkage_protect() in a bunch of syscalls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
b67bfe0d42cac56c512dd5da4b1b347a23f4b70a 28-Feb-2013 Sasha Levin <sasha.levin@oracle.com> hlist: drop the node parameter from iterators

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
41badc15cbad0350de34408c1b0c690f9df76d4b 23-Feb-2013 Michel Lespinasse <walken@google.com> mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool

do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.

This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bebeb3d68b24bb4132d452c5707fe321208bcbcd 23-Feb-2013 Michel Lespinasse <walken@google.com> mm: introduce mm_populate() for populating new vmas

When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.

This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() /
mlockall().

Signed-off-by: Michel Lespinasse <walken@google.com>
Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3ffa3c0e3f6e62f67fc2346ca60161dfb030083d 24-Jun-2012 Al Viro <viro@zeniv.linux.org.uk> aio: now fput() is OK from interrupt context; get rid of manual delayed __fput()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
e3fc629d7bb70848fbf479688a66d4e76dff46ac 31-May-2012 Al Viro <viro@zeniv.linux.org.uk> switch aio and shm to do_mmap_pgoff(), make do_mmap() static

after all, 0 bytes and 0 pages is the same thing...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ac34ebb3a67e699edcb5ac72f19d31679369dfaa 01-Jun-2012 Christopher Yeoh <cyeoh@au1.ibm.com> aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()

A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after
changes made to support CMA in an earlier patch.

Rather than having an additional check_access parameter to these
functions, the first paramater type is overloaded to allow the caller to
specify CHECK_IOVEC_ONLY which means check that the contents of the iovec
are valid, but do not check the memory that they point to. This is used
by process_vm_readv/writev where we need to validate that a iovec passed
to the syscall is valid but do not want to check the memory that it points
to at this point because it refers to an address space in another process.

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a70b52ec1aaeaf60f4739edb1b422827cb6f3893 22-May-2012 Linus Torvalds <torvalds@linux-foundation.org> vfs: make AIO use the proper rw_verify_area() area helpers

We had for some reason overlooked the AIO interface, and it didn't use
the proper rw_verify_area() helper function that checks (for example)
mandatory locking on the file, and that the size of the access doesn't
cause us to overflow the provided offset limits etc.

Instead, AIO did just the security_file_permission() thing (that
rw_verify_area() also does) directly.

This fixes it to do all the proper helper functions, which not only
means that now mandatory file locking works with AIO too, we can
actually remove lines of code.

Reported-by: Manish Honap <manish_honap_vit@yahoo.co.in>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bfce281c287a427d0841fadf5d59242757b4e620 21-Apr-2012 Al Viro <viro@zeniv.linux.org.uk> kill mm argument of vm_munmap()

it's always current->mm

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
936af1576e4c24b466380fc2b8d93352161d13b0 21-Apr-2012 Al Viro <viro@zeniv.linux.org.uk> aio: don't bother with unmapping when aio_free_ring() is coming from exit_aio()

... since exit_mmap() is coming and it will munmap() everything anyway.
In all other cases aio_free_ring() has ctx->mm == current->mm; moreover,
all other callers of vm_munmap() have mm == current->mm, so this will
allow us to get rid of mm argument of vm_munmap().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
a46ef99d80817a167477ed1c8b4d90ee0c2e726f 21-Apr-2012 Linus Torvalds <torvalds@linux-foundation.org> VM: add "vm_munmap()" helper function

Like the vm_brk() function, this is the same as "do_munmap()", except it
does the VM locking for the caller.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a2e1859adb3b05fa99f87a67df9ef2a4b7b04a13 20-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: take final put_ioctx() into callers of io_destroy()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
06af121eab543e5554f7a29538f171a382aaf855 20-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: merge aio_cancel_all() with wait_for_all_aios()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9fcf03d0d6e845ed495fc8b1ec328b473ff298b3 14-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: fix the comment in aio_kick_handler()

It should've been changed when queue_work() became
queue_delayed_work(..., 0) in there. It's always had been
about not needing a delay, not about not using specific
function...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
cd1ea261ac128479833b9f518bf788ee47ada2de 11-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: don't bother with cancel_delayed_work() in exit_aio()

__put_ioctx() will cover it anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
bf50722a3c4a83aae651dc20b708308a4f119eb9 11-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: use cancel_delayed_work_sync()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9fa1cb397fa052fc9acfaf5a9f2faff31e10f6b7 11-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: aio_nr_lock is taken only synchronously now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2dd542b7aeb1c222273cf0593a718d9b44998d9f 11-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: aio_nr decrements don't need to be delayed

we can do that right in __put_ioctx(); as the result, the loop
in ioctx_alloc() can be killed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
e23754f880f10124f0a2848f9d17e361a295378e 06-Mar-2012 Al Viro <viro@zeniv.linux.org.uk> aio: don't bother with async freeing on failure in ioctx_alloc()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
e8e3c3d66fd9d1ee2250f68d778cc48c1346d228 25-Nov-2011 Cong Wang <amwang@redhat.com> fs: remove the second argument of k[un]map_atomic()

Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Cong Wang <amwang@redhat.com>
c7b285550544c22bc005ec20978472c9ac7138c6 08-Mar-2012 Al Viro <viro@ZenIV.linux.org.uk> aio: fix the "too late munmap()" race

Current code has put_ioctx() called asynchronously from aio_fput_routine();
that's done *after* we have killed the request that used to pin ioctx,
so there's nothing to stop io_destroy() waiting in wait_for_all_aios()
from progressing. As the result, we can end up with async call of
put_ioctx() being the last one and possibly happening during exit_mmap()
or elf_core_dump(), neither of which expects stray munmap() being done
to them...

We do need to prevent _freeing_ ioctx until aio_fput_routine() is done
with that, but that's all we care about - neither io_destroy() nor
exit_aio() will progress past wait_for_all_aios() until aio_fput_routine()
does really_put_req(), so the ioctx teardown won't be done until then
and we don't care about the contents of ioctx past that point.

Since actual freeing of these suckers is RCU-delayed, we don't need to
bump ioctx refcount when request goes into list for async removal.
All we need is rcu_read_lock held just over the ->ctx_lock-protected
area in aio_fput_routine().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
86b62a2cb4fc09037bbce2959d2992962396fd7f 07-Mar-2012 Al Viro <viro@ZenIV.linux.org.uk> aio: fix io_setup/io_destroy race

Have ioctx_alloc() return an extra reference, so that caller would drop it
on success and not bother with re-grabbing it on failure exit. The current
code is obviously broken - io_destroy() from another thread that managed
to guess the address io_setup() would've returned would free ioctx right
under us; gets especially interesting if aio_context_t * we pass to
io_setup() points to PROT_READ mapping, so put_user() fails and we end
up doing io_destroy() on kioctx another thread has just got freed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
880641bb9da2473e9ecf6c708d993b29928c1b3c 05-Mar-2012 Jeff Moyer <jmoyer@redhat.com> aio: wake up waiters when freeing unused kiocbs

Bart Van Assche reported a hung fio process when either hot-removing
storage or when interrupting the fio process itself. The (pruned) call
trace for the latter looks like so:

fio D 0000000000000001 0 6849 6848 0x00000004
ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0
ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8
ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d
Call Trace:
schedule+0x3f/0x60
io_schedule+0x8f/0xd0
wait_for_all_aios+0xc0/0x100
exit_aio+0x55/0xc0
mmput+0x2d/0x110
exit_mm+0x10d/0x130
do_exit+0x671/0x860
do_group_exit+0x44/0xb0
get_signal_to_deliver+0x218/0x5a0
do_signal+0x65/0x700
do_notify_resume+0x65/0x80
int_signal+0x12/0x17

The problem lies with the allocation batching code. It will
opportunistically allocate kiocbs, and then trim back the list of iocbs
when there is not enough room in the completion ring to hold all of the
events.

In the case above, what happens is that the pruning back of events ends
up freeing up the last active request and the context is marked as dead,
so it is thus responsible for waking up waiters. Unfortunately, the
code does not check for this condition, so we end up with a hung task.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Cc: <stable@kernel.org> [3.2.x only]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
630d9c47274aa89bfa77fe6556d7818bdcb12992 17-Nov-2011 Paul Gortmaker <paul.gortmaker@windriver.com> fs: reduce the use of module.h wherever possible

For files only using THIS_MODULE and/or EXPORT_SYMBOL, map
them onto including export.h -- or if the file isn't even
using those, then just delete the include. Fix up any implicit
include dependencies that were being masked by module.h along
the way.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
69e4747ee9727d660b88d7e1efe0f4afcb35db1b 08-Jan-2012 Gleb Natapov <gleb@redhat.com> Unused iocbs in a batch should not be accounted as active.

Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
allocated in a batch during processing of first iocbs. All iocbs in a
batch are automatically added to ctx->active_reqs list and accounted in
ctx->reqs_active.

If one (not the last one) of iocbs submitted by an user fails, further
iocbs are not processed, but they are still present in ctx->active_reqs
and accounted in ctx->reqs_active. This causes process to stuck in a D
state in wait_for_all_aios() on exit since ctx->reqs_active will never
go down to zero. Furthermore since kiocb_batch_free() frees iocb
without removing it from active_reqs list the list become corrupted
which may cause oops.

Fix this by removing iocb from ctx->active_reqs and updating
ctx->reqs_active in kiocb_batch_free().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@kernel.org # 3.2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
080d676de095a14ecba14c0b9a91acb5bbb634df 02-Nov-2011 Jeff Moyer <jmoyer@redhat.com> aio: allocate kiocbs in batches

In testing aio on a fast storage device, I found that the context lock
takes up a fair amount of cpu time in the I/O submission path. The reason
is that we take it for every I/O submitted (see __aio_get_req). Since we
know how many I/Os are passed to io_submit, we can preallocate the kiocbs
in batches, reducing the number of times we take and release the lock.

In my testing, I was able to reduce the amount of time spent in
_raw_spin_lock_irq by .56% (average of 3 runs). The command I used to
test this was:

aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>

I also tested the patch with various numbers of events passed to
io_submit, and I ran the xfstests aio group of tests to ensure I didn't
break anything.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Daniel Ehrenberg <dehrenberg@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fcf634098c00dd9cd247447368495f0b79be12d1 01-Nov-2011 Christopher Yeoh <cyeoh@au1.ibm.com> Cross Memory Attach

The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.

The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.

- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)

As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.

There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.

Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.

There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:

http://marc.info/?l=linux-mm&m=130105930902915&w=2

There is a basic man page for the proposed interface here:

http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.

For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:

http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e91f90bb0bb10be9cc8efd09a3cf4ecffcad0db1 23-Mar-2011 Roland Dreier <roland@purestorage.com> aio: wake all waiters when destroying ctx

The test program below will hang because io_getevents() uses
add_wait_queue_exclusive(), which means the wake_up() in io_destroy() only
wakes up one of the threads. Fix this by using wake_up_all() in the aio
code paths where we want to make sure no one gets stuck.

// t.c -- compile with gcc -lpthread -laio t.c

#include <libaio.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

static const int nthr = 2;

void *getev(void *ctx)
{
struct io_event ev;
io_getevents(ctx, 1, 1, &ev, NULL);
printf("io_getevents returned\n");
return NULL;
}

int main(int argc, char *argv[])
{
io_context_t ctx = 0;
pthread_t thread[nthr];
int i;

io_setup(1024, &ctx);

for (i = 0; i < nthr; ++i)
pthread_create(&thread[i], NULL, getev, ctx);

sleep(1);

io_destroy(ctx);

for (i = 0; i < nthr; ++i)
pthread_join(thread[i], NULL);

return 0;
}

Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cf15900e1209d5b46ec2d24643adbf561830935f 03-Mar-2011 Jens Axboe <jaxboe@fusionio.com> aio: remove request submission batching

This should be useless now that we have on-stack plugging. So lets just
kill it.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
9f5b9425468c85a901d863d241ba5c5dff9b23b8 01-Jul-2010 Shaohua Li <shaohua.li@intel.com> fs: make aio plug

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
7eaceaccab5f40bbfda044629a6298616aeaed50 10-Mar-2011 Jens Axboe <jaxboe@fusionio.com> block: remove per-queue plugging

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
7137c6bd455234bcb7560fd829e6ee49cae5fed6 25-Feb-2011 Jan Kara <jack@suse.cz> aio: fix race between io_destroy() and io_submit()

A race can occur when io_submit() races with io_destroy():

CPU1 CPU2
io_submit()
do_io_submit()
...
ctx = lookup_ioctx(ctx_id);
io_destroy()
Now do_io_submit() holds the last reference to ctx.
...
queue new AIO
put_ioctx(ctx) - frees ctx with active AIOs

We solve this issue by checking whether ctx is being destroyed in AIO
submission path after adding new AIO to ctx. Then we are guaranteed that
either io_destroy() waits for new AIO or we see that ctx is being
destroyed and bail out.

Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3bd9a5d734c7cc7533b27abf451416c7f50095a7 25-Feb-2011 Nick Piggin <npiggin@gmail.com> aio: fix rcu ioctx lookup

aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.

lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we might take
a refcount on a zero count ioctx.

Fix the bug by atomically testing for zero refcount before incrementing.

[jack@suse.cz: added comment into the code]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d37adaa1596246929f7ab49843fd124595506175 26-Jan-2011 Tejun Heo <tj@kernel.org> fs/aio: aio_wq isn't used in memory reclaim path

aio_wq isn't used during memory reclaim. Convert to alloc_workqueue()
without WQ_MEM_RECLAIM. It's possible to use system_wq but given that
the number of work items is determined from userland and the work item
may block, enforcing strict concurrency limit would be a good idea.

Also, move fput_work to system_wq so that aio_wq is used soley to
throttle the max concurrency of aio work items and fput_work doesn't
interact with other work items.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: linux-aio@kvack.org
27eaa1c90c608aa907336c2743d5ecf35c469440 13-Dec-2010 Namhyung Kim <namhyung@gmail.com> aio: check return value of create_workqueue()

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d3486f8b9eebcaa15ba8b72b63217e317f4f8635 13-Jan-2011 Jeff Moyer <jmoyer@redhat.com> aio: remove unused aio_run_iocbs()

aio_run_iocbs() is not used at all, so get rid of it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2e41025598ea7abb4330db98f78c5a084e25682f 13-Jan-2011 Namhyung Kim <namhyung@gmail.com> aio: remove unnecessary check

'nr >= min_nr >= 0' always satisfies 'nr >= 0' so the check is unnecesary.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7de9c6ee3ecffd99e1628e81a5ea5468f7581a1f 23-Oct-2010 Al Viro <viro@zeniv.linux.org.uk> new helper: ihold()

Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
306fb0979443419288594446a348155a8027dcf2 23-Aug-2010 Chris Mason <chris.mason@oracle.com> aio: bump i_count instead of using igrab

The aio batching code is using igrab to get an extra reference on the
inode so it can safely batch. igrab will go ahead and take the global
inode spinlock, which can be a bottleneck on large machines doing lots
of AIO.

In this case, igrab isn't required because we already have a reference
on the file handle. It is safe to just bump the i_count directly
on the inode.

Benchmarking shows this patch brings IOP/s on tons of flash up by about
2.5X.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
a0c42bac79731276c9b2f28d54f9e658fcf843a2 22-Sep-2010 Jan Kara <jack@suse.cz> aio: do not return ERESTARTSYS as a result of AIO

OCFS2 can return ERESTARTSYS from its write function when the process is
signalled while waiting for a cluster lock (and the filesystem is mounted
with intr mount option). Generally, it seems reasonable to allow
filesystems to return this error code from its IO functions. As we must
not leak ERESTARTSYS (and similar error codes) to userspace as a result of
an AIO operation, we have to properly convert it to EINTR inside AIO code
(restarting the syscall isn't really an option because other AIO could
have been already submitted by the same io_submit syscall).

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
75e1c70fc31490ef8a373ea2a4bea2524099b478 10-Sep-2010 Jeff Moyer <jmoyer@redhat.com> aio: check for multiplication overflow in do_io_submit

Tavis Ormandy pointed out that do_io_submit does not do proper bounds
checking on the passed-in iocb array:

       if (unlikely(nr < 0))
               return -EINVAL;

       if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))
               return -EFAULT;                      ^^^^^^^^^^^^^^^^^^

The attached patch checks for overflow, and if it is detected, the
number of iocbs submitted is scaled down to a number that will fit in
the long.  This is an ok thing to do, as sys_io_submit is documented as
returning the number of iocbs submitted, so callers should handle a
return value of less than the 'nr' argument passed in.

Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
642b5123ac5ec40a28575e930a3e2ff595473e9d 05-Aug-2010 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> aio: fix wrong subsystem comments

- sys_io_destroy(): acutually return -EINVAL if the context pointed to
is invalidIndex: linux-2.6.33-rc4/fs/aio.c
- sys_io_getevents(): An argument specifying timeout is not `when',
but `timeout'.
- sys_io_getevents(): Should describe what is returned if this syscall
succeeds.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d7065da038227a4d09a244e6014e0186a6bd21d0 26-May-2010 Al Viro <viro@zeniv.linux.org.uk> get rid of the magic around f_count in aio

__aio_put_req() plays sick games with file refcount. What
it wants is fput() from atomic context; it's almost always
done with f_count > 1, so they only have to deal with delayed
work in rare cases when their reference happens to be the
last one. Current code decrements f_count and if it hasn't
hit 0, everything is fine. Otherwise it keeps a pointer
to struct file (with zero f_count!) around and has delayed
work do __fput() on it.

Better way to do it: use atomic_long_add_unless( , -1, 1)
instead of !atomic_long_dec_and_test(). IOW, decrement it
only if it's not the last reference, leave refcount alone
if it was. And use normal fput() in delayed work.

I've made that atomic_long_add_unless call a new helper -
fput_atomic(). Drops a reference to file if it's safe to
do in atomic (i.e. if that's not the last one), tells if
it had been able to do that. aio.c converted to it, __fput()
use is gone. req->ki_file *always* contributes to refcount
now. And __fput() became static.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9d85cba718efeef9ca00ce3f7f34f5880737aa9b 26-May-2010 Jeff Moyer <jmoyer@redhat.com> aio: fix the compat vectored operations

The aio compat code was not converting the struct iovecs from 32bit to
64bit pointers, causing either EINVAL to be returned from io_getevents, or
EFAULT as the result of the I/O. This patch passes a compat flag to
io_submit to signal that pointer conversion is necessary for a given iocb
array.

A variant of this was tested by Michael Tokarev. I have also updated the
libaio test harness to exercise this code path with good success.
Further, I grabbed a copy of ltp and ran the
testcases/kernel/syscall/readv and writev tests there (compiled with -m32
on my 64bit system). All seems happy, but extra eyes on this would be
welcome.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_COMPAT=n build]
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org> [2.6.35.1]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fac046ad0b1ee2c4244ebf43a26433ef0ea29ae4 16-Dec-2009 Shaohua Li <shaohua.li@intel.com> aio: remove unused field

Don't know the reason, but it appears ki_wait field of iocb never gets used.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b9d128f1088ea5245109dfc9bbceb128b6371a77 29-Oct-2009 Jens Axboe <jens.axboe@oracle.com> block: move bdi/address_space unplug functions to backing-dev.h

There's nothing block related about them, the backing device
is used by things like NFS etc as well. This gets rid of the
need to protect such calls by CONFIG_BLOCK.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
cfb1e33eed48165763edc7a4a067cf5f74898d0b 03-Oct-2009 Jeff Moyer <jmoyer@redhat.com> aio: implement request batching

Hi,

Some workloads issue batches of small I/O, and the performance is poor
due to the call to blk_run_address_space for every single iocb. Nathan
Roberts pointed this out, and suggested that by deferring this call
until all I/Os in the iocb array are submitted to the block layer, we
can realize some impressive performance gains (up to 30% for sequential
4k reads in batches of 16).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
385773e04806e8903e9ec683f5c4bd14926a86dc 23-Sep-2009 H Hartley Sweeten <hartleys@visionengravers.com> aio.c: move EXPORT* macros to line after function

As mentioned in Documentation/CodingStyle, move EXPORT* macro's
to the line immediately after the closing function brace line.

Also, move the __initcall() similarly.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3d2d827f5ca5e32816194119d5c980c7e04474a6 22-Sep-2009 Michael S. Tsirkin <mst@redhat.com> mm: move use_mm/unuse_mm from aio.c to mm/

Anyone who wants to do copy to/from user from a kernel thread, needs
use_mm (like what fs/aio has). Move that into mm/, to make reusing and
exporting easier down the line, and make aio use it. Next intended user,
besides aio, will be vhost-net.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
133890103b9de08904f909995973e4b5c08a780e 30-Jun-2009 Davide Libenzi <davidel@xmailserver.org> eventfd: revised interface and cleanups

Change the eventfd interface to de-couple the eventfd memory context, from
the file pointer instance.

Without such change, there is no clean way to racely free handle the
POLLHUP event sent when the last instance of the file* goes away. Also,
now the internal eventfd APIs are using the eventfd context instead of the
file*.

This patch is required by KVM's IRQfd code, which is still under
development.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
65c24491b4fef017c64e39ec64384fde5e05e0a0 19-Mar-2009 Jeff Moyer <jmoyer@redhat.com> aio: lookup_ioctx can return the wrong value when looking up a bogus context

The libaio test harness turned up a problem whereby lookup_ioctx on a
bogus io context was returning the 1 valid io context from the list
(harness/cases/3.p).

Because of that, an extra put_iocontext was done, and when the process
exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio
(since we expect a users count of 1 and instead get 0).

The problem was introduced by "aio: make the lookup_ioctx() lockless"
(commit abf137dd7712132ee56d5b3143c2ff61a72a5faa).

Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not
return with a NULL tpos at the end of the loop, even if the entry was
not found.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
87c3a86e1c220121d0ced59d1a71e78ed9abc6dd 19-Mar-2009 Davide Libenzi <davidel@xmailserver.org> eventfd: remove fput() call from possible IRQ context

Remove a source of fput() call from inside IRQ context. Myself, like Eric,
wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was
able to, with the attached test program. Independently from this, the bug is
conceptually there, so we might be better off fixing it. This patch adds an
optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd.
Playing with ->f_count directly is not pretty in general, but the alternative
here would be to add a brand new delayed fput() infrastructure, that I'm not
sure is worth it.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
002c8976ee537724b20a5e179d9b349309438836 14-Jan-2009 Heiko Carstens <heiko.carstens@de.ibm.com> [CVE-2009-0029] System call wrappers part 16

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
abf137dd7712132ee56d5b3143c2ff61a72a5faa 09-Dec-2008 Jens Axboe <jens.axboe@oracle.com> aio: make the lookup_ioctx() lockless

The mm->ioctx_list is currently protected by a reader-writer lock,
so we always grab that lock on the read side for doing ioctx
lookups. As the workload is extremely reader biased, turn this into
an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
the rwlock and use a spinlock for providing update side exclusion.

There's usually only 1 entry on this list, so it doesn't make sense
to look into fancier data structures.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
516e0cc5646f377ab80fcc2ee639892eccb99853 26-Jul-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] f_count may wrap around

make it atomic_long_t; while we are at it, get rid of useless checks in affs,
hfs and hpfs - ->open() always has it equal to 1, ->release() - to 0.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
246bb0b1deb29726990620d8b5e55ca29f331362 25-Jul-2008 Oleg Nesterov <oleg@tv-sign.ru> kill PF_BORROWED_MM in favour of PF_KTHREAD

Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.

No functional changes yet. But this allows us to do further
fixes/cleanups.

oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
kthreads, this is wrong because of use_mm(). The problem with
PF_BORROWED_MM is that we need task_lock() to avoid races. With this
patch we can check PF_KTHREAD directly, or use a simple lockless helper:

/* The result must not be dereferenced !!! */
struct mm_struct *__get_task_mm(struct task_struct *tsk)
{
if (tsk->flags & PF_KTHREAD)
return NULL;
return tsk->mm;
}

Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
thread without PF_BORROWED_MM.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aab2545fdd6641b76af0ae96456c4ca9d1e50dad 06-Jun-2008 Oleg Nesterov <oleg@tv-sign.ru> uml: activate_mm: remove the dead PF_BORROWED_MM check

use_mm() was changed to use switch_mm() instead of activate_mm(), since
then nobody calls (and nobody should call) activate_mm() with
PF_BORROWED_MM bit set.

As Jeff Dike pointed out, we can also remove the "old != new" check, it is
always true.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c6f3a97f86a5c97be0ca255976110bb9c3cfe669 30-Apr-2008 Thomas Gleixner <tglx@linutronix.de> debugobjects: add timer specific object debugging code

Add calls to the generic object debugging infrastructure and provide fixup
functions which allow to keep the system alive when recoverable problems have
been detected by the object debugging core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
39fa00311f21318cc498b139c2cc2830dcad98ff 29-Apr-2008 Jeff Moyer <jmoyer@redhat.com> aio: fix misleading comments

The FIXME comments are inaccurate.
The locking comment over lookup_ioctx() is wrong.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
801678c5a3b4c79236970bcca27c733f5559e0d1 29-Apr-2008 Hirofumi Nakagawa <hnakagawa@miraclelinux.com> Remove duplicated unlikely() in IS_ERR()

Some drivers have duplicated unlikely() macros. IS_ERR() already has
unlikely() in itself.

This patch cleans up such pointless code.

Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jeff Garzik <jeff@garzik.org>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d5470b596abdd566339b2417e807b1198be64b97 29-Apr-2008 Adrian Bunk <bunk@kernel.org> fs/aio.c: make 3 functions static

Make the following needlessly global functions static:

- __put_ioctx()
- lookup_ioctx()
- io_submit_one()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e92adcba261fd391591bb63c1703185a04a41554 28-Apr-2008 Jeff Moyer <jmoyer@redhat.com> aio: io_getevents() should return if io_destroy() is invoked

This patch wakes up a thread waiting in io_getevents if another thread
destroys the context. This was tested using a small program that spawns a
thread to wait in io_getevents while the parent thread destroys the io context
and then waits for the getevents thread to exit. Without this patch, the
program hangs indefinitely. With the patch, the program exits as expected.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Christopher Smith <x@xman.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8d1c98b0b5c0148b519c6416e689ef6a89ffcea3 11-Apr-2008 Davide Libenzi <davidel@xmailserver.org> eventfd/kaio integration fix

Jeff Roberson discovered a race when using kaio eventfd based notifications.
When it occurs it can lead tomissed wakeups and hung userspace.

This patch fixes the race by moving the notification inside the spinlocked
section of kaio. The operation is safe since eventfd spinlock and kaio one
are unrelated.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jeff Roberson <jroberson@chesapeake.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
598af051a79d05b751fe793f1fe09fcf74763e02 11-Apr-2008 Roland McGrath <roland@redhat.com> asmlinkage_protect sys_io_getevents

Use asmlinkage_protect in sys_io_getevents, because GCC for i386 with
CONFIG_FRAME_POINTER=n can decide to clobber an argument word on the
stack, i.e. the user struct pt_regs. Here the problem is not a tail
call, but just the compiler's use of the stack when it inlines and
optimizes the body of the called function. This seems to avoid it.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6cb2a21049b8990df4576c5fce4d48d0206c22d5 20-Mar-2008 Quentin Barnes <qbarnes+linux@yahoo-inc.com> aio: bad AIO race in aio_complete() leads to process hang

My group ran into a AIO process hang on a 2.6.24 kernel with the process
sleeping indefinitely in io_getevents(2) waiting for the last wakeup to come
and it never would.

We ran the tests on x86_64 SMP. The hang only occurred on a Xeon box
("Clovertown") but not a Core2Duo ("Conroe"). On the Xeon, the L2 cache isn't
shared between all eight processors, but is L2 is shared between between all
two processors on the Core2Duo we use.

My analysis of the hang is if you go down to the second while-loop
in read_events(), what happens on processor #1:
1) add_wait_queue_exclusive() adds thread to ctx->wait
2) aio_read_evt() to check tail
3) if aio_read_evt() returned 0, call [io_]schedule() and sleep

In aio_complete() with processor #2:
A) info->tail = tail;
B) waitqueue_active(&ctx->wait)
C) if waitqueue_active() returned non-0, call wake_up()

The way the code is written, step 1 must be seen by all other processors
before processor 1 checks for pending events in step 2 (that were recorded by
step A) and step A by processor 2 must be seen by all other processors
(checked in step 2) before step B is done.

The race I believed I was seeing is that steps 1 and 2 were
effectively swapped due to the __list_add() being delayed by the L2
cache not shared by some of the other processors. Imagine:
proc 2: just before step A
proc 1, step 1: adds to ctx->wait, but is not visible by other processors yet
proc 1, step 2: checks tail and sees no pending events
proc 2, step A: updates tail
proc 1, step 3: calls [io_]schedule() and sleeps
proc 2, step B: checks ctx->wait, but sees no one waiting, skips wakeup
so proc 1 sleeps indefinitely

My patch adds a memory barrier between steps A and B. It ensures that the
update in step 1 gets seen on processor 2 before continuing. If processor 1
was just before step 1, the memory barrier makes sure that step A (update
tail) gets seen by the time processor 1 makes it to step 2 (check tail).

Before the patch our AIO process would hang virtually 100% of the time. After
the patch, we have yet to see the process ever hang.

Signed-off-by: Quentin Barnes <qbarnes+linux@yahoo-inc.com>
Reviewed-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ We should probably disallow that "if (waitqueue_active()) wake_up()"
coding pattern, because it's so often buggy wrt memory ordering ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c2ec66828fd253802abb912668f4bf9597e3c898 08-Feb-2008 Rusty Russell <rusty@rustcorp.com.au> aio: negative offset should return -EINVAL

An AIO read or write should return -EINVAL if the offset is negative.
This check matches the one in pread and pwrite.

This was found by the libaio test suite.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7adfa2ff3efa02a7a9f2632d2d2662d3e5eb5304 08-Feb-2008 Rusty Russell <rusty@rustcorp.com.au> aio: partial write should not return error code

When an AIO write gets an error after writing some data (eg. ENOSPC), it
should return the amount written already, not the error. Just like write()
is supposed to.

This was found by the libaio test suite.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-By: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fc9b52cd8f5f459b88adcf67c47668425ae31a78 08-Feb-2008 Harvey Harrison <harvey.harrison@gmail.com> fs: remove fastcall, it is always empty

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
56c4da454de1264e381256f658f61b9ef690dd21 30-Jan-2008 Harvey Harrison <harvey.harrison@gmail.com> core: remove last users of empty FASTCALL macro

FASTCALL is always empty after the x86 removal.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
e00ba3dae077f54cfd2af42e939a618caa7a3bca 05-Dec-2007 Jeff Moyer <jmoyer@redhat.com> aio: only account I/O wait time in read_events if there are active requests

On 2.6.24, top started showing 100% iowait on one CPU when a UML instance was
running (but completely idle). The UML code sits in io_getevents waiting for
an event to be submitted and completed.

Fix this by checking ctx->reqs_active before scheduling to determine whether
or not we are waiting for I/O.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6212e3a388fdda3f19fa660ef5a30edf54d1dcfd 18-Oct-2007 Alexey Dobriyan <adobriyan@gmail.com> Remove struct task_struct::io_wait

Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b
during 2.6.9 development. Commit introduced io_wait field which remained
write-only than and still remains write-only.

Also garbage collect macros which "use" io_wait.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
41d10da3717409de33d5441f2f6d8f072ab3fbb6 17-Oct-2007 Jeff Moyer <jmoyer@redhat.com> aio: account I/O wait time properly

Some months back I proposed changing the schedule() call in
read_events to an io_schedule():
http://osdir.com/ml/linux.kernel.aio.general/2006-10/msg00024.html
This was rejected as there are AIO operations that do not initiate
disk I/O. I've had another look at the problem, and the only AIO
operation that will not initiate disk I/O is IOCB_CMD_NOOP. However,
this command isn't even wired up!

Given that it doesn't work, and hasn't for *years*, I'm going to
suggest again that we do proper I/O accounting when using AIO.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
87e2831c3fa39cbf6f7ab676bb5aef039b9659e2 08-Oct-2007 Yan Zheng <yanzheng@21cn.com> AIO: fix cleanup in io_submit_one(...)

When IOCB_FLAG_RESFD flag is set and iocb->aio_resfd is incorrect,
statement 'goto out_put_req' is executed. At label 'out_put_req',
aio_put_req(..) is called, which requires 'req->ki_filp' set.

Signed-off-by: Yan Zheng<yanzheng@21cn.com>
Cc: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9c3060bedd84144653a2ad7bea32389f65598d40 11-May-2007 Davide Libenzi <davidel@xmailserver.org> signal/timer/event: KAIO eventfd support example

This is an example about how to add eventfd support to the current KAIO code,
in order to enable KAIO to post readiness events to a pollable fd (hence
compatible with POSIX select/poll). The KAIO code simply signals the eventfd
fd when events are ready, and this triggers a POLLIN in the fd. This patch
uses a reserved for future use member of the struct iocb to pass an eventfd
file descriptor, that KAIO will use to post events every time a request
completes. At that point, an aio_getevents() will return the completed result
to a struct io_event. I made a quick test program to verify the patch, and it
runs fine here:

http://www.xmailserver.org/eventfd-aio-test.c

The test program uses poll(2), but it'd, of course, work with select and epoll
too.

This can allow to schedule both block I/O and other poll-able devices
requests, and wait for results using select/poll/epoll. In a typical
scenario, an application would submit KAIO request using aio_submit(), and
will also use epoll_ctl() on the whole other class of devices (that with the
addition of signals, timers and user events, now it's pretty much complete),
and then would:

epoll_wait(...);
for_each_event {
if (curr_event_is_kaiofd) {
aio_getevents();
dispatch_aio_events();
} else {
dispatch_epoll_event();
}
}

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
28e53bddf814485699a4142bc056fd37d4e11dd4 09-May-2007 Oleg Nesterov <oleg@tv-sign.ru> unify flush_work/flush_work_keventd and rename it to cancel_work_sync

flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
(this was possible from the very beginnig, I missed this). So we can unify
flush_work_keventd and flush_work.

Also, rename flush_work() to cancel_work_sync() and fix all callers.
Perhaps this is not the best name, but "flush_work" is really bad.

(akpm: this is why the earlier patches bypassed maintainers)

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Auke Kok <auke-jan.h.kok@intel.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a9df62c7585e6caa1e7d2425b2b14460ec3afc20 09-May-2007 Andrew Morton <akpm@osdl.org> aio: use flush_work()

Migrate AIO over to use flush_work().

Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0a31bd5f2bbb6473ef9d24f0063ca91cfa678b64 06-May-2007 Christoph Lameter <clameter@sgi.com> KMEM_CACHE(): simplify slab cache creation

This patch provides a new macro

KMEM_CACHE(<struct>, <flags>)

to simplify slab creation. KMEM_CACHE creates a slab with the name of the
struct, with the size of the struct and with the alignment of the struct.
Additional slab flags may be specified if necessary.

Example

struct test_slab {
int a,b,c;
struct list_head;
} __cacheline_aligned_in_smp;

test_slab_cache = KMEM_CACHE(test_slab, SLAB_PANIC)

will create a new slab named "test_slab" of the size sizeof(struct
test_slab) and aligned to the alignment of test slab. If it fails then we
panic.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
28defbea64622f69d65a6079bf800cedb9915a5f 28-Mar-2007 Zach Brown <zach.brown@oracle.com> [PATCH] aio: remove bare user-triggerable error printk

The user can generate console output if they cause do_mmap() to fail
during sys_io_setup(). This was seen in a regression test that does
exactly that by spinning calling mmap() until it gets -ENOMEM before
calling io_setup().

We don't need this printk at all, just remove it.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c376222960ae91d5ffb9197ee36771aaed1d9f90 10-Feb-2007 Robert P. J. Day <rpjday@mindspring.com> [PATCH] Transform kmem_cache_alloc()+memset(0) -> kmem_cache_zalloc().

Replace appropriate pairs of "kmem_cache_alloc()" + "memset(0)" with the
corresponding "kmem_cache_zalloc()" call.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Greg KH <greg@kroah.com>
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e10a4437cb37c85f2df95432025b392d98aac2aa 10-Feb-2007 Robert P. J. Day <rpjday@mindspring.com> [PATCH] Remove final references to deprecated "MAP_ANON" page protection flag

Remove the last vestiges of the long-deprecated "MAP_ANON" page protection
flag: use "MAP_ANONYMOUS" instead.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dee11c2364f51cac53df17d742a0c69097e29a4e 03-Feb-2007 Ken Chen <kenchen@google.com> [PATCH] aio: fix buggy put_ioctx call in aio_complete - v2

An AIO bug was reported that sleeping function is being called in softirq
context:

BUG: warning at kernel/mutex.c:132/__mutex_lock_common()
Call Trace:
[<a000000100577b00>] __mutex_lock_slowpath+0x640/0x6c0
[<a000000100577ba0>] mutex_lock+0x20/0x40
[<a0000001000a25b0>] flush_workqueue+0xb0/0x1a0
[<a00000010018c0c0>] __put_ioctx+0xc0/0x240
[<a00000010018d470>] aio_complete+0x2f0/0x420
[<a00000010019cc80>] finished_one_bio+0x200/0x2a0
[<a00000010019d1c0>] dio_bio_complete+0x1c0/0x200
[<a00000010019d260>] dio_bio_end_aio+0x60/0x80
[<a00000010014acd0>] bio_endio+0x110/0x1c0
[<a0000001002770e0>] __end_that_request_first+0x180/0xba0
[<a000000100277b90>] end_that_request_chunk+0x30/0x60
[<a0000002073c0c70>] scsi_end_request+0x50/0x300 [scsi_mod]
[<a0000002073c1240>] scsi_io_completion+0x200/0x8a0 [scsi_mod]
[<a0000002074729b0>] sd_rw_intr+0x330/0x860 [sd_mod]
[<a0000002073b3ac0>] scsi_finish_command+0x100/0x1c0 [scsi_mod]
[<a0000002073c2910>] scsi_softirq_done+0x230/0x300 [scsi_mod]
[<a000000100277d20>] blk_done_softirq+0x160/0x1c0
[<a000000100083e00>] __do_softirq+0x200/0x240
[<a000000100083eb0>] do_softirq+0x70/0xc0

See report: http://marc.theaimsgroup.com/?l=linux-kernel&m=116599593200888&w=2

flush_workqueue() is not allowed to be called in the softirq context.
However, aio_complete() called from I/O interrupt can potentially call
put_ioctx with last ref count on ioctx and triggers bug. It is simply
incorrect to perform ioctx freeing from aio_complete.

The bug is trigger-able from a race between io_destroy() and aio_complete().
A possible scenario:

cpu0 cpu1
io_destroy aio_complete
wait_for_all_aios { __aio_put_req
... ctx->reqs_active--;
if (!ctx->reqs_active)
return;
}
...
put_ioctx(ioctx)

put_ioctx(ctx);
__put_ioctx
bam! Bug trigger!

The real problem is that the condition check of ctx->reqs_active in
wait_for_all_aios() is incorrect that access to reqs_active is not
being properly protected by spin lock.

This patch adds that protective spin lock, and at the same time removes
all duplicate ref counting for each kiocb as reqs_active is already used
as a ref count for each active ioctx. This also ensures that buggy call
to flush_workqueue() in softirq context is eliminated.

Signed-off-by: "Ken Chen" <kenchen@google.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: <stable@kernel.org>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1ebb1101c556b1915ff041655e629a072e64dcda 30-Dec-2006 Zach Brown <zach.brown@oracle.com> [PATCH] Fix lock inversion aio_kick_handler()

lockdep found a AB BC CA lock inversion in retry-based AIO:

1) The task struct's alloc_lock (A) is acquired in process context with
interrupts enabled. An interrupt might arrive and call wake_up() which
grabs the wait queue's q->lock (B).

2) When performing retry-based AIO the AIO core registers
aio_wake_function() as the wake funtion for iocb->ki_wait. It is called
with the wait queue's q->lock (B) held and then tries to add the iocb to
the run list after acquiring the ctx_lock (C).

3) aio_kick_handler() holds the ctx_lock (C) while acquiring the
alloc_lock (A) via lock_task() and unuse_mm(). Lockdep emits a warning
saying that we're trying to connect the irq-safe q->lock to the
irq-unsafe alloc_lock via ctx_lock.

This fixes the inversion by calling unuse_mm() in the AIO kick handing path
after we've released the ctx_lock. As Ben LaHaise pointed out __put_ioctx
could set ctx->mm to NULL, so we must only access ctx->mm while we have the
lock.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Suparna Bhattacharya <suparna@in.ibm.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
90aef12e6dd609e1ad7fb70044eedc78ca55ee5e 13-Dec-2006 Jeremy Fitzhardinge <jeremy@goop.org> [PATCH] Use activate_mm() in fs/aio.c:use_mm()

activate_mm() is not the right thing to be using in use_mm(). It should be
switch_mm().

On normal x86, they're synonymous, but for the Xen patches I'm adding a
hook which assumes that activate_mm is only used the first time a new mm
is used after creation (I have another hook for dealing with dup_mm). I
think this use of activate_mm() is the only place where it could be used
a second time on an mm.

>From a quick look at the other architectures I think this is OK (most
simply implement one in terms of the other), but some are doing some
subtly different stuff between the two.

Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
97d2a80584b30b5cd32da411deca1986ef61877a 07-Dec-2006 Benjamin LaHaise <bcrl@kvack.org> [PATCH] aio: remove ki_retried debugging member

Remove the ki_retried member from struct kiocb. I think the idea was
bounced around a while back, but Arnaldo pointed out another reason that we
should dig it up when he pointed out that the last cacheline of struct
kiocb only contains 4 bytes. By removing the debugging member, we save
more than the 8 byte on 64 bit machines.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
b62e8ec2ac580b47c11eb76e8852ac1ec7d617cd 07-Dec-2006 Chen, Kenneth W <kenneth.w.chen@intel.com> [PATCH] aio: kill pointless ki_nbytes assignment in aio_setup_single_vector

io_submit_one assigns ki_left = ki_nbytes = iocb->aio_nbytes, then calls
down to aio_setup_iocb, then to aio_setup_single_vector. In there,
ki_nbytes is reassigned to the same value it got two call stack above it.
There is no need to do so.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
e18b890bb0881bbab6f4f1a6cd20d9c60d66b003 07-Dec-2006 Christoph Lameter <clameter@sgi.com> [PATCH] slab: remove kmem_cache_t

Replace all uses of kmem_cache_t with struct kmem_cache.

The patch was generated using the following script:

#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#

set -e

for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done

The script was run like this

sh replace kmem_cache_t "struct kmem_cache"

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
93e06b4140cc018826bce4d97b0bf7c9ba05ae6e 30-Nov-2006 Eric Sesterhenn <snakebyte@gmx.de> BUG_ON conversion for fs/aio.c

This patch converts a if () BUG(); construct to BUG_ON();
which occupies less space, uses unlikely and is safer when
BUG() is disabled.

Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
65f27f38446e1976cc98fd3004b110fedcddd189 22-Nov-2006 David Howells <dhowells@redhat.com> WorkStruct: Pass the work_struct pointer instead of context data

Pass the work_struct pointer to the work function rather than context data.
The work function can use container_of() to work out the data.

For the cases where the container of the work_struct may go away the moment the
pending bit is cleared, it is made possible to defer the release of the
structure by deferring the clearing of the pending bit.

To make this work, an extra flag is introduced into the management side of the
work_struct. This governs auto-release of the structure upon execution.

Ordinarily, the work queue executor would release the work_struct for further
scheduling or deallocation by clearing the pending bit prior to jumping to the
work function. This means that, unless the driver makes some guarantee itself
that the work_struct won't go away, the work function may not access anything
else in the work_struct or its container lest they be deallocated.. This is a
problem if the auxiliary data is taken away (as done by the last patch).

However, if the pending bit is *not* cleared before jumping to the work
function, then the work function *may* access the work_struct and its container
with no problems. But then the work function must itself release the
work_struct by calling work_release().

In most cases, automatic release is fine, so this is the default. Special
initiators exist for the non-auto-release case (ending in _NAR).


Signed-Off-By: David Howells <dhowells@redhat.com>
52bad64d95bd89e08c49ec5a071fa6dcbe5a1a9c 22-Nov-2006 David Howells <dhowells@redhat.com> WorkStruct: Separate delayable and non-delayable events.

Separate delayable work items from non-delayable work items be splitting them
into a separate structure (delayed_work), which incorporates a work_struct and
the timer_list removed from work_struct.

The work_struct struct is huge, and this limits it's usefulness. On a 64-bit
architecture it's nearly 100 bytes in size. This reduces that by half for the
non-delayable type of event.

Signed-Off-By: David Howells <dhowells@redhat.com>
691578cd558e647dd89ae5bc74a2307bce821186 03-Oct-2006 Zach Brown <zach.brown@oracle.com> [PATCH] pr_debug: aio: use size_t length modifier in pr_debug format arguments

aio: use size_t length modifier in pr_debug format arguments

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
eed4e51fb60c3863c134a5e9f6006b29805ead97 01-Oct-2006 Badari Pulavarty <pbadari@us.ibm.com> [PATCH] Add vector AIO support

This work is initially done by Zach Brown to add support for vectored aio.
These are the core changes for AIO to support
IOCB_CMD_PREADV/IOCB_CMD_PWRITEV.

[akpm@osdl.org: huge build fix]
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
027445c37282bc1ed26add45e573ad2d3e4860a5 01-Oct-2006 Badari Pulavarty <pbadari@us.ibm.com> [PATCH] Vectorize aio_read/aio_write fileop methods

This patch vectorizes aio_read() and aio_write() methods to prepare for
collapsing all aio & vectored operations into one interface - which is
aio_read()/aio_write().

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Michael Holzheu <HOLZHEU@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d6e05edc59ecd79e8badf440c0d295a979bdfa3e 26-Jun-2006 Andreas Mohr <andi@lisas.de> spelling fixes

acquired (aquired)
contiguous (contigious)
successful (succesful, succesfull)
surprise (suprise)
whether (weather)
some other misspellings

Signed-off-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
626ab0e69d376fa07599af669af8ba92d58e87c1 23-Jun-2006 Oleg Nesterov <oleg@tv-sign.ru> [PATCH] list: use list_replace_init() instead of list_splice_init()

list_splice_init(list, head) does unneeded job if it is known that
list_empty(head) == 1. We can use list_replace_init() instead.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
11b0b5abb2097a63c1081d9b7e825b987b227972 25-Mar-2006 Oliver Neukum <neukum@fachschaft.cup.uni-muenchen.de> [PATCH] use kzalloc and kcalloc in core fs code

Signed-off-by: Oliver Neukum <oliver@neukum.name>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
095975da26dba21698582e91e96be10f7417333f 08-Jan-2006 Nick Piggin <nickpiggin@yahoo.com.au> [PATCH] rcu file: use atomic primitives

Use atomic_inc_not_zero for rcu files instead of special case rcuref.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d00689af6b3b6ba9e1fdefec3bd62edc860c385d 14-Nov-2005 Zach Brown <zach.brown@oracle.com> [PATCH] aio: replace locking comments with assert_spin_locked()

aio: replace locking comments with assert_spin_locked()

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
20dcae32439384b6863c626bb3b2a09bed65b33e 14-Nov-2005 Zach Brown <zach.brown@oracle.com> [PATCH] aio: remove kioctx from mm_struct

Sync iocbs have a life cycle that don't need a kioctx. Their retrying, if
any, is done in the context of their owner who has allocated them on the
stack.

The sole user of a sync iocb's ctx reference was aio_complete() checking for
an elevated iocb ref count that could never happen. No path which grabs an
iocb ref has access to sync iocbs.

If we were to implement sync iocb cancelation it would be done by the owner of
the iocb using its on-stack reference.

Removing this chunk from aio_complete allows us to remove the entire kioctx
instance from mm_struct, reducing its size by a third. On a i386 testing box
the slab size went from 768 to 504 bytes and from 5 to 8 per page.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
d55b5fdaf40846221d543937b786956e27837fda 07-Nov-2005 Zach Brown <zach.brown@oracle.com> [PATCH] aio: remove aio_max_nr accounting race

AIO was adding a new context's max requests to the global total before
testing if that resulting total was over the global limit. This let
innocent tasks get their new limit tested along with a racing guilty task
that was crossing the limit. This serializes the _nr accounting with a
spinlock It also switches to using unsigned long for the global totals.
Individual contexts are still limited to an unsigned int's worth of
requests by the syscall interface.

The problem and fix were verified with a simple program that spun creating
and destroying a context while holding on to another long lived context.
Before the patch a task creating a tiny context could get a spurious EAGAIN
if it raced with a task creating a very large context that overran the
limit.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
8766ce41018a0cb80fbe0ce7dbf747f357c752da 23-Oct-2005 Kostik Belousov <konstantin.belousov@zoral.com.ua> [PATCH] aio syscalls are not checked by lsm

Another case of missing call to security_file_permission: aio functions
(namely, io_submit) does not check credentials with security modules.

Below is the simple patch to the problem. It seems that it is enough to
check for rights at the request submission time.

Signed-off-by: Kostik Belousov <kostikbel@gmail.com>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4faa5285283fad081443e3612ca426a311bb6c7e 18-Oct-2005 Zach Brown <zach.brown@oracle.com> [PATCH] aio: revert lock_kiocb()

lock_kiocb() was introduced to serialize retrying and cancellation. In the
process of doing so it tried to sleep waiting for KIF_LOCKED while holding
the ctx_lock spinlock. Recent fixes have ensured that multiple concurrent
retries won't be attempted for a given iocb. Cancel has other problems and
has no significant in-tree users that have been complaining about it. So
for the immediate future we'll revert sleeping with the lock held and will
address proper cancellation and retry serialization in the future.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
353fb07e2043d2df12dddf4e2c39552d0ab9b026 30-Sep-2005 Zach Brown <zach.brown@oracle.com> [PATCH] aio: avoid extra aio_{read,write} call when ki_left == 0

Recently aio_p{read,write} changed to perform retries internally rather
than returning -EIOCBRETRY. This inadvertantly resulted in always calling
aio_{read,write} with ki_left at 0 which would in turn immediately return
0. Harmless, but we can avoid this call by checking in the caller.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
897f15fb587fd2772b9e7ff6ec0265057f3c3975 30-Sep-2005 Zach Brown <zach.brown@oracle.com> [PATCH] aio: remove unlocked task_list test and resulting race

Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick path could delete the task_list item from the wait
queue before getting the ctx_lock and putting the iocb on the run list.
The run path was testing the task_list item outside the lock so that it
could catch ki_retry methods that return -EIOCBRETRY *without* putting the
iocb on a wait queue and promising to call kick_iocb. This unlocked check
could then race with the kick path to cause both to try and put the iocb on
the run list.

The patch stops the run path from testing task_list by requring that any
ki_retry that returns -EIOCBRETRY *must* guarantee that kick_iocb() will be
called in the future. aio_p{read,write}, the only in-tree -EIOCBRETRY
users, are updated.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
998765e5588b197737d457e16f72832d8036190f 30-Sep-2005 Zach Brown <zach.brown@oracle.com> [PATCH] aio: lock around kiocbTryKick()

Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick patch could set the Kicked bit before acquiring the
ctx_lock and putting the iocb on the run list. The run path, while holding
the ctx_lock, could see this partial kick and mistake it for a kick that
was deferred while it was doing work with the run_list NULLed out. It
would then race with the kick thread to add the iocb to the run list.

This patch moves the kick setting under the ctx_lock so that only one of
the kick or run path queues the iocb on the run list, as intended.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
a464adeb7e8f1cd65ca911e20a7c02e452dc2c17 17-Sep-2005 Zach Brown <zach.brown@oracle.com> [PATCH] Add smp_mb__after_clear_bit() to unlock_kiocb()

Add smp_mb__after_clear_bit() to unlock_kiocb()

AIO's use of wait_on_bit_lock()/wake_up_bit() forgot to add a barrier
between clearing its lock bit and calling wake_up_bit() so wake_up_bit()'s
unlocked waitqueue_active() can race. This puts AIO's use in line with the
others and the comment above wake_up_bit().

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ab2af1f5005069321c5d130f09cce577b03f43ef 09-Sep-2005 Dipankar Sarma <dipankar@in.ibm.com> [PATCH] files: files struct with RCU

Patch to eliminate struct files_struct.file_lock spinlock on the reader side
and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
updates to the fdtable are done by allocating a new fdtable structure and
setting files->fdt to point to the new structure. The fdtable structure is
protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
global list of fdtable to be freed is not scalable, so we use a per-cpu list.
If keventd is already handling the current cpu's work, we use a timer to defer
queueing of that work.

Since the last publication, this patch has been re-written to avoid using
explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
premitives instead. This required that the fd information is kept in a
separate structure (fdtable) and updated atomically.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ac0b1bc1edbe81c0cb36cad7e7f5b91f4d9e12ed 09-Sep-2005 Benjamin LaHaise <bcrl@linux.intel.com> [PATCH] aio: kiocb locking to serialise retry and cancel

Implement a per-kiocb lock to serialise retry operations and cancel. This
is done using wait_on_bit_lock() on the KIF_LOCKED bit of kiocb->ki_flags.
Also, make the cancellation path lock the kiocb and subsequently release
all references to it if the cancel was successful. This version includes a
fix for the deadlock with __aio_run_iocbs.

Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
8f58202bf6b915656e116ece3bc4ace14bfe533a 09-Sep-2005 Wendy Cheng <wcheng@redhat.com> [PATCH] change io_cancel return code for no cancel case

Note that other than few exceptions, most of the current filesystem and/or
drivers do not have aio cancel specifically defined (kiob->ki_cancel field
is mostly NULL). However, sys_io_cancel system call universally sets
return code to -EAGAIN. This gives applications a wrong impression that
this call is implemented but just never works. We have customer inquires
about this issue.

Changed by Benjamin LaHaise to EINVAL instead of ENOSYS

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1e40cd383ccc7c9f8b338c56ce28c326e25eb2fe 04-Sep-2005 Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> [PATCH] uml: fixes performance regression in activate_mm and thus exec()

Normally, activate_mm() is called from exec(), and thus it used to be a
no-op because we use a completely new "MM context" on the host (for
instance, a new process), and so we didn't need to flush any "TLB entries"
(which for us are the set of memory mappings for the host process from the
virtual "RAM" file).

Kernel threads, instead, are usually handled in a different way. So, when
for AIO we call use_mm(), things used to break and so Benjamin implemented
activate_mm(). However, that is only needed for AIO, and could slow down
exec() inside UML, so be smart: detect being called for AIO (via
PF_BORROWED_MM) and do the full flush only in that situation.

Comment also the caller so that people won't go breaking UML without
noticing. I also rely on the caller's locks for testing current->flags.

Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
c016e2257acd00a7ffd87fa1eec896138563d1aa 29-Jun-2005 S�bastien Dugu <sebastien.dugue@bull.net> [PATCH] aio-retry-fix: fix aio retry work queueing

In the case of buffered AIO, in the aio retry path (aio_run_iocb), when the
retry method returns EIOCBRETRY the kicked iocb is added to the context run
list but is never queued onto the work queue. The request therefore is
never completed.

This patch fixes that by adding the appropriate call to aio_queue_work in
aio_run_aiocb so that subsequent retries will be handled by the aio worker
thread.

Signed-off-by: S�bastien Dugu� <sebastien.dugue@bull.net>
Acked-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
954d3e95369cf73b4bc1e570729f68264a0e6fe0 01-May-2005 Ken Chen <kenneth.w.chen@intel.com> [PATCH] aio: optimize io_submit_one()

This patch optimizes io_submit_one to call aio_run_iocb() directly if
ctx->run_list is empty. When the list is empty, the operation of adding to
the list, then call to __aio_run_iocbs() is unnecessary because these
operations are done in one atomic step. ctx->run_list always has only one
element in this case. This optimization speeds up industry standard db
transaction processing benchmark by 0.2%.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
644d3a088a3b862ed0b57c286cf58a6bd338ce08 01-May-2005 Ken Chen <kenneth.w.chen@intel.com> [PATCH] aio: clean up debug code

Clean up code that was previously used for debug purpose. Remove aio_run,
aio_wakeups, iocb->ki_queued and iocb->ki_kicked. Also clean up unused
variable count in __aio_run_iocbs() and debug code in read_events().

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
4bf69b2a06090c01c27f25ea5cd1440f7bf9256f 01-May-2005 Ken Chen <kenneth.w.chen@intel.com> [PATCH] aio: ring wrapping simplification

Since the tail pointer in aio_ring structure never wrap ring size more than
once, so a simple compare is sufficient to wrap the index around. This avoid
a more expensive mod operation.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
212079cf4ee99e492a57b817e796825d423a30bb 01-May-2005 Ken Chen <kenneth.w.chen@intel.com> [PATCH] aio: remove superfluous kiocb member initialization

This patch removes superfluous kiocb member initialization in the AIO
allocation and deallocation path. For example, in really_put_req(),
right before kiocb is returned to slab, 5 variables are reset to NULL.
The same variables will be initialized at the kiocb allocation time,
so why bother reset them knowing that they will be set to valid data
at alloc time? Another example: ki_retry is initialized in __aio_get_req,
but is initialized again in io_submit_one.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Suparna Bhattacharya <suparna@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
25ee7e3832951cf5896b194f6cd929a44863f419 25-Apr-2005 Adrian Bunk <bunk@stusta.de> [PATCH] fs/aio.c: make some code static

This patch makes some needlessly global code static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!