History log of /mm/fadvise.c
Revision Date Author Comments (<<< Hide modified files) (Show modified files >>>)
ad8a1b558e6c76fb53901956d3c8f29b82a4ccfa 11-Jan-2012 Shawn Bohrer <sbohrer@rgmadvisors.com> fadvise: only initiate writeback for specified range with FADV_DONTNEED

Previously POSIX_FADV_DONTNEED would start writeback for the entire file
when the bdi was not write congested. This negatively impacts performance
if the file contains dirty pages outside of the requested range. This
change uses __filemap_fdatawrite_range() to only initiate writeback for
the requested range.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Acked-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/fadvise.c
0141450f66c3c12a3aaa869748caa64241885cdf 05-Mar-2010 Wu Fengguang <fengguang.wu@intel.com> readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM

This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

Tested-by: Quentin Barnes <qbarnes+nfs@yahoo-inc.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> [2.6.33.x]
Cc: <qbarnes+nfs@yahoo-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/fadvise.c
f7e839dd36fd940b0202cfb7d39b2a1b2dc59b1b 17-Jun-2009 Wu Fengguang <fengguang.wu@intel.com> readahead: move max_sane_readahead() calls into force_page_cache_readahead()

Impact: code simplification.

Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/fadvise.c
6673e0c3fbeaed2cd08e2fd4a4aa97382d6fedb0 14-Jan-2009 Heiko Carstens <heiko.carstens@de.ibm.com> [CVE-2009-0029] System call wrapper special cases

System calls with an unsigned long long argument can't be converted with
the standard wrappers since that would include a cast to long, which in
turn means that we would lose the upper 32 bit on 32 bit architectures.
Also semctl can't use the standard wrapper since it has a 'union'
parameter.

So we handle them as special case and add some extra wrappers instead.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
/mm/fadvise.c
e1f8e87449147ffe5ea3de64a46af7de450ce279 16-Oct-2008 Francois Cami <francois.cami@free.fr> Remove Andrew Morton's old email accounts

People can use the real name an an index into MAINTAINERS to find the
current email address.

Signed-off-by: Francois Cami <francois.cami@free.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/fadvise.c
70688e4dd1647f0ceb502bbd5964fa344c5eb411 28-Apr-2008 Nick Piggin <npiggin@suse.de> xip: support non-struct page backed memory

Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
the user mappings.

This requires the get_xip_page API to be changed to an address based one.
Improve the API layering a little bit too, while we're here.

This is required in order to support XIP filesystems on memory that isn't
backed with struct page (but memory with struct page is still supported too).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/fadvise.c
b5beb1caff4ce063e6e2dc13f23b80eeef4e9782 05-Feb-2008 Masatake YAMATO <yamato@redhat.com> check ADVICE of fadvise64_64 even if get_xip_page is given

I've written some test programs in ltp project. During writing I met an
problem which I cannot solve in user land. So I wrote a patch for linux
kernel. Please, include this patch if acceptable.

The test program tests the 4th parameter of fadvise64_64:

long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice);

My test case calls fadvise64_64 with invalid advice value and checks errno is
set to EINVAL. About the advice parameter man page says:

...
Permissible values for advice include:

POSIX_FADV_NORMAL
...
POSIX_FADV_SEQUENTIAL
...
POSIX_FADV_RANDOM
...
POSIX_FADV_NOREUSE
...
POSIX_FADV_WILLNEED
...
POSIX_FADV_DONTNEED
...
ERRORS
...
EINVAL An invalid value was specified for advice.

However, I got a bug report that the system call invocations
in my test case returned 0 unexpectedly.

I've inspected the kernel code:

asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
{
struct file *file = fget(fd);
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
pgoff_t start_index;
pgoff_t end_index;
unsigned long nrpages;
int ret = 0;

if (!file)
return -EBADF;

if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
ret = -ESPIPE;
goto out;
}

mapping = file->f_mapping;
if (!mapping || len < 0) {
ret = -EINVAL;
goto out;
}

if (mapping->a_ops->get_xip_page)
/* no bad return value, but ignore advice */
goto out;
...
out:
fput(file);
return ret;
}

I found the advice parameter is just ignored in the case
mapping->a_ops->get_xip_page is given. This behavior is different from
what is written on the man page. Is this o.k.?

get_xip_page is given if CONFIG_EXT2_FS_XIP is true.
Anyway I cannot find the easy way to detect get_xip_page
field is given or CONFIG_EXT2_FS_XIP is true from the
user space.

I propose the following patch which checks the advice parameter
even if get_xip_page is given.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/fadvise.c
d3ac7f892b7d07d61d0895caa4f6e190e43112f8 08-Dec-2006 Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path

Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in linux/mm/.

Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/mm/fadvise.c
60c371bc753495f36d3a71338b46030f7fffce3b 05-Aug-2006 Andrew Morton <akpm@osdl.org> [PATCH] fadvise() make POSIX_FADV_NOREUSE a no-op

The POSIX_FADV_NOREUSE hint means "the application will use this range of the
file a single time". It seems to be intended that the implementation will use
this hint to perform drop-behind of that part of the file when the application
gets around to reading or writing it.

However for reasons which aren't obvious (or sane?) I mapped
POSIX_FADV_NOREUSE onto POSIX_FADV_WILLNEED. ie: it does readahead.

That's daft. So for now, make POSIX_FADV_NOREUSE a no-op.

This is a non-back-compatible change. If someone was using POSIX_FADV_NOREUSE
to perform readahead, they lose. The likelihood is low.

If/when we later implement POSIX_FADV_NOREUSE things will get interesting - to
do it fully we'll need to maintain file offset/length ranges and peform all
sorts of complex tricks, and managing the lifetime of those ranges' data
structures will be interesting..

A sensible implementation would probably ignore the file range and would
simply mark the entire file as needing some form of drop-behind treatment.

Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/mm/fadvise.c
8f72e4028a1ff968000cec4a034f45619fbd7ec4 10-Jul-2006 Andrew Morton <akpm@osdl.org> [PATCH] fadvise: remove dead comments

Cc: "Michael Kerrisk" <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/mm/fadvise.c
f79e2abb9bd452d97295f34376dedbec9686b986 31-Mar-2006 Andrew Morton <akpm@osdl.org> [PATCH] sys_sync_file_range()

Remove the recently-added LINUX_FADV_ASYNC_WRITE and LINUX_FADV_WRITE_WAIT
fadvise() additions, do it in a new sys_sync_file_range() syscall instead.
Reasons:

- It's more flexible. Things which would require two or three syscalls with
fadvise() can be done in a single syscall.

- Using fadvise() in this manner is something not covered by POSIX.

The patch wires up the syscall for x86.

The sycall is implemented in the new fs/sync.c. The intention is that we can
move sys_fsync(), sys_fdatasync() and perhaps sys_sync() into there later.

Documentation for the syscall is in fs/sync.c.

A test app (sync_file_range.c) is in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz.

The available-to-GPL-modules do_sync_file_range() is for knfsd: "A COMMIT can
say NFS_DATA_SYNC or NFS_FILE_SYNC. I can skip the ->fsync call for
NFS_DATA_SYNC which is hopefully the more common."

Note: the `async' writeout mode SYNC_FILE_RANGE_WRITE will turn synchronous if
the queue is congested. This is trivial to fix: add a new flag bit, set
wbc->nonblocking. But I'm not sure that we want to expose implementation
details down to that level.

Note: it's notable that we can sync an fd which wasn't opened for writing.
Same with fsync() and fdatasync()).

Note: the code takes some care to handle attempts to sync file contents
outside the 16TB offset on 32-bit machines. It makes such attempts appear to
succeed, for best 32-bit/64-bit compatibility. Perhaps it should make such
requests fail...

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/mm/fadvise.c
ebcf28e1c7a295f3321249dd235ad2e45938fdd9 24-Mar-2006 Andrew Morton <akpm@osdl.org> [PATCH] fadvise(): write commands

Add two new linux-specific fadvise extensions():

LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.

LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.

By combining these two operations the application may do several things:

LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.

LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.

LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.

It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.

To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().

The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.

As Ulrich notes, these two functions are somewhat abusive of the fadvise()
concept, which appears to be "set the future policy for this fd".

But these commands are a perfect fit with the fadvise() impementation, and
several of the existing fadvise() commands are synchronous and don't affect
future policy either. I think we can live with the slight incongruity.

Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/mm/fadvise.c
87ba81dba431232548ce29d5d224115d0c2355ac 08-Jan-2006 Valentine Barshak <vbarshak@ru.mvista.com> [PATCH] fadvise: return ESPIPE on FIFO/pipe

The patch makes posix_fadvise return ESPIPE on FIFO/pipe in order to be
fully POSIX-compliant.

Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/mm/fadvise.c
fe77ba6f4f97690baa4c756611a07f3cc033f6ae 24-Jun-2005 Carsten Otte <cotte@de.ibm.com> [PATCH] xip: madvice/fadvice: execute in place

Make sys_madvice/fadvice return sane with xip.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
/mm/fadvise.c
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 17-Apr-2005 Linus Torvalds <torvalds@ppc970.osdl.org> Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
/mm/fadvise.c