History log of /arch/s390/mm/pgtable.c
Revision Date Author Comments
66e9bbdb3dbb335b158bb88de2642966af816ffe 06-Oct-2014 Dominik Dingel <dingel@linux.vnet.ibm.com> s390/mm: fixing calls of pte_unmap_unlock

pte_unmap works on page table entry pointers, derefencing should be avoided.
As on s390 pte_unmap is a NOP, this is more a cleanup if we want to supply
later such function.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
dc77d344b41f3ffdd3b02317597e717b0b799f46 27-Aug-2014 Christian Borntraeger <borntraeger@de.ibm.com> KVM: s390/mm: fix up indentation of set_guest_storage_key

commit ab3f285f227f ("KVM: s390/mm: try a cow on read only pages for
key ops")' misaligned a code block. Let's fixup the indentation.

Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
c6c956b80bdf151cf41d3e7e5c54755d930a212c 01-Jul-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390/mm: support gmap page tables with less than 5 levels

Add an addressing limit to the gmap address spaces and only allocate
the page table levels that are needed for the given limit. The limit
is fixed and can not be changed after a gmap has been created.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
527e30b41d8b86e9ae7f5b740de416958c0e574e 30-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390/mm: use radix trees for guest to host mappings

Store the target address for the gmap segments in a radix tree
instead of using invalid segment table entries. gmap_translate
becomes a simple radix_tree_lookup, gmap_fault is split into the
address translation with gmap_translate and the part that does
the linking of the gmap shadow page table with the process page
table.
A second radix tree is used to keep the pointers to the segment
table entries for segments that are mapped in the guest address
space. On unmap of a segment the pointer is retrieved from the
radix tree and is used to carry out the segment invalidation in
the gmap shadow page table. As the radix tree can only store one
pointer, each host segment may only be mapped to exactly one
guest location.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
6e0a0431bf7d90ed0b8a0a974ad219617a70cc22 29-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390/mm: cleanup gmap function arguments, variable names

Make the order of arguments for the gmap calls more consistent,
if the gmap pointer is passed it is always the first argument.
In addition distinguish between guest address and user address
by naming the variables gaddr for a guest address and vmaddr for
a user address.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
9da4e3807657f3bcd12cfbb5671d80794303dde2 30-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390/mm: readd address parameter to gmap_do_ipte_notify

Revert git commit c3a23b9874c1 ("remove unnecessary parameter from
gmap_do_ipte_notify").

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
ab3f285f227fec62868037e9b1b1fd18294a83b8 19-Aug-2014 Christian Borntraeger <borntraeger@de.ibm.com> KVM: s390/mm: try a cow on read only pages for key ops

The PFMF instruction handler blindly wrote the storage key even if
the page was mapped R/O in the host. Lets try a COW before continuing
and bail out in case of errors.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
152125b7a882df36a55a8eadbea6d0edf1461ee7 24-Jul-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm: implement dirty bits for large segment table entries

The large segment table entry format has block of bits for the
ACC/F values for the large page. These bits are valid only if
another bit (AV bit 0x10000) of the segment table entry is set.
The ACC/F bits do not have a meaning if the AV bit is off.
This allows to put the THP splitting bit, the segment young bit
and the new segment dirty bit into the ACC/F bits as long as
the AV bit stays off. The dirty and young information is only
available if the pmd is large.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
55e4283c3eb1d850893f645dd695c9c75d5fa1fc 25-Jul-2014 Christian Borntraeger <borntraeger@de.ibm.com> KVM: s390/mm: Fix page table locking vs. split pmd lock

commit ec66ad66a0de87866be347b5ecc83bd46427f53b (s390/mm: enable
split page table lock for PMD level) activated the split pmd lock
for s390. Turns out that we missed one place: We also have to take
the pmd lock instead of the page table lock when we reallocate the
page tables (==> changing entries in the PMD) during sie enablement.

Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
beef560b4cdfafb2211a856e1d722540f5151933 14-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/uaccess: simplify control register updates

Always switch to the kernel ASCE in switch_mm. Load the secondary
space ASCE in finish_arch_post_lock_switch after checking that
any pending page table operations have completed. The primary
ASCE is loaded in entry[64].S. With this the update_primary_asce
call can be removed from the switch_to macro and from the start
of switch_mm function. Remove the load_primary argument from
update_user_asce/clear_user_asce, rename update_user_asce to
set_user_asce and rename update_primary_asce to load_kernel_asce.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
3a801517ad49f586f2016e1b1321e6cd28a97a04 16-May-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390: correct locking for s390_enable_skey

Use the mm semaphore to serialize multiple invocations of s390_enable_skey.
The second CPU faulting on a storage key operation needs to wait for the
completion of the page table update. Taking the mm semaphore writable
has the positive side-effect that it prevents any host faults from
taking place which does have implications on keys vs PGSTE.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
a0bf4f149bbfa2e31b5f4172c817afdb7b986733 24-Mar-2014 Dominik Dingel <dingel@linux.vnet.ibm.com> KVM: s390/mm: new gmap_test_and_clear_dirty function

For live migration kvm needs to test and clear the dirty bit of guest pages.

That for is ptep_test_and_clear_user_dirty, to be sure we are not racing with
other code, we protect the pte. This needs to be done within
the architecture memory management code.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
0a61b222df75a6a69dc34816f7db2f61fee8c935 18-Oct-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390/mm: use software dirty bit detection for user dirty tracking

Switch the user dirty bit detection used for migration from the hardware
provided host change-bit in the pgste to a fault based detection method.
This reduced the dependency of the host from the storage key to a point
where it becomes possible to enable the RCP bypass for KVM guests.

The fault based dirty detection will only indicate changes caused
by accesses via the guest address space. The hardware based method
can detect all changes, even those caused by I/O or accesses via the
kernel page table. The KVM/qemu code needs to take this into account.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
934bc131efc3e4be6a52f7dd6c4dbf99635e381a 14-Jan-2014 Dominik Dingel <dingel@linux.vnet.ibm.com> KVM: s390: Allow skeys to be enabled for the current process

Introduce a new function s390_enable_skey(), which enables storage key
handling via setting the use_skey flag in the mmu context.

This function is only useful within the context of kvm.

Note that enabling storage keys will cause a one-time hickup when
walking the page table; however, it saves us special effort for cases
like clear reset while making it possible for us to be architecture
conform.

s390_enable_skey() takes the page table lock to prevent reseting
storage keys triggered from multiple vcpus.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
d4cb11340be6a1613d40d2b546cb111ea2547066 29-Jan-2014 Dominik Dingel <dingel@linux.vnet.ibm.com> KVM: s390: Clear storage keys

page_table_reset_pgste() already does a complete page table walk to
reset the pgste. Enhance it to initialize the storage keys to
PAGE_DEFAULT_KEY if requested by the caller. This will be used
for lazy storage key handling. Also provide an empty stub for
!CONFIG_PGSTE

Lets adopt the current code (diag 308) to not clear the keys.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
1e1836e84f87d12feac6dd225fcef5eba1ca724b 08-Apr-2014 Alex Thorlton <athorlton@sgi.com> mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"

The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e. statically allocated data). This patch
allows us to do just that, without affecting other jobs running on the
system.

We need to do this sort of thing for jobs where THP hurts performance,
due to the possibility of increased remote memory accesses that can be
created by situations such as the following:

When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
be handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from. If many remote nodes need to do work on
that same chunk, they'll be making remote accesses.

With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.

This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov. The main goal here is to
add a prctl switch to allow us to disable to THP on a per mm_struct
basis.

Here's a bit of test data with the new patch in place...

First with the flag unset:

# perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 0
Set thp_disabled state to 0
Process pid = 18027

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23

Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
1,008,986 context-switches # 0.000 M/sec [100.00%]
7,717 CPU-migrations # 0.000 M/sec [100.00%]
1,698,932 page-faults # 0.000 M/sec
355,222,544,890,379 cycles # 1.298 GHz [100.00%]
536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
148,286,797,266,411 instructions # 0.42 insns per cycle
# 3.62 stalled cycles per insn [100.00%]
27,061,793,159,503 branches # 98.867 M/sec [100.00%]
1,188,655,196 branch-misses # 0.00% of all branches

427.001706337 seconds time elapsed

Now with the flag set:

# perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 1
Set thp_disabled state to 1
Process pid = 144957

PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23

Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
534,205 context-switches # 0.000 M/sec [100.00%]
4,595 CPU-migrations # 0.000 M/sec [100.00%]
63,133,119 page-faults # 0.000 M/sec
147,977,747,269,768 cycles # 1.066 GHz [100.00%]
200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
180,916,213,503,160 instructions # 1.22 insns per cycle
# 1.11 stalled cycles per insn [100.00%]
26,999,511,005,868 branches # 194.536 M/sec [100.00%]
714,066,351 branch-misses # 0.00% of all branches

216.196778807 seconds time elapsed

As with previous versions of the patch, We're getting about a 2x
performance increase here. Here's a link to the test case I used, along
with the little wrapper to activate the flag:

http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

This patch (of 3):

Revert commit 8e72033f2a48 and add in code to fix up any issues caused
by the revert.

The revert is necessary because hugepage_madvise would return -EINVAL
when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
patch set.

Here's a snip of an e-mail from Gerald detailing the original purpose of
this code, and providing justification for the revert:

"The intent of commit 8e72033f2a48 was to guard against any future
programming errors that may result in an madvice(MADV_HUGEPAGE) on
guest mappings, which would crash the kernel.

Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
8e72033f2a48 was to be reverted, because that check will also prevent
a kernel crash in the case described above, it will now send a
SIGSEGV instead.

This would now also allow to do the madvise on other parts, if
needed, so it is a more flexible approach. One could also say that
it would have been better to do it this way right from the
beginning..."

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
457f2180951cdcbfb4657ddcc83b486e93497f56 21-Mar-2014 Heiko Carstens <heiko.carstens@de.ibm.com> s390/uaccess: rework uaccess code - fix locking issues

The current uaccess code uses a page table walk in some circumstances,
e.g. in case of the in atomic futex operations or if running on old
hardware which doesn't support the mvcos instruction.

However it turned out that the page table walk code does not correctly
lock page tables when accessing page table entries.
In other words: a different cpu may invalidate a page table entry while
the current cpu inspects the pte. This may lead to random data corruption.

Adding correct locking however isn't trivial for all uaccess operations.
Especially copy_in_user() is problematic since that requires to hold at
least two locks, but must be protected against ABBA deadlock when a
different cpu also performs a copy_in_user() operation.

So the solution is a different approach where we change address spaces:

User space runs in primary address mode, or access register mode within
vdso code, like it currently already does.

The kernel usually also runs in home space mode, however when accessing
user space the kernel switches to primary or secondary address mode if
the mvcos instruction is not available or if a compare-and-swap (futex)
instruction on a user space address is performed.
KVM however is special, since that requires the kernel to run in home
address space while implicitly accessing user space with the sie
instruction.

So we end up with:

User space:
- runs in primary or access register mode
- cr1 contains the user asce
- cr7 contains the user asce
- cr13 contains the kernel asce

Kernel space:
- runs in home space mode
- cr1 contains the user or kernel asce
-> the kernel asce is loaded when a uaccess requires primary or
secondary address mode
- cr7 contains the user or kernel asce, (changed with set_fs())
- cr13 contains the kernel asce

In case of uaccess the kernel changes to:
- primary space mode in case of a uaccess (copy_to_user) and uses
e.g. the mvcp instruction to access user space. However the kernel
will stay in home space mode if the mvcos instruction is available
- secondary space mode in case of futex atomic operations, so that the
instructions come from primary address space and data from secondary
space

In case of kvm the kernel runs in home space mode, but cr1 gets switched
to contain the gmap asce before the sie instruction gets executed. When
the sie instruction is finished cr1 will be switched back to contain the
user asce.

A context switch between two processes will always load the kernel asce
for the next process in cr1. So the first exit to user space is a bit
more expensive (one extra load control register instruction) than before,
however keeps the code rather simple.

In sum this means there is no need to perform any error prone page table
walks anymore when accessing user space.

The patch seems to be rather large, however it mainly removes the
the page table walk code and restores the previously deleted "standard"
uaccess code, with a couple of changes.

The uaccess without mvcos mode can be enforced with the "uaccess_primary"
kernel parameter.

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
1b948d6caec4f28e3524244ca0f77c6ae8ddceef 03-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm,tlb: optimize TLB flushing for zEC12

The zEC12 machines introduced the local-clearing control for the IDTE
and IPTE instruction. If the control is set only the TLB of the local
CPU is cleared of entries, either all entries of a single address space
for IDTE, or the entry for a single page-table entry for IPTE.
Without the local-clearing control the TLB flush is broadcasted to all
CPUs in the configuration, which is expensive.

The reset of the bit mask of the CPUs that need flushing after a
non-local IDTE is tricky. As TLB entries for an address space remain
in the TLB even if the address space is detached a new bit field is
required to keep track of attached CPUs vs. CPUs in the need of a
flush. After a non-local flush with IDTE the bit-field of attached CPUs
is copied to the bit-field of CPUs in need of a flush. The ordering
of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is
such that an underindication in mm_cpumask(mm) is prevented but an
overindication in mm_cpumask(mm) is possible.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
02a8f3abb708919149cb657a5202f4603f0c38e2 03-Apr-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm,tlb: safeguard against speculative TLB creation

The principles of operations states that the CPU is allowed to create
TLB entries for an address space anytime while an ASCE is loaded to
the control register. This is true even if the CPU is running in the
kernel and the user address space is not (actively) accessed.

In theory this can affect two aspects of the TLB flush logic.
For full-mm flushes the ASCE of the dying process is still attached.
The approach to flush first with IDTE and then just free all page
tables can in theory lead to stale TLB entries. Use the batched
free of page tables for the full-mm flushes as well.

For operations that can have a stale ASCE in the control register,
e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE
to prevent invalid TLBs from being created.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
aaeff84a2dfa224611fc9fee89cb20277469c454 19-Mar-2014 Dominik Dingel <dingel@linux.vnet.ibm.com> s390/mm: remove unnecessary parameter from gmap_do_ipte_notify

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
c7c5be73ccc05da9899b313b9fa0042aae56502f 19-Mar-2014 Dominik Dingel <dingel@linux.vnet.ibm.com> s390/mm: fixing comment so that parameter name match

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ec66ad66a0de87866be347b5ecc83bd46427f53b 12-Feb-2014 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm: enable split page table lock for PMD level

Add the pgtable_pmd_page_ctor/pgtable_pmd_page_dtor calls to the pmd
allocation and free functions and enable ARCH_ENABLE_SPLIT_PMD_PTLOCK
for 64 bit.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
deedabb2b4a68a63351a949b1abcf73fc97eb406 21-May-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/kvm: set guest page states to stable on re-ipl

The guest page state needs to be reset to stable for all pages
on initial program load via diagnose 0x308.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
b31288fa83b2bcc8834e1e208e9526b8bd5ce361 17-Apr-2013 Konstantin Weitz <konstantin.weitz@gmail.com> s390/kvm: support collaborative memory management

This patch enables Collaborative Memory Management (CMM) for kvm
on s390. CMM allows the guest to inform the host about page usage
(see arch/s390/mm/cmm.c). The host uses this information to avoid
swapping in unused pages in the page fault handler. Further, a CPU
provided list of unused invalid pages is processed to reclaim swap
space of not yet accessed unused pages.

[ Martin Schwidefsky: patch reordering and cleanup ]

Signed-off-by: Konstantin Weitz <konstantin.weitz@gmail.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
b4a960159e6f5254ac3c95dd183789f402431977 13-Dec-2013 Hendrik Brueckner <brueckner@linux.vnet.ibm.com> s390: Fix misspellings using 'codespell' tool

Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
e89cfa58a8358fdb4d4e79936c25222416ad415e 14-Nov-2013 Kirill A. Shutemov <kirill.shutemov@linux.intel.com> s390: handle pgtable_page_ctor() fail

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c389a250ab4cfa4a3775d9f2c45271618af6d5b2 14-Nov-2013 Kirill A. Shutemov <kirill.shutemov@linux.intel.com> mm, thp: do not access mm->pmd_huge_pte directly

Currently mm->pmd_huge_pte protected by page table lock. It will not
work with split lock. We have to have per-pmd pmd_huge_pte for proper
access serialization.

For now, let's just introduce wrapper to access mm->pmd_huge_pte.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
106078641f32a6a10d9759f809f809725695cb09 28-Oct-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm,tlb: correct tlb flush on page table upgrade

The IDTE instruction used to flush TLB entries for a specific address
space uses the address-space-control element (ASCE) to identify
affected TLB entries. The upgrade of a page table adds a new top
level page table which changes the ASCE. The TLB entries associated
with the old ASCE need to be flushed and the ASCE for the address space
needs to be replaced synchronously on all CPUs which currently use it.
The concept of a lazy ASCE update with an exception handler is broken.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
be39f1968e33ca641af120a2d659421ad2225dea 31-Oct-2013 Dominik Dingel <dingel@linux.vnet.ibm.com> s390/mm: page_table_realloc returns failure

There is a possible race between setting has_pgste and reallocation of the
page_table, change the order to fix this.
Also page_table_alloc_pgste can fail, in that case we need to backpropagte this
as -ENOMEM to the caller of page_table_realloc.

Based on a patch by Christian Borntraeger <borntraeger@de.ibm.com>.

Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
e258d719ff28ecc7a048eb8f78380e68c4b3a3f0 24-Sep-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/uaccess: always run the kernel in home space

Simplify the uaccess code by removing the user_mode=home option.
The kernel will now always run in the home space mode.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
63df41d663fc27e96571bfea86d3f9ee81289e07 06-Sep-2013 Heiko Carstens <heiko.carstens@de.ibm.com> s390: make various functions static, add declarations to header files

Make various functions static, add declarations to header files to
fix a couple of sparse findings.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
984e2a5975e538a6475f7453523896319a1cb597 06-Sep-2013 Heiko Carstens <heiko.carstens@de.ibm.com> s390/mm: add __releases()/__acquires() annotations to gmap_alloc_table()

Let sparse not incorrectly complain about unbalanced locking.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
0944fe3f4a323f436180d39402cae7f9c46ead17 23-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm: implement software referenced bits

The last remaining use for the storage key of the s390 architecture
is reference counting. The alternative is to make page table entries
invalid while they are old. On access the fault handler marks the
pte/pmd as young which makes the pte/pmd valid if the access rights
allow read access. The pte/pmd invalidations required for software
managed reference bits cost a bit of performance, on the other hand
the RRBE/RRBM instructions to read and reset the referenced bits are
quite expensive as well.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
5c474a1e2265c5156e6c63f87a7e99053039b8b9 16-Aug-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm: introduce ptep_flush_lazy helper

Isolate the logic of IDTE vs. IPTE flushing of ptes in two functions,
ptep_flush_lazy and __tlb_flush_mm_lazy.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
e509861105a3c1425f3f929bd631f88340b499bf 23-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm: cleanup page table definitions

Improve the encoding of the different pte types and the naming of the
page, segment table and region table bits. Due to the different pte
encoding the hugetlbfs primitives need to be adapted as well. To improve
compatability with common code make the huge ptes use the encoding of
normal ptes. The conversion between the pte and pmd encoding for a huge
pte is done with set_huge_pte_at and huge_ptep_get.
Overall the code is now easier to understand.

Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ee6ee55bb505c5bd8e64bc652281a93fb99c07b3 26-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390: fix task size check

The gmap_map_segment function uses PGDIR_SIZE in the check for the
maximum address in the tasks address space. This incorrectly limits
the amount of memory usable for a kvm guest to 4TB. The correct limit
is (1UL << 53). As the TASK_SIZE has different values (4TB vs 8PB)
dependent on the existance of the fourth page table level, create
a new define 'TASK_MAX_SIZE' for (1UL << 53).

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
3eabaee998c787e7e1565574821652548f7fc003 26-Jul-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> KVM: s390: allow sie enablement for multi-threaded programs

Improve the code to upgrade the standard 2K page tables to 4K page tables
with PGSTEs to allow the operation to happen when the program is already
multi-threaded.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
24d5dd0208ed1cd3ef6bf30a50b347ef366f21ac 27-May-2013 Christian Borntraeger <borntraeger@de.ibm.com> s390/kvm: Provide function for setting the guest storage key

From time to time we need to set the guest storage key. Lets
provide a helper function that handles the changes with all the
right locking and checking.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
6b0b50b0617fad5f2af3b928596a25f7de8dbf50 06-Jun-2013 Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> mm/THP: add pmd args to pgtable deposit and withdraw APIs

This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
db70ccdfb9953b984f5b95d98c50d8da335bab59 12-Jun-2013 Christian Borntraeger <borntraeger@de.ibm.com> KVM: s390: Provide function for setting the guest storage key

From time to time we need to set the guest storage key. Lets
provide a helper function that handles the changes with all the
right locking and checking.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
e86cbd8765bd2e1f9eeb209822449c9b1e5958cf 29-May-2013 Christian Borntraeger <borntraeger@de.ibm.com> s390/pgtable: Fix gmap notifier address

The address of the gmap notifier was broken, resulting in
unhandled validity intercepts in KVM. Fix the rmap->vmaddr
to be on a segment boundary.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
f8b5ff2cff232df052955ef975f7219e1faa217f 17-May-2013 Christian Borntraeger <borntraeger@de.ibm.com> s390: fix gmap_ipte_notifier vs. software dirty pages

On heavy paging load some guest cpus started to loop in gmap_ipte_notify.
This was visible as stalled cpus inside the guest. The gmap_ipte_notifier
tries to map a user page and then made sure that the pte is valid and
writable. Turns out that with the software change bit tracking the pte
can become read-only (and only software writable) if the page is clean.
Since we loop in this code, the page would stay clean and, therefore,
be never writable again.
Let us just use fixup_user_fault, that guarantees to call handle_mm_fault.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
0d0dafc1e48fd254c22f75738def870a7ffd2c3e 17-May-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/kvm: rename RCP_xxx defines to PGSTE_xxx

The RCP byte is a part of the PGSTE value, the existing RCP_xxx names
are inaccurate. As the defines describe bits and pieces of the PGSTE,
the names should start with PGSTE_. The KVM_UR_BIT and KVM_UC_BIT are
part of the PGSTE as well, give them better names as well.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
bb4b42ce0ca36af8c113587ab64b138b3cf5459c 08-May-2013 Christian Borntraeger <borntraeger@de.ibm.com> s390: fix gmap_ipte_notifier vs. software dirty pages

On heavy paging load some guest cpus started to loop in gmap_ipte_notify.
This was visible as stalled cpus inside the guest. The gmap_ipte_notifier
tries to map a user page and then made sure that the pte is valid and
writable. Turns out that with the software change bit tracking the pte
can become read-only (and only software writable) if the page is clean.
Since we loop in this code, the page would stay clean and, therefore,
be never writable again.
Let us just use fixup_user_fault, that guarantees to call handle_mm_fault.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
d3383632d4e8e9ae747f582eaee8c2e79f828ae6 17-Apr-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm: add pte invalidation notifier for kvm

Add a notifier for kvm to get control before a page table entry is
invalidated. The notifier is only called for ptes of an address space
with pgstes that have been explicitly marked to require notification.
Kvm will use this to get control before prefix pages of virtual CPU
are unmapped.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ab8e5235868f99dfc779e4eaff28f53d63714ce4 16-Apr-2013 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm,gmap: segment mapping race

The gmap_map_segment function creates a special invalid segment table
entry with the address of the requested target location in the process
address space. The first access will create the connection between the
gmap segment table and the target page table of the main process.
If two threads do this concurrently both will walk the page tables and
allocate a gmap_rmap structure for the same segment table entry.
To avoid the race recheck the segment table entry after taking to page
table lock.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
c5034945ce59abacdd02c5eff29f4f54df197880 10-Sep-2012 Heiko Carstens <heiko.carstens@de.ibm.com> s390/mm,gmap: implement gmap_translate()

Implement gmap_translate() function which translates a guest absolute address
to a user space process address without establishing the guest page table
entries.

This is useful for kvm guest address translations where no memory access
is expected to happen soon (e.g. tprot exception handler).

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
9e0fdb4145205bea95c2888a195c3ead2652f120 05-Mar-2013 Heiko Carstens <heiko.carstens@de.ibm.com> s390/mm,gmap: implement gmap_translate()

Implement gmap_translate() function which translates a guest absolute address
to a user space process address without establishing the guest page table
entries.

This is useful for kvm guest address translations where no memory access
is expected to happen soon (e.g. tprot exception handler).

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
0a4ccc992978ef552dc86ac68bc1ec62cf268e2a 02-Nov-2012 Heiko Carstens <heiko.carstens@de.ibm.com> s390/mm: move kernel_page_present/kernel_map_pages to page_attr.c

Keep related functions together and move to appropriate file.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
1ae1c1d09f220ded48ee9a7d91a65e94f95c4af1 09-Oct-2012 Gerald Schaefer <gerald.schaefer@de.ibm.com> thp, s390: architecture backend for thp on s390

This implements the architecture backend for transparent hugepages
on s390.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
274023da1e8a49efa6fd9bf857f8557e5db44cdf 09-Oct-2012 Gerald Schaefer <gerald.schaefer@de.ibm.com> thp, s390: disable thp for kvm host on s390

This patch is part of the architecture backend for thp on s390. It
disables thp for kvm hosts, because there is no kvm host hugepage support
so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT,
and future thp mappings are prevented by setting VM_NOHUGEPAGE in
mm->def_flags.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9501d09fa3c4ca18971083dfb0c9aa1afc85f19c 09-Oct-2012 Gerald Schaefer <gerald.schaefer@de.ibm.com> thp, s390: thp pagetable pre-allocation for s390

This patch is part of the architecture backend for thp on s390. It
provides the pagetable pre-allocation functions
pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike
other archs, s390 has no struct page * as pgtable_t, but rather a pointer
to the page table. So instead of saving the pagetable pre- allocation
list info inside the struct page, it is being saved within the pagetable
itself.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
75077afbec1ac89178c1542b23a70d0f960b0aaf 09-Oct-2012 Gerald Schaefer <gerald.schaefer@de.ibm.com> thp, s390: thp splitting backend for s390

This patch is part of the architecture backend for thp on s390. It
provides the functions related to thp splitting, including serialization
against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb
flushing operation to serialize against gup on s390, because that wouldn't
be stopped by the disabled IRQs. So instead, smp_call_function() is
called with an empty function, which will have the expected effect.

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
41459d36cf0d57813017dae6080a879cc038e5fe 14-Sep-2012 Heiko Carstens <heiko.carstens@de.ibm.com> s390: add uninitialized_var() to suppress false positive compiler warnings

Get rid of these:

arch/s390/kernel/smp.c:134:19: warning: ‘status’ may be used uninitialized in this function [-Wuninitialized]
arch/s390/mm/pgtable.c:641:10: warning: ‘table’ may be used uninitialized in this function [-Wuninitialized]
arch/s390/mm/pgtable.c:644:12: warning: ‘page’ may be used uninitialized in this function [-Wuninitialized]
drivers/s390/cio/cio.c:1037:14: warning: ‘schid’ may be used uninitialized in this function [-Wuninitialized]

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
d1b0d842c4450e410053083db837ef16532a4139 02-Sep-2012 Heiko Carstens <heiko.carstens@de.ibm.com> s390/mm: rename addressing_mode to s390_user_mode

Renaming the globally visible variable "user_mode" to "addressing_mode" in
order to fix a name clash was not a good idea. (Commit 37fe1d73 "s390/mm:
rename user_mode variable to addressing_mode")
Looking at the code after a couple of weeks one thinks: addressing mode of
what?
So rename the variable again. This time to s390_user_mode. Which hopefully
makes more sense.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
37fe1d73a449bdebc4908d04e518f5852d6c453b 27-Jul-2012 Heiko Carstens <heiko.carstens@de.ibm.com> s390/mm: rename user_mode variable to addressing_mode

Fix name clash with user_mode() define which is also used in common code.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
0f6f281b731d20bfe75c13f85d33f3f05b440222 26-Jul-2012 Martin Schwidefsky <schwidefsky@de.ibm.com> s390/mm: downgrade page table after fork of a 31 bit process

The downgrade of the 4 level page table created by init_new_context is
currently done only in start_thread31. If a 31 bit process forks the
new mm uses a 4 level page table, including the task size of 2<<42
that goes along with it. This is incorrect as now a 31 bit process
can map memory beyond 2GB. Define arch_dup_mmap to do the downgrade
after fork.

Cc: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
a53c8fab3f87c995c30ac226a03af95361243144 20-Jul-2012 Heiko Carstens <heiko.carstens@de.ibm.com> s390/comments: unify copyright messages and remove file names

Remove the file name from the comment at top of many files. In most
cases the file name was wrong anyway, so it's rather pointless.

Also unify the IBM copyright statement. We did have a lot of sightly
different statements and wanted to change them one after another
whenever a file gets touched. However that never happened. Instead
people start to take the old/"wrong" statements to use as a template
for new files.
So unify all of them in one go.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2739b6d12407792f253b7a15233930338e6831c9 09-May-2012 Christian Borntraeger <borntraeger@de.ibm.com> s390/kvm: bad rss-counter state

commit c3f0327f8e9d7a503f0d64573c311eddd61f197d
mm: add rss counters consistency check
detected the following problem with kvm on s390:

BUG: Bad rss-counter state mm:00000004f73ef000 idx:0 val:-10
BUG: Bad rss-counter state mm:00000004f73ef000 idx:1 val:-5

We have to make sure that we accumulate all rss values into
the mm before we replace the mm to avoid triggering this (harmless)
bug message.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
cd94154cc6a28dd9dc271042c1a59c08d26da886 11-Apr-2012 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] fix tlb flushing for page table pages

Git commit 36409f6353fc2d7b6516e631415f938eadd92ffa "use generic RCU
page-table freeing code" introduced a tlb flushing bug. Partially revert
the above git commit and go back to s390 specific page table flush code.

For s390 the TLB can contain three types of entries, "normal" TLB
page-table entries, TLB combined region-and-segment-table (CRST) entries
and real-space entries. Linux does not use real-space entries which
leaves normal TLB entries and CRST entries. The CRST entries are
intermediate steps in the page-table translation called translation paths.
For example a 4K page access in a three-level page table setup will
create two CRST TLB entries and one page-table TLB entry. The advantage
of that approach is that a page access next to the previous one can reuse
the CRST entries and needs just a single read from memory to create the
page-table TLB entry. The disadvantage is that the TLB flushing rules are
more complicated, before any page-table may be freed the TLB needs to be
flushed.

In short: the generic RCU page-table freeing code is incorrect for the
CRST entries, in particular the check for mm_users < 2 is troublesome.

This is applicable to 3.0+ kernels.

Cc: <stable@vger.kernel.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
a0616cdebcfd575dcd4c46102d1b52fbb827fc29 28-Mar-2012 David Howells <dhowells@redhat.com> Disintegrate asm/system.h for S390

Disintegrate asm/system.h for S390.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-s390@vger.kernel.org
2320c5793790fcda80e6dcc088dbda86040235e5 17-Feb-2012 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] incorrect PageTables counter for kvm page tables

The page_table_free_pgste function is used for kvm processes to free page
tables that have the pgste extension. It calls pgtable_page_ctor instead of
pgtable_page_dtor which increases NR_PAGETABLE instead of decreasing it.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
14045ebf1e1156d966a796cacad91028e01797e5 27-Dec-2011 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] add support for physical memory > 4TB

The kernel address space of a 64 bit kernel currently uses a three level
page table and the vmemmap array has a fixed address and a fixed maximum
size. A three level page table is good enough for systems with less than
3.8TB of memory, for bigger systems four page table levels need to be
used. Each page table level costs a bit of performance, use 3 levels for
normal systems and 4 levels only for the really big systems.
To avoid bloating sparse.o too much set MAX_PHYSMEM_BITS to 46 for a
maximum of 64TB of memory.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
c86cce2a20207cbf2b3dfe97c985a1f5aa5d3798 27-Dec-2011 Christian Borntraeger <borntraeger@de.ibm.com> [S390] kvm: fix sleeping function ... at mm/page_alloc.c:2260

commit cc772456ac9b460693492b3a3d89e8c81eda5874
[S390] fix list corruption in gmap reverse mapping

added a potential dead lock:

BUG: sleeping function called from invalid context at mm/page_alloc.c:2260
in_atomic(): 1, irqs_disabled(): 0, pid: 1108, name: qemu-system-s39
3 locks held by qemu-system-s39/1108:
#0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm]
#1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298
#2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298
CPU: 0 Not tainted 3.1.3 #45
Process qemu-system-s39 (pid: 1108, task: 00000004f8b3cb30, ksp: 00000004fd5978d0)
00000004fd5979a0 00000004fd597920 0000000000000002 0000000000000000
00000004fd5979c0 00000004fd597938 00000004fd597938 0000000000617e96
0000000000000000 00000004f8b3cf58 0000000000000000 0000000000000000
000000000000000d 000000000000000c 00000004fd597988 0000000000000000
0000000000000000 0000000000100a18 00000004fd597920 00000004fd597960
Call Trace:
([<0000000000100926>] show_trace+0xee/0x144)
[<0000000000131f3a>] __might_sleep+0x12a/0x158
[<0000000000217fb4>] __alloc_pages_nodemask+0x224/0xadc
[<0000000000123086>] gmap_alloc_table+0x46/0x114
[<000000000012395c>] gmap_map_segment+0x268/0x298
[<000003e00486b014>] kvm_arch_commit_memory_region+0x44/0x6c [kvm]
[<000003e004866414>] __kvm_set_memory_region+0x3b0/0x4a4 [kvm]
[<000003e004866554>] kvm_set_memory_region+0x4c/0x6c [kvm]
[<000003e004867c7a>] kvm_vm_ioctl+0x14a/0x314 [kvm]
[<0000000000292100>] do_vfs_ioctl+0x94/0x588
[<0000000000292688>] SyS_ioctl+0x94/0xac
[<000000000061e124>] sysc_noemu+0x22/0x28
[<000003fffcd5e7ca>] 0x3fffcd5e7ca
3 locks held by qemu-system-s39/1108:
#0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm]
#1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298
#2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298

Fix this by freeing the lock on the alloc path. This is ok, since the
gmap table is never freed until we call gmap_free, so the table we are
walking cannot go.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
388186bc920d9200202e4d25de66fa95b1b8fc68 30-Oct-2011 Christian Borntraeger <borntraeger@de.ibm.com> [S390] kvm: Handle diagnose 0x10 (release pages)

Linux on System z uses a ballooner based on diagnose 0x10. (aka as
collaborative memory management). This patch implements diagnose
0x10 on the guest address space.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
499069e1a421e2a85e76846c3237f00f1a5cb435 30-Oct-2011 Carsten Otte <cotte@de.ibm.com> [S390] take mmap_sem when walking guest page table

gmap_fault needs to walk the guest page table. However, parts of
that may change if some other thread does munmap. In that case
gmap_unmap_notifier will also unmap the corresponding parts from
the guest page table. We need to take mmap_sem in order to serialize
these operations.
do_exception now calls __gmap_fault with mmap_sem held which does
not get exported to modules. The exported function, which is called
from KVM, now takes mmap_sem.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
cc772456ac9b460693492b3a3d89e8c81eda5874 30-Oct-2011 Carsten Otte <cotte@de.ibm.com> [S390] fix list corruption in gmap reverse mapping

This introduces locking via mm->page_table_lock to protect
the rmap list for guest mappings from being corrupted by concurrent
operations.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
a9162f238a84ee05b09ea4b0ebd97fb20448c28c 30-Oct-2011 Carsten Otte <cotte@de.ibm.com> [S390] fix possible deadlock in gmap_map_segment

Fix possible deadlock reported by lockdep:
qemu-system-s39/2963 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: gmap_alloc_table+0x9c/0x120
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: gmap_map_segment+0xa6/0x27c

Actually gmap_alloc_table is the only called in gmap_map_segment with
mmap_sem held, thus it's safe to simply remove the inner lock.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
e73b7fffe487c315fd1a4fa22282e3362b440a06 30-Oct-2011 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] memory leak with RCU_TABLE_FREE

The rcu page table free code uses a couple of bits in the page table
pointer passed to tlb_remove_table to discern the different page table
types. __tlb_remove_table extracts the type with an incorrect mask which
leads to memory leaks. The correct mask is ((FRAG_MASK << 4) | FRAG_MASK).

Cc: stable@kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
05873df981ca1dd32f398e7b4e19864de907e064 26-Sep-2011 Carsten Otte <cotte@de.ibm.com> [S390] gmap: always up mmap_sem properly

If gmap_unmap_segment figures that the segment was not mapped in the
first place, it need to up mmap_sem on exit.

Cc: <stable@kernel.org>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
480e5926ce3bb61ec229be2dab08bdce8abb8d2e 20-Sep-2011 Christian Borntraeger <borntraeger@de.ibm.com> [S390] kvm: fix address mode switching

598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address
spaces for kvm guest images) changed kvm to use a separate address
space for kvm guests. This address space was switched in __vcpu_run
In some cases (preemption, page fault) there is the possibility that
this address space switch is lost.
The typical symptom was a huge amount of validity intercepts or
random guest addressing exceptions.
Fix this by doing the switch in sie_loop and sie_exit and saving the
address space in the gmap structure itself. Also use the preempt
notifier.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
944291de33b26a8b403f13f5eb0cc51fb982aa1e 03-Aug-2011 Jan Glauber <jang@linux.vnet.ibm.com> [S390] missing return in page_table_alloc_pgste

Fix the following compile warning for !CONFIG_PGSTE:

CC arch/s390/mm/pgtable.o
arch/s390/mm/pgtable.c: In function ‘page_table_alloc_pgste’:
arch/s390/mm/pgtable.c:531:1: warning: no return statement in function returning non-void [-Wreturn-type]

Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
e5992f2e6c3829cd43dbc4438ee13dcd6506f7f3 24-Jul-2011 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] kvm guest address space mapping

Add code that allows KVM to control the virtual memory layout that
is seen by a guest. The guest address space uses a second page table
that shares the last level pte-tables with the process page table.
If a page is unmapped from the process page table it is automatically
unmapped from the guest page table as well.

The guest address space mapping starts out empty, KVM can map any
individual 1MB segments from the process virtual memory to any 1MB
aligned location in the guest virtual memory. If a target segment in
the process virtual memory does not exist or is unmapped while a
guest mapping exists the desired target address is stored as an
invalid segment table entry in the guest page table.
The population of the guest page table is fault driven.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
36409f6353fc2d7b6516e631415f938eadd92ffa 06-Jun-2011 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] use generic RCU page-table freeing code

Replace the s390 specific rcu page-table freeing code with the
generic variant. This requires to duplicate the definition for the
struct mmu_table_batch as s390 does not use the generic tlb flush
code.

While we are at it remove the restriction that page table fragments
can not be reused after a single fragment has been freed with rcu
and split out allocation and freeing of page tables with pgstes.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
3c5cffb66d8ea94832650fcb55194715b0229088 29-May-2011 Heiko Carstens <heiko.carstens@de.ibm.com> [S390] mm: fix mmu_gather rework

Quite a few functions that get called from the tlb gather code require that
preemption must be disabled. So disable preemption inside of the called
functions instead.
The only drawback is that rcu_table_freelist_finish() doesn't get necessarily
called on the cpu(s) that filled the free lists. So we may see a delay, until
we finally see an rcu callback. However over time this shouldn't matter.

So we get rid of lots of "BUG: using smp_processor_id() in preemptible"
messages.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
1c395176962176660bb108f90e97e1686cfe0d85 25-May-2011 Peter Zijlstra <a.p.zijlstra@chello.nl> mm: now that all old mmu_gather code is gone, remove the storage

Fold all the mmu_gather rework patches into one for submission

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
043d07084b5347a26eab0a07aa13a4a929ad9e71 23-May-2011 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] Remove data execution protection

The noexec support on s390 does not rely on a bit in the page table
entry but utilizes the secondary space mode to distinguish between
memory accesses for instructions vs. data. The noexec code relies
on the assumption that the cpu will always use the secondary space
page table for data accesses while it is running in the secondary
space mode. Up to the z9-109 class machines this has been the case.
Unfortunately this is not true anymore with z10 and later machines.
The load-relative-long instructions lrl, lgrl and lgfrl access the
memory operand using the same addressing-space mode that has been
used to fetch the instruction.
This breaks the noexec mode for all user space binaries compiled
with march=z10 or later. The only option is to remove the current
noexec support.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
f1be77bb21120b5306b56d6854db1f8eb5c3678b 31-Jan-2011 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] pgtable_list corruption

After page_table_free_rcu removed a page from the pgtable_list
page_table_free better not add it again. Otherwise a page_table_alloc
can reuse a page table fragment that is still in the rcu process.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
e05ef9bdb899e2f3798be74691842fc597d8ce60 25-Oct-2010 Christian Borntraeger <borntraeger@de.ibm.com> [S390] kvm: Fix badness at include/asm/mmu_context.h:83

commit 050eef364ad700590a605a0749f825cab4834b1e
[S390] fix tlb flushing vs. concurrent /proc accesses
broke KVM on s390x. On every schedule a
Badness at include/asm/mmu_context.h:83 appears. s390_enable_sie
replaces the mm on the __running__ task, therefore, we have to
increase the attach count of the new mm.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
80217147a3d80c8a4e48f06e2f6e965455f3fe2a 25-Oct-2010 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] lockless get_user_pages_fast()

Implement get_user_pages_fast without locking in the fastpath on s390.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
5a0e3ad6af8660be21ca98a971cd00f331318c05 24-Mar-2010 Tejun Heo <tj@kernel.org> include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
b11b53342773361f3353b285eb6a3fd6074e7997 07-Dec-2009 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] Improve address space mode selection.

Introduce user_mode to replace the two variables switch_amode and
s390_noexec. There are three valid combinations of the old values:
1) switch_amode == 0 && s390_noexec == 0
2) switch_amode == 1 && s390_noexec == 0
3) switch_amode == 1 && s390_noexec == 1
They get replaced by
1) user_mode == HOME_SPACE_MODE
2) user_mode == PRIMARY_SPACE_MODE
3) user_mode == SECONDARY_SPACE_MODE
The new kernel parameter user_mode=[primary,secondary,home] lets
you choose the address space mode the user space processes should
use. In addition the CONFIG_S390_SWITCH_AMODE config option
is removed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
52a21f2cee108ea1c8abc4fdaf64a66f21af26db 06-Oct-2009 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] fix build breakage with CONFIG_AIO=n

next-20090925 randconfig build breaks on s390x, with CONFIG_AIO=n.

arch/s390/mm/pgtable.c: In function 's390_enable_sie':
arch/s390/mm/pgtable.c:282: error: 'struct mm_struct' has no member named 'ioctx_list'
arch/s390/mm/pgtable.c:298: error: 'struct mm_struct' has no member named 'ioctx_list'
make[1]: *** [arch/s390/mm/pgtable.o] Error 1

Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
87458ff4582953d6b3bf45edeac8582849552e69 22-Sep-2009 Heiko Carstens <heiko.carstens@de.ibm.com> [S390] Change kernel_page_present coding style.

Make the inline assembly look like all others.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
50aa98bad056a17655864a4d71ebc32d95c629a7 11-Sep-2009 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] fix recursive locking on page_table_lock

Suzuki Poulose reported the following recursive locking bug on s390:

Here is the stack trace : (see Appendix I for more info)

[<0000000000406ed6>] _spin_lock+0x52/0x94
[<0000000000103bde>] crst_table_free+0x14e/0x1a4
[<00000000001ba684>] __pmd_alloc+0x114/0x1ec
[<00000000001be8d0>] handle_mm_fault+0x2cc/0xb80
[<0000000000407d62>] do_dat_exception+0x2b6/0x3a0
[<0000000000114f8c>] sysc_return+0x0/0x8
[<00000200001642b2>] 0x200001642b2

The page_table_lock is already acquired in __pmd_alloc (mm/memory.c) and
it tries to populate the pud/pgd with a new pmd allocated. If another
thread populates it before we get a chance, we free the pmd using
pmd_free().

On s390x, pmd_free(even pud_free ) is #defined to crst_table_free(),
which acquires the page_table_lock to protect the crst_table index updates.

Hence this ends up in a recursive locking of the page_table_lock.

The solution suggested by Dave Hansen is to use a new spin lock in the mmu
context to protect the access to the crst_list and the pgtable_list.

Reported-by: Suzuki Poulose <suzuki@in.ibm.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
7db11a363fc41cec170a94a3542031e5e64bb333 16-Jun-2009 Hans-Joachim Picht <hans@linux.vnet.ibm.com> [S390] pm: add kernel_page_present

Fix the following build failure caused by make allyesconfig using
CONFIG_HIBERNATION and CONFIG_DEBUG_PAGEALLOC

kernel/built-in.o: In function `saveable_page':
kernel/power/snapshot.c:897: undefined reference to `kernel_page_present'
kernel/built-in.o: In function `safe_copy_page':
kernel/power/snapshot.c:948: undefined reference to `kernel_page_present'
make: *** [.tmp_vmlinux1] Error 1

Signed-off-by: Hans-Joachim Picht <hans@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
239a64255fae8933d95273b5b92545949ca4e743 12-Jun-2009 Heiko Carstens <heiko.carstens@de.ibm.com> [S390] vmalloc: add vmalloc kernel parameter support

With the kernel parameter 'vmalloc=<size>' the size of the vmalloc area
can be specified. This can be used to increase or decrease the size of
the area. Works in the same way as on some other architectures.
This can be useful for features which make excessive use of vmalloc and
wouldn't work otherwise.
The default sizes remain unchanged: 96MB for 31 bit kernels and 1GB for
64 bit kernels.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
005f8eee6f3c8173e492d7bd4d51bda990eb468b 26-Mar-2009 Rusty Russell <rusty@rustcorp.com.au> [S390] cpumask: use mm_cpumask() wrapper

Makes code futureproof against the impending change to mm->cpu_vm_mask.

It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
702d9e584feb028ed7e2a6d2b103b8ea57622ff2 26-Mar-2009 Carsten Otte <cotte@de.ibm.com> [S390] check addressing mode in s390_enable_sie

The sie instruction requires address spaces to be switched
to run proper. This patch verifies that this is the case
in s390_enable_sie, otherwise the kernel would crash badly
as soon as the process runs into sie.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
f481bfafd36e621d6cbc62d4b25f74811410aef7 18-Mar-2009 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] make page table walking more robust

Make page table walking on s390 more robust. The current code requires
that the pgd/pud/pmd/pte loop is only done for address ranges that are
below the end address of the last vma of the address space. But this
is not always true, e.g. the generic page table walker does not guarantee
this. Change TASK_SIZE/TASK_SIZE_OF to reflect the current size of the
address space. This makes the generic page table walker happy but it
breaks the upgrade of a 3 level page table to a 4 level page table.
To make the upgrade work again another fix is required.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
abf137dd7712132ee56d5b3143c2ff61a72a5faa 09-Dec-2008 Jens Axboe <jens.axboe@oracle.com> aio: make the lookup_ioctx() lockless

The mm->ioctx_list is currently protected by a reader-writer lock,
so we always grab that lock on the read side for doing ioctx
lookups. As the workload is extremely reader biased, turn this into
an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
the rwlock and use a spinlock for providing update side exclusion.

There's usually only 1 entry on this list, so it doesn't make sense
to look into fancier data structures.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
250cf776f74b5932a1977d0489cae9206e2351dd 28-Oct-2008 Christian Borntraeger <borntraeger@de.ibm.com> [S390] pgtables: Fix race in enable_sie vs. page table ops

The current enable_sie code sets the mm->context.pgstes bit to tell
dup_mm that the new mm should have extended page tables. This bit is also
used by the s390 specific page table primitives to decide about the page
table layout - which means context.pgstes has two meanings. This can cause
any kind of bugs. For example - e.g. shrink_zone can call
ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young
will test for context.pgstes. Since enable_sie changed that value of the old
struct mm without changing the page table layout ptep_clear_flush_young will
do the wrong thing.
The solution is to split pgstes into two bits
- one for the allocation
- one for the current state

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
74b6b522ec83f9c44fc7743f2adcb24664aa8f45 21-May-2008 Christian Borntraeger <borntraeger@de.ibm.com> KVM: s390: fix locking order problem in enable_sie

There are potential locking problem in enable_sie. We take the task_lock
and the mmap_sem. As exit_mm uses the same locks vice versa, this triggers
a lockdep warning.
The second problem is that dup_mm and mmput might sleep, so we must not
hold the task_lock at that moment.

The solution is to dup the mm unconditional and use the task_lock before and
afterwards to check if we can use the new mm. dup_mm and mmput are called
outside the task_lock, but we run update_mm while holding the task_lock,
protection us against ptrace.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
402b08622d9ac6e32e25289573272e0f21bb58a7 25-Mar-2008 Carsten Otte <cotte@de.ibm.com> s390: KVM preparation: provide hook to enable pgstes in user pagetable

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.

Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
review feedback. Now we do have the prototype for dup_mm in
include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
call task_lock() to prevent race against ptrace modification of mm_users.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
6252d702c5311ce916caf75ed82e5c8245171c92 09-Feb-2008 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] dynamic page tables.

Add support for different number of page table levels dependent
on the highest address used for a process. This will cause a 31 bit
process to use a two level page table instead of the four level page
table that is the default after the pud has been introduced. Likewise
a normal 64 bit process will use three levels instead of four. Only
if a process runs out of the 4 tera bytes which can be addressed with
a three level page table the fourth level is dynamically added. Then
the process can use up to 8 peta byte.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
146e4b3c8b92071b18f0b2e6f47165bad4f9e825 09-Feb-2008 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] 1K/2K page table pages.

This patch implements 1K/2K page table pages for s390.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 08-Feb-2008 Martin Schwidefsky <schwidefsky@de.ibm.com> CONFIG_HIGHPTE vs. sub-page page tables.

Background: I've implemented 1K/2K page tables for s390. These sub-page
page tables are required to properly support the s390 virtualization
instruction with KVM. The SIE instruction requires that the page tables
have 256 page table entries (pte) followed by 256 page status table entries
(pgste). The pgstes are only required if the process is using the SIE
instruction. The pgstes are updated by the hardware and by the hypervisor
for a number of reasons, one of them is dirty and reference bit tracking.
To avoid wasting memory the standard pte table allocation should return
1K/2K (31/64 bit) and 2K/4K if the process is using SIE.

Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means
the s390 version for pte_alloc_one cannot return a pointer to a struct
page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
cannot return a pointer to a pte either, since that would require more than
32 bit for the return value of pte_alloc_one (and the pte * would not be
accessible since its not kmapped).

Solution: The only solution I found to this dilemma is a new typedef: a
pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a
later patch. For everybody else it will be a (struct page *). The
additional problem with the initialization of the ptl lock and the
NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
a destructor pgtable_page_dtor. The page table allocation and free
functions need to call these two whenever a page table page is allocated or
freed. pmd_populate will get a pgtable_t instead of a struct page pointer.
To get the pgtable_t back from a pmd entry that has been installed with
pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page
call in free_pte_range and apply_to_pte_range.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3610cce87af0693603db171d5b6f6735f5e3dc5b 22-Oct-2007 Martin Schwidefsky <schwidefsky@de.ibm.com> [S390] Cleanup page table definitions.

- De-confuse the defines for the address-space-control-elements
and the segment/region table entries.
- Create out of line functions for page table allocation / freeing.
- Simplify get_shadow_xxx functions.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>