66e9bbdb3dbb335b158bb88de2642966af816ffe |
|
06-Oct-2014 |
Dominik Dingel <dingel@linux.vnet.ibm.com> |
s390/mm: fixing calls of pte_unmap_unlock pte_unmap works on page table entry pointers, derefencing should be avoided. As on s390 pte_unmap is a NOP, this is more a cleanup if we want to supply later such function. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
dc77d344b41f3ffdd3b02317597e717b0b799f46 |
|
27-Aug-2014 |
Christian Borntraeger <borntraeger@de.ibm.com> |
KVM: s390/mm: fix up indentation of set_guest_storage_key commit ab3f285f227f ("KVM: s390/mm: try a cow on read only pages for key ops")' misaligned a code block. Let's fixup the indentation. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
c6c956b80bdf151cf41d3e7e5c54755d930a212c |
|
01-Jul-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390/mm: support gmap page tables with less than 5 levels Add an addressing limit to the gmap address spaces and only allocate the page table levels that are needed for the given limit. The limit is fixed and can not be changed after a gmap has been created. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
527e30b41d8b86e9ae7f5b740de416958c0e574e |
|
30-Apr-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390/mm: use radix trees for guest to host mappings Store the target address for the gmap segments in a radix tree instead of using invalid segment table entries. gmap_translate becomes a simple radix_tree_lookup, gmap_fault is split into the address translation with gmap_translate and the part that does the linking of the gmap shadow page table with the process page table. A second radix tree is used to keep the pointers to the segment table entries for segments that are mapped in the guest address space. On unmap of a segment the pointer is retrieved from the radix tree and is used to carry out the segment invalidation in the gmap shadow page table. As the radix tree can only store one pointer, each host segment may only be mapped to exactly one guest location. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
6e0a0431bf7d90ed0b8a0a974ad219617a70cc22 |
|
29-Apr-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390/mm: cleanup gmap function arguments, variable names Make the order of arguments for the gmap calls more consistent, if the gmap pointer is passed it is always the first argument. In addition distinguish between guest address and user address by naming the variables gaddr for a guest address and vmaddr for a user address. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
9da4e3807657f3bcd12cfbb5671d80794303dde2 |
|
30-Apr-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390/mm: readd address parameter to gmap_do_ipte_notify Revert git commit c3a23b9874c1 ("remove unnecessary parameter from gmap_do_ipte_notify"). Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
ab3f285f227fec62868037e9b1b1fd18294a83b8 |
|
19-Aug-2014 |
Christian Borntraeger <borntraeger@de.ibm.com> |
KVM: s390/mm: try a cow on read only pages for key ops The PFMF instruction handler blindly wrote the storage key even if the page was mapped R/O in the host. Lets try a COW before continuing and bail out in case of errors. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
|
152125b7a882df36a55a8eadbea6d0edf1461ee7 |
|
24-Jul-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm: implement dirty bits for large segment table entries The large segment table entry format has block of bits for the ACC/F values for the large page. These bits are valid only if another bit (AV bit 0x10000) of the segment table entry is set. The ACC/F bits do not have a meaning if the AV bit is off. This allows to put the THP splitting bit, the segment young bit and the new segment dirty bit into the ACC/F bits as long as the AV bit stays off. The dirty and young information is only available if the pmd is large. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
55e4283c3eb1d850893f645dd695c9c75d5fa1fc |
|
25-Jul-2014 |
Christian Borntraeger <borntraeger@de.ibm.com> |
KVM: s390/mm: Fix page table locking vs. split pmd lock commit ec66ad66a0de87866be347b5ecc83bd46427f53b (s390/mm: enable split page table lock for PMD level) activated the split pmd lock for s390. Turns out that we missed one place: We also have to take the pmd lock instead of the page table lock when we reallocate the page tables (==> changing entries in the PMD) during sie enablement. Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
beef560b4cdfafb2211a856e1d722540f5151933 |
|
14-Apr-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/uaccess: simplify control register updates Always switch to the kernel ASCE in switch_mm. Load the secondary space ASCE in finish_arch_post_lock_switch after checking that any pending page table operations have completed. The primary ASCE is loaded in entry[64].S. With this the update_primary_asce call can be removed from the switch_to macro and from the start of switch_mm function. Remove the load_primary argument from update_user_asce/clear_user_asce, rename update_user_asce to set_user_asce and rename update_primary_asce to load_kernel_asce. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
3a801517ad49f586f2016e1b1321e6cd28a97a04 |
|
16-May-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390: correct locking for s390_enable_skey Use the mm semaphore to serialize multiple invocations of s390_enable_skey. The second CPU faulting on a storage key operation needs to wait for the completion of the page table update. Taking the mm semaphore writable has the positive side-effect that it prevents any host faults from taking place which does have implications on keys vs PGSTE. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
a0bf4f149bbfa2e31b5f4172c817afdb7b986733 |
|
24-Mar-2014 |
Dominik Dingel <dingel@linux.vnet.ibm.com> |
KVM: s390/mm: new gmap_test_and_clear_dirty function For live migration kvm needs to test and clear the dirty bit of guest pages. That for is ptep_test_and_clear_user_dirty, to be sure we are not racing with other code, we protect the pte. This needs to be done within the architecture memory management code. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
0a61b222df75a6a69dc34816f7db2f61fee8c935 |
|
18-Oct-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390/mm: use software dirty bit detection for user dirty tracking Switch the user dirty bit detection used for migration from the hardware provided host change-bit in the pgste to a fault based detection method. This reduced the dependency of the host from the storage key to a point where it becomes possible to enable the RCP bypass for KVM guests. The fault based dirty detection will only indicate changes caused by accesses via the guest address space. The hardware based method can detect all changes, even those caused by I/O or accesses via the kernel page table. The KVM/qemu code needs to take this into account. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
934bc131efc3e4be6a52f7dd6c4dbf99635e381a |
|
14-Jan-2014 |
Dominik Dingel <dingel@linux.vnet.ibm.com> |
KVM: s390: Allow skeys to be enabled for the current process Introduce a new function s390_enable_skey(), which enables storage key handling via setting the use_skey flag in the mmu context. This function is only useful within the context of kvm. Note that enabling storage keys will cause a one-time hickup when walking the page table; however, it saves us special effort for cases like clear reset while making it possible for us to be architecture conform. s390_enable_skey() takes the page table lock to prevent reseting storage keys triggered from multiple vcpus. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
d4cb11340be6a1613d40d2b546cb111ea2547066 |
|
29-Jan-2014 |
Dominik Dingel <dingel@linux.vnet.ibm.com> |
KVM: s390: Clear storage keys page_table_reset_pgste() already does a complete page table walk to reset the pgste. Enhance it to initialize the storage keys to PAGE_DEFAULT_KEY if requested by the caller. This will be used for lazy storage key handling. Also provide an empty stub for !CONFIG_PGSTE Lets adopt the current code (diag 308) to not clear the keys. Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
1e1836e84f87d12feac6dd225fcef5eba1ca724b |
|
08-Apr-2014 |
Alex Thorlton <athorlton@sgi.com> |
mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags" The main motivation behind this patch is to provide a way to disable THP for jobs where the code cannot be modified, and using a malloc hook with madvise is not an option (i.e. statically allocated data). This patch allows us to do just that, without affecting other jobs running on the system. We need to do this sort of thing for jobs where THP hurts performance, due to the possibility of increased remote memory accesses that can be created by situations such as the following: When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be handed out, and the THP will be stuck on whatever node the chunk was originally referenced from. If many remote nodes need to do work on that same chunk, they'll be making remote accesses. With THP disabled, 4K pages can be handed out to separate nodes as they're needed, greatly reducing the amount of remote accesses to memory. This patch is based on some of my work combined with some suggestions/patches given by Oleg Nesterov. The main goal here is to add a prctl switch to allow us to disable to THP on a per mm_struct basis. Here's a bit of test data with the new patch in place... First with the flag unset: # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 0 Set thp_disabled state to 0 Process pid = 18027 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23 Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 273719072.841402 task-clock # 641.026 CPUs utilized [100.00%] 1,008,986 context-switches # 0.000 M/sec [100.00%] 7,717 CPU-migrations # 0.000 M/sec [100.00%] 1,698,932 page-faults # 0.000 M/sec 355,222,544,890,379 cycles # 1.298 GHz [100.00%] 536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%] 409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%] 148,286,797,266,411 instructions # 0.42 insns per cycle # 3.62 stalled cycles per insn [100.00%] 27,061,793,159,503 branches # 98.867 M/sec [100.00%] 1,188,655,196 branch-misses # 0.00% of all branches 427.001706337 seconds time elapsed Now with the flag set: # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Setting thp_disabled for this task... thp_disable: 1 Set thp_disabled state to 1 Process pid = 144957 PF/ MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/ TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES 512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23 Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g': 138789390.540183 task-clock # 641.959 CPUs utilized [100.00%] 534,205 context-switches # 0.000 M/sec [100.00%] 4,595 CPU-migrations # 0.000 M/sec [100.00%] 63,133,119 page-faults # 0.000 M/sec 147,977,747,269,768 cycles # 1.066 GHz [100.00%] 200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%] 105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%] 180,916,213,503,160 instructions # 1.22 insns per cycle # 1.11 stalled cycles per insn [100.00%] 26,999,511,005,868 branches # 194.536 M/sec [100.00%] 714,066,351 branch-misses # 0.00% of all branches 216.196778807 seconds time elapsed As with previous versions of the patch, We're getting about a 2x performance increase here. Here's a link to the test case I used, along with the little wrapper to activate the flag: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz This patch (of 3): Revert commit 8e72033f2a48 and add in code to fix up any issues caused by the revert. The revert is necessary because hugepage_madvise would return -EINVAL when VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch set. Here's a snip of an e-mail from Gerald detailing the original purpose of this code, and providing justification for the revert: "The intent of commit 8e72033f2a48 was to guard against any future programming errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings, which would crash the kernel. Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48 was to be reverted, because that check will also prevent a kernel crash in the case described above, it will now send a SIGSEGV instead. This would now also allow to do the madvise on other parts, if needed, so it is a more flexible approach. One could also say that it would have been better to do it this way right from the beginning..." Signed-off-by: Alex Thorlton <athorlton@sgi.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
457f2180951cdcbfb4657ddcc83b486e93497f56 |
|
21-Mar-2014 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/uaccess: rework uaccess code - fix locking issues The current uaccess code uses a page table walk in some circumstances, e.g. in case of the in atomic futex operations or if running on old hardware which doesn't support the mvcos instruction. However it turned out that the page table walk code does not correctly lock page tables when accessing page table entries. In other words: a different cpu may invalidate a page table entry while the current cpu inspects the pte. This may lead to random data corruption. Adding correct locking however isn't trivial for all uaccess operations. Especially copy_in_user() is problematic since that requires to hold at least two locks, but must be protected against ABBA deadlock when a different cpu also performs a copy_in_user() operation. So the solution is a different approach where we change address spaces: User space runs in primary address mode, or access register mode within vdso code, like it currently already does. The kernel usually also runs in home space mode, however when accessing user space the kernel switches to primary or secondary address mode if the mvcos instruction is not available or if a compare-and-swap (futex) instruction on a user space address is performed. KVM however is special, since that requires the kernel to run in home address space while implicitly accessing user space with the sie instruction. So we end up with: User space: - runs in primary or access register mode - cr1 contains the user asce - cr7 contains the user asce - cr13 contains the kernel asce Kernel space: - runs in home space mode - cr1 contains the user or kernel asce -> the kernel asce is loaded when a uaccess requires primary or secondary address mode - cr7 contains the user or kernel asce, (changed with set_fs()) - cr13 contains the kernel asce In case of uaccess the kernel changes to: - primary space mode in case of a uaccess (copy_to_user) and uses e.g. the mvcp instruction to access user space. However the kernel will stay in home space mode if the mvcos instruction is available - secondary space mode in case of futex atomic operations, so that the instructions come from primary address space and data from secondary space In case of kvm the kernel runs in home space mode, but cr1 gets switched to contain the gmap asce before the sie instruction gets executed. When the sie instruction is finished cr1 will be switched back to contain the user asce. A context switch between two processes will always load the kernel asce for the next process in cr1. So the first exit to user space is a bit more expensive (one extra load control register instruction) than before, however keeps the code rather simple. In sum this means there is no need to perform any error prone page table walks anymore when accessing user space. The patch seems to be rather large, however it mainly removes the the page table walk code and restores the previously deleted "standard" uaccess code, with a couple of changes. The uaccess without mvcos mode can be enforced with the "uaccess_primary" kernel parameter. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
1b948d6caec4f28e3524244ca0f77c6ae8ddceef |
|
03-Apr-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm,tlb: optimize TLB flushing for zEC12 The zEC12 machines introduced the local-clearing control for the IDTE and IPTE instruction. If the control is set only the TLB of the local CPU is cleared of entries, either all entries of a single address space for IDTE, or the entry for a single page-table entry for IPTE. Without the local-clearing control the TLB flush is broadcasted to all CPUs in the configuration, which is expensive. The reset of the bit mask of the CPUs that need flushing after a non-local IDTE is tricky. As TLB entries for an address space remain in the TLB even if the address space is detached a new bit field is required to keep track of attached CPUs vs. CPUs in the need of a flush. After a non-local flush with IDTE the bit-field of attached CPUs is copied to the bit-field of CPUs in need of a flush. The ordering of operations on cpu_attach_mask, attach_count and mm_cpumask(mm) is such that an underindication in mm_cpumask(mm) is prevented but an overindication in mm_cpumask(mm) is possible. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
02a8f3abb708919149cb657a5202f4603f0c38e2 |
|
03-Apr-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm,tlb: safeguard against speculative TLB creation The principles of operations states that the CPU is allowed to create TLB entries for an address space anytime while an ASCE is loaded to the control register. This is true even if the CPU is running in the kernel and the user address space is not (actively) accessed. In theory this can affect two aspects of the TLB flush logic. For full-mm flushes the ASCE of the dying process is still attached. The approach to flush first with IDTE and then just free all page tables can in theory lead to stale TLB entries. Use the batched free of page tables for the full-mm flushes as well. For operations that can have a stale ASCE in the control register, e.g. a delayed update_user_asce in switch_mm, load the kernel ASCE to prevent invalid TLBs from being created. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
aaeff84a2dfa224611fc9fee89cb20277469c454 |
|
19-Mar-2014 |
Dominik Dingel <dingel@linux.vnet.ibm.com> |
s390/mm: remove unnecessary parameter from gmap_do_ipte_notify Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
c7c5be73ccc05da9899b313b9fa0042aae56502f |
|
19-Mar-2014 |
Dominik Dingel <dingel@linux.vnet.ibm.com> |
s390/mm: fixing comment so that parameter name match Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
ec66ad66a0de87866be347b5ecc83bd46427f53b |
|
12-Feb-2014 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm: enable split page table lock for PMD level Add the pgtable_pmd_page_ctor/pgtable_pmd_page_dtor calls to the pmd allocation and free functions and enable ARCH_ENABLE_SPLIT_PMD_PTLOCK for 64 bit. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
deedabb2b4a68a63351a949b1abcf73fc97eb406 |
|
21-May-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/kvm: set guest page states to stable on re-ipl The guest page state needs to be reset to stable for all pages on initial program load via diagnose 0x308. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
b31288fa83b2bcc8834e1e208e9526b8bd5ce361 |
|
17-Apr-2013 |
Konstantin Weitz <konstantin.weitz@gmail.com> |
s390/kvm: support collaborative memory management This patch enables Collaborative Memory Management (CMM) for kvm on s390. CMM allows the guest to inform the host about page usage (see arch/s390/mm/cmm.c). The host uses this information to avoid swapping in unused pages in the page fault handler. Further, a CPU provided list of unused invalid pages is processed to reclaim swap space of not yet accessed unused pages. [ Martin Schwidefsky: patch reordering and cleanup ] Signed-off-by: Konstantin Weitz <konstantin.weitz@gmail.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
b4a960159e6f5254ac3c95dd183789f402431977 |
|
13-Dec-2013 |
Hendrik Brueckner <brueckner@linux.vnet.ibm.com> |
s390: Fix misspellings using 'codespell' tool Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
e89cfa58a8358fdb4d4e79936c25222416ad415e |
|
14-Nov-2013 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
s390: handle pgtable_page_ctor() fail Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
c389a250ab4cfa4a3775d9f2c45271618af6d5b2 |
|
14-Nov-2013 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm, thp: do not access mm->pmd_huge_pte directly Currently mm->pmd_huge_pte protected by page table lock. It will not work with split lock. We have to have per-pmd pmd_huge_pte for proper access serialization. For now, let's just introduce wrapper to access mm->pmd_huge_pte. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
106078641f32a6a10d9759f809f809725695cb09 |
|
28-Oct-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm,tlb: correct tlb flush on page table upgrade The IDTE instruction used to flush TLB entries for a specific address space uses the address-space-control element (ASCE) to identify affected TLB entries. The upgrade of a page table adds a new top level page table which changes the ASCE. The TLB entries associated with the old ASCE need to be flushed and the ASCE for the address space needs to be replaced synchronously on all CPUs which currently use it. The concept of a lazy ASCE update with an exception handler is broken. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
be39f1968e33ca641af120a2d659421ad2225dea |
|
31-Oct-2013 |
Dominik Dingel <dingel@linux.vnet.ibm.com> |
s390/mm: page_table_realloc returns failure There is a possible race between setting has_pgste and reallocation of the page_table, change the order to fix this. Also page_table_alloc_pgste can fail, in that case we need to backpropagte this as -ENOMEM to the caller of page_table_realloc. Based on a patch by Christian Borntraeger <borntraeger@de.ibm.com>. Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
e258d719ff28ecc7a048eb8f78380e68c4b3a3f0 |
|
24-Sep-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/uaccess: always run the kernel in home space Simplify the uaccess code by removing the user_mode=home option. The kernel will now always run in the home space mode. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
63df41d663fc27e96571bfea86d3f9ee81289e07 |
|
06-Sep-2013 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390: make various functions static, add declarations to header files Make various functions static, add declarations to header files to fix a couple of sparse findings. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
984e2a5975e538a6475f7453523896319a1cb597 |
|
06-Sep-2013 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/mm: add __releases()/__acquires() annotations to gmap_alloc_table() Let sparse not incorrectly complain about unbalanced locking. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
0944fe3f4a323f436180d39402cae7f9c46ead17 |
|
23-Jul-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm: implement software referenced bits The last remaining use for the storage key of the s390 architecture is reference counting. The alternative is to make page table entries invalid while they are old. On access the fault handler marks the pte/pmd as young which makes the pte/pmd valid if the access rights allow read access. The pte/pmd invalidations required for software managed reference bits cost a bit of performance, on the other hand the RRBE/RRBM instructions to read and reset the referenced bits are quite expensive as well. Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
5c474a1e2265c5156e6c63f87a7e99053039b8b9 |
|
16-Aug-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm: introduce ptep_flush_lazy helper Isolate the logic of IDTE vs. IPTE flushing of ptes in two functions, ptep_flush_lazy and __tlb_flush_mm_lazy. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
e509861105a3c1425f3f929bd631f88340b499bf |
|
23-Jul-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm: cleanup page table definitions Improve the encoding of the different pte types and the naming of the page, segment table and region table bits. Due to the different pte encoding the hugetlbfs primitives need to be adapted as well. To improve compatability with common code make the huge ptes use the encoding of normal ptes. The conversion between the pte and pmd encoding for a huge pte is done with set_huge_pte_at and huge_ptep_get. Overall the code is now easier to understand. Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
ee6ee55bb505c5bd8e64bc652281a93fb99c07b3 |
|
26-Jul-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390: fix task size check The gmap_map_segment function uses PGDIR_SIZE in the check for the maximum address in the tasks address space. This incorrectly limits the amount of memory usable for a kvm guest to 4TB. The correct limit is (1UL << 53). As the TASK_SIZE has different values (4TB vs 8PB) dependent on the existance of the fourth page table level, create a new define 'TASK_MAX_SIZE' for (1UL << 53). Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
3eabaee998c787e7e1565574821652548f7fc003 |
|
26-Jul-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
KVM: s390: allow sie enablement for multi-threaded programs Improve the code to upgrade the standard 2K page tables to 4K page tables with PGSTEs to allow the operation to happen when the program is already multi-threaded. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
24d5dd0208ed1cd3ef6bf30a50b347ef366f21ac |
|
27-May-2013 |
Christian Borntraeger <borntraeger@de.ibm.com> |
s390/kvm: Provide function for setting the guest storage key From time to time we need to set the guest storage key. Lets provide a helper function that handles the changes with all the right locking and checking. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
6b0b50b0617fad5f2af3b928596a25f7de8dbf50 |
|
06-Jun-2013 |
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> |
mm/THP: add pmd args to pgtable deposit and withdraw APIs This will be later used by powerpc THP support. In powerpc we want to use pgtable for storing the hash index values. So instead of adding them to mm_context list, we would like to store them in the second half of pmd Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
db70ccdfb9953b984f5b95d98c50d8da335bab59 |
|
12-Jun-2013 |
Christian Borntraeger <borntraeger@de.ibm.com> |
KVM: s390: Provide function for setting the guest storage key From time to time we need to set the guest storage key. Lets provide a helper function that handles the changes with all the right locking and checking. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
e86cbd8765bd2e1f9eeb209822449c9b1e5958cf |
|
29-May-2013 |
Christian Borntraeger <borntraeger@de.ibm.com> |
s390/pgtable: Fix gmap notifier address The address of the gmap notifier was broken, resulting in unhandled validity intercepts in KVM. Fix the rmap->vmaddr to be on a segment boundary. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
f8b5ff2cff232df052955ef975f7219e1faa217f |
|
17-May-2013 |
Christian Borntraeger <borntraeger@de.ibm.com> |
s390: fix gmap_ipte_notifier vs. software dirty pages On heavy paging load some guest cpus started to loop in gmap_ipte_notify. This was visible as stalled cpus inside the guest. The gmap_ipte_notifier tries to map a user page and then made sure that the pte is valid and writable. Turns out that with the software change bit tracking the pte can become read-only (and only software writable) if the page is clean. Since we loop in this code, the page would stay clean and, therefore, be never writable again. Let us just use fixup_user_fault, that guarantees to call handle_mm_fault. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
|
0d0dafc1e48fd254c22f75738def870a7ffd2c3e |
|
17-May-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/kvm: rename RCP_xxx defines to PGSTE_xxx The RCP byte is a part of the PGSTE value, the existing RCP_xxx names are inaccurate. As the defines describe bits and pieces of the PGSTE, the names should start with PGSTE_. The KVM_UR_BIT and KVM_UC_BIT are part of the PGSTE as well, give them better names as well. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
|
bb4b42ce0ca36af8c113587ab64b138b3cf5459c |
|
08-May-2013 |
Christian Borntraeger <borntraeger@de.ibm.com> |
s390: fix gmap_ipte_notifier vs. software dirty pages On heavy paging load some guest cpus started to loop in gmap_ipte_notify. This was visible as stalled cpus inside the guest. The gmap_ipte_notifier tries to map a user page and then made sure that the pte is valid and writable. Turns out that with the software change bit tracking the pte can become read-only (and only software writable) if the page is clean. Since we loop in this code, the page would stay clean and, therefore, be never writable again. Let us just use fixup_user_fault, that guarantees to call handle_mm_fault. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
d3383632d4e8e9ae747f582eaee8c2e79f828ae6 |
|
17-Apr-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm: add pte invalidation notifier for kvm Add a notifier for kvm to get control before a page table entry is invalidated. The notifier is only called for ptes of an address space with pgstes that have been explicitly marked to require notification. Kvm will use this to get control before prefix pages of virtual CPU are unmapped. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
ab8e5235868f99dfc779e4eaff28f53d63714ce4 |
|
16-Apr-2013 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm,gmap: segment mapping race The gmap_map_segment function creates a special invalid segment table entry with the address of the requested target location in the process address space. The first access will create the connection between the gmap segment table and the target page table of the main process. If two threads do this concurrently both will walk the page tables and allocate a gmap_rmap structure for the same segment table entry. To avoid the race recheck the segment table entry after taking to page table lock. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
c5034945ce59abacdd02c5eff29f4f54df197880 |
|
10-Sep-2012 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/mm,gmap: implement gmap_translate() Implement gmap_translate() function which translates a guest absolute address to a user space process address without establishing the guest page table entries. This is useful for kvm guest address translations where no memory access is expected to happen soon (e.g. tprot exception handler). Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
9e0fdb4145205bea95c2888a195c3ead2652f120 |
|
05-Mar-2013 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/mm,gmap: implement gmap_translate() Implement gmap_translate() function which translates a guest absolute address to a user space process address without establishing the guest page table entries. This is useful for kvm guest address translations where no memory access is expected to happen soon (e.g. tprot exception handler). Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
|
0a4ccc992978ef552dc86ac68bc1ec62cf268e2a |
|
02-Nov-2012 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/mm: move kernel_page_present/kernel_map_pages to page_attr.c Keep related functions together and move to appropriate file. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
1ae1c1d09f220ded48ee9a7d91a65e94f95c4af1 |
|
09-Oct-2012 |
Gerald Schaefer <gerald.schaefer@de.ibm.com> |
thp, s390: architecture backend for thp on s390 This implements the architecture backend for transparent hugepages on s390. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
274023da1e8a49efa6fd9bf857f8557e5db44cdf |
|
09-Oct-2012 |
Gerald Schaefer <gerald.schaefer@de.ibm.com> |
thp, s390: disable thp for kvm host on s390 This patch is part of the architecture backend for thp on s390. It disables thp for kvm hosts, because there is no kvm host hugepage support so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT, and future thp mappings are prevented by setting VM_NOHUGEPAGE in mm->def_flags. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
9501d09fa3c4ca18971083dfb0c9aa1afc85f19c |
|
09-Oct-2012 |
Gerald Schaefer <gerald.schaefer@de.ibm.com> |
thp, s390: thp pagetable pre-allocation for s390 This patch is part of the architecture backend for thp on s390. It provides the pagetable pre-allocation functions pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike other archs, s390 has no struct page * as pgtable_t, but rather a pointer to the page table. So instead of saving the pagetable pre- allocation list info inside the struct page, it is being saved within the pagetable itself. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
75077afbec1ac89178c1542b23a70d0f960b0aaf |
|
09-Oct-2012 |
Gerald Schaefer <gerald.schaefer@de.ibm.com> |
thp, s390: thp splitting backend for s390 This patch is part of the architecture backend for thp on s390. It provides the functions related to thp splitting, including serialization against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb flushing operation to serialize against gup on s390, because that wouldn't be stopped by the disabled IRQs. So instead, smp_call_function() is called with an empty function, which will have the expected effect. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
41459d36cf0d57813017dae6080a879cc038e5fe |
|
14-Sep-2012 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390: add uninitialized_var() to suppress false positive compiler warnings Get rid of these: arch/s390/kernel/smp.c:134:19: warning: ‘status’ may be used uninitialized in this function [-Wuninitialized] arch/s390/mm/pgtable.c:641:10: warning: ‘table’ may be used uninitialized in this function [-Wuninitialized] arch/s390/mm/pgtable.c:644:12: warning: ‘page’ may be used uninitialized in this function [-Wuninitialized] drivers/s390/cio/cio.c:1037:14: warning: ‘schid’ may be used uninitialized in this function [-Wuninitialized] Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
d1b0d842c4450e410053083db837ef16532a4139 |
|
02-Sep-2012 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/mm: rename addressing_mode to s390_user_mode Renaming the globally visible variable "user_mode" to "addressing_mode" in order to fix a name clash was not a good idea. (Commit 37fe1d73 "s390/mm: rename user_mode variable to addressing_mode") Looking at the code after a couple of weeks one thinks: addressing mode of what? So rename the variable again. This time to s390_user_mode. Which hopefully makes more sense. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
37fe1d73a449bdebc4908d04e518f5852d6c453b |
|
27-Jul-2012 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/mm: rename user_mode variable to addressing_mode Fix name clash with user_mode() define which is also used in common code. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
0f6f281b731d20bfe75c13f85d33f3f05b440222 |
|
26-Jul-2012 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
s390/mm: downgrade page table after fork of a 31 bit process The downgrade of the 4 level page table created by init_new_context is currently done only in start_thread31. If a 31 bit process forks the new mm uses a 4 level page table, including the task size of 2<<42 that goes along with it. This is incorrect as now a 31 bit process can map memory beyond 2GB. Define arch_dup_mmap to do the downgrade after fork. Cc: stable@vger.kernel.org Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
a53c8fab3f87c995c30ac226a03af95361243144 |
|
20-Jul-2012 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
s390/comments: unify copyright messages and remove file names Remove the file name from the comment at top of many files. In most cases the file name was wrong anyway, so it's rather pointless. Also unify the IBM copyright statement. We did have a lot of sightly different statements and wanted to change them one after another whenever a file gets touched. However that never happened. Instead people start to take the old/"wrong" statements to use as a template for new files. So unify all of them in one go. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
2739b6d12407792f253b7a15233930338e6831c9 |
|
09-May-2012 |
Christian Borntraeger <borntraeger@de.ibm.com> |
s390/kvm: bad rss-counter state commit c3f0327f8e9d7a503f0d64573c311eddd61f197d mm: add rss counters consistency check detected the following problem with kvm on s390: BUG: Bad rss-counter state mm:00000004f73ef000 idx:0 val:-10 BUG: Bad rss-counter state mm:00000004f73ef000 idx:1 val:-5 We have to make sure that we accumulate all rss values into the mm before we replace the mm to avoid triggering this (harmless) bug message. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
cd94154cc6a28dd9dc271042c1a59c08d26da886 |
|
11-Apr-2012 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] fix tlb flushing for page table pages Git commit 36409f6353fc2d7b6516e631415f938eadd92ffa "use generic RCU page-table freeing code" introduced a tlb flushing bug. Partially revert the above git commit and go back to s390 specific page table flush code. For s390 the TLB can contain three types of entries, "normal" TLB page-table entries, TLB combined region-and-segment-table (CRST) entries and real-space entries. Linux does not use real-space entries which leaves normal TLB entries and CRST entries. The CRST entries are intermediate steps in the page-table translation called translation paths. For example a 4K page access in a three-level page table setup will create two CRST TLB entries and one page-table TLB entry. The advantage of that approach is that a page access next to the previous one can reuse the CRST entries and needs just a single read from memory to create the page-table TLB entry. The disadvantage is that the TLB flushing rules are more complicated, before any page-table may be freed the TLB needs to be flushed. In short: the generic RCU page-table freeing code is incorrect for the CRST entries, in particular the check for mm_users < 2 is troublesome. This is applicable to 3.0+ kernels. Cc: <stable@vger.kernel.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
a0616cdebcfd575dcd4c46102d1b52fbb827fc29 |
|
28-Mar-2012 |
David Howells <dhowells@redhat.com> |
Disintegrate asm/system.h for S390 Disintegrate asm/system.h for S390. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-s390@vger.kernel.org
|
2320c5793790fcda80e6dcc088dbda86040235e5 |
|
17-Feb-2012 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] incorrect PageTables counter for kvm page tables The page_table_free_pgste function is used for kvm processes to free page tables that have the pgste extension. It calls pgtable_page_ctor instead of pgtable_page_dtor which increases NR_PAGETABLE instead of decreasing it. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
14045ebf1e1156d966a796cacad91028e01797e5 |
|
27-Dec-2011 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] add support for physical memory > 4TB The kernel address space of a 64 bit kernel currently uses a three level page table and the vmemmap array has a fixed address and a fixed maximum size. A three level page table is good enough for systems with less than 3.8TB of memory, for bigger systems four page table levels need to be used. Each page table level costs a bit of performance, use 3 levels for normal systems and 4 levels only for the really big systems. To avoid bloating sparse.o too much set MAX_PHYSMEM_BITS to 46 for a maximum of 64TB of memory. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
c86cce2a20207cbf2b3dfe97c985a1f5aa5d3798 |
|
27-Dec-2011 |
Christian Borntraeger <borntraeger@de.ibm.com> |
[S390] kvm: fix sleeping function ... at mm/page_alloc.c:2260 commit cc772456ac9b460693492b3a3d89e8c81eda5874 [S390] fix list corruption in gmap reverse mapping added a potential dead lock: BUG: sleeping function called from invalid context at mm/page_alloc.c:2260 in_atomic(): 1, irqs_disabled(): 0, pid: 1108, name: qemu-system-s39 3 locks held by qemu-system-s39/1108: #0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm] #1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298 #2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298 CPU: 0 Not tainted 3.1.3 #45 Process qemu-system-s39 (pid: 1108, task: 00000004f8b3cb30, ksp: 00000004fd5978d0) 00000004fd5979a0 00000004fd597920 0000000000000002 0000000000000000 00000004fd5979c0 00000004fd597938 00000004fd597938 0000000000617e96 0000000000000000 00000004f8b3cf58 0000000000000000 0000000000000000 000000000000000d 000000000000000c 00000004fd597988 0000000000000000 0000000000000000 0000000000100a18 00000004fd597920 00000004fd597960 Call Trace: ([<0000000000100926>] show_trace+0xee/0x144) [<0000000000131f3a>] __might_sleep+0x12a/0x158 [<0000000000217fb4>] __alloc_pages_nodemask+0x224/0xadc [<0000000000123086>] gmap_alloc_table+0x46/0x114 [<000000000012395c>] gmap_map_segment+0x268/0x298 [<000003e00486b014>] kvm_arch_commit_memory_region+0x44/0x6c [kvm] [<000003e004866414>] __kvm_set_memory_region+0x3b0/0x4a4 [kvm] [<000003e004866554>] kvm_set_memory_region+0x4c/0x6c [kvm] [<000003e004867c7a>] kvm_vm_ioctl+0x14a/0x314 [kvm] [<0000000000292100>] do_vfs_ioctl+0x94/0x588 [<0000000000292688>] SyS_ioctl+0x94/0xac [<000000000061e124>] sysc_noemu+0x22/0x28 [<000003fffcd5e7ca>] 0x3fffcd5e7ca 3 locks held by qemu-system-s39/1108: #0: (&kvm->slots_lock){+.+.+.}, at: [<000003e004866542>] kvm_set_memory_region+0x3a/0x6c [kvm] #1: (&mm->mmap_sem){++++++}, at: [<0000000000123790>] gmap_map_segment+0x9c/0x298 #2: (&(&mm->page_table_lock)->rlock){+.+.+.}, at: [<00000000001237a8>] gmap_map_segment+0xb4/0x298 Fix this by freeing the lock on the alloc path. This is ok, since the gmap table is never freed until we call gmap_free, so the table we are walking cannot go. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
388186bc920d9200202e4d25de66fa95b1b8fc68 |
|
30-Oct-2011 |
Christian Borntraeger <borntraeger@de.ibm.com> |
[S390] kvm: Handle diagnose 0x10 (release pages) Linux on System z uses a ballooner based on diagnose 0x10. (aka as collaborative memory management). This patch implements diagnose 0x10 on the guest address space. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
499069e1a421e2a85e76846c3237f00f1a5cb435 |
|
30-Oct-2011 |
Carsten Otte <cotte@de.ibm.com> |
[S390] take mmap_sem when walking guest page table gmap_fault needs to walk the guest page table. However, parts of that may change if some other thread does munmap. In that case gmap_unmap_notifier will also unmap the corresponding parts from the guest page table. We need to take mmap_sem in order to serialize these operations. do_exception now calls __gmap_fault with mmap_sem held which does not get exported to modules. The exported function, which is called from KVM, now takes mmap_sem. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
cc772456ac9b460693492b3a3d89e8c81eda5874 |
|
30-Oct-2011 |
Carsten Otte <cotte@de.ibm.com> |
[S390] fix list corruption in gmap reverse mapping This introduces locking via mm->page_table_lock to protect the rmap list for guest mappings from being corrupted by concurrent operations. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
a9162f238a84ee05b09ea4b0ebd97fb20448c28c |
|
30-Oct-2011 |
Carsten Otte <cotte@de.ibm.com> |
[S390] fix possible deadlock in gmap_map_segment Fix possible deadlock reported by lockdep: qemu-system-s39/2963 is trying to acquire lock: (&mm->mmap_sem){++++++}, at: gmap_alloc_table+0x9c/0x120 but task is already holding lock: (&mm->mmap_sem){++++++}, at: gmap_map_segment+0xa6/0x27c Actually gmap_alloc_table is the only called in gmap_map_segment with mmap_sem held, thus it's safe to simply remove the inner lock. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
e73b7fffe487c315fd1a4fa22282e3362b440a06 |
|
30-Oct-2011 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] memory leak with RCU_TABLE_FREE The rcu page table free code uses a couple of bits in the page table pointer passed to tlb_remove_table to discern the different page table types. __tlb_remove_table extracts the type with an incorrect mask which leads to memory leaks. The correct mask is ((FRAG_MASK << 4) | FRAG_MASK). Cc: stable@kernel.org Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
05873df981ca1dd32f398e7b4e19864de907e064 |
|
26-Sep-2011 |
Carsten Otte <cotte@de.ibm.com> |
[S390] gmap: always up mmap_sem properly If gmap_unmap_segment figures that the segment was not mapped in the first place, it need to up mmap_sem on exit. Cc: <stable@kernel.org> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
480e5926ce3bb61ec229be2dab08bdce8abb8d2e |
|
20-Sep-2011 |
Christian Borntraeger <borntraeger@de.ibm.com> |
[S390] kvm: fix address mode switching 598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address spaces for kvm guest images) changed kvm to use a separate address space for kvm guests. This address space was switched in __vcpu_run In some cases (preemption, page fault) there is the possibility that this address space switch is lost. The typical symptom was a huge amount of validity intercepts or random guest addressing exceptions. Fix this by doing the switch in sie_loop and sie_exit and saving the address space in the gmap structure itself. Also use the preempt notifier. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Avi Kivity <avi@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
944291de33b26a8b403f13f5eb0cc51fb982aa1e |
|
03-Aug-2011 |
Jan Glauber <jang@linux.vnet.ibm.com> |
[S390] missing return in page_table_alloc_pgste Fix the following compile warning for !CONFIG_PGSTE: CC arch/s390/mm/pgtable.o arch/s390/mm/pgtable.c: In function ‘page_table_alloc_pgste’: arch/s390/mm/pgtable.c:531:1: warning: no return statement in function returning non-void [-Wreturn-type] Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
e5992f2e6c3829cd43dbc4438ee13dcd6506f7f3 |
|
24-Jul-2011 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] kvm guest address space mapping Add code that allows KVM to control the virtual memory layout that is seen by a guest. The guest address space uses a second page table that shares the last level pte-tables with the process page table. If a page is unmapped from the process page table it is automatically unmapped from the guest page table as well. The guest address space mapping starts out empty, KVM can map any individual 1MB segments from the process virtual memory to any 1MB aligned location in the guest virtual memory. If a target segment in the process virtual memory does not exist or is unmapped while a guest mapping exists the desired target address is stored as an invalid segment table entry in the guest page table. The population of the guest page table is fault driven. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
36409f6353fc2d7b6516e631415f938eadd92ffa |
|
06-Jun-2011 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] use generic RCU page-table freeing code Replace the s390 specific rcu page-table freeing code with the generic variant. This requires to duplicate the definition for the struct mmu_table_batch as s390 does not use the generic tlb flush code. While we are at it remove the restriction that page table fragments can not be reused after a single fragment has been freed with rcu and split out allocation and freeing of page tables with pgstes. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
3c5cffb66d8ea94832650fcb55194715b0229088 |
|
29-May-2011 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
[S390] mm: fix mmu_gather rework Quite a few functions that get called from the tlb gather code require that preemption must be disabled. So disable preemption inside of the called functions instead. The only drawback is that rcu_table_freelist_finish() doesn't get necessarily called on the cpu(s) that filled the free lists. So we may see a delay, until we finally see an rcu callback. However over time this shouldn't matter. So we get rid of lots of "BUG: using smp_processor_id() in preemptible" messages. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
|
1c395176962176660bb108f90e97e1686cfe0d85 |
|
25-May-2011 |
Peter Zijlstra <a.p.zijlstra@chello.nl> |
mm: now that all old mmu_gather code is gone, remove the storage Fold all the mmu_gather rework patches into one for submission Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reported-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
043d07084b5347a26eab0a07aa13a4a929ad9e71 |
|
23-May-2011 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] Remove data execution protection The noexec support on s390 does not rely on a bit in the page table entry but utilizes the secondary space mode to distinguish between memory accesses for instructions vs. data. The noexec code relies on the assumption that the cpu will always use the secondary space page table for data accesses while it is running in the secondary space mode. Up to the z9-109 class machines this has been the case. Unfortunately this is not true anymore with z10 and later machines. The load-relative-long instructions lrl, lgrl and lgfrl access the memory operand using the same addressing-space mode that has been used to fetch the instruction. This breaks the noexec mode for all user space binaries compiled with march=z10 or later. The only option is to remove the current noexec support. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
f1be77bb21120b5306b56d6854db1f8eb5c3678b |
|
31-Jan-2011 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] pgtable_list corruption After page_table_free_rcu removed a page from the pgtable_list page_table_free better not add it again. Otherwise a page_table_alloc can reuse a page table fragment that is still in the rcu process. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
e05ef9bdb899e2f3798be74691842fc597d8ce60 |
|
25-Oct-2010 |
Christian Borntraeger <borntraeger@de.ibm.com> |
[S390] kvm: Fix badness at include/asm/mmu_context.h:83 commit 050eef364ad700590a605a0749f825cab4834b1e [S390] fix tlb flushing vs. concurrent /proc accesses broke KVM on s390x. On every schedule a Badness at include/asm/mmu_context.h:83 appears. s390_enable_sie replaces the mm on the __running__ task, therefore, we have to increase the attach count of the new mm. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
80217147a3d80c8a4e48f06e2f6e965455f3fe2a |
|
25-Oct-2010 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] lockless get_user_pages_fast() Implement get_user_pages_fast without locking in the fastpath on s390. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
5a0e3ad6af8660be21ca98a971cd00f331318c05 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
b11b53342773361f3353b285eb6a3fd6074e7997 |
|
07-Dec-2009 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] Improve address space mode selection. Introduce user_mode to replace the two variables switch_amode and s390_noexec. There are three valid combinations of the old values: 1) switch_amode == 0 && s390_noexec == 0 2) switch_amode == 1 && s390_noexec == 0 3) switch_amode == 1 && s390_noexec == 1 They get replaced by 1) user_mode == HOME_SPACE_MODE 2) user_mode == PRIMARY_SPACE_MODE 3) user_mode == SECONDARY_SPACE_MODE The new kernel parameter user_mode=[primary,secondary,home] lets you choose the address space mode the user space processes should use. In addition the CONFIG_S390_SWITCH_AMODE config option is removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
52a21f2cee108ea1c8abc4fdaf64a66f21af26db |
|
06-Oct-2009 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] fix build breakage with CONFIG_AIO=n next-20090925 randconfig build breaks on s390x, with CONFIG_AIO=n. arch/s390/mm/pgtable.c: In function 's390_enable_sie': arch/s390/mm/pgtable.c:282: error: 'struct mm_struct' has no member named 'ioctx_list' arch/s390/mm/pgtable.c:298: error: 'struct mm_struct' has no member named 'ioctx_list' make[1]: *** [arch/s390/mm/pgtable.o] Error 1 Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
87458ff4582953d6b3bf45edeac8582849552e69 |
|
22-Sep-2009 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
[S390] Change kernel_page_present coding style. Make the inline assembly look like all others. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
50aa98bad056a17655864a4d71ebc32d95c629a7 |
|
11-Sep-2009 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] fix recursive locking on page_table_lock Suzuki Poulose reported the following recursive locking bug on s390: Here is the stack trace : (see Appendix I for more info) [<0000000000406ed6>] _spin_lock+0x52/0x94 [<0000000000103bde>] crst_table_free+0x14e/0x1a4 [<00000000001ba684>] __pmd_alloc+0x114/0x1ec [<00000000001be8d0>] handle_mm_fault+0x2cc/0xb80 [<0000000000407d62>] do_dat_exception+0x2b6/0x3a0 [<0000000000114f8c>] sysc_return+0x0/0x8 [<00000200001642b2>] 0x200001642b2 The page_table_lock is already acquired in __pmd_alloc (mm/memory.c) and it tries to populate the pud/pgd with a new pmd allocated. If another thread populates it before we get a chance, we free the pmd using pmd_free(). On s390x, pmd_free(even pud_free ) is #defined to crst_table_free(), which acquires the page_table_lock to protect the crst_table index updates. Hence this ends up in a recursive locking of the page_table_lock. The solution suggested by Dave Hansen is to use a new spin lock in the mmu context to protect the access to the crst_list and the pgtable_list. Reported-by: Suzuki Poulose <suzuki@in.ibm.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
7db11a363fc41cec170a94a3542031e5e64bb333 |
|
16-Jun-2009 |
Hans-Joachim Picht <hans@linux.vnet.ibm.com> |
[S390] pm: add kernel_page_present Fix the following build failure caused by make allyesconfig using CONFIG_HIBERNATION and CONFIG_DEBUG_PAGEALLOC kernel/built-in.o: In function `saveable_page': kernel/power/snapshot.c:897: undefined reference to `kernel_page_present' kernel/built-in.o: In function `safe_copy_page': kernel/power/snapshot.c:948: undefined reference to `kernel_page_present' make: *** [.tmp_vmlinux1] Error 1 Signed-off-by: Hans-Joachim Picht <hans@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
239a64255fae8933d95273b5b92545949ca4e743 |
|
12-Jun-2009 |
Heiko Carstens <heiko.carstens@de.ibm.com> |
[S390] vmalloc: add vmalloc kernel parameter support With the kernel parameter 'vmalloc=<size>' the size of the vmalloc area can be specified. This can be used to increase or decrease the size of the area. Works in the same way as on some other architectures. This can be useful for features which make excessive use of vmalloc and wouldn't work otherwise. The default sizes remain unchanged: 96MB for 31 bit kernels and 1GB for 64 bit kernels. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
005f8eee6f3c8173e492d7bd4d51bda990eb468b |
|
26-Mar-2009 |
Rusty Russell <rusty@rustcorp.com.au> |
[S390] cpumask: use mm_cpumask() wrapper Makes code futureproof against the impending change to mm->cpu_vm_mask. It's also a chance to use the new cpumask_ ops which take a pointer (the older ones are deprecated, but there's no hurry for arch code). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
702d9e584feb028ed7e2a6d2b103b8ea57622ff2 |
|
26-Mar-2009 |
Carsten Otte <cotte@de.ibm.com> |
[S390] check addressing mode in s390_enable_sie The sie instruction requires address spaces to be switched to run proper. This patch verifies that this is the case in s390_enable_sie, otherwise the kernel would crash badly as soon as the process runs into sie. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
f481bfafd36e621d6cbc62d4b25f74811410aef7 |
|
18-Mar-2009 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] make page table walking more robust Make page table walking on s390 more robust. The current code requires that the pgd/pud/pmd/pte loop is only done for address ranges that are below the end address of the last vma of the address space. But this is not always true, e.g. the generic page table walker does not guarantee this. Change TASK_SIZE/TASK_SIZE_OF to reflect the current size of the address space. This makes the generic page table walker happy but it breaks the upgrade of a 3 level page table to a 4 level page table. To make the upgrade work again another fix is required. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
abf137dd7712132ee56d5b3143c2ff61a72a5faa |
|
09-Dec-2008 |
Jens Axboe <jens.axboe@oracle.com> |
aio: make the lookup_ioctx() lockless The mm->ioctx_list is currently protected by a reader-writer lock, so we always grab that lock on the read side for doing ioctx lookups. As the workload is extremely reader biased, turn this into an rcu hlist so we can make lookup_ioctx() lockless. Get rid of the rwlock and use a spinlock for providing update side exclusion. There's usually only 1 entry on this list, so it doesn't make sense to look into fancier data structures. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
|
250cf776f74b5932a1977d0489cae9206e2351dd |
|
28-Oct-2008 |
Christian Borntraeger <borntraeger@de.ibm.com> |
[S390] pgtables: Fix race in enable_sie vs. page table ops The current enable_sie code sets the mm->context.pgstes bit to tell dup_mm that the new mm should have extended page tables. This bit is also used by the s390 specific page table primitives to decide about the page table layout - which means context.pgstes has two meanings. This can cause any kind of bugs. For example - e.g. shrink_zone can call ptep_clear_flush_young while enable_sie is running. ptep_clear_flush_young will test for context.pgstes. Since enable_sie changed that value of the old struct mm without changing the page table layout ptep_clear_flush_young will do the wrong thing. The solution is to split pgstes into two bits - one for the allocation - one for the current state Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
74b6b522ec83f9c44fc7743f2adcb24664aa8f45 |
|
21-May-2008 |
Christian Borntraeger <borntraeger@de.ibm.com> |
KVM: s390: fix locking order problem in enable_sie There are potential locking problem in enable_sie. We take the task_lock and the mmap_sem. As exit_mm uses the same locks vice versa, this triggers a lockdep warning. The second problem is that dup_mm and mmput might sleep, so we must not hold the task_lock at that moment. The solution is to dup the mm unconditional and use the task_lock before and afterwards to check if we can use the new mm. dup_mm and mmput are called outside the task_lock, but we run update_mm while holding the task_lock, protection us against ptrace. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
|
402b08622d9ac6e32e25289573272e0f21bb58a7 |
|
25-Mar-2008 |
Carsten Otte <cotte@de.ibm.com> |
s390: KVM preparation: provide hook to enable pgstes in user pagetable The SIE instruction on s390 uses the 2nd half of the page table page to virtualize the storage keys of a guest. This patch offers the s390_enable_sie function, which reorganizes the page tables of a single-threaded process to reserve space in the page table: s390_enable_sie makes sure that the process is single threaded and then uses dup_mm to create a new mm with reorganized page tables. The old mm is freed and the process has now a page status extended field after every page table. Code that wants to exploit pgstes should SELECT CONFIG_PGSTE. This patch has a small common code hit, namely making dup_mm non-static. Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's review feedback. Now we do have the prototype for dup_mm in include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now call task_lock() to prevent race against ptrace modification of mm_users. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Avi Kivity <avi@qumranet.com>
|
6252d702c5311ce916caf75ed82e5c8245171c92 |
|
09-Feb-2008 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] dynamic page tables. Add support for different number of page table levels dependent on the highest address used for a process. This will cause a 31 bit process to use a two level page table instead of the four level page table that is the default after the pud has been introduced. Likewise a normal 64 bit process will use three levels instead of four. Only if a process runs out of the 4 tera bytes which can be addressed with a three level page table the fourth level is dynamically added. Then the process can use up to 8 peta byte. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
146e4b3c8b92071b18f0b2e6f47165bad4f9e825 |
|
09-Feb-2008 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] 1K/2K page table pages. This patch implements 1K/2K page table pages for s390. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|
2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 |
|
08-Feb-2008 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
CONFIG_HIGHPTE vs. sub-page page tables. Background: I've implemented 1K/2K page tables for s390. These sub-page page tables are required to properly support the s390 virtualization instruction with KVM. The SIE instruction requires that the page tables have 256 page table entries (pte) followed by 256 page status table entries (pgste). The pgstes are only required if the process is using the SIE instruction. The pgstes are updated by the hardware and by the hypervisor for a number of reasons, one of them is dirty and reference bit tracking. To avoid wasting memory the standard pte table allocation should return 1K/2K (31/64 bit) and 2K/4K if the process is using SIE. Problem: Page size on s390 is 4K, page table size is 1K or 2K. That means the s390 version for pte_alloc_one cannot return a pointer to a struct page. Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one cannot return a pointer to a pte either, since that would require more than 32 bit for the return value of pte_alloc_one (and the pte * would not be accessible since its not kmapped). Solution: The only solution I found to this dilemma is a new typedef: a pgtable_t. For s390 pgtable_t will be a (pte *) - to be introduced with a later patch. For everybody else it will be a (struct page *). The additional problem with the initialization of the ptl lock and the NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and a destructor pgtable_page_dtor. The page table allocation and free functions need to call these two whenever a page table page is allocated or freed. pmd_populate will get a pgtable_t instead of a struct page pointer. To get the pgtable_t back from a pmd entry that has been installed with pmd_populate a new function pmd_pgtable is added. It replaces the pmd_page call in free_pte_range and apply_to_pte_range. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
3610cce87af0693603db171d5b6f6735f5e3dc5b |
|
22-Oct-2007 |
Martin Schwidefsky <schwidefsky@de.ibm.com> |
[S390] Cleanup page table definitions. - De-confuse the defines for the address-space-control-elements and the segment/region table entries. - Create out of line functions for page table allocation / freeing. - Simplify get_shadow_xxx functions. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
|