371528caec553785c37f73fa3926ea0de84f986f |
|
24-Feb-2012 |
Anton Vorontsov <anton.vorontsov@linaro.org> |
mm: memcg: Correct unregistring of events attached to the same eventfd There is an issue when memcg unregisters events that were attached to the same eventfd: - On the first call mem_cgroup_usage_unregister_event() removes all events attached to a given eventfd, and if there were no events left, thresholds->primary would become NULL; - Since there were several events registered, cgroups core will call mem_cgroup_usage_unregister_event() again, but now kernel will oops, as the function doesn't expect that threshold->primary may be NULL. That's a good question whether mem_cgroup_usage_unregister_event() should actually remove all events in one go, but nowadays it can't do any better as cftype->unregister_event callback doesn't pass any private event-associated cookie. So, let's fix the issue by simply checking for threshold->primary. FWIW, w/o the patch the following oops may be observed: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0 Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs RIP: 0010:[<ffffffff810be32c>] [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0 RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246 Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0) Call Trace: [<ffffffff8107092b>] cgroup_event_remove+0x2b/0x60 [<ffffffff8103db94>] process_one_work+0x174/0x450 [<ffffffff8103e413>] worker_thread+0x123/0x2d0 Cc: stable <stable@vger.kernel.org> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
82b3f2a7171731cce62f25058d25afb91a14710c |
|
04-Feb-2012 |
Andrew Morton <akpm@linux-foundation.org> |
mm/memcontrol.c: fix warning with CONFIG_NUMA=n mm/memcontrol.c: In function 'memcg_check_events': mm/memcontrol.c:779: warning: unused variable 'do_numainfo' Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
701b259f446be2f3625fb852bceb93afe76e206d |
|
25-Jan-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Davem says: 1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet. 2) tg3 header length computation correction from Eric Dumazet. 3) More build and reference counting fixes for socket memory cgroup code from Glauber Costa. 4) module.h snuck back into a core header after all the hard work we did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer. 5) Fix PHY naming regression and add some new PCI IDs in stmmac, from Alessandro Rubini. 6) Netlink message generation fix in new team driver, should only advertise the entries that changed during events, from Jiri Pirko. 7) SRIOV VF registration and unregistration fixes, and also add a missing PCI ID, from Roopa Prabhu. 8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka. 9) ftgmac100/ftmac100 build fix, missing interrupt.h include. 10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun. 11) Off by one fix in netem packet scheduler, from Vijay Subramanian. 12) TCP loss detection fix from Yuchung Cheng. 13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu. 14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger. 15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC congestion control modules, fix from Neal Cardwell. 16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław. 17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri. 18) Statistics bug fixes in mlx4 from Eugenia Emantayev. 19) rds locking bug fix during info dumps, from your's truly. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits) rds: Make rds_sock_lock BH rather than IRQ safe. netprio_cgroup.h: dont include module.h from other includes net: flow_dissector.c missing include linux/export.h team: send only changed options/ports via netlink net/hyperv: fix possible memory leak in do_set_multicast() drivers/net: dsa/mv88e6xxx.c files need linux/module.h stmmac: added PCI identifiers llc: Fix race condition in llc_ui_recvmsg stmmac: fix phy naming inconsistency dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches. tg3: fix ipv6 header length computation skge: add byte queue limit support mv643xx_eth: Add Rx Discard and Rx Overrun statistics bnx2x: fix compilation error with SOE in fw_dump bnx2x: handle CHIP_REVISION during init_one bnx2x: allow user to change ring size in ISCSI SD mode bnx2x: fix Big-Endianess in ethtool -t bnx2x: fixed ethtool statistics for MF modes bnx2x: credit-leakage fixup on vlan_mac_del_all macvlan: fix a possible use after free ...
|
6568d4a9c9ff16d6c4f0b14dfea567806ce579e4 |
|
20-Jan-2012 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: memcg: update the correct soft limit tree during migration end_migration() passes the old page instead of the new page to commit the charge. This page descriptor is not used for committing itself, though, since we also pass the (correct) page_cgroup descriptor. But it's used to find the soft limit tree through the page's zone, so the soft limit tree of the old page's zone is updated instead of that of the new page's, which might get slightly out of date until the next charge reaches the ratelimit point. This glitch has been present since 5564e88 ("memcg: condense page_cgroup-to-page lookup points"). This fixes a bug that I introduced in 2.6.38. It's benign enough (to my knowledge) that we probably don't want this for stable. Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
376be5ff8a6a36efadd131860cf26841f366d44c |
|
20-Jan-2012 |
Glauber Costa <glommer@parallels.com> |
net: fix socket memcg build with !CONFIG_NET There is still a build bug with the sock memcg code, that triggers with !CONFIG_NET, that survived my series of randconfig builds. Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Randy Dunlap <rdunlap@xenotime.net> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
/mm/memcontrol.c
|
319d3b9c97b5e3191e419bb95496bf08ec50f096 |
|
15-Jan-2012 |
Glauber Costa <glommer@parallels.com> |
net: move sock_update_memcg outside of CONFIG_INET Although only used currently for tcp sockets, this function is now used in common sock code (for sock_clone()) Commit 475f1b52645a29936b9df1d8fcd45f7e56bd4a9f moved the declaration of sock_update_clone() to inside sock.c, but this only fixes the problem when CONFIG_CGROUP_MEM_RES_CTLR_KMEM is also not defined. This patch here is verified to fix both problems, although reverting the previous one is not necessary. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: David S. Miller <davem@davemloft.net> CC: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
/mm/memcontrol.c
|
90b3feaec8ffb167abd8903bf111605c2f035aa8 |
|
13-Jan-2012 |
Hugh Dickins <hughd@google.com> |
memcg: fix mem_cgroup_print_bad_page If DEBUG_VM, mem_cgroup_print_bad_page() is called whenever bad_page() shows a "Bad page state" message, removes page from circulation, adds a taint and continues. This is at a very low level, often when a spinlock is held (sometimes when page table lock is held, for example). We want to recover from this badness, not make it worse: we must not kmalloc memory here, we must not do a cgroup path lookup via dubious pointers. No doubt that code was useful to debug a particular case at one time, and may be again, but take it out of the mainline kernel. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
12d27107867fc7216e8faaff0b894b0f162dcf75 |
|
13-Jan-2012 |
Hugh Dickins <hughd@google.com> |
memcg: fix split_huge_page_refcounts() This patch started off as a cleanup: __split_huge_page_refcounts() has to cope with two scenarios, when the hugepage being split is already on LRU, and when it is not; but why does it have to split that accounting across three different sites? Consolidate it in lru_add_page_tail(), handling evictable and unevictable alike, and use standard add_page_to_lru_list() when accounting is needed (when the head is not yet on LRU). But a recent regression in -next, I guess the removal of PageCgroupAcctLRU test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix: under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number, messing up reclaim calculations and causing a freeze at rmdir of cgroup. Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that count - this has not been the only such incident. Document that lru_add_page_tail() is for Transparent HugePages by #ifdef around it. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3ed28fa1080c73747ce17f2025b28b062fb5aa7f |
|
13-Jan-2012 |
Bob Liu <lliubbo@gmail.com> |
memcg: cleanup for_each_node_state() We already have for_each_node(node) define in nodemask.h, better to use it. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
38c5d72f3ebe5ddd57d2f08dc035070fc6c9a287 |
|
13-Jan-2012 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: simplify LRU handling by new rule Now, at LRU handling, memory cgroup needs to do complicated works to see valid pc->mem_cgroup, which may be overwritten. This patch is for relaxing the protocol. This patch guarantees - when pc->mem_cgroup is overwritten, page must not be on LRU. By this, LRU routine can believe pc->mem_cgroup and don't need to check bits on pc->flags. This new rule may adds small overheads to swapin. But in most case, lru handling gets faster. After this patch, PCG_ACCT_LRU bit is obsolete and removed. [akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree] [akpm@linux-foundation.org: clean up code comment] [hughd@google.com: fix NULL mem_cgroup_try_charge] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4e5f01c2b9b94321992acb09c35d34f5ee5bb274 |
|
13-Jan-2012 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: clear pc->mem_cgroup if necessary. This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup and reducing atomic ops/complexity in memcg LRU handling. In some cases, pages are added to lru before charge to memcg and pages are not classfied to memory cgroup at lru addtion. Now, the lru where the page should be added is determined a bit in page_cgroup->flags and pc->mem_cgroup. I'd like to remove the check of flag. To handle the case pc->mem_cgroup may contain stale pointers if pages are added to LRU before classification. This patch resets pc->mem_cgroup to root_mem_cgroup before lru additions. [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build] [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build] [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal] [hughd@google.com: stop oops in mem_cgroup_reset_owner()] [hughd@google.com: fix page migration to reset_owner] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
36b62ad539498d00c2d280a151abad5f7630fa73 |
|
13-Jan-2012 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: simplify corner case handling of LRU. This patch simplifies LRU handling of racy case (memcg+SwapCache). At charging, SwapCache tend to be on LRU already. So, before overwriting pc->mem_cgroup, the page must be removed from LRU and added to LRU later. This patch does spin_lock(zone->lru_lock); if (PageLRU(page)) remove from LRU overwrite pc->mem_cgroup if (PageLRU(page)) add to new LRU. spin_unlock(zone->lru_lock); And guarantee all pages are not on LRU at modifying pc->mem_cgroup. This patch also unfies lru handling of replace_page_cache() and swapin. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
dc67d50465f249bb357bf85b3ed1f642eb00130a |
|
13-Jan-2012 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: simplify page cache charging This patch is a clean up. No functional/logical changes. Because of commit ef6a3c6311 ("mm: add replace_page_cache_page() function") , FUSE uses replace_page_cache() instead of add_to_page_cache(). Then, mem_cgroup_cache_charge() is not called against FUSE's pages from splice. So now, mem_cgroup_cache_charge() gets pages that are not on the LRU with the exception of PageSwapCache pages. For checking, WARN_ON_ONCE(PageLRU(page)) is added. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
de077d222d5ca6108cab119a09593344c12100ab |
|
13-Jan-2012 |
David Rientjes <rientjes@google.com> |
oom, memcg: fix exclusion of memcg threads after they have detached their mm The oom killer relies on logic that identifies threads that have already been oom killed when scanning the tasklist and, if found, deferring until such threads have exited. This is done by checking for any candidate threads that have the TIF_MEMDIE bit set. For memcg ooms, candidate threads are first found by calling task_in_mem_cgroup() since the oom killer should not defer if there's an oom killed thread in another memcg. Unfortunately, task_in_mem_cgroup() excludes threads if they have detached their mm in the process of exiting so TIF_MEMDIE is never detected for such conditions. This is different for global, mempolicy, and cpuset oom conditions where a detached mm is only excluded after checking for TIF_MEMDIE and deferring, if necessary, in select_bad_process(). The fix is to return true if a task has a detached mm but is still in the memcg or its hierarchy that is currently oom. This will allow the oom killer to appropriately defer rather than kill unnecessarily or, in the worst case, panic the machine if nothing else is available to kill. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c3cecc683446ad54ca587d7123bd3ce94bd7b8e1 |
|
13-Jan-2012 |
Michal Hocko <mhocko@suse.cz> |
memcg: free entries in soft_limit_tree if allocation fails If we are not able to allocate tree nodes for all NUMA nodes then we should release those that were allocated. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9fb4b7cc0724f178d4b24a2a566ea1e7cb120b82 |
|
13-Jan-2012 |
Bob Liu <lliubbo@gmail.com> |
page_cgroup: add helper function to get swap_cgroup There are multiple places which need to get the swap_cgroup address, so add a helper function: static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent, struct swap_cgroup_ctrl **ctrl); to simplify the code. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
40f23a21a8501c1b2c65c50c19b516488ac31313 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: remove unneeded checks from uncharge_page() mem_cgroup_uncharge_page() is only called on either freshly allocated pages without page->mapping or on rmapped PageAnon() pages. There is no need to check for a page->mapping that is not an anon_vma. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7a0524cfc8f9f585471a31b1282a9ce4a1a7d444 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: remove unneeded checks from newpage_charge() All callsites pass in freshly allocated pages and a valid mm. As a result, all checks pertaining to the page's mapcount, page->mapping or the fallback to init_mm are unneeded. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cfa449461e67b60df986170eecb089831fa9e49a |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: lookup_page_cgroup (almost) never returns NULL Pages have their corresponding page_cgroup descriptors set up before they are used in userspace, and thus managed by a memory cgroup. The only time where lookup_page_cgroup() can return NULL is in the CONFIG_DEBUG_VM-only page sanity checking code that executes while feeding pages into the page allocator for the first time. Remove the NULL checks against lookup_page_cgroup() results from all callsites where we know that corresponding page_cgroup descriptors must be allocated, and add a comment to the callsite that actually does have to check the return value. [hughd@google.com: stop oops in mem_cgroup_update_page_stat()] Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0e574a932d2cab8eb3b02d21feb59f2c09154738 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: clean up fault accounting The fault accounting functions have a single, memcg-internal user, so they don't need to be global. In fact, their one-line bodies can be directly folded into the caller. And since faults happen one at a time, use this_cpu_inc() directly instead of this_cpu_add(foo, 1). Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
72835c86ca15d0126354b73d5f29ce9194931c9b |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: unify remaining mem_cont, mem, etc. variable names to memcg Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f53d7ce32e13dbd09573b176e6521a04c2c77803 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: shorten preempt-disabled section around event checks Only the ratelimit checks themselves have to run with preemption disabled, the resulting actions - checking for usage thresholds, updating the soft limit tree - can and should run with preemption enabled. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reported-by: Yong Zhang <yong.zhang0@gmail.com> Tested-by: Yong Zhang <yong.zhang0@gmail.com> Reported-by: Luis Henriques <henrix@camandro.org> Tested-by: Luis Henriques <henrix@camandro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e94c8a9cbce1aee4af9e1285802785481b7f93c5 |
|
13-Jan-2012 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: make mem_cgroup_split_huge_fixup() more efficient In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times. But thinking again, - compound_lock() is held at move_accout...then, it's not necessary to take move_lock_page_cgroup(). - LRU is locked and all tail pages will go into the same LRU as head is now on. - page_cgroup is contiguous in huge page range. This patch fixes mem_cgroup_split_huge_fixup() as to be called once per hugepage and reduce costs for spliting. [akpm@linux-foundation.org: fix typo, per Michal] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
925b7673cce39116ce61e7a06683a4a0dad1e72a |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: make per-memcg LRU lists exclusive Now that all code that operated on global per-zone LRU lists is converted to operate on per-memory cgroup LRU lists instead, there is no reason to keep the double-LRU scheme around any longer. The pc->lru member is removed and page->lru is linked directly to the per-memory cgroup LRU lists, which removes two pointers from a descriptor that exists for every page frame in the system. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Ying Han <yinghan@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6290df545814990ca2663baf6e894669132d5f73 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: collect LRU list heads into struct lruvec Having a unified structure with a LRU list set for both global zones and per-memcg zones allows to keep that code simple which deals with LRU lists and does not care about the container itself. Once the per-memcg LRU lists directly link struct pages, the isolation function and all other list manipulations are shared between the memcg case and the global LRU case. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ad2b8e601099a23dffffb53f91c18d874fe98854 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty root_mem_cgroup, lacking a configurable limit, was never subject to limit reclaim, so the pages charged to it could be kept off its LRU lists. They would be found on the global per-zone LRU lists upon physical memory pressure and it made sense to avoid uselessly linking them to both lists. The global per-zone LRU lists are about to go away on memcg-enabled kernels, with all pages being exclusively linked to their respective per-memcg LRU lists. As a result, pages of the root_mem_cgroup must also be linked to its LRU lists again. This is purely about the LRU list, root_mem_cgroup is still not charged. The overhead is temporary until the double-LRU scheme is going away completely. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
5660048ccac8735d9bc0a46325a02e6a6518b5b2 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: move memcg hierarchy reclaim to generic reclaim code Memory cgroup limit reclaim and traditional global pressure reclaim will soon share the same code to reclaim from a hierarchical tree of memory cgroups. In preparation of this, move the two right next to each other in shrink_zone(). The mem_cgroup_hierarchical_reclaim() polymath is split into a soft limit reclaim function, which still does hierarchy walking on its own, and a limit (shrinking) reclaim function, which relies on generic reclaim code to walk the hierarchy. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
527a5ec9a53471d855291ba9f1fdf1dd4e12a184 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: per-priority per-zone hierarchy scan generations Memory cgroup limit reclaim currently picks one memory cgroup out of the target hierarchy, remembers it as the last scanned child, and reclaims all zones in it with decreasing priority levels. The new hierarchy reclaim code will pick memory cgroups from the same hierarchy concurrently from different zones and priority levels, it becomes necessary that hierarchy roots not only remember the last scanned child, but do so for each zone and priority level. Until now, we reclaimed memcgs like this: mem = mem_cgroup_iter(root) for each priority level: for each zone in zonelist: reclaim(mem, zone) But subsequent patches will move the memcg iteration inside the loop over the zones: for each priority level: for each zone in zonelist: mem = mem_cgroup_iter(root) reclaim(mem, zone) And to keep with the original scan order - memcg -> priority -> zone - the last scanned memcg has to be remembered per zone and per priority level. Furthermore, global reclaim will be switched to the hierarchy walk as well. Different from limit reclaim, which can just recheck the limit after some reclaim progress, its target is to scan all memcgs for the desired zone pages, proportional to the memcg size, and so reliably detecting a full hierarchy round-trip will become crucial. Currently, the code relies on one reclaimer encountering the same memcg twice, but that is error-prone with concurrent reclaimers. Instead, use a generation counter that is increased every time the child with the highest ID has been visited, so that reclaimers can stop when the generation changes. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9f3a0d0933de079665ec1b498947ffbf805b0018 |
|
13-Jan-2012 |
Johannes Weiner <jweiner@redhat.com> |
mm: memcg: consolidate hierarchy iteration primitives The memcg naturalization series: Memory control groups are currently bolted onto the side of traditional memory management in places where better integration would be preferrable. To reclaim memory, for example, memory control groups maintain their own LRU list and reclaim strategy aside from the global per-zone LRU list reclaim. But an extra list head for each existing page frame is expensive and maintaining it requires additional code. This patchset disables the global per-zone LRU lists on memory cgroup configurations and converts all its users to operate on the per-memory cgroup lists instead. As LRU pages are then exclusively on one list, this saves two list pointers for each page frame in the system: page_cgroup array size with 4G physical memory vanilla: allocated 31457280 bytes of page_cgroup patched: allocated 15728640 bytes of page_cgroup At the same time, system performance for various workloads is unaffected: 100G sparse file cat, 4G physical memory, 10 runs, to test for code bloat in the traditional LRU handling and kswapd & direct reclaim paths, without/with the memory controller configured in vanilla: 71.603(0.207) seconds patched: 71.640(0.156) seconds vanilla: 79.558(0.288) seconds patched: 77.233(0.147) seconds 100G sparse file cat in 1G memory cgroup, 10 runs, to test for code bloat in the traditional memory cgroup LRU handling and reclaim path vanilla: 96.844(0.281) seconds patched: 94.454(0.311) seconds 4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M swap on SSD, 10 runs, to test for regressions in kswapd & direct reclaim using per-memcg LRU lists with multiple memcgs and multiple allocators within each memcg vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ] patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ] 16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M swap on SSD, 10 runs, to test for regressions in hierarchical memcg setups vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ] patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ] This patch: There are currently two different implementations of iterating over a memory cgroup hierarchy tree. Consolidate them into one worker function and base the convenience looping-macros on top of it. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ab936cbcd02072a34b60d268f94440fd5cf1970b |
|
13-Jan-2012 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: add mem_cgroup_replace_page_cache() to fix LRU issue Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a function replace_page_cache_page(). This function replaces a page in the radix-tree with a new page. WHen doing this, memory cgroup needs to fix up the accounting information. memcg need to check PCG_USED bit etc. In some(many?) cases, 'newpage' is on LRU before calling replace_page_cache(). So, memcg's LRU accounting information should be fixed, too. This patch adds mem_cgroup_replace_page_cache() and removes the old hooks. In that function, old pages will be unaccounted without touching res_counter and new page will be accounted to the memcg (of old page). WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid races with LRU handling. Background: replace_page_cache_page() is called by FUSE code in its splice() handling. Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated page and may be on LRU. LRU mis-accounting will be critical for memory cgroup because rmdir() checks the whole LRU is empty and there is no account leak. If a page is on the other LRU than it should be, rmdir() will fail. This bug was added in March 2011, but no bug report yet. I guess there are not many people who use memcg and FUSE at the same time with upstream kernels. The result of this bug is that admin cannot destroy a memcg because of account leak. So, no panic, no deadlock. And, even if an active cgroup exist, umount can succseed. So no problem at shutdown. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
38e5781bbf8e82c1635ea845e0d07b2228a5ac7a |
|
09-Jan-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: igmp: Avoid zero delay when receiving odd mixture of IGMP queries netdev: make net_device_ops const bcm63xx: make ethtool_ops const usbnet: make ethtool_ops const net: Fix build with INET disabled. net: introduce netif_addr_lock_nested() and call if when appropriate net: correct lock name in dev_[uc/mc]_sync documentations. net: sk_update_clone is only used in net/core/sock.c 8139cp: fix missing napi_gro_flush. pktgen: set correct max and min in pktgen_setup_inject() smsc911x: Unconditionally include linux/smscphy.h in smsc911x.h asix: fix infinite loop in rx_fixup() net: Default UDP and UNIX diag to 'n'. r6040: fix typo in use of MCR0 register bits net: fix sock_clone reference mismatch with tcp memcontrol
|
db0c2bf69aa095d4a6de7b1145f29fe9a7c0f6a3 |
|
09-Jan-2012 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup * 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cgroup: fix to allow mounting a hierarchy by name cgroup: move assignement out of condition in cgroup_attach_proc() cgroup: Remove task_lock() from cgroup_post_fork() cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end() cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static cgroup: only need to check oldcgrp==newgrp once cgroup: remove redundant get/put of task struct cgroup: remove redundant get/put of old css_set from migrate cgroup: Remove unnecessary task_lock before fetching css_set on migration cgroup: Drop task_lock(parent) on cgroup_fork() cgroups: remove redundant get/put of css_set from css_set_check_fetched() resource cgroups: remove bogus cast cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task() cgroup, cpuset: don't use ss->pre_attach() cgroup: don't use subsys->can_attach_task() or ->attach_task() cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() cgroup: improve old cgroup handling in cgroup_attach_proc() cgroup: always lock threadgroup during migration threadgroup: extend threadgroup_lock() to cover exit and exec threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem ... Fix up conflict in kernel/cgroup.c due to commit e0197aae59e5: "cgroups: fix a css_set not found bug in cgroup_attach_proc" that already mentioned that the bug is fixed (differently) in Tejun's cgroup patchset. This one, in other words.
|
f3f511e1ce6f1a6f0a5bb8320e9f802e76f6b999 |
|
05-Jan-2012 |
Glauber Costa <glommer@parallels.com> |
net: fix sock_clone reference mismatch with tcp memcontrol Sockets can also be created through sock_clone. Because it copies all data in the sock structure, it also copies the memcg-related pointer, and all should be fine. However, since we now use reference counts in socket creation, we are left with some sockets that have no reference counts. It matters when we destroy them, since it leads to a mismatch. Signed-off-by: Glauber Costa <glommer@parallels.com> CC: David S. Miller <davem@davemloft.net> CC: Greg Thelen <gthelen@google.com> CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com> CC: Laurent Chavey <chavey@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
/mm/memcontrol.c
|
abb434cb0539fb355c1c921f8fd761efbbac3462 |
|
23-Dec-2011 |
David S. Miller <davem@davemloft.net> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: net/bluetooth/l2cap_core.c Just two overlapping changes, one added an initialization of a local variable, and another change added a new local variable. Signed-off-by: David S. Miller <davem@davemloft.net>
|
65c64ce8ee642eb330a4c4d94b664725f2902b44 |
|
22-Dec-2011 |
Glauber Costa <glommer@parallels.com> |
Partial revert "Basic kernel memory functionality for the Memory Controller" This reverts commit e5671dfae59b165e2adfd4dfbdeab11ac8db5bda. After a follow up discussion with Michal, it was agreed it would be better to leave the kmem controller with just the tcp files, deferring the behavior of the other general memory.kmem.* files for a later time, when more caches are controlled. This is because generic kmem files are not used by tcp accounting and it is not clear how other slab caches would fit into the scheme. We are reverting the original commit so we can track the reference. Part of the patch is kept, because it was used by the later tcp code. Conflicts are shown in the bottom. init/Kconfig is removed from the revert entirely. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> CC: Kirill A. Shutemov <kirill@shutemov.name> CC: Paul Menage <paul@paulmenage.org> CC: Greg Thelen <gthelen@google.com> CC: Johannes Weiner <jweiner@redhat.com> CC: David S. Miller <davem@davemloft.net> Conflicts: Documentation/cgroups/memory.txt mm/memcontrol.c Signed-off-by: David S. Miller <davem@davemloft.net>
/mm/memcontrol.c
|
a41c58a6665cc995e237303b05db42100b71b65e |
|
20-Dec-2011 |
Hillf Danton <dhillf@gmail.com> |
memcg: keep root group unchanged if creation fails If the request is to create non-root group and we fail to meet it, we should leave the root unchanged. Signed-off-by: Hillf Danton <dhillf@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2f7ee5691eecb67c8108b92001a85563ea336ac5 |
|
13-Dec-2011 |
Tejun Heo <tj@kernel.org> |
cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach() Currently, there's no way to pass multiple tasks to cgroup_subsys methods necessitating the need for separate per-process and per-task methods. This patch introduces cgroup_taskset which can be used to pass multiple tasks and their associated cgroups to cgroup_subsys methods. Three methods - can_attach(), cancel_attach() and attach() - are converted to use cgroup_taskset. This unifies passed parameters so that all methods have access to all information. Conversions in this patchset are identical and don't introduce any behavior change. -v2: documentation updated as per Paul Menage's suggestion. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: James Morris <jmorris@namei.org>
/mm/memcontrol.c
|
d1a4c0b37c296e600ffe08edb0db2dc1b8f550d7 |
|
11-Dec-2011 |
Glauber Costa <glommer@parallels.com> |
tcp memory pressure controls This patch introduces memory pressure controls for the tcp protocol. It uses the generic socket memory pressure code introduced in earlier patches, and fills in the necessary data in cg_proto struct. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
/mm/memcontrol.c
|
e1aab161e0135aafcd439be20b4f35e4b0922d95 |
|
11-Dec-2011 |
Glauber Costa <glommer@parallels.com> |
socket: initial cgroup code. The goal of this work is to move the memory pressure tcp controls to a cgroup, instead of just relying on global conditions. To avoid excessive overhead in the network fast paths, the code that accounts allocated memory to a cgroup is hidden inside a static_branch(). This branch is patched out until the first non-root cgroup is created. So when nobody is using cgroups, even if it is mounted, no significant performance penalty should be seen. This patch handles the generic part of the code, and has nothing tcp-specific. Signed-off-by: Glauber Costa <glommer@parallels.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtsu.com> CC: Kirill A. Shutemov <kirill@shutemov.name> CC: David S. Miller <davem@davemloft.net> CC: Eric W. Biederman <ebiederm@xmission.com> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
/mm/memcontrol.c
|
e5671dfae59b165e2adfd4dfbdeab11ac8db5bda |
|
11-Dec-2011 |
Glauber Costa <glommer@parallels.com> |
Basic kernel memory functionality for the Memory Controller This patch lays down the foundation for the kernel memory component of the Memory Controller. As of today, I am only laying down the following files: * memory.independent_kmem_limit * memory.kmem.limit_in_bytes (currently ignored) * memory.kmem.usage_in_bytes (always zero) Signed-off-by: Glauber Costa <glommer@parallels.com> CC: Kirill A. Shutemov <kirill@shutemov.name> CC: Paul Menage <paul@paulmenage.org> CC: Greg Thelen <gthelen@google.com> CC: Johannes Weiner <jweiner@redhat.com> CC: Michal Hocko <mhocko@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
/mm/memcontrol.c
|
32aaeffbd4a7457bf2f7448b33b5946ff2a960eb |
|
07-Nov-2011 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
|
4799401fef9d5951b2da384c5eb08034c48e08a0 |
|
02-Nov-2011 |
Steven Rostedt <srostedt@redhat.com> |
memcg: Fix race condition in memcg_check_events() with this_cpu usage Various code in memcontrol.c () calls this_cpu_read() on the calculations to be done from two different percpu variables, or does an open-coded read-modify-write on a single percpu variable. Disable preemption throughout these operations so that the writes go to the correct palces. [hannes@cmpxchg.org: added this_cpu to __this_cpu conversion] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a61ed3cec51cfd4877855c24890ab8d3e2b143e3 |
|
02-Nov-2011 |
Johannes Weiner <jweiner@redhat.com> |
memcg: close race between charge and putback There is a potential race between a thread charging a page and another thread putting it back to the LRU list: charge: putback: SetPageCgroupUsed SetPageLRU PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU The order of setting one flag and checking the other is crucial, otherwise the charge may observe !PageLRU while the putback observes !PageCgroupUsed and the page is not linked to the memcg LRU at all. Global memory pressure may fix this by trying to isolate and putback the page for reclaim, where that putback would link it to the memcg LRU again. Without that, the memory cgroup is undeletable due to a charge whose physical page can not be found and moved out. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Cc: Ying Han <yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9b272977e3b99a8699361d214b51f98c8a9e0e7b |
|
02-Nov-2011 |
Johannes Weiner <jweiner@redhat.com> |
memcg: skip scanning active lists based on individual size Reclaim decides to skip scanning an active list when the corresponding inactive list is above a certain size in comparison to leave the assumed working set alone while there are still enough reclaim candidates around. The memcg implementation of comparing those lists instead reports whether the whole memcg is low on the requested type of inactive pages, considering all nodes and zones. This can lead to an oversized active list not being scanned because of the state of the other lists in the memcg, as well as an active list being scanned while its corresponding inactive list has enough pages. Not only is this wrong, it's also a scalability hazard, because the global memory state over all nodes and zones has to be gathered for each memcg and zone scanned. Make these calculations purely based on the size of the two LRU lists that are actually affected by the outcome of the decision. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0a619e58703b86d53d07e938eade9a91a4a863c6 |
|
02-Nov-2011 |
Igor Mammedov <imammedo@redhat.com> |
memcg: do not expose uninitialized mem_cgroup_per_node to world If somebody is touching data too early, it might be easier to diagnose a problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to understand why mem_cgroup_per_zone is [un|partly]initialized. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
715a5ee82ab3c07430f748630044354132add5ad |
|
02-Nov-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix oom schedule_timeout() Before calling schedule_timeout(), task state should be changed. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c0ff4b8540a5c158b8e5bafb7d767298b67b0b92 |
|
02-Nov-2011 |
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> |
memcg: rename mem variable to memcg The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses "struct mem_cgroup *memcg". Rename all mem variables to memcg in source file. Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4356f21d09283dc6d39a6f7287a65ddab61e2808 |
|
01-Nov-2011 |
Minchan Kim <minchan.kim@gmail.com> |
mm: change isolate mode from #define to bitwise type Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b9e15bafdf1aa20791cdefdcbf1ccf7d7aa03aaa |
|
26-May-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
mm: Add export.h for EXPORT_SYMBOL to active symbol exporters These files were getting <linux/module.h> via an implicit include path, but we want to crush those out of existence since they cost time during compiles of processing thousands of lines of headers for no reason. Give them the lightweight header that just contains the EXPORT_SYMBOL infrastructure. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
/mm/memcontrol.c
|
185efc0f9a1f2d6ad6d4782c5d9e529f3290567f |
|
15-Sep-2011 |
Johannes Weiner <jweiner@redhat.com> |
memcg: Revert "memcg: add memory.vmscan_stat" Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add memory.vmscan_stat"). The implementation of per-memcg reclaim statistics violates how memcg hierarchies usually behave: hierarchically. The reclaim statistics are accounted to child memcgs and the parent hitting the limit, but not to hierarchy levels in between. Usually, hierarchical statistics are perfectly recursive, with each level representing the sum of itself and all its children. Since this exports statistics to userspace, this may lead to confusion and problems with changing things after the release, so revert it now, we can try again later. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
23751be0094012eb6b4756fa80ca54b3eb83069f |
|
26-Aug-2011 |
Johannes Weiner <jweiner@redhat.com> |
memcg: fix hierarchical oom locking Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than counter") tried to oom lock the hierarchy and roll back upon encountering an already locked memcg. The code is confused when it comes to detecting a locked memcg, though, so it would fail and rollback after locking one memcg and encountering an unlocked second one. The result is that oom-locking hierarchies fails unconditionally and that every oom killer invocation simply goes to sleep on the oom waitqueue forever. The tasks practically hang forever without anyone intervening, possibly holding locks that trip up unrelated tasks, too. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
5af12d0efdbd9967cc71a0a10c4025c4255a6254 |
|
26-Aug-2011 |
Johannes Weiner <jweiner@redhat.com> |
memcg: pin execution to current cpu while draining stock Commit d1a05b6973c7 ("memcg do not try to drain per-cpu caches without pages") added a drain_local_stock() call to a preemptible section. The draining task looks up the cpu-local stock twice to set the draining-flag, then to drain the stock and clear the flag again. If the task is migrated to a different CPU in between, noone will clear the flag on the first stock and it will be forever undrainable. Its charge can not be recovered and the cgroup can not be deleted anymore. Properly pin the task to the executing CPU while draining stocks. Signed-off-by: Johannes Weiner <jweiner@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9f50fad65b87a8776ae989ca059ad6c17925dfc3 |
|
09-Aug-2011 |
Michal Hocko <mhocko@suse.cz> |
Revert "memcg: get rid of percpu_charge_mutex lock" This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc. The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE bit operations is sufficient but that is not true. Johannes Weiner has reported a crash during parallel memory cgroup removal: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70 Oops: 0000 [#1] PREEMPT SMP Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3 RIP: 0010:[<ffffffff81083b70>] css_is_ancestor+0x20/0x70 RSP: 0018:ffff880077b09c88 EFLAGS: 00010202 Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310) Call Trace: [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40 [<ffffffff810feccf>] drain_all_stock+0x11f/0x170 [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0 [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20 [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500 [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0 [<ffffffff81114e7b>] do_rmdir+0xfb/0x110 [<ffffffff81114ea6>] sys_rmdir+0x16/0x20 [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b We are crashing because we try to dereference cached memcg when we are checking whether we should wait for draining on the cache. The cache is already cleaned up, though. There is also a theoretical chance that the cached memcg gets freed between we test for the FLUSHING_CACHED_CHARGE and dereference it in mem_cgroup_same_or_subtree: CPU0 CPU1 CPU2 mem=stock->cached stock->cached=NULL clear_bit test_and_set_bit test_bit() ... <preempted> mem_cgroup_destroy use after free The percpu_charge_mutex protected from this race because sync draining is exclusive. It is safer to revert now and come up with a more parallel implementation later. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
aa3b189551ad8e5cc1d9c663735c131650238278 |
|
04-Aug-2011 |
Hugh Dickins <hughd@google.com> |
tmpfs: convert mem_cgroup shmem to radix-swap Remove mem_cgroup_shmem_charge_fallback(): it was only required when we had to move swappage to filecache with GFP_NOWAIT. Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by moving its call out from shmem_add_to_page_cache() to two of thats three callers. But leave it doing mem_cgroup_uncharge_cache_page() on error: although asymmetrical, it's easier for all 3 callers to handle. These two changes would also be appropriate if anyone were to start using shmem_read_mapping_page_gfp() with GFP_NOWAIT. Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test radix_tree_exceptional_entry() to get what it needs for itself. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8521fc50d433507a7cdc96bec280f9e5888a54cc |
|
27-Jul-2011 |
Michal Hocko <mhocko@suse.cz> |
memcg: get rid of percpu_charge_mutex lock percpu_charge_mutex protects from multiple simultaneous per-cpu charge caches draining because we might end up having too many work items. At least this was the case until commit 26fe61684449 ("memcg: fix percpu cached charge draining frequency") when we introduced a more targeted draining for async mode. Now that also sync draining is targeted we can safely remove mutex because we will not send more work than the current number of CPUs. FLUSHING_CACHED_CHARGE protects from sending the same work multiple times and stock->nr_pages == 0 protects from pointless sending a work if there is obviously nothing to be done. This is of course racy but we can live with it as the race window is really small (we would have to see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still non-zero). The only remaining place where we can race is synchronous mode when we rely on FLUSHING_CACHED_CHARGE test which might have been set by other drainer on the same group but we should wait in that case as well. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3e92041d68b40c47faa34c7dc08fc650a6c36adc |
|
27-Jul-2011 |
Michal Hocko <mhocko@suse.cz> |
memcg: add mem_cgroup_same_or_subtree() helper We are checking whether a given two groups are same or at least in the same subtree of a hierarchy at several places. Let's make a helper for it to make code easier to read. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d38144b7a5f8d0a5e05d549177191374c6911009 |
|
27-Jul-2011 |
Michal Hocko <mhocko@suse.cz> |
memcg: unify sync and async per-cpu charge cache draining Currently we have two ways how to drain per-CPU caches for charges. drain_all_stock_sync will synchronously drain all caches while drain_all_stock_async will asynchronously drain only those that refer to a given memory cgroup or its subtree in hierarchy. Targeted async draining has been introduced by 26fe6168 (memcg: fix percpu cached charge draining frequency) to reduce the cpu workers number. sync draining is currently triggered only from mem_cgroup_force_empty which is triggered only by userspace (mem_cgroup_force_empty_write) or when a cgroup is removed (mem_cgroup_pre_destroy). Although these are not usually frequent operations it still makes some sense to do targeted draining as well, especially if the box has many CPUs. This patch unifies both methods to use the single code (drain_all_stock) which relies on the original async implementation and just adds flush_work to wait on all caches that are still under work for the sync mode. We are using FLUSHING_CACHED_CHARGE bit check to prevent from waiting on a work that we haven't triggered. Please note that both sync and async functions are currently protected by percpu_charge_mutex so we cannot race with other drainers. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d1a05b6973c7cb33144fa965d73facc708ffc37d |
|
27-Jul-2011 |
Michal Hocko <mhocko@suse.cz> |
memcg: do not try to drain per-cpu caches without pages drain_all_stock_async tries to optimize a work to be done on the work queue by excluding any work for the current CPU because it assumes that the context we are called from already tried to charge from that cache and it's failed so it must be empty already. While the assumption is correct we can optimize it even more by checking the current number of pages in the cache. This will also reduce a work on other CPUs with an empty stock. For the current CPU we can simply call drain_local_stock rather than deferring it to the work queue. [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
82f9d486e59f588c7d100865c36510644abda356 |
|
27-Jul-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: add memory.vmscan_stat The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim in...") says it adds scanning stats to memory.stat file. But it doesn't because we considered we needed to make a concensus for such new APIs. This patch is a trial to add memory.scan_stat. This shows - the number of scanned pages(total, anon, file) - the number of rotated pages(total, anon, file) - the number of freed pages(total, anon, file) - the number of elaplsed time (including sleep/pause time) for both of direct/soft reclaim. The biggest difference with oringinal Ying's one is that this file can be reset by some write, as # echo 0 ...../memory.scan_stat Example of output is here. This is a result after make -j 6 kernel under 300M limit. [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat scanned_pages_by_limit 9471864 scanned_anon_pages_by_limit 6640629 scanned_file_pages_by_limit 2831235 rotated_pages_by_limit 4243974 rotated_anon_pages_by_limit 3971968 rotated_file_pages_by_limit 272006 freed_pages_by_limit 2318492 freed_anon_pages_by_limit 962052 freed_file_pages_by_limit 1356440 elapsed_ns_by_limit 351386416101 scanned_pages_by_system 0 scanned_anon_pages_by_system 0 scanned_file_pages_by_system 0 rotated_pages_by_system 0 rotated_anon_pages_by_system 0 rotated_file_pages_by_system 0 freed_pages_by_system 0 freed_anon_pages_by_system 0 freed_file_pages_by_system 0 elapsed_ns_by_system 0 scanned_pages_by_limit_under_hierarchy 9471864 scanned_anon_pages_by_limit_under_hierarchy 6640629 scanned_file_pages_by_limit_under_hierarchy 2831235 rotated_pages_by_limit_under_hierarchy 4243974 rotated_anon_pages_by_limit_under_hierarchy 3971968 rotated_file_pages_by_limit_under_hierarchy 272006 freed_pages_by_limit_under_hierarchy 2318492 freed_anon_pages_by_limit_under_hierarchy 962052 freed_file_pages_by_limit_under_hierarchy 1356440 elapsed_ns_by_limit_under_hierarchy 351386416101 scanned_pages_by_system_under_hierarchy 0 scanned_anon_pages_by_system_under_hierarchy 0 scanned_file_pages_by_system_under_hierarchy 0 rotated_pages_by_system_under_hierarchy 0 rotated_anon_pages_by_system_under_hierarchy 0 rotated_file_pages_by_system_under_hierarchy 0 freed_pages_by_system_under_hierarchy 0 freed_anon_pages_by_system_under_hierarchy 0 freed_file_pages_by_system_under_hierarchy 0 elapsed_ns_by_system_under_hierarchy 0 total_xxxx is for hierarchy management. This will be useful for further memcg developments and need to be developped before we do some complicated rework on LRU/softlimit management. This patch adds a new struct memcg_scanrecord into scan_control struct. sc->nr_scanned at el is not designed for exporting information. For example, nr_scanned is reset frequentrly and incremented +2 at scanning mapped pages. To avoid complexity, I added a new param in scan_control which is for exporting scanning score. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Andrew Bresticker <abrestic@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
108b6a78463bb8c7163e4f9779f36ad8bbade334 |
|
27-Jul-2011 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix behavior of mem_cgroup_resize_limit() Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to memsw.limit") introduced "memsw_is_minimum" flag, which becomes true when mem_limit == memsw_limit. The flag is checked at the beginning of reclaim, and "noswap" is set if the flag is true, because using swap is meaningless in this case. This works well in most cases, but when we try to shrink mem_limit, which is the same as memsw_limit now, we might fail to shrink mem_limit because swap doesn't used. This patch fixes this behavior by: - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim - If it is set, don't set "noswap" flag even if memsw_is_minimum is true. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
1af8efe965676ab30d6c8a5b1fccc9229f339a3b |
|
27-Jul-2011 |
Michal Hocko <mhocko@suse.cz> |
memcg: change memcg_oom_mutex to spinlock memcg_oom_mutex is used to protect memcg OOM path and eventfd interface for oom_control. None of the critical sections which it protects sleep (eventfd_signal works from atomic context and the rest are simple linked list resp. oom_lock atomic operations). Mutex is also too heavyweight for those code paths because it triggers a lot of scheduling. It also makes makes convoying effects more visible when we have a big number of oom killing because we take the lock mutliple times during mem_cgroup_handle_oom so we have multiple places where many processes can sleep. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
79dfdaccd1d5b40ff7cf4a35a0e63696ebb78b4d |
|
27-Jul-2011 |
Michal Hocko <mhocko@suse.cz> |
memcg: make oom_lock 0 and 1 based rather than counter Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock counter which is incremented by mem_cgroup_oom_lock when we are about to handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if oom_lock > 1 to prevent from multiple oom kills at the same time. The counter is then decremented by mem_cgroup_oom_unlock called from the same function. This works correctly but it can lead to serious starvations when we have many processes triggering OOM and many CPUs available for them (I have tested with 16 CPUs). Consider a process (call it A) which gets the oom_lock (the first one that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other processes that are blocked on the mutex. While A releases the mutex and calls mem_cgroup_out_of_memory others will wake up (one after another) and increase the counter and fall into sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it takes the mutex again and decreases oom_lock and wakes other tasks (if releasing memory by somebody else - e.g. killed process - hasn't done it yet). A testcase would look like: Assume malloc XXX is a program allocating XXX Megabytes of memory which touches all allocated pages in a tight loop # swapoff SWAP_DEVICE # cgcreate -g memory:A # cgset -r memory.oom_control=0 A # cgset -r memory.limit_in_bytes= 200M # for i in `seq 100` # do # cgexec -g memory:A malloc 10 & # done The main problem here is that all processes still race for the mutex and there is no guarantee that we will get counter back to 0 for those that got back to mem_cgroup_handle_oom. In the end the whole convoy in/decreases the counter but we do not get to 1 that would enable killing so nothing useful can be done. The time is basically unbounded because it highly depends on scheduling and ordering on mutex (I have seen this taking hours...). This patch replaces the counter by a simple {un}lock semantic. As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to make sure that nobody else races with us which is guaranteed by the memcg_oom_mutex. We have to be careful while locking subtrees because we can encounter a subtree which is already locked: hierarchy: A / \ B \ /\ \ C D E B - C - D tree might be already locked. While we want to enable locking E subtree because OOM situations cannot influence each other we definitely do not want to allow locking A. Therefore we have to refuse lock if any subtree is already locked and clear up the lock for all nodes that have been set up to the failure point. On the other hand we have to make sure that the rest of the world will recognize that a group is under OOM even though it doesn't have a lock. Therefore we have to introduce under_oom variable which is incremented and decremented for the whole subtree when we enter resp. leave mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be updated under memcg_oom_mutex because its users only check a single group and they use atomic operations for that. This can be checked easily by the following test case: # cgcreate -g memory:A # cgset -r memory.use_hierarchy=1 A # cgset -r memory.oom_control=1 A # cgset -r memory.limit_in_bytes= 100M # cgset -r memory.memsw.limit_in_bytes= 100M # cgcreate -g memory:A/B # cgset -r memory.oom_control=1 A/B # cgset -r memory.limit_in_bytes=20M # cgset -r memory.memsw.limit_in_bytes=20M # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A While B gets oom_lock A will not get it. Both of them go into sleep and wait for an external action. We can make the limit higher for A to enforce waking it up # cgset -r memory.memsw.limit_in_bytes=300M A # cgset -r memory.limit_in_bytes=300M A malloc in A has to wake up even though it doesn't have oom_lock. Finally, the unlock path is very easy because we always unlock only the subtree we have locked previously while we always decrement under_oom. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
bb2a0de92c891b8feeedc0178acb3ae009d899a8 |
|
27-Jul-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: consolidate memory cgroup lru stat functions In mm/memcontrol.c, there are many lru stat functions as.. mem_cgroup_zone_nr_lru_pages mem_cgroup_node_nr_file_lru_pages mem_cgroup_nr_file_lru_pages mem_cgroup_node_nr_anon_lru_pages mem_cgroup_nr_anon_lru_pages mem_cgroup_node_nr_unevictable_lru_pages mem_cgroup_nr_unevictable_lru_pages mem_cgroup_node_nr_lru_pages mem_cgroup_nr_lru_pages mem_cgroup_get_local_zonestat Some of them are under #ifdef MAX_NUMNODES >1 and others are not. This seems bad. This patch consolidates all functions into mem_cgroup_zone_nr_lru_pages() mem_cgroup_node_nr_lru_pages() mem_cgroup_nr_lru_pages() For these functions, "which LRU?" information is passed by a mask. example: mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON)) And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON. example: mem_cgroup_nr_lru_pages(mem, ALL_LRU) BTW, considering layout of NUMA memory placement of counters, this patch seems to be better. Now, when we gather all LRU information, we scan in following orer for_each_lru -> for_each_node -> for_each_zone. This means we'll touch cache lines in different node in turn. After patch, we'll scan for_each_node -> for_each_zone -> for_each_lru(mask) Then, we'll gather information in the same cacheline at once. [akpm@linux-foundation.org: fix warnigns, build error] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
1f4c025b5a5520fd2571244196b1b01ad96d18f6 |
|
27-Jul-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() Each memory cgroup has a 'swappiness' value which can be accessed by get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages() and swappiness is passed by argument. It's propagated by scan_control. get_swappiness() is a static function but some planned updates will need to get swappiness from files other than memcontrol.c This patch exports get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the argument of swapiness from try_to_free... and drop swappiness from scan_control. only memcg uses it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
453a9bf347f1e22a5bb3605ced43b2366921221d |
|
09-Jul-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix numa scan information update to be triggered by memory event commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin order") adds an numa node round-robin for memcg. But the information is updated once per 10sec. This patch changes the update trigger from jiffies to memcg's event count. After this patch, numa scan information will be updated when we see 1024 events of pagein/pageout under a memcg. [akpm@linux-foundation.org: attempt to repair code layout] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4d0c066d29f030d47d19678f8008933e67dd3b72 |
|
09-Jul-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix reclaimable lru check in memcg Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is used for checking whether the memcg contains reclaimable pages or not. If no pages in it, the routine skips it. But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle "noswap" condition correctly. This doesn't work on a swapless system. This patch adds test_mem_cgroup_reclaimable() and replaces mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter and returns correct answer to the caller. And this new function has "noswap" argument and can see only FILE LRU if necessary. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix kerneldoc layout] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
072441e21ddcd1140606b7d4ef6eab579a86b0b3 |
|
28-Jun-2011 |
Hugh Dickins <hughd@google.com> |
mm: move shmem prototypes to shmem_fs.h Before adding any more global entry points into shmem.c, gather such prototypes into shmem_fs.h. Remove mm's own declarations from swap.h, but for now leave the ones in mm.h: because shmem_file_setup() and shmem_zero_setup() are called from various places, and we should not force other subsystems to update immediately. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
fbc29a25e484be073e7d762c9f7f1d4bf8aecc48 |
|
16-Jun-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: avoid percpu cached charge draining at softlimit Based on Michal Hocko's comment. We are not draining per cpu cached charges during soft limit reclaim because background reclaim doesn't care about charges. It tries to free some memory and charges will not give any. Cached charges might influence only selection of the biggest soft limit offender but as the call is done only after the selection has been already done it makes no change. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
26fe616844491a41a1abc02e29f7a9d1ec2f8ddb |
|
16-Jun-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix percpu cached charge draining frequency For performance, memory cgroup caches some "charge" from res_counter into per cpu cache. This works well but because it's cache, it needs to be flushed in some cases. Typical cases are 1. when someone hit limit. 2. when rmdir() is called and need to charges to be 0. But "1" has problem. Recently, with large SMP machines, we see many kworker runs because of flushing memcg's cache. Bad things in implementation are that even if a cpu contains a cache for memcg not related to a memcg which hits limit, drain code is called. This patch does A) check percpu cache contains a useful data or not. B) check other asynchronous percpu draining doesn't run. C) don't call local cpu callback. (*)This patch avoid changing the calling condition with hard-limit. When I run "cat 1Gfile > /dev/null" under 300M limit memcg, [Before] 13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat 58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1 60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1 4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0 57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1 61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1 62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1 63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1 [After] 2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat 2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top 1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0 [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Michal Hocko <mhocko@suse.cz> Tested-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7ae534d074e01e54d5cfbc9734b73fdfc855501f |
|
16-Jun-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix wrong check of noswap with softlimit Hierarchical reclaim doesn't swap out if memsw and resource limits are thye same (memsw_is_minimum == true) because we would hit mem+swap limit anyway (during hard limit reclaim). If it comes to the soft limit we shouldn't consider memsw_is_minimum at all because it doesn't make much sense. Either the soft limit is bellow the hard limit and then we cannot hit mem+swap limit or the direct reclaim takes a precedence. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8957712710e045044e3c44375c6a87d7ffa17d51 |
|
16-Jun-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
mm: memory.numa_stat: fix file permission Commit 406eb0c9ba76 ("memcg: add memory.numastat api for numa statistics") adds memory.numa_stat file for memory cgroup. But the file permissions are wrong. [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat ---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat This patch fixes the permission as [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a433658c30974fc87ba3ff52d7e4e6299762aa3d |
|
16-Jun-2011 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
vmscan,memcg: memcg aware swap token Currently, memcg reclaim can disable swap token even if the swap token mm doesn't belong in its memory cgroup. It's slightly risky. If an admin creates very small mem-cgroup and silly guy runs contentious heavy memory pressure workload, every tasks are going to lose swap token and then system may become unresponsive. That's bad. This patch adds 'memcg' parameter into disable_swap_token(). and if the parameter doesn't match swap token, VM doesn't disable it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel<riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
456f998ec817ebfa254464be4f089542fa390645 |
|
27-May-2011 |
Ying Han <yinghan@google.com> |
memcg: add the pagefault count into memcg stats Two new stats in per-memcg memory.stat which tracks the number of page faults and number of major page faults. "pgfault" "pgmajfault" They are different from "pgpgin"/"pgpgout" stat which count number of pages charged/discharged to the cgroup and have no meaning of reading/ writing page to disk. It is valuable to track the two stats for both measuring application's performance as well as the efficiency of the kernel page reclaim path. Counting pagefaults per process is useful, but we also need the aggregated value since processes are monitored and controlled in cgroup basis in memcg. Functional test: check the total number of pgfault/pgmajfault of all memcgs and compare with global vmstat value: $ cat /proc/vmstat | grep fault pgfault 1070751 pgmajfault 553 $ cat /dev/cgroup/memory.stat | grep fault pgfault 1071138 pgmajfault 553 total_pgfault 1071142 total_pgmajfault 553 $ cat /dev/cgroup/A/memory.stat | grep fault pgfault 199 pgmajfault 0 total_pgfault 199 total_pgmajfault 0 Performance test: run page fault test(pft) wit 16 thread on faulting in 15G anon pages in 16G container. There is no regression noticed on the "flt/cpu/s" Sample output from pft: TAG pft:anon-sys-default: Gb Thr CLine User System Wall flt/cpu/s fault/wsec 15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260 +-------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 16682.962 17344.027 16913.524 16928.812 166.5362 + 10 16695.568 16923.896 16820.604 16824.652 84.816568 No difference proven at 95.0% confidence [akpm@linux-foundation.org: fix build] [hughd@google.com: shmem fix] Signed-off-by: Ying Han <yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
406eb0c9ba765eb066406fd5ce9d5e2b169a4d5a |
|
27-May-2011 |
Ying Han <yinghan@google.com> |
memcg: add memory.numastat api for numa statistics The new API exports numa_maps per-memcg basis. This is a piece of useful information where it exports per-memcg page distribution across real numa nodes. One of the usecases is evaluating application performance by combining this information w/ the cpu allocation to the application. The output of the memory.numastat tries to follow w/ simiar format of numa_maps like: total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... And we have per-node: total = file + anon + unevictable $ cat /dev/cgroup/memory/memory.numa_stat total=250020 N0=87620 N1=52367 N2=45298 N3=64735 file=225232 N0=83402 N1=46160 N2=40522 N3=55148 anon=21053 N0=3424 N1=6207 N2=4776 N3=6646 unevictable=3735 N0=794 N1=0 N2=0 N3=2941 Signed-off-by: Ying Han <yinghan@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
1bac180bd29e03989f50054af97b53b8d37a364a |
|
27-May-2011 |
Ying Han <yinghan@google.com> |
memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages() The caller of the function has been renamed to zone_nr_lru_pages(), and this is just fixing up in the memcg code. The current name is easily to be mis-read as zone's total number of pages. Signed-off-by: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4fd14ebf6e3b66423dfac2bc9defda7b83ee07b3 |
|
27-May-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: remove unused retry signal from reclaim If the memcg reclaim code detects the target memcg below its limit it exits and returns a guaranteed non-zero value so that the charge is retried. Nowadays, the charge side checks the memcg limit itself and does not rely on this non-zero return value trick. This patch removes it. The reclaim code will now always return the true number of pages it reclaimed on its own. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel<riel@redhat.com> Acked-by: Ying Han<yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
889976dbcb1218119fdd950fb7819084e37d7d37 |
|
27-May-2011 |
Ying Han <yinghan@google.com> |
memcg: reclaim memory from nodes in round-robin order Presently, memory cgroup's direct reclaim frees memory from the current node. But this has some troubles. Usually when a set of threads works in a cooperative way, they tend to operate on the same node. So if they hit limits under memcg they will reclaim memory from themselves, damaging the active working set. For example, assume 2 node system which has Node 0 and Node 1 and a memcg which has 1G limit. After some work, file cache remains and the usages are Node 0: 1M Node 1: 998M. and run an application on Node 0, it will eat its foot before freeing unnecessary file caches. This patch adds round-robin for NUMA and adds equal pressure to each node. When using cpuset's spread memory feature, this will work very well. But yes, a better algorithm is needed. [akpm@linux-foundation.org: comment editing] [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons] Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
39cc98f1f8aa949afeea89f424c7494b0785d7da |
|
27-May-2011 |
Michal Hocko <mhocko@suse.cz> |
memcg: remove pointless next_mz nullification in mem_cgroup_soft_limit_reclaim() next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node selects the same mz. This doesn't make much sense as we assign to the variable right in the next loop. Compiler will probably optimize this out but it is little bit confusing for the code reading. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0ae5e89c60c9eb87da36a2614836bc434b0ec2ad |
|
27-May-2011 |
Ying Han <yinghan@google.com> |
memcg: count the soft_limit reclaim in global background reclaim The global kswapd scans per-zone LRU and reclaims pages regardless of the cgroup. It breaks memory isolation since one cgroup can end up reclaiming pages from another cgroup. Instead we should rely on memcg-aware target reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under memory pressure. In the global background reclaim, we do soft reclaim before scanning the per-zone LRU. However, the return value is ignored. This patch is the first step to skip shrink_zone() if soft_limit reclaim does enough work. This is part of the effort which tries to reduce reclaiming pages in global LRU in memcg. The per-memcg background reclaim patchset further enhances the per-cgroup targetting reclaim, which I should have V4 posted shortly. Try running multiple memory intensive workloads within seperate memcgs. Watch the counters of soft_steal in memory.stat. $ cat /dev/cgroup/A/memory.stat | grep 'soft' soft_steal 240000 soft_scan 240000 total_soft_steal 240000 total_soft_scan 240000 This patch: In the global background reclaim, we do soft reclaim before scanning the per-zone LRU. However, the return value is ignored. We would like to skip shrink_zone() if soft_limit reclaim does enough work. Also, we need to make the memory pressure balanced across per-memcg zones, like the logic vm-core. This patch is the first step where we start with counting the nr_scanned and nr_reclaimed from soft_limit reclaim into the global scan_control. Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f780bdb7c1c73009cb57adcf99ef50027d80bf3c |
|
27-May-2011 |
Ben Blum <bblum@andrew.cmu.edu> |
cgroups: add per-thread subsystem callbacks Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a2c8990aed5ab000491732b07c8c4465d1b389b8 |
|
25-May-2011 |
Michal Hocko <mhocko@suse.cz> |
memsw: remove noswapaccount kernel parameter The noswapaccount parameter has been deprecated since 2.6.38 without any complaints from users so we can remove it. swapaccount=0|1 can be used instead. As we are removing the parameter we can also clean up swapaccount because it doesn't have to accept an empty string anymore (to match noswapaccount) and so we can push = into __setup macro rather than checking "=1" resp. "=0" strings Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
25985edcedea6396277003854657b5f3cb31a628 |
|
31-Mar-2011 |
Lucas De Marchi <lucas.demarchi@profusion.mobi> |
Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
/mm/memcontrol.c
|
5a6475a4e162200f43855e2d42bbf55bcca1a9f2 |
|
24-Mar-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix leak on wrong LRU with FUSE fs/fuse/dev.c::fuse_try_move_page() does (1) remove a page by ->steal() (2) re-add the page to page cache (3) link the page to LRU if it was not on LRU at (1) This implies the page is _on_ LRU when it's added to radix-tree. So, the page is added to memory cgroup while it's on LRU. because LRU is lazy and no one flushs it. This is the same behavior as SwapCache and needs special care as - remove page from LRU before overwrite pc->mem_cgroup. - add page to LRU after overwrite pc->mem_cgroup. And we need to taking care of pagevec. If PageLRU(page) is set before we add PCG_USED bit, the page will not be added to memcg's LRU (in short period). So, regardlress of PageLRU(page) value before commit_charge(), we need to check PageLRU(page) after commit_charge(). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=30432 Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Balbir Singh <balbir@in.ibm.com> Reported-by: Daniel Poelzleithner <poelzi@poelzi.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4be4489feae6da890765cc1bdc1af5e4f8c4b75f |
|
24-Mar-2011 |
Andrew Morton <akpm@linux-foundation.org> |
mm/memcontrol.c: suppress uninitialized-var warning with older gcc's mm/memcontrol.c: In function 'mem_cgroup_force_empty': mm/memcontrol.c:2280: warning: 'flags' may be used uninitialized in this function It's a false positive. Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7a159cc9d7987cdb4853f8711f5f89e01cfffe42 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: use native word page statistics counters The statistic counters are in units of pages, there is no reason to make them 64-bit wide on 32-bit machines. Make them native words. Since they are signed, this leaves 31 bit on 32-bit machines, which can represent roughly 8TB assuming a page size of 4k. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e9f8974f2f559b00c87ccfba67bca3903f913d50 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: break out event counters from other stats For increasing and decreasing per-cpu cgroup usage counters it makes sense to use signed types, as single per-cpu values might go negative during updates. But this is not the case for only-ever-increasing event counters. All the counters have been signed 64-bit so far, which was enough to count events even with the sign bit wasted. This patch: - divides s64 counters into signed usage counters and unsigned monotonically increasing event counters. - converts unsigned event counters into 'unsigned long' rather than 'u64'. This matches the type used by the /proc/vmstat event counters. The next patch narrows the signed usage counters type (on 32-bit CPUs, that is). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7ec99d6213b579a84c85ad37f2aa8ded4857c53c |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: unify charge/uncharge quantities to units of pages There is no clear pattern when we pass a page count and when we pass a byte count that is a multiple of PAGE_SIZE. We never charge or uncharge subpage quantities, so convert it all to page counts. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7ffd4ca7a2cdd7a18f0b499a4e9e0e7cf36ba018 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: convert uncharge batching from bytes to page granularity We never uncharge subpage quantities. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
11c9ea4e80fc3be83485667204c68d0a732f3757 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: convert per-cpu stock from bytes to page granularity We never keep subpage quantities in the per-cpu stock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e7018b8d27e0c9aa2200e5b393e0fe9093c6565c |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: keep only one charge cancelling function We have two charge cancelling functions: one takes a page count, the other a page size. The second one just divides the parameter by PAGE_SIZE and then calls the first one. This is trivial, no need for an extra function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
bf1ff2635a5fda207fc870df348bfc766e8dcd4d |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: remove memcg->reclaim_param_lock The reclaim_param_lock is only taken around single reads and writes to integer variables and is thus superfluous. Drop it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4dc03de1b29901b61cb27e4cab44a7f578dc0fc9 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: charged pages always have valid per-memcg zone info page_cgroup_zoneinfo() will never return NULL for a charged page, remove the check for it in mem_cgroup_get_reclaim_stat_from_page(). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6b3ae58efca06623c197fd6d91ded4aa3a8fe039 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: remove direct page_cgroup-to-page pointer In struct page_cgroup, we have a full word for flags but only a few are reserved. Use the remaining upper bits to encode, depending on configuration, the node or the section, to enable page_cgroup-to-page lookups without a direct pointer. This saves a full word for every page in a system with memory cgroups enabled. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
5564e88ba6fd2f6dcd83a592771810cd84b5ae80 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: condense page_cgroup-to-page lookup points The per-cgroup LRU lists string up 'struct page_cgroup's. To get from those structures to the page they represent, a lookup is required. Currently, the lookup is done through a direct pointer in struct page_cgroup, so a lot of functions down the callchain do this lookup by themselves instead of receiving the page pointer from their callers. The next patch removes this pointer, however, and the lookup is no longer that straight-forward. In preparation for that, this patch only leaves the non-optional lookups when coming directly from the LRU list and passes the page down the stack. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
de3638d9cdc89ac899225996b8dcedbcbc53bdd2 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: fold __mem_cgroup_move_account into caller It is one logical function, no need to have it split up. Also, get rid of some checks from the inner function that ensured the sanity of the outer function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
97a6c37b34f46feed2544bd40891ee6dd0fd1554 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: change page_cgroup_zoneinfo signature Instead of passing a whole struct page_cgroup to this function, let it take only what it really needs from it: the struct mem_cgroup and the page. This has the advantage that reading pc->mem_cgroup is now done at the same place where the ordering rules for this pointer are enforced and explained. It is also in preparation for removing the pc->page backpointer. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ad324e94475a04cfcdfdb11ad20f8ea81268e411 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: no uncharged pages reach page_cgroup_zoneinfo This patch series removes the direct page pointer from struct page_cgroup, which saves 20% of per-page memcg memory overhead (Fedora and Ubuntu enable memcg per default, openSUSE apparently too). The node id or section number is encoded in the remaining free bits of pc->flags which allows calculating the corresponding page without the extra pointer. I ran, what I think is, a worst-case microbenchmark that just cats a large sparse file to /dev/null, because it means that walking the LRU list on behalf of per-cgroup reclaim and looking up pages from page_cgroups is happening constantly and at a high rate. But it made no measurable difference. A profile reported a 0.11% share of the new lookup_cgroup_page() function in this benchmark. This patch: All callsites check PCG_USED before passing pc->mem_cgroup, so the latter is never NULL. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f212ad7cf9c73f8a7fa160e223dcb3f074441a72 |
|
24-Mar-2011 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: add memcg sanity checks at allocating and freeing pages Add checks at allocating or freeing a page whether the page is used (iow, charged) from the view point of memcg. This check may be useful in debugging a problem and we did similar checks before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot). This patch adds some overheads at allocating or freeing memory, so it's enabled only when CONFIG_DEBUG_VM is enabled. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
af4a662144884a7dbb19acbef70878b3b955f928 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: remove NULL check from lookup_page_cgroup() result The page_cgroup array is set up before even fork is initialized. I seriously doubt that this code executes before the array is alloc'd. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c14f35c70e068392ccae0b2d6f755baea5eed4d6 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: remove impossible conditional when committing No callsite ever passes a NULL pointer for a struct mem_cgroup * to the committing function. There is no need to check for it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3403968d7a7dc373901cad0cad56b3afcb09cc50 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: remove unused page flag bitfield defines These definitions have been unused since '4b3bde4 memcg: remove the overhead associated with the root cgroup'. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9d11ea9f163a14920487bdda77461e64d600fd48 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: simplify the way memory limits are checked Since transparent huge pages, checking whether memory cgroups are below their limits is no longer enough, but the actual amount of chargeable space is important. To not have more than one limit-checking interface, replace memory_cgroup_check_under_limit() and memory_cgroup_check_margin() with a single memory_cgroup_margin() that returns the chargeable space and leaves the comparison to the callsite. Soft limits are now checked the other way round, by using the already existing function that returns the amount by which soft limits are exceeded: res_counter_soft_limit_excess(). Also remove all the corresponding functions on the res_counter side that are now no longer used. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b7c6167848fa36e32f1874b95c1edc02881cd040 |
|
24-Mar-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: soft limit reclaim should end at limit not below Soft limit reclaim continues until the usage is below the current soft limit, but the documented semantics are actually that soft limit reclaim will push usage back until the soft limits are met again. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
56039efa18f2530fc23e8ef19e716b65ee2a1d1e |
|
24-Mar-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix ugly initialization of return value is in caller Remove initialization of vaiable in caller of memory cgroup function. Actually, it's return value of memcg function but it's initialized in caller. Some memory cgroup uses following style to bring the result of start function to the end function for avoiding races. mem_cgroup_start_A(&(*ptr)) /* Something very complicated can happen here. */ mem_cgroup_end_A(*ptr) In some calls, *ptr should be initialized to NULL be caller. But it's ugly. This patch fixes that *ptr is initialized by _start function. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
033193275b3ffcfe7f3fde7b569f3d207f6cd6a0 |
|
23-Mar-2011 |
Dave Hansen <dave@linux.vnet.ibm.com> |
pagewalk: only split huge pages when necessary Right now, if a mm_walk has either ->pte_entry or ->pmd_entry set, it will unconditionally split any transparent huge pages it runs in to. In practice, that means that anyone doing a cat /proc/$pid/smaps will unconditionally break down every huge page in the process and depend on khugepaged to re-collapse it later. This is fairly suboptimal. This patch changes that behavior. It teaches each ->pmd_entry handler (there are five) that they must break down the THPs themselves. Also, the _generic_ code will never break down a THP unless a ->pte_entry handler is actually set. This means that the ->pmd_entry handlers can now choose to deal with THPs without breaking them down. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric B Munson <emunson@mgebm.net> Tested-by: Eric B Munson <emunson@mgebm.net> Cc: Michael J Wolf <mjwolf@us.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3f58a82943337fb6e79acfa5346719a97d3c0b98 |
|
23-Mar-2011 |
Minchan Kim <minchan.kim@gmail.com> |
memcg: move memcg reclaimable page into tail of inactive list The rotate_reclaimable_page function moves just written out pages, which the VM wanted to reclaim, to the end of the inactive list. That way the VM will find those pages first next time it needs to free memory. This patch applies the rule in memcg. It can help to prevent unnecessary working page eviction of memcg. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ef6a3c63112e865d632ff7c478ba7c7160cad0d1 |
|
23-Mar-2011 |
Miklos Szeredi <mszeredi@suse.cz> |
mm: add replace_page_cache_page() function This function basically does: remove_from_page_cache(old); page_cache_release(old); add_to_page_cache_locked(new); Except it does this atomically, so there's no possibility for the "add" to fail because of a race. If memory cgroups are enabled, then the memory cgroup charge is also moved from the old page to the new. This function is currently used by fuse to move pages into the page cache on read, instead of copying the page contents. [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3751d60430fe4c26460a5ca8ad8672d32f93bcb1 |
|
02-Feb-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix event counting breakage from recent THP update Changes in e401f1761 ("memcg: modify accounting function for supporting THP better") adds nr_pages to support multiple page size in memory_cgroup_charge_statistics. But counting the number of event nees abs(nr_pages) for increasing counters. This patch fixes event counting. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8493ae439f7038b502df1d687e61dde54c27ca92 |
|
02-Feb-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: never OOM when charging huge pages Huge page coverage should obviously have less priority than the continued execution of a process. Never kill a process when charging it a huge page fails. Instead, give up after the first failed reclaim attempt and fall back to regular pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
19942822df65ee4a47c2e6d6d70cace1b7f01710 |
|
02-Feb-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: prevent endless loop when charging huge pages to near-limit group If reclaim after a failed charging was unsuccessful, the limits are checked again, just in case they settled by means of other tasks. This is all fine as long as every charge is of size PAGE_SIZE, because in that case, being below the limit means having at least PAGE_SIZE bytes available. But with transparent huge pages, we may end up in an endless loop where charging and reclaim fail, but we keep going because the limits are not yet exceeded, although not allowing for a huge page. Fix this up by explicitely checking for enough room, not just whether we are within limits. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9221edb7120e2dc3ae90f1c58514979f7ba40e46 |
|
02-Feb-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: prevent endless loop when charging huge pages The charging code can encounter a charge size that is bigger than a regular page in two situations: one is a batched charge to fill the per-cpu stocks, the other is a huge page charge. This code is distributed over two functions, however, and only the outer one is aware of huge pages. In case the charging fails, the inner function will tell the outer function to retry if the charge size is bigger than regular pages--assuming batched charging is the only case. And the outer function will retry forever charging a huge page. This patch makes sure the inner function can distinguish between batch charging and a single huge page charge. It will only signal another attempt if batch charging failed, and go into regular reclaim when it is called on behalf of a huge page. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
552b372ba9db85751e7db2998f07cca2e51f5865 |
|
02-Feb-2011 |
Michal Hocko <mhocko@suse.cz> |
memsw: deprecate noswapaccount kernel parameter and schedule it for removal noswapaccount couldn't be used to control memsw for both on/off cases so we have added swapaccount[=0|1] parameter. This way we can turn the feature in two ways noswapaccount resp. swapaccount=0. We have kept the original noswapaccount but I think we should remove it after some time as it just makes more command line parameters without any advantages and also the code to handle parameters is uglier if we want both parameters. Signed-off-by: Michal Hocko <mhocko@suse.cz> Requested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
fceda1bf498677501befc7da72fd2e4de7f18466 |
|
02-Feb-2011 |
Michal Hocko <mhocko@suse.cz> |
memsw: handle swapaccount kernel parameter correctly __setup based kernel command line parameters handlers which are handled in obsolete_checksetup are provided with the parameter value including = (more precisely everything right after the parameter name). This means that the current implementation of swapaccount[=1|0] doesn't work at all because if there is a value for the parameter then we are testing for "0" resp. "1" but we are getting "=0" resp. "=1" and if there is no parameter value we are getting an empty string rather than NULL. The original noswapccount parameter, which doesn't care about the value, works correctly. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
52dbb9050936fd33ceb45f10529dbc992507c058 |
|
26-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix race at move_parent around compound_order() A fix up mem_cgroup_move_parent() which use compound_order() in asynchronous manner. This compound_order() may return unknown value because we don't take lock. Use PageTransHuge() and HPAGE_SIZE instead of it. Also clean up for mem_cgroup_move_parent(). - remove unnecessary initialization of local variable. - rename charge_size -> page_size - remove unnecessary (wrong) comment. - added a comment about THP. Note: Current design take compound_page_lock() in caller of move_account(). This should be revisited when we implement direct move_task of hugepage without splitting. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3d37c4a9199920964ffdfaec6335d93b9dcf9ca5 |
|
26-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: bugfix check mem_cgroup_disabled() at split fixup mem_cgroup_disabled() should be checked at splitting. If disabled, no heavy work is necesary. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
01c88e2d6b7330c0cc5867fe2297e7d826e1337d |
|
26-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix account leak at failure of memsw acconting Commit 4b53433468 ("memcg: clean up try_charge main loop") removes a cancel of charge at case: memory charge-> success. mem+swap charge-> failure. This leaks usage of memory. Fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: <stable@kernel.org> [2.6.36+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8dba474f034c322d96ada39cb20cac711d80dcb2 |
|
26-Jan-2011 |
Jesper Juhl <jj@chaosbits.net> |
mm/memcontrol.c: fix uninitialized variable use in mem_cgroup_move_parent() In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps to the 'put_back' label ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge); if (ret || !parent) goto put_back; where we'll if (charge > PAGE_SIZE) compound_unlock_irqrestore(page, flags); but, we have not assigned anything to 'flags' at this point, nor have we called 'compound_lock_irqsave()' (which is what sets 'flags'). The 'put_back' label should be moved below the call to compound_unlock_irqrestore() as per this patch. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
713735b4233fad3ae35b5cad656baa41413887ca |
|
20-Jan-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: correctly order reading PCG_USED and pc->mem_cgroup The placement of the read-side barrier is confused: the writer first sets pc->mem_cgroup, then PCG_USED. The read-side barrier has to be between testing PCG_USED and reading pc->mem_cgroup. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
987eba66e0e6aa654d60881a14731a353ee0acb4 |
|
20-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix rmdir, force_empty with THP Now, when THP is enabled, memcg's rmdir() function is broken because move_account() for THP page is not supported. This will cause account leak or -EBUSY issue at rmdir(). This patch fixes the issue by supporting move_account() THP pages. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ece35ca810326946ddc930c43356312ad5de44d4 |
|
20-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix LRU accounting with THP memory cgroup's LRU stat should take care of size of pages because Transparent Hugepage inserts hugepage into LRU. If this value is the number wrong, memory reclaim will not work well. Note: only head page of THP's huge page is linked into LRU. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ca3e021417eed30ec2b64ce88eb0acf64aa9bc29 |
|
20-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix USED bit handling at uncharge in THP Now, under THP: at charge: - PageCgroupUsed bit is set to all page_cgroup on a hugepage. ....set to 512 pages. at uncharge - PageCgroupUsed bit is unset on the head page. So, some pages will remain with "Used" bit. This patch fixes that Used bit is set only to the head page. Used bits for tail pages will be set at splitting if necessary. This patch adds this lock order: compound_lock() -> page_cgroup_move_lock(). [akpm@linux-foundation.org: fix warning] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e401f1761c0b01966e36e41e2c385d455a7b44ee |
|
20-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: modify accounting function for supporting THP better mem_cgroup_charge_statisics() was designed for charging a page but now, we have transparent hugepage. To fix problems (in following patch) it's required to change the function to get the number of pages as its arguments. The new function gets following as argument. - type of page rather than 'pc' - size of page which is accounted. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
50de1dd967d4ba3b8a90ebe7a4f5feca24191317 |
|
14-Jan-2011 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix memory migration of shmem swapcache In the current implementation mem_cgroup_end_migration() decides whether the page migration has succeeded or not by checking "oldpage->mapping". But if we are tring to migrate a shmem swapcache, the page->mapping of it is NULL from the begining, so the check would be invalid. As a result, mem_cgroup_end_migration() assumes the migration has succeeded even if it's not, so "newpage" would be freed while it's not uncharged. This patch fixes it by passing mem_cgroup_end_migration() the result of the page migration. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
17295c88a160c6eea3fcf46cec9d08a0fcb02db9 |
|
14-Jan-2011 |
Jesper Juhl <jj@chaosbits.net> |
memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then followed by memset() to zero the memory. This can be more efficiently achieved by using kzalloc() and vzalloc(). There's also one situation where we can use kzalloc_node() - this is what's new in this version of the patch. Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
dfe076b0971a783469bc2066e85d46e23c8acb1c |
|
14-Jan-2011 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix deadlock between cpuset and memcg Commit b1dd693e ("memcg: avoid deadlock between move charge and try_charge()") can cause another deadlock about mmap_sem on task migration if cpuset and memcg are mounted onto the same mount point. After the commit, cgroup_attach_task() has sequence like: cgroup_attach_task() ss->can_attach() cpuset_can_attach() mem_cgroup_can_attach() down_read(&mmap_sem) (1) ss->attach() cpuset_attach() mpol_rebind_mm() down_write(&mmap_sem) (2) up_write(&mmap_sem) cpuset_migrate_mm() do_migrate_pages() down_read(&mmap_sem) up_read(&mmap_sem) mem_cgroup_move_task() mem_cgroup_clear_mc() up_read(&mmap_sem) We can cause deadlock at (2) because we've already aquire the mmap_sem at (1). But the commit itself is necessary to fix deadlocks which have existed before the commit like: Ex.1) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | down_write(&mmap_sem) mc.moving_task = current | .. mem_cgroup_precharge_mc() | __mem_cgroup_try_charge() mem_cgroup_count_precharge() | prepare_to_wait() down_read(&mmap_sem) | if (mc.moving_task) -> cannot aquire the lock | -> true | schedule() | -> move charge should wake it up Ex.2) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | mc.moving_task = current | mem_cgroup_precharge_mc() | mem_cgroup_count_precharge() | down_read(&mmap_sem) | .. | up_read(&mmap_sem) | | down_write(&mmap_sem) mem_cgroup_move_task() | .. mem_cgroup_move_charge() | __mem_cgroup_try_charge() down_read(&mmap_sem) | prepare_to_wait() -> cannot aquire the lock | if (mc.moving_task) | -> true | schedule() | -> move charge should wake it up This patch fixes all of these problems by: 1. revert the commit. 2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge() has released the mmap_sem. 3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel all extra charges, wake up all waiters, and retry trylock. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
043d18b1e5bdfc4870b8a19d00f0d5c636a5c231 |
|
14-Jan-2011 |
Minchan Kim <minchan.kim@gmail.com> |
memcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list() Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f3e8eb70b1807d1b30aa6972af0cf30077c40112 |
|
14-Jan-2011 |
Johannes Weiner <hannes@cmpxchg.org> |
memcg: fix unit mismatch in memcg oom limit calculation Adding the number of swap pages to the byte limit of a memory control group makes no sense. Convert the pages to bytes before adding them. The only user of this code is the OOM killer, and the way it is used means that the error results in a higher OOM badness value. Since the cgroup limit is the same for all tasks in the cgroup, the error should have no practical impact at the moment. But let's not wait for future or changing users to trip over it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
dbd4ea78f002df283c95d9774837041735fa1bf9 |
|
14-Jan-2011 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: add lock to synchronize page accounting and migration Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page accounting and migration code. This reworks the locking scheme of _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable. 1. If pages are being migrated from a memcg, then updates to that memcg page statistics are protected by grabbing PCG_MOVE_LOCK using move_lock_page_cgroup(). In an upcoming commit, memcg dirty page accounting will be updating memcg page accounting (specifically: num writeback pages) from IRQ context (softirq). Avoid a deadlocking nested spin lock attempt by disabling irq on the local processor when grabbing the PCG_MOVE_LOCK. 2. lock for update_page_stat is used only for avoiding race with move_account(). So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem is between mem_cgroup_update_page_stat() and mem_cgroup_move_account_page(). Trade-off: * Changing lock_page_cgroup() to always disable IRQ (or local_bh) has some impacts on performance and I think it's bad to disable IRQ when it's not necessary. * adding a new lock makes move_account() slower. Score is here. Performance Impact: moving a 8G anon process. Before: real 0m0.792s user 0m0.000s sys 0m0.780s After: real 0m0.854s user 0m0.000s sys 0m0.842s This score is bad but planned patches for optimization can reduce this impact. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Righi <arighi@develer.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2a7106f2cb0768d00fe8c1eb42a754a7d8518f08 |
|
14-Jan-2011 |
Greg Thelen <gthelen@google.com> |
memcg: create extensible page stat update routines Replace usage of the mem_cgroup_update_file_mapped() memcg statistic update routine with two new routines: * mem_cgroup_inc_page_stat() * mem_cgroup_dec_page_stat() As before, only the file_mapped statistic is managed. However, these more general interfaces allow for new statistics to be more easily added. New statistics are added with memcg dirty page accounting. Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Andrea Righi <arighi@develer.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
37c2ac7872a9387542616f658d20ac25f5bdb32e |
|
14-Jan-2011 |
Andrea Arcangeli <aarcange@redhat.com> |
thp: compound_trans_order Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2c888cfbc1b45508a44763d85ba2e8ac43faff5f |
|
14-Jan-2011 |
Rik van Riel <riel@redhat.com> |
thp: fix anon memory statistics with transparent hugepages Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU statistics, so the Active(anon) and Inactive(anon) statistics in /proc/meminfo are correct. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
152c9ccb75548c027fa3103efa4fa4e19a345449 |
|
14-Jan-2011 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
thp: transhuge-memcg: commit tail pages at charge By this patch, when a transparent hugepage is charged, not only the head page but also all the tail pages are committed, IOW pc->mem_cgroup and pc->flags of tail pages are set. Without this patch: - Tail pages are not linked to any memcg's LRU at splitting. This causes many problems, for example, the charged memcg's directory can never be rmdir'ed because it doesn't have enough pages to scan to make the usage decrease to 0. - "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in root cgroup is calculated by the stat not by res_counter(since 2.6.32), it would be incorrect too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ec1685109f1314a30919489ef2800ed626a38c1e |
|
14-Jan-2011 |
Andrea Arcangeli <aarcange@redhat.com> |
thp: memcg compound Teach memcg to charge/uncharge compound pages. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ebb76ce16daf6908dc030dec1c00827d37129fe5 |
|
29-Dec-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix wrong VM_BUG_ON() in try_charge()'s mm->owner check At __mem_cgroup_try_charge(), VM_BUG_ON(!mm->owner) is checked. But as commented in mem_cgroup_from_task(), mm->owner can be NULL in some racy case. This check of VM_BUG_ON() is bad. A possible story to hit this is at swapoff()->try_to_unuse(). It passes mm_struct to mem_cgroup_try_charge_swapin() while mm->owner is NULL. If we can't get proper mem_cgroup from swap_cgroup information, mm->owner is used as charge target and we see NULL. Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a42c390cfa0c2612459d7226ba11612847ca3a64 |
|
24-Nov-2010 |
Michal Hocko <mhocko@suse.cz> |
cgroups: make swap accounting default behavior configurable Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP configuration option and then it is turned on by default. There is a boot option (noswapaccount) which can disable this feature. This makes it hard for distributors to enable the configuration option as this feature leads to a bigger memory consumption and this is a no-go for general purpose distribution kernel. On the other hand swap accounting may be very usuful for some workloads. This patch adds a new configuration option which controls the default behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED). If the option is selected then the feature is turned on by default. It also adds a new boot parameter swapaccount[=1|0] which enhances the original noswapaccount parameter semantic by means of enable/disable logic (defaults to 1 if no value is provided to be still consistent with noswapaccount). The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well) Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b1dd693e5b9348bd68a80e679e03cf9c0973b01b |
|
24-Nov-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: avoid deadlock between move charge and try_charge() __mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g. mlock does it). This means it can cause deadlock if it races with move charge: Ex.1) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | down_write(&mmap_sem) mc.moving_task = current | .. mem_cgroup_precharge_mc() | __mem_cgroup_try_charge() mem_cgroup_count_precharge() | prepare_to_wait() down_read(&mmap_sem) | if (mc.moving_task) -> cannot aquire the lock | -> true | schedule() Ex.2) move charge | try charge --------------------------------------+------------------------------ mem_cgroup_can_attach() | mc.moving_task = current | mem_cgroup_precharge_mc() | mem_cgroup_count_precharge() | down_read(&mmap_sem) | .. | up_read(&mmap_sem) | | down_write(&mmap_sem) mem_cgroup_move_task() | .. mem_cgroup_move_charge() | __mem_cgroup_try_charge() down_read(&mmap_sem) | prepare_to_wait() -> cannot aquire the lock | if (mc.moving_task) | -> true | schedule() To avoid this deadlock, we do all the move charge works (both can_attach() and attach()) under one mmap_sem section. And after this patch, we set/clear mc.moving_task outside mc.lock, because we use the lock only to check mc.from/to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
112bc2e120a94a511858918d6866a4978f9c500e |
|
24-Nov-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: fix false positive VM_BUG on non-SMP Fix this: kernel BUG at mm/memcontrol.c:2155! invalid opcode: 0000 [#1] last sysfs file: Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs EIP: 0060:[<c10731b2>] EFLAGS: 00000246 CPU: 0 EIP is at mem_cgroup_move_account+0xe2/0xf0 EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000 ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000) Stack: 00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000 08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50 c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340 Call Trace: [<c1074d37>] ? mem_cgroup_move_charge_pte_range+0xe7/0x130 [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130 [<c106c75e>] ? walk_page_range+0xee/0x1d0 [<c10725d6>] ? mem_cgroup_move_task+0x66/0x90 [<c1074c50>] ? mem_cgroup_move_charge_pte_range+0x0/0x130 [<c1072570>] ? mem_cgroup_move_task+0x0/0x90 [<c1042616>] ? cgroup_attach_task+0x136/0x200 [<c1042878>] ? cgroup_tasks_write+0x48/0xc0 [<c1041e9e>] ? cgroup_file_write+0xde/0x220 [<c101398d>] ? do_page_fault+0x17d/0x3f0 [<c108a79d>] ? alloc_fd+0x2d/0xd0 [<c1041dc0>] ? cgroup_file_write+0x0/0x220 [<c1077ba2>] ? vfs_write+0x92/0xc0 [<c1077c81>] ? sys_write+0x41/0x70 [<c1140e3d>] ? syscall_call+0x7/0xb Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b <0f> 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3 EIP: [<c10731b2>] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30 ---[ end trace 7daa1582159b6532 ]--- lock_page_cgroup and unlock_page_cgroup are implemented using bit_spinlock. bit_spinlock doesn't touch the bit if we are on non-SMP machine, so we can't use the bit to check whether the lock was taken. Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead of PageCgroupLocked to fix it. [akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d2e61b8dc99fdb36e0fd176e25365f69afda4ff9 |
|
11-Nov-2010 |
Dan Carpenter <error27@gmail.com> |
memcg: null dereference on allocation failure The original code had a null dereference if alloc_percpu() failed. This was introduced in commit 711d3d2c9bc3 ("memcg: cpu hotplug aware percpu count updates") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
26174efd42100eefac67732c0c12f41a205fa335 |
|
28-Oct-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: generic filestat update interface This patch extracts the core logic from mem_cgroup_update_file_mapped() as mem_cgroup_update_file_stat() and adds a wrapper. As a planned future update, memory cgroup has to count dirty pages to implement dirty_ratio/limit. And more, the number of dirty pages is required to kick flusher thread to start writeback. (Now, no kick.) This patch is preparation for it and makes other statistics implementation clearer. Just a clean up. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
1489ebad8b5b20300562f634f279cb9c435fd90b |
|
28-Oct-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: cpu hotplug aware quick acount_move detection An event counter MEM_CGROUP_ON_MOVE is used for quick check whether file stat update can be done in async manner or not. Now, it use percpu counter and for_each_possible_cpu to update. This patch replaces for_each_possible_cpu to for_each_online_cpu and adds necessary synchronization logic at CPU HOTPLUG. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
711d3d2c9bc3fb7cb5116352fecdb5b4adb6db6e |
|
28-Oct-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: cpu hotplug aware percpu count updates Now, memcgroup's per cpu coutner uses for_each_possible_cpu() to get the value. It's better to use for_each_online_cpu() and a cpu hotplug handler. This patch only handles statistics counter. MEM_CGROUP_ON_MOVE will be handled in another patch. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7d74b06f240f1bd1b4b68dd6fe84164d8bf4e315 |
|
28-Oct-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: use for_each_mem_cgroup In memory cgroup management, we sometimes have to walk through subhierarchy of cgroup to gather informaiton, or lock something, etc. Now, to do that, mem_cgroup_walk_tree() function is provided. It calls given callback function per cgroup found. But the bad thing is that it has to pass a fixed style function and argument, "void*" and it adds much type casting to memcontrol.c. To make the code clean, this patch replaces walk_tree() with for_each_mem_cgroup_tree(iter, root) An iterator style call. The good point is that iterator call doesn't have to assume what kind of function is called under it. A bad point is that it may cause reference-count leak if a caller use "break" from the loop by mistake. I think the benefit is larger. The modified code seems straigtforward and easy to read because we don't have misterious callbacks and pointer cast. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
32047e2a85f06633ee4c53e2d0346fbcd34e480b |
|
28-Oct-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management At accounting file events per memory cgroup, we need to find memory cgroup via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee pc->mem_cgroup is not overwritten while we make use of it. But, considering the context which page-cgroup for files are accessed, we can use alternative light-weight mutual execusion in the most case. At handling file-caches, the only race we have to take care of is "moving" account, IOW, overwriting page_cgroup->mem_cgroup. (See comment in the patch) Unlike charge/uncharge, "move" happens not so frequently. It happens only when rmdir() and task-moving (with a special settings.) This patch adds a race-checker for file-cache-status accounting v.s. account moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added. The routine for account move 1. Increment it before start moving 2. Call synchronize_rcu() 3. Decrement it after the end of moving. By this, file-status-counting routine can check it needs to call lock_page_cgroup(). In most case, I doesn't need to call it. Following is a perf data of a process which mmap()/munmap 32MB of file cache in a minute. Before patch: 28.25% mmap mmap [.] main 22.64% mmap [kernel.kallsyms] [k] page_fault 9.96% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped 3.67% mmap [kernel.kallsyms] [k] filemap_fault 3.50% mmap [kernel.kallsyms] [k] unmap_vmas 2.99% mmap [kernel.kallsyms] [k] __do_fault 2.76% mmap [kernel.kallsyms] [k] find_get_page After patch: 30.00% mmap mmap [.] main 23.78% mmap [kernel.kallsyms] [k] page_fault 5.52% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped 3.81% mmap [kernel.kallsyms] [k] unmap_vmas 3.26% mmap [kernel.kallsyms] [k] find_get_page 3.18% mmap [kernel.kallsyms] [k] __do_fault 3.03% mmap [kernel.kallsyms] [k] filemap_fault 2.40% mmap [kernel.kallsyms] [k] handle_mm_fault 2.40% mmap [kernel.kallsyms] [k] do_page_fault This patch reduces memcg's cost to some extent. (mem_cgroup_update_file_mapped is called by both of map/unmap) Note: It seems some more improvements are required..but no idea. maybe removing set/unset flag is required. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0c270f8f9988fb0d93ea214fdcff7ab90eb3d894 |
|
28-Oct-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix race in file_mapped accouting flag management Presently memory cgroup accounts file-mapped by counter and flag. counter is working in the same way with zone_stat but FileMapped flag only exists in memcg (for helping move_account). This flag can be updated wrongly in a case. Assume CPU0 and CPU1 and a thread mapping a page on CPU0, another thread unmapping it on CPU1. CPU0 CPU1 rmv rmap (mapcount 1->0) add rmap (mapcount 0->1) lock_page_cgroup() memcg counter+1 (some delay) set MAPPED FLAG. unlock_page_cgroup() lock_page_cgroup() memcg counter-1 clear MAPPED flag In the above sequence counter is properly updated but FLAG is not. This means that representing a state by a flag which is maintained by counter needs some special care. To handle this, when clearing a flag, this patch check mapcount directly and clear the flag only when mapcount == 0. (if mapcount >0, someone will make it to zero later and flag will be cleared.) Reverse case, dec-after-inc cannot be a problem because page_table_lock() works well for it. (IOW, to make above sequence, 2 processes should touch the same page at once with map/unmap.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ad4ca5f4b70236dab5e457ff6567d36f75d2e7c5 |
|
07-Oct-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: fix thresholds with use_hierarchy == 1 We need to check parent's thresholds if parent has use_hierarchy == 1 to be sure that parent's threshold events will be triggered even if parent itself is not active (no MEM_CGROUP_EVENTS). Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
13d7e3a2dba6a79589ed34dc0b9114d7b5ff9eab |
|
11-Aug-2010 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: convert to use zone_to_nid() from bare zone->zone_pgdat->node_id We have zone_to_nid(). this patch convert all existing users of zone->zone_pgdat->node_id. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
00918b6ab89df8984ca06397cb77994dabd73f9b |
|
11-Aug-2010 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: remove nid and zid argument from mem_cgroup_soft_limit_reclaim() mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid and zid can be calculated from zone. So remove it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
14fec79680f7cc4617d6ba69324e63d4a732986c |
|
11-Aug-2010 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: mem_cgroup_shrink_node_zone() doesn't need sc.nodemask Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus it doesn't need to initialize sc.nodemask because shrink_zone() doesn't use it at all. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f75ca962037ffd639a44fd88933cd9b84c4c4411 |
|
11-Aug-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: avoid css_get() Now, memory cgroup increments css(cgroup subsys state)'s reference count per a charged page. And the reference count is kept until the page is uncharged. But this has 2 bad effect. 1. Because css_get/put calls atomic_inc()/dec, heavy call of them on large smp will not scale well. 2. Because css's refcnt cannot be in a state as "ready-to-release", cgroup's notify_on_release handler can't work with memcg. 3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small. This has been a problem since the 1st merge of memcg. This is a trial to remove css's refcnt per a page. Even if we remove refcnt, pre_destroy() does enough synchronization as - check res->usage == 0. - check no pages on LRU. This patch removes css's refcnt per page. Even after this patch, at the 1st look, it seems css_get() is still called in try_charge(). But the logic is. - If a memcg of mm->owner is cached one, consume_stock() will work. At success, return immediately. - If consume_stock returns false, css_get() is called and go to slow path which may be blocked. At the end of slow path, css_put() is called and restart from the start if necessary. So, in the fast path, we don't call css_get() and can avoid access to shared counter. This patch can make the most possible case fast. Here is a result of multi-threaded page fault benchmark. [Before] 25.32% multi-fault-all [kernel.kallsyms] [k] clear_page_c 9.30% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave 8.02% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm <=====(*) 7.83% multi-fault-all [kernel.kallsyms] [k] down_read_trylock 5.38% multi-fault-all [kernel.kallsyms] [k] __css_put 5.29% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask 4.92% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq 4.24% multi-fault-all [kernel.kallsyms] [k] up_read 3.53% multi-fault-all [kernel.kallsyms] [k] css_put 2.11% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault 1.76% multi-fault-all [kernel.kallsyms] [k] __rmqueue 1.64% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge [After] 28.41% multi-fault-all [kernel.kallsyms] [k] clear_page_c 10.08% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irq 9.58% multi-fault-all [kernel.kallsyms] [k] down_read_trylock 9.38% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.86% multi-fault-all [kernel.kallsyms] [k] __alloc_pages_nodemask 5.65% multi-fault-all [kernel.kallsyms] [k] up_read 2.82% multi-fault-all [kernel.kallsyms] [k] handle_mm_fault 2.64% multi-fault-all [kernel.kallsyms] [k] mem_cgroup_add_lru_list 2.48% multi-fault-all [kernel.kallsyms] [k] __mem_cgroup_commit_charge Then, 8.02% of try_get_mem_cgroup_from_mm() disappears because this patch removes css_tryget() in it. (But yes, this is an extreme case.) Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
158e0a2d1b3cffed8b46cbc56393a1394672ef79 |
|
11-Aug-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: use find_lock_task_mm() in memory cgroups oom When the OOM killer scans task, it check a task is under memcg or not when it's called via memcg's context. But, as Oleg pointed out, a thread group leader may have NULL ->mm and task_in_mem_cgroup() may do wrong decision. We have to use find_lock_task_mm() in memcg as generic OOM-Killer does. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
73045c47b6facbdf4656e6763c8cb469de4337e2 |
|
11-Aug-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: remove mem from arg of charge_common mem_cgroup_charge_common() is always called with @mem = NULL, so it's meaningless. This patch removes it. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
bd0d24bfe8a8f8d2400569740874a67d164d40a9 |
|
11-Aug-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: remove redundant code - try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we don't have to call them in task_in_mem_cgroup(). - *mz is not used in __mem_cgroup_uncharge_common(). - we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration() after we've cleared PCG_MIGRATION of @oldpage. - remove empty comment. - remove redundant empty line in mem_cgroup_cache_charge(). Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2bd9bb206b338888b226e70139a25a67d10007f0 |
|
11-Aug-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: clean up waiting move acct Now, for checking a memcg is under task-account-moving, we do css_tryget() against mc.to and mc.from. But this is just complicating things. This patch makes the check easier. This patch adds a spinlock to move_charge_struct and guard modification of mc.to and mc.from. By this, we don't have to think about complicated races arount this not-critical path. [balbir@linux.vnet.ibm.com: don't crash on a null memcg being passed] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4b53433468c87794b523e4683fbd4e8e8aca1f63 |
|
11-Aug-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: clean up try_charge main loop mem_cgroup_try_charge() has a big loop in it and seems to be hard to read. Most of routines are for slow path. This patch moves codes out from the loop and make it clear what's done. Summary: - refactoring a function to detect a memcg is under acccount move or not. - refactoring a function to wait for the end of moving task acct. - refactoring a main loop('s slow path) as a function and make it clear why we retry or quit by return code. - add fatal_signal_pending() check for bypassing charge loops. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cc8e970c3ce4d98afa8eb02dbd2526ce57f7611a |
|
10-Aug-2010 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: add mm_vmscan_memcg_isolate tracepoint Memcg also need to trace page isolation information as global reclaim. This patch does it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a63d83f427fbce97a6cea0db2e64b0eb8435cd10 |
|
10-Aug-2010 |
David Rientjes <rientjes@google.com> |
oom: badness heuristic rewrite This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
25edde0332916ae706ccf83de688be57bcc844b7 |
|
10-Aug-2010 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
vmscan: kill prev_priority completely Since 2.6.28 zone->prev_priority is unused. Then it can be removed safely. It reduce stack usage slightly. Now I have to say that I'm sorry. 2 years ago, I thought prev_priority can be integrate again, it's useful. but four (or more) times trying haven't got good performance number. Thus I give up such approach. The rest of this changelog is notes on prev_priority and why it existed in the first place and why it might be not necessary any more. This information is based heavily on discussions between Andrew Morton, Rik van Riel and Kosaki Motohiro who is heavily quotes from. Historically prev_priority was important because it determined when the VM would start unmapping PTE pages. i.e. there are no balances of note within the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there is a potential risk of unnecessarily increasing minor faults as a large amount of read activity of use-once pages could push mapped pages to the end of the LRU and get unmapped. There is no proof this is still a problem but currently it is not considered to be. Active files are not deactivated if the active file list is smaller than the inactive list reducing the liklihood that file-mapped pages are being pushed off the LRU and referenced executable pages are kept on the active list to avoid them getting pushed out by read activity. Even if it is a problem, prev_priority prev_priority wouldn't works nowadays. First of all, current vmscan still a lot of UP centric code. it expose some weakness on some dozens CPUs machine. I think we need more and more improvement. The problem is, current vmscan mix up per-system-pressure, per-zone-pressure and per-task-pressure a bit. example, prev_priority try to boost priority to other concurrent priority. but if the another task have mempolicy restriction, it is unnecessary, but also makes wrong big latency and exceeding reclaim. per-task based priority + prev_priority adjustment make the emulation of per-system pressure. but it have two issue 1) too rough and brutal emulation 2) we need per-zone pressure, not per-system. Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about 2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer. but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the system have higher memory pressure than priority==0 (1/4096*10,000 > 2). prev_priority can't solve such multithreads workload issue. In other word, prev_priority concept assume the sysmtem don't have lots threads." Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4d845ebf4cf9e985b1704b1f08b37f744b4ede13 |
|
30-Jun-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix wake up in oom wait queue OOM-waitqueue should be waken up when oom_disable is canceled. This is a fix for 3c11ecf448eff8f1 ("memcg: oom kill disable and oom status"). How to test: Create a cgroup A... 1. set memory.limit and memory.memsw.limit to be small value 2. echo 1 > /cgroup/A/memory.oom_control, this disables oom-kill. 3. run a program which must cause OOM. A program executed in 3 will sleep by oom_waiqueue in memcg. Then, how to wake it up is problem. 1. echo 0 > /cgroup/A/memory.oom_control (enable OOM-killer) 2. echo big mem > /cgroup/A/memory.memsw.limit_in_bytes(allow more swap) etc.. Without the patch, a task in slept can not be waken up. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2c488db27b614816024e7994117f599337de0f34 |
|
26-May-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: clean up memory thresholds Introduce struct mem_cgroup_thresholds. It helps to reduce number of checks of thresholds type (memory or mem+swap). [akpm@linux-foundation.org: repair comment] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
907860ed381a31b0102f362df67c1c5cae6ef050 |
|
26-May-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
cgroups: make cftype.unregister_event() void-returning Since we are unable to handle an error returned by cftype.unregister_event() properly, let's make the callback void-returning. mem_cgroup_unregister_event() has been rewritten to be a "never fail" function. On mem_cgroup_usage_register_event() we save old buffer for thresholds array and reuse it in mem_cgroup_usage_unregister_event() to avoid allocation. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ac39cf8cb86c45eeac6a592ce0d58f9021a97235 |
|
26-May-2010 |
akpm@linux-foundation.org <akpm@linux-foundation.org> |
memcg: fix mis-accounting of file mapped racy with migration FILE_MAPPED per memcg of migrated file cache is not properly updated, because our hook in page_add_file_rmap() can't know to which memcg FILE_MAPPED should be counted. Basically, this patch is for fixing the bug but includes some big changes to fix up other messes. Now, at migrating mapped file, events happen in following sequence. 1. allocate a new page. 2. get memcg of an old page. 3. charge ageinst a new page before migration. But at this point, no changes to new page's page_cgroup, no commit for the charge. (IOW, PCG_USED bit is not set.) 4. page migration replaces radix-tree, old-page and new-page. 5. page migration remaps the new page if the old page was mapped. 6. Here, the new page is unlocked. 7. memcg commits the charge for newpage, Mark the new page's page_cgroup as PCG_USED. Because "commit" happens after page-remap, we can count FILE_MAPPED at "5", because we should avoid to trust page_cgroup->mem_cgroup. if PCG_USED bit is unset. (Note: memcg's LRU removal code does that but LRU-isolation logic is used for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is not on LRU or page_cgroup->mem_cgroup is NULL.) We can lose file_mapped accounting information at 5 because FILE_MAPPED is updated only when mapcount changes 0->1. So we should catch it. BTW, historically, above implemntation comes from migration-failure of anonymous page. Because we charge both of old page and new page with mapcount=0, we can't catch - the page is really freed before remap. - migration fails but it's freed before remap or .....corner cases. New migration sequence with memcg is: 1. allocate a new page. 2. mark PageCgroupMigration to the old page. 3. charge against a new page onto the old page's memcg. (here, new page's pc is marked as PageCgroupUsed.) 4. page migration replaces radix-tree, page table, etc... 5. At remapping, new page's page_cgroup is now makrked as "USED" We can catch 0->1 event and FILE_MAPPED will be properly updated. And we can catch SWAPOUT event after unlock this and freeing this page by unmap() can be caught. 7. Clear PageCgroupMigration of the old page. So, FILE_MAPPED will be correctly updated. Then, for what MIGRATION flag is ? Without it, at migration failure, we may have to charge old page again because it may be fully unmapped. "charge" means that we have to dive into memory reclaim or something complated. So, it's better to avoid charge it again. Before this patch, __commit_charge() was working for both of the old/new page and fixed up all. But this technique has some racy condtion around FILE_MAPPED and SWAPOUT etc... Now, the kernel use MIGRATION flag and don't uncharge old page until the end of migration. I hope this change will make memcg's page migration much simpler. This page migration has caused several troubles. Worth to add a flag for simplification. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
315c1998e10527ff364a9883048455e609bc7232 |
|
26-May-2010 |
Phil Carmody <ext-phil.2.carmody@nokia.com> |
mm: memcontrol - uninitialised return value Only an out of memory error will cause ret to be set. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
5407a56257b6ade44fd9bcac972c99845b7413cd |
|
26-May-2010 |
Phil Carmody <ext-phil.2.carmody@nokia.com> |
mm: remove unnecessary use of atomic The bottom 4 hunks are atomically changing memory to which there are no aliases as it's freshly allocated, so there's no need to use atomic operations. The other hunks are just atomic_read and atomic_set, and do not involve any read-modify-write. The use of atomic_{read,set} doesn't prevent a read/write or write/write race, so if a race were possible (I'm not saying one is), then it would still be there even with atomic_set. See: http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/ Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
87946a72283be3de936adc754b7007df7d3e6aeb |
|
26-May-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: move charge of file pages This patch adds support for moving charge of file pages, which include normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting bit 1 of <target cgroup>/memory.move_charge_at_immigrate. Unlike the case of anonymous pages, file pages(and swaps) in the range mmapped by the task will be moved even if the task hasn't done page fault, i.e. they might not be the task's "RSS", but other task's "RSS" that maps the same file. And mapcount of the page is ignored(the page can be moved even if page_mapcount(page) > 1). So, conditions that the page/swap should be met to be moved is that it must be in the range mmapped by the target task and it must be charged to the old cgroup. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix warning] Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
90254a65833b67502d14736410b3857a15535c67 |
|
26-May-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: clean up move charge This patch cleans up move charge code by: - define functions to handle pte for each types, and make is_target_pte_for_mc() cleaner. - instead of checking the MOVE_CHARGE_TYPE_ANON bit, define a function that checks the bit. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3c11ecf448eff8f12922c498b8274ce98587eb74 |
|
26-May-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: oom kill disable and oom status This adds a feature to disable oom-killer for memcg, if disabled, of course, tasks under memcg will stop. But now, we have oom-notifier for memcg. And the world around memcg is not under out-of-memory. memcg's out-of-memory just shows memcg hits limit. Then, administrator or management daemon can recover the situation by - kill some process - enlarge limit, add more swap. - migrate some tasks - remove file cache on tmps (difficult ?) Unlike oom-killer, you can take enough information before killing tasks. (by gcore, or, ps etc.) [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9490ff275606da012d5b373342a49610ad61cb81 |
|
26-May-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: oom notifier Considering containers or other resource management softwares in userland, event notification of OOM in memcg should be implemented. Now, memcg has "threshold" notifier which uses eventfd, we can make use of it for oom notification. This patch adds oom notification eventfd callback for memcg. The usage is very similar to threshold notifier, but control file is memory.oom_control and no arguments other than eventfd is required. % cgroup_event_notifier /cgroup/A/memory.oom_control dummy (About cgroup_event_notifier, see Documentation/cgroup/) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
dc98df5a1b7be402a0e1c71f1b89ccf249ac15ee |
|
26-May-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: oom wakeup filter memcg's oom waitqueue is a system-wide wait_queue (for handling hierarchy.) So, it's better to add custom wake function and do filtering in wake up path. This patch adds a filtering feature for waking up oom-waiters. Hierarchy is properly handled. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f39d01be4c59a61a08d0cb53f615e7016b85d339 |
|
20-May-2010 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits) vlynq: make whole Kconfig-menu dependant on architecture add descriptive comment for TIF_MEMDIE task flag declaration. EEPROM: max6875: Header file cleanup EEPROM: 93cx6: Header file cleanup EEPROM: Header file cleanup agp: use NULL instead of 0 when pointer is needed rtc-v3020: make bitfield unsigned PCI: make bitfield unsigned jbd2: use NULL instead of 0 when pointer is needed cciss: fix shadows sparse warning doc: inode uses a mutex instead of a semaphore. uml: i386: Avoid redefinition of NR_syscalls fix "seperate" typos in comments cocbalt_lcdfb: correct sections doc: Change urls for sparse Powerpc: wii: Fix typo in comment i2o: cleanup some exit paths Documentation/: it's -> its where appropriate UML: Fix compiler warning due to missing task_struct declaration UML: add kernel.h include to signal.c ...
|
747388d78a0ae768fd82b55c4ed38aa646a72364 |
|
11-May-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix css_is_ancestor() RCU locking Some callers (in memcontrol.c) calls css_is_ancestor() without rcu_read_lock. Because css_is_ancestor() has to access RCU protected data, it should be under rcu_read_lock(). This makes css_is_ancestor() itself does safe access to RCU protected area. (At least, "root" can have refcnt==0 if it's not an ancestor of "child". So, we need rcu_read_lock().) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7f0f15464185a92f9d8791ad231bcd7bf6df54e4 |
|
11-May-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix css_id() RCU locking for real Commit ad4ba375373937817404fd92239ef4cadbded23b ("memcg: css_id() must be called under rcu_read_lock()") modifies memcontol.c for fixing RCU check message. But Andrew Morton pointed out that the fix doesn't seems sane and it was just for hidining lockdep messages. This is a patch for do proper things. Checking again, all places, accessing without rcu_read_lock, that commit fixies was intentional.... all callers of css_id() has reference count on it. So, it's not necessary to be under rcu_read_lock(). Considering again, we can use rcu_dereference_check for css_id(). We know css->id is valid if css->refcnt > 0. (css->id never changes and freed after css->refcnt going to be 0.) This patch makes use of rcu_dereference_check() in css_id/depth and remove unnecessary rcu-read-lock added by the commit. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
91bc482ec5a615e8ecebc106aaf7d0c267d511de |
|
07-May-2010 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: create rcu_my_thread_group_empty() wrapper memcg: css_id() must be called under rcu_read_lock() cgroup: Check task_lock in task_subsys_state() sched: Fix an RCU warning in print_task() cgroup: Fix an RCU warning in alloc_css_id() cgroup: Fix an RCU warning in cgroup_path() KEYS: Fix an RCU warning in the reading of user keys KEYS: Fix an RCU warning
|
ad4ba375373937817404fd92239ef4cadbded23b |
|
23-Apr-2010 |
Paul E. McKenney <paulmck@linux.vnet.ibm.com> |
memcg: css_id() must be called under rcu_read_lock() This patch fixes task_in_mem_cgroup(), mem_cgroup_uncharge_swapcache(), mem_cgroup_move_swap_account(), and is_target_pte_for_mc() to protect calls to css_id(). An additional RCU lockdep splat was reported for memcg_oom_wake_function(), however, this function is not yet in mainline as of 2.6.34-rc5. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org>
/mm/memcontrol.c
|
93d5c9be1ddd57d4063ce463c9ac2be1e5ee14f1 |
|
23-Apr-2010 |
Andrea Arcangeli <aarcange@redhat.com> |
memcg: fix prepare migration If a signal is pending (task being killed by sigkill) __mem_cgroup_try_charge will write NULL into &mem, and css_put will oops on null pointer dereference. BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff810fc6cc>] mem_cgroup_prepare_migration+0x7c/0xc0 PGD a5d89067 PUD a5d8a067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/platform/microcode/firmware/microcode/loading CPU 0 Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc acpi_cpufreq pcspkr sg [last unloaded: microcode] Pid: 5299, comm: largepages Tainted: G W 2.6.34-rc3 #3 Penryn1600SLI-110dB/To Be Filled By O.E.M. RIP: 0010:[<ffffffff810fc6cc>] [<ffffffff810fc6cc>] mem_cgroup_prepare_migration+0x7c/0xc0 [nishimura@mxp.nes.nec.co.jp: fix merge issues] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6c9468e9eb1252eaefd94ce7f06e1be9b0b641b1 |
|
23-Apr-2010 |
Jiri Kosina <jkosina@suse.cz> |
Merge branch 'master' into for-next
|
8725d5416213a145ccc9c236dbd26830ba409e00 |
|
06-Apr-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix race in file_mapped accounting Presently, memcg's FILE_MAPPED accounting has following race with move_account (happens at rmdir()). increment page->mapcount (rmap.c) mem_cgroup_update_file_mapped() move_account() lock_page_cgroup() check page_mapped() if page_mapped(page)>1 { FILE_MAPPED -1 from old memcg FILE_MAPPED +1 to old memcg } ..... overwrite pc->mem_cgroup unlock_page_cgroup() lock_page_cgroup() FILE_MAPPED + 1 to pc->mem_cgroup unlock_page_cgroup() Then, old memcg (-1 file mapped) new memcg (+2 file mapped) This happens because move_account see page_mapped() which is not guarded by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup and move account information based on it. Now, all checks are synchronous with lock_page_cgroup(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Righi <arighi@develer.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e7bbcdf3747e3919c31cfa87853c69d178bce548 |
|
23-Mar-2010 |
Dan Carpenter <error27@gmail.com> |
memcontrol: fix potential null deref There was a potential null deref introduced in c62b1a3b31b5 ("memcg: use generic percpu instead of private implementation"). Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
5cfb80a73b5a52fb19d8b0611203e4dd58e8e9a2 |
|
23-Mar-2010 |
Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> |
memcg: disable move charge in no mmu case In commit 02491447 ("memcg: move charges of anonymous swap"), I tried to disable move charge feature in no mmu case by enclosing all the related functions with "#ifdef CONFIG_MMU", but the commit places these ifdefs in wrong place. (it seems that it's mangled while handling some fixes...) This patch fixes it up. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
320cc51d90832231cece478f0db6550ef367f8f3 |
|
15-Mar-2010 |
Greg Thelen <gthelen@google.com> |
mm: fix typo in refill_stock() comment Change refill_stock() comment: s/consumt_stock()/consume_stock()/ Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
/mm/memcontrol.c
|
867578cbccb0893cc14fc29c670f7185809c90d6 |
|
11-Mar-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix oom kill behavior In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a0a4db548edcce067c1201ef25cf2bc29f32dca4 |
|
11-Mar-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
cgroups: remove events before destroying subsystem state objects Events should be removed after rmdir of cgroup directory, but before destroying subsystem state objects. Let's take reference to cgroup directory dentry to do that. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d2265e6fa3f220ea5fd37522d13390e9675adcf7 |
|
11-Mar-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg : share event counter rather than duplicate Memcg has 2 eventcountes which counts "the same" event. Just usages are different from each other. This patch tries to reduce event counter. Now logic uses "only increment, no reset" counter and masks for each checks. Softlimit chesk was done per 1000 evetns. So, the similar check can be done by !(new_counter & 0x3ff). Threshold check was done per 100 events. So, the similar check can be done by (!new_counter & 0x7f) ALL event checks are done right after EVENT percpu counter is updated. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
430e48631e72aeab74d844c57b441f98a2e36eee |
|
11-Mar-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: update threshold and softlimit at commit Presently, move_task does "batched" precharge. Because res_counter or css's refcnt are not-scalable jobs for memcg, try_charge_().. tend to be done in batched manner if allowed. Now, softlimit and threshold check their event counter in try_charge, but the charge is not a per-page event. And event counter is not updated at charge(). Moreover, precharge doesn't pass "page" to try_charge() and softlimit tree will be never updated until uncharge() causes an event." So the best place to check the event counter is commit_charge(). This is per-page event by its nature. This patch move checks to there. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c62b1a3b31b5e27a6c5c2e91cc5ce05fdb6344d0 |
|
11-Mar-2010 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: use generic percpu instead of private implementation When per-cpu counter for memcg was implemneted, dynamic percpu allocator was not very good. But now, we have good one and useful macros. This patch replaces memcg's private percpu counter implementation with generic dynamic percpu allocator. The benefits are - We can remove private implementation. - The counters will be NUMA-aware. (Current one is not...) - This patch makes sizeof struct mem_cgroup smaller. Then, struct mem_cgroup may be fit in page size on small config. - About basic performance aspects, see below. [Before] # size mm/memcontrol.o text data bss dec hex filename 24373 2528 4132 31033 7939 mm/memcontrol.o [page-fault-throuput test on 8cpu/SMP in root cgroup] # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8 Performance counter stats for './multi-fault-fork 8' (5 runs): 45878618 page-faults ( +- 0.110% ) 602635826 cache-misses ( +- 0.105% ) 61.005373262 seconds time elapsed ( +- 0.004% ) Then cache-miss/page fault = 13.14 [After] #size mm/memcontrol.o text data bss dec hex filename 23913 2528 4132 30573 776d mm/memcontrol.o # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8 Performance counter stats for './multi-fault-fork 8' (5 runs): 48179400 page-faults ( +- 0.271% ) 588628407 cache-misses ( +- 0.136% ) 61.004615021 seconds time elapsed ( +- 0.004% ) Then cache-miss/page fault = 12.22 Text size is reduced. This performance improvement is not big and will be invisible in real world applications. But this result shows this patch has some good effect even on (small) SMP. Here is a test program I used. 1. fork() processes on each cpus. 2. do page fault repeatedly on each process. 3. after 60secs, kill all childredn and exit. (3 is necessary for getting stable data, this is improvement from previous one.) #define _GNU_SOURCE #include <stdio.h> #include <sched.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <signal.h> #include <stdlib.h> /* * For avoiding contention in page table lock, FAULT area is * sparse. If FAULT_LENGTH is too large for your cpus, decrease it. */ #define FAULT_LENGTH (2 * 1024 * 1024) #define PAGE_SIZE 4096 #define MAXNUM (128) void alarm_handler(int sig) { } void *worker(int cpu, int ppid) { void *start, *end; char *c; cpu_set_t set; int i; CPU_ZERO(&set); CPU_SET(cpu, &set); sched_setaffinity(0, sizeof(set), &set); start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (start == MAP_FAILED) { perror("mmap"); exit(1); } end = start + FAULT_LENGTH; pause(); //fprintf(stderr, "run%d", cpu); while (1) { for (c = (char*)start; (void *)c < end; c += PAGE_SIZE) *c = 0; madvise(start, FAULT_LENGTH, MADV_DONTNEED); } return NULL; } int main(int argc, char *argv[]) { int num, i, ret, pid, status; int pids[MAXNUM]; if (argc < 2) return 0; setpgid(0, 0); signal(SIGALRM, alarm_handler); num = atoi(argv[1]); pid = getpid(); for (i = 0; i < num; ++i) { ret = fork(); if (!ret) { worker(i, pid); exit(0); } pids[i] = ret; } sleep(1); kill(-pid, SIGALRM); sleep(60); for (i = 0; i < num; i++) kill(pids[i], SIGKILL); for (i = 0; i < num; i++) waitpid(pids[i], &status, 0); return 0; } Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6a6135b64fda39d931a79090f4da37f1c6da4a8c |
|
11-Mar-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: typo in comment to mem_cgroup_print_oom_info() s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/ Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2e72b6347c9459e6cff5634ddc815485bae6985f |
|
11-Mar-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: implement memory thresholds It allows to register multiple memory and memsw thresholds and gets notifications when it crosses. To register a threshold application need: - create an eventfd; - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to cgroup.event_control. Application will be notified through eventfd when memory usage crosses threshold in any direction. It's applicable for root and non-root cgroup. It uses stats to track memory usage, simmilar to soft limits. It checks if we need to send event to userspace on every 100 page in/out. I guess it's good compromise between performance and accuracy of thresholds. [akpm@linux-foundation.org: coding-style fixes] [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
378ce724bc2a0ef1243e11c09d58a70bb6be007a |
|
11-Mar-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: rework usage of stats by soft limit Instead of incrementing counter on each page in/out and comparing it with constant, we set counter to constant, decrement counter on each page in/out and compare it with zero. We want to make comparing as fast as possible. On many RISC systems (probably not only RISC) comparing with zero is more effective than comparing with a constant, since not every constant can be immediate operand for compare instruction. Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT, since really it's not a generic counter. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
104f39284e830f425085886ef72c49aee6631575 |
|
11-Mar-2010 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: extract mem_group_usage() from mem_cgroup_read() Helper to get memory or mem+swap usage of the cgroup. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Vladislav Buzov <vbuzov@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Alexander Shishkin <virtuoso@slind.org> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
483c30b514bd3037fa3f19fa42327c94c10f51c8 |
|
11-Mar-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: improve performance in moving swap charge Try to reduce overheads in moving swap charge by: - Adds a new function(__mem_cgroup_put), which takes "count" as a arg and decrement mem->refcnt by "count". - Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path of moving swap account, and consolidate all of them into mem_cgroup_clear_mc. We cannot do that about mc.to->refcnt. These changes reduces the overhead from 1.35sec to 0.9sec to move charges of 1G anonymous memory(including 500MB swap) in my test environment. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
024914477e15ef8b17f271ec47f1bb8a589f0806 |
|
11-Mar-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: move charges of anonymous swap This patch is another core part of this move-charge-at-task-migration feature. It enables moving charges of anonymous swaps. To move the charge of swap, we need to exchange swap_cgroup's record. In current implementation, swap_cgroup's record is protected by: - page lock: if the entry is on swap cache. - swap_lock: if the entry is not on swap cache. This works well in usual swap-in/out activity. But this behavior make the feature of moving swap charge check many conditions to exchange swap_cgroup's record safely. So I changed modification of swap_cgroup's recored(swap_cgroup_record()) to use xchg, and define a new function to cmpxchg swap_cgroup's record. This patch also enables moving charge of non pte_present but not uncharged swap caches, which can be exist on swap-out path, by getting the target pages via find_get_page() as do_mincore() does. [kosaki.motohiro@jp.fujitsu.com: fix ia64 build] [akpm@linux-foundation.org: fix typos] Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8033b97c9b5ef063e3f4bf2efe1cd0a22093aaff |
|
11-Mar-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: avoid oom during moving charge This move-charge-at-task-migration feature has extra charges on "to"(pre-charges) and "from"(left-over charges) during moving charge. This means unnecessary oom can happen. This patch tries to avoid such oom. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
854ffa8d104e44111fec96764c0e0cb29223d54c |
|
11-Mar-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: improve performance in moving charge Try to reduce overheads in moving charge by: - Instead of calling res_counter_uncharge() against the old cgroup in __mem_cgroup_move_account() everytime, call res_counter_uncharge() at the end of task migration once. - removed css_get(&to->css) from __mem_cgroup_move_account() because callers should have already called css_get(). And removed css_put(&to->css) too, which was called by callers of move_account on success of move_account. - Instead of calling __mem_cgroup_try_charge(), i.e. res_counter_charge(), repeatedly, call res_counter_charge(PAGE_SIZE * count) in can_attach() if possible. - Instead of calling css_get()/css_put() repeatedly, make use of coalesce __css_get()/__css_put() if possible. These changes reduces the overhead from 1.7sec to 0.6sec to move charges of 1G anonymous memory in my test environment. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4ffef5feff4e4240e767d2f1144b1634a41762e3 |
|
11-Mar-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: move charges of anonymous page This patch is the core part of this move-charge-at-task-migration feature. It implements functions to move charges of anonymous pages mapped only by the target task. Implementation: - define struct move_charge_struct and a valuable of it(mc) to remember the count of pre-charges and other information. - At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge() repeatedly and count up mc.precharge. - At attach(), parse the page table, find a target page to be move, and call mem_cgroup_move_account() about the page. - Cancel all precharges if mc.precharge > 0 on failure or at the end of task move. [akpm@linux-foundation.org: a little simplification] Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7dc74be032bfcaa2f9d9e4296ff5bbddfa9e2f19 |
|
11-Mar-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: add interface to move charge at task migration In current memcg, charges associated with a task aren't moved to the new cgroup at task migration. Some users feel this behavior to be strange. These patches are for this feature, that is, for charging to the new cgroup and, of course, uncharging from the old cgroup at task migration. This patch adds "memory.move_charge_at_immigrate" file, which is a flag file to determine whether charges should be moved to the new cgroup at task migration or not and what type of charges should be moved. This patch also adds read and write handlers of the file. This patch also adds no-op handlers for this feature. These handlers will be implemented in later patches. And you cannot write any values other than 0 to move_charge_at_immigrate yet. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
648bcc771145172a14bc35eeb849ed08f6aa4f1e |
|
05-Mar-2010 |
Thiago Farina <tfransosi@gmail.com> |
mm/memcontrol.c: fix "integer as NULL pointer" sparse warning mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer Signed-off-by: Thiago Farina <tfransosi@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
fce66477578d081f19aef5ea218664ff7758c33a |
|
16-Jan-2010 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: ensure list is empty at rmdir Current mem_cgroup_force_empty() only ensures mem->res.usage == 0 on success. But this doesn't guarantee memcg's LRU is really empty, because there are some cases in which !PageCgrupUsed pages exist on memcg's LRU. For example: - Pages can be uncharged by its owner process while they are on LRU. - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common(). So there can be a case in which the usage is zero but some of the LRUs are not empty. OTOH, mem_cgroup_del_lru_list(), which can be called asynchronously with rmdir, accesses the mem_cgroup, so this access can cause a problem if it races with rmdir because the mem_cgroup might have been freed by rmdir. Actually, I saw a bug which seems to be caused by this race. [1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 [1530745.950651] IP: [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0 [1530745.950651] Oops: 0002 [#1] SMP [1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map [1530745.950651] CPU 3 [1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom serio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table] [1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G M 2.6.32-mm1-00701-g2b04386 #3 Express5800/140Rd-4 [N8100-1065] [1530745.950651] RIP: 0010:[<ffffffff810fbc11>] [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] RSP: 0018:ffff8803863ddcb8 EFLAGS: 00010002 [1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0 [1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238 [1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643 [1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000 [1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310 [1530745.950651] FS: 0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000000 [1530745.950651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0 [1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff88038612a8a0) [1530745.950651] Stack: [1530745.950651] ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a [1530745.950651] <0> 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000 [1530745.950651] <0> 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046 [1530745.950651] Call Trace: [1530745.950651] [<ffffffff810c739a>] release_pages+0x142/0x1e7 [1530745.950651] [<ffffffff810c778f>] ? pagevec_move_tail+0x6e/0x112 [1530745.950651] [<ffffffff810c781e>] pagevec_move_tail+0xfd/0x112 [1530745.950651] [<ffffffff810c78a9>] lru_add_drain+0x76/0x94 [1530745.950651] [<ffffffff810dba0c>] exit_mmap+0x6e/0x145 [1530745.950651] [<ffffffff8103f52d>] mmput+0x5e/0xcf [1530745.950651] [<ffffffff81043ea8>] exit_mm+0x11c/0x129 [1530745.950651] [<ffffffff8108fb29>] ? audit_free+0x196/0x1c9 [1530745.950651] [<ffffffff81045353>] do_exit+0x1f5/0x6b7 [1530745.950651] [<ffffffff8106133f>] ? up_read+0x2b/0x2f [1530745.950651] [<ffffffff8137d187>] ? lockdep_sys_exit_thunk+0x35/0x67 [1530745.950651] [<ffffffff81045898>] do_group_exit+0x83/0xb0 [1530745.950651] [<ffffffff810458dc>] sys_exit_group+0x17/0x1b [1530745.950651] [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b [1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 83 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 <48> ff 4c d0 50 48 8b 05 2b 2d 7c 00 48 39 43 08 74 39 48 8b 4b [1530745.950651] RIP [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80 [1530745.950651] RSP <ffff8803863ddcb8> [1530745.950651] CR2: 0000000000000230 [1530745.950651] ---[ end trace c3419c1bb8acc34f ]--- [1530745.950651] Fixing recursive fault but reboot is needed! The problem here is pages on LRU may contain pointer to stale memcg. To make res->usage to be 0, all pages on memcg must be uncharged or moved to another(parent) memcg. Moved page_cgroup have already removed from original LRU, but uncharged page_cgroup contains pointer to memcg withou PCG_USED bit. (This asynchronous LRU work is for improving performance.) If PCG_USED bit is not set, page_cgroup will never be added to memcg's LRU. So, about pages not on LRU, they never access stale pointer. Then, what we have to take care of is page_cgroup _on_ LRU list. This patch fixes this problem by making mem_cgroup_force_empty() visit all LRUs before exiting its loop and guarantee there are no pages on its LRU. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d4220f987cf473c65a342ca69e3eb13dea919a49 |
|
16-Dec-2009 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits) HWPOISON: Remove stray phrase in a comment HWPOISON: Try to allocate migration page on the same node HWPOISON: Don't do early filtering if filter is disabled HWPOISON: Add a madvise() injector for soft page offlining HWPOISON: Add soft page offline support HWPOISON: Undefine short-hand macros after use to avoid namespace conflict HWPOISON: Use new shake_page in memory_failure HWPOISON: Use correct name for MADV_HWPOISON in documentation HWPOISON: mention HWPoison in Kconfig entry HWPOISON: Use get_user_page_fast in hwpoison madvise HWPOISON: add an interface to switch off/on all the page filters HWPOISON: add memory cgroup filter memcg: add accessor to mem_cgroup.css memcg: rename and export try_get_mem_cgroup_from_page() HWPOISON: add page flags filter mm: export stable page flags HWPOISON: limit hwpoison injector to known page types HWPOISON: add fs/device filters HWPOISON: return 0 to indicate success reliably HWPOISON: make semantics of IGNORED/DELAYED clear ...
|
aa20d489ceb024f91aae084ee00c47fc6a12255c |
|
16-Dec-2009 |
Bob Liu <lliubbo@gmail.com> |
memcg: code clean, remove unused variable in mem_cgroup_resize_limit() Variable `progress' isn't used in mem_cgroup_resize_limit() any more. Remove it. [akpm@linux-foundation.org: cleanup] Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9ab322caa347c4b580bcaf08f2253ea4cbd9e9ad |
|
16-Dec-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: remove memcg_tasklist memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock caused by race between oom and cpuset_attach) instead of cgroup_mutex to fix a deadlock problem. The cgroup_mutex, which was removed by the commit, in mem_cgroup_out_of_memory() was originally introduced at commit c7ba5c9e (Memory controller: OOM handling). IIUC, the intention of this cgroup_mutex was to prevent task move during select_bad_process() so that situations like below can be avoided. Assume cgroup "foo" has exceeded its limit and is about to trigger oom. 1. Process A, which has been in cgroup "baa" and uses large memory, is just moved to cgroup "foo". Process A can be the candidates for being killed. 2. Process B, which has been in cgroup "foo" and uses large memory, is just moved from cgroup "foo". Process B can be excluded from the candidates for being killed. But these race window exists anyway even if we hold a lock, because __mem_cgroup_try_charge() decides wether it should trigger oom or not outside of the lock. So the original cgroup_mutex in mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And IMHO, those races are not so critical for users. This patch removes it and make codes simpler. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d31f56dbf8bafaacb0c617f9a6f137498d5c7aed |
|
16-Dec-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: avoid oom-killing innocent task in case of use_hierarchy task_in_mem_cgroup(), which is called by select_bad_process() to check whether a task can be a candidate for being oom-killed from memcg's limit, checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs to). But this check return true(it's false positive) when: <some path>/aa use_hierarchy == 0 <- hitting limit <some path>/aa/00 use_hierarchy == 1 <- the task belongs to This leads to killing an innocent task in aa/00. This patch is a fix for this bug. And this patch also fixes the arg for mem_cgroup_print_oom_info(). We should print information of mem_cgroup which the task being killed, not current, belongs to. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
57f9fd7d25ac9a0d7e3a4ced580e780ab4524e3b |
|
16-Dec-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: cleanup mem_cgroup_move_parent() mem_cgroup_move_parent() calls try_charge first and cancel_charge on failure. IMHO, charge/uncharge(especially charge) is high cost operation, so we should avoid it as far as possible. This patch tries to delay try_charge in mem_cgroup_move_parent() by re-ordering checks it does. And this patch renames mem_cgroup_move_account() to __mem_cgroup_move_account(), changes the return value of __mem_cgroup_move_account() from int to void, and adds a new wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid for moving account and calls __mem_cgroup_move_account(). This patch removes the last caller of trylock_page_cgroup(), so removes its definition too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a3032a2c15c6967f9f0c0c28375b1a5c833a3112 |
|
16-Dec-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: add mem_cgroup_cancel_charge() There are some places calling both res_counter_uncharge() and css_put() to cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge(). This patch introduces mem_cgroup_cancel_charge() and call it in those places. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d8046582d5ee24448800e71c6933fdb6813aa062 |
|
16-Dec-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: make memcg's file mapped consistent with global VM In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED And in global VM, mapped shared memory is accounted into FILE_MAPPED. But memcg doesn't. fix it. Note: page_is_file_cache() just checks SwapBacked or not. So, we need to check PageAnon. Cc: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cdec2e4265dfa09490601b00aeabd8a8d4af30f0 |
|
16-Dec-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: coalesce charging via percpu storage This is a patch for coalescing access to res_counter at charging by percpu caching. At charge, memcg charges 64pages and remember it in percpu cache. Because it's cache, drain/flush if necessary. This version uses public percpu area. 2 benefits for using public percpu area. 1. Sum of stocked charge in the system is limited to # of cpus not to the number of memcg. This shows better synchonization. 2. drain code for flush/cpuhotplug is very easy (and quick) The most important point of this patch is that we never touch res_counter in fast path. The res_counter is system-wide shared counter which is modified very frequently. We shouldn't touch it as far as we can for avoiding false sharing. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 cache miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 cache miss/faults [ + coalescing uncharge patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 cache miss/faults [ + coalescing uncharge patch + this patch ] 34224709 page-faults # 0.072 M/sec ( +- 0.173% ) 34.69 cache miss/faults Changelog (since Oct/2): - updated comments - replaced get_cpu_var() with __get_cpu_var() if possible. - removed mutex for system-wide drain. adds a counter instead of it. - removed CONFIG_HOTPLUG_CPU Changelog (old): - rebased onto the latest mmotm - moved charge size check before __GFP_WAIT check for avoiding unnecesary - added asynchronous flush routine. - fixed bugs pointed out by Nishimura-san. [akpm@linux-foundation.org: tweak comments] [nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
569b846df54ffb2827b83ce3244c5f032394cba4 |
|
16-Dec-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: coalesce uncharge during unmap/truncate In massive parallel enviroment, res_counter can be a performance bottleneck. One strong techinque to reduce lock contention is reducing calls by coalescing some amount of calls into one. Considering charge/uncharge chatacteristic, - charge is done one by one via demand-paging. - uncharge is done by - in chunk at munmap, truncate, exit, execve... - one by one via vmscan/paging. It seems we have a chance to coalesce uncharges for improving scalability at unmap/truncation. This patch is a for coalescing uncharge. For avoiding scattering memcg's structure to functions under /mm, this patch adds memcg batch uncharge information to the task. A reason for per-task batching is for making use of caller's context information. We do batched uncharge (deleyed uncharge) when truncation/unmap occurs but do direct uncharge when uncharge is called by memory reclaim (vmscan.c). The degree of coalescing depends on callers - at invalidate/trucate... pagevec size - at unmap ....ZAP_BLOCK_SIZE (memory itself will be freed in this degree.) Then, we'll not coalescing too much. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 miss/faults [child with this patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 miss/faults We can see some amounts of improvement. (root cgroup doesn't affected by this patch) Another patch for "charge" will follow this and above will be improved more. Changelog(since 2009/10/02): - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes) - some clean up and commentary/description updates. - added initialize code to copy_process(). (possible bug fix) Changelog(old): - fixed !CONFIG_MEM_CGROUP case. - rebased onto the latest mmotm + softlimit fix patches. - unified patch for callers - added commetns. - make ->do_batch as bool. - removed css_get() at el. We don't need it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cd9b45b78a61e8df250e69385c74e729e5b66abf |
|
16-Dec-2009 |
Kirill A. Shutemov <kirill@shutemov.name> |
memcg: fix memory.memsw.usage_in_bytes for root cgroup A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum of the usage of pages and swapents in the cgroup. Presently the root cgroup's memsw.usage_in_bytes shows the wrong value - the number of swapents are not added. So take MEM_CGROUP_STAT_SWAPOUT into account. Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d324236b3333e87c8825b35f2104184734020d35 |
|
16-Dec-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
memcg: add accessor to mem_cgroup.css So that an outside user can free the reference count grabbed by try_get_mem_cgroup_from_page(). CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> CC: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
/mm/memcontrol.c
|
e42d9d5d47961fb5db0be65b56dd52fe7b2421f1 |
|
16-Dec-2009 |
Wu Fengguang <fengguang.wu@intel.com> |
memcg: rename and export try_get_mem_cgroup_from_page() So that the hwpoison injector can get mem_cgroup for arbitrary page and thus know whether it is owned by some mem_cgroup task(s). [AK: Merged with latest git tree] CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> CC: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com>
/mm/memcontrol.c
|
407f9c8b0889ced1dbe2f9157e4e60c61329d5c9 |
|
15-Dec-2009 |
Hugh Dickins <hugh.dickins@tiscali.co.uk> |
ksm: mem cgroup charge swapin copy But ksm swapping does require one small change in mem cgroup handling. When do_swap_page()'s call to ksm_might_need_to_copy() does indeed substitute a duplicate page to accommodate a different anon_vma (or a the !PageSwapCache check in mem_cgroup_try_charge_swapin(). That was returning success without charging, on the assumption that pte_same() would fail after, which is not the case here. Originally I proposed that success, so that an unshrinkable mem cgroup at its limit would not fail unnecessarily; but that's a minor point, and there are plenty of other places where we may fail an overallocation which might later prove unnecessary. So just go ahead and do what all the other exceptions do: proceed to charge current mm. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Chris Wright <chrisw@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
af901ca181d92aac3a7dc265144a9081a86d8f39 |
|
14-Nov-2009 |
André Goddard Rosa <andre.goddard@gmail.com> |
tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
/mm/memcontrol.c
|
21ae2956ce289f61f11863cc67080f9a28101ae0 |
|
07-Oct-2009 |
Uwe Kleine-König <u.kleine-koenig@pengutronix.de> |
tree-wide: fix typos "aquire" -> "acquire", "cumsumed" -> "consumed" This patch was generated by git grep -E -i -l '[Aa]quire' | xargs -r perl -p -i -e 's/([Aa])quire/$1cquire/' and the cumsumed was found by checking the diff for aquire. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
/mm/memcontrol.c
|
ef8745c1e7fc5413d760b3b958f3fd3a0beaad72 |
|
02-Oct-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: reduce check for softlimit excess In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly and it takes res_counter's spin_lock every time. This patch removes unnecessary calls for res_count_soft_limit_excess. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4e649152cbaa1aedd01821d200ab9d597fe469e4 |
|
02-Oct-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: some modification to softlimit under hierarchical memory reclaim. This patch clean up/fixes for memcg's uncharge soft limit path. Problems: Now, res_counter_charge()/uncharge() handles softlimit information at charge/uncharge and softlimit-check is done when event counter per memcg goes over limit. Now, event counter per memcg is updated only when memory usage is over soft limit. Here, considering hierarchical memcg management, ancesotors should be taken care of. Now, ancerstors(hierarchy) are handled in charge() but not in uncharge(). This is not good. Prolems: 1. memcg's event counter incremented only when softlimit hits. That's bad. It makes event counter hard to be reused for other purpose. 2. At uncharge, only the lowest level rescounter is handled. This is bug. Because ancesotor's event counter is not incremented, children should take care of them. 3. res_counter_uncharge()'s 3rd argument is NULL in most case. ops under res_counter->lock should be small. No "if" sentense is better. Fixes: * Removed soft_limit_xx poitner and checks in charge and uncharge. Do-check-only-when-necessary scheme works enough well without them. * make event-counter of memcg incremented at every charge/uncharge. (per-cpu area will be accessed soon anyway) * All ancestors are checked at soft-limit-check. This is necessary because ancesotor's event counter may never be modified. Then, they should be checked at the same time. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
26251eaf98e26dc2ce2dc26d63bc502700760704 |
|
02-Oct-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix refcnt going negative __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz" with incremnted mz->mem->css's refcnt. Then, the caller of this function has to call css_put(mz->mem->css). But, mz can be !NULL even if "not found" i.e. without css_get(). By this, css->refcnt will go down to minus. This may cause various things...one of results will be initite-loop in css_tryget() as this. INFO: RCU detected CPU 0 stall (t=10000 jiffies) sending NMI to all CPUs: NMI backtrace for cpu 0 CPU 0: <snip> <<EOE>> <IRQ> [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10 [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0 [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70 [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0 [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370 [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130 [<ffffffff81063a26>] update_process_times+0x46/0x70 [<ffffffff81085930>] tick_sched_timer+0x60/0x160 [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160 [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150 [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0 [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180 [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180 [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180 [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110 [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330 [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60 Above shows CPU0 caught in css_tryget()'s inifinite loop because of bad refcnt. This is a fix to set mz=NULL at the top of retry path. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
1dd3a27326d307952f8ad2499478c84dc7311517 |
|
24-Sep-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: show swap usage in stat file We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would be useful for users to show swap usage in memory.stat file, because they don't need calculate memsw.usage - res.usage to know swap usage. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0c3e73e84fe3f64cf1c2e8bb4e91e8901cbcdc38 |
|
24-Sep-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcg: improve resource counter scalability Reduce the resource counter overhead (mostly spinlock) associated with the root cgroup. This is a part of the several patches to reduce mem cgroup overhead. I had posted other approaches earlier (including using percpu counters). Those patches will be a natural addition and will be added iteratively on top of these. The patch stops resource counter accounting for the root cgroup. The data for display is derived from the statisitcs we maintain via mem_cgroup_charge_statistics (which is more scalable). What happens today is that, we do double accounting, once using res_counter_charge() and once using memory_cgroup_charge_statistics(). For the root, since we don't implement limits any more, we don't need to track every charge via res_counter_charge() and check for limit being exceeded and reclaim. The main mem->res usage_in_bytes can be derived by summing the cache and rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need additional data about swapped out memory. This patch adds a MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed recursively when hierarchy is enabled. The tests results I see on a 24 way show that 1. The lock contention disappears from /proc/lock_stats 2. The results of the test are comparable to running with cgroup_disable=memory. Here is a sample of my program runs Without Patch Performance counter stats for '/home/balbir/parallel_pagefault': 7192804.124144 task-clock-msecs # 23.937 CPUs 424691 context-switches # 0.000 M/sec 267 CPU-migrations # 0.000 M/sec 28498113 page-faults # 0.004 M/sec 5826093739340 cycles # 809.989 M/sec 408883496292 instructions # 0.070 IPC 7057079452 cache-references # 0.981 M/sec 3036086243 cache-misses # 0.422 M/sec 300.485365680 seconds time elapsed With cgroup_disable=memory Performance counter stats for '/home/balbir/parallel_pagefault': 7182183.546587 task-clock-msecs # 23.915 CPUs 425458 context-switches # 0.000 M/sec 203 CPU-migrations # 0.000 M/sec 92545093 page-faults # 0.013 M/sec 6034363609986 cycles # 840.185 M/sec 437204346785 instructions # 0.072 IPC 6636073192 cache-references # 0.924 M/sec 2358117732 cache-misses # 0.328 M/sec 300.320905827 seconds time elapsed With this patch applied Performance counter stats for '/home/balbir/parallel_pagefault': 7191619.223977 task-clock-msecs # 23.955 CPUs 422579 context-switches # 0.000 M/sec 88 CPU-migrations # 0.000 M/sec 91946060 page-faults # 0.013 M/sec 5957054385619 cycles # 828.333 M/sec 1058117350365 instructions # 0.178 IPC 9161776218 cache-references # 1.274 M/sec 1920494280 cache-misses # 0.267 M/sec 300.218764862 seconds time elapsed Data from Prarit (kernel compile with make -j64 on a 64 CPU/32G machine) For a single run Without patch real 27m8.988s user 87m24.916s sys 382m6.037s With patch real 4m18.607s user 84m58.943s sys 50m52.682s With config turned off real 4m54.972s user 90m13.456s sys 50m19.711s NOTE: The data looks counterintuitive due to the increased performance with the patch, even over the config being turned off. We probably need more runs, but so far all testing has shown that the patches definitely help. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4e41695356fb4e0b153be1440ad027e46e0a7ea2 |
|
24-Sep-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memory controller: soft limit reclaim on contention Implement reclaim from groups over their soft limit Permit reclaim from memory cgroups on contention (via the direct reclaim path). memory cgroup soft limit reclaim finds the group that exceeds its soft limit by the largest number of pages and reclaims pages from it and then reinserts the cgroup into its correct place in the rbtree. Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long loops in case all swap is turned off. The code has been refactored and the loop check (loop < 2) has been enhanced for soft limits. For soft limits, we try to do more targetted reclaim. Instead of bailing out after two loops, the routine now reclaims memory proportional to the size by which the soft limit is exceeded. The proportion has been empirically determined. [akpm@linux-foundation.org: build fix] [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling] [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
75822b4495b62e8721e9b88e3cf9e653a0c85b73 |
|
24-Sep-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memory controller: soft limit refactor reclaim flags Refactor mem_cgroup_hierarchical_reclaim() Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into flags, so that new parameters don't have to be passed as we make the reclaim routine more flexible Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f64c3f54940d6929a2b6dcffaab942bd62be2e66 |
|
24-Sep-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memory controller: soft limit organize cgroups Organize cgroups over soft limit in a RB-Tree Introduce an RB-Tree for storing memory cgroups that are over their soft limit. The overall goal is to 1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded. We are careful about updates, updates take place only after a particular time interval has passed 2. We remove the node from the RB-Tree when the usage goes below the soft limit The next set of patches will exploit the RB-Tree to get the group that is over its soft limit by the largest amount and reclaim from it, when we face memory contention. [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
296c81d89f4f14269f7346f81442910158c0a83a |
|
24-Sep-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memory controller: soft limit interface Add an interface to allow get/set of soft limits. Soft limits for memory plus swap controller (memsw) is currently not supported. Resource counters have been enhanced to support soft limits and new type RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be directly set and do not need any reclaim or checks before setting them to a newer value. Kamezawa-San raised a question as to whether soft limit should belong to res_counter. Since all resources understand the basic concepts of hard and soft limits, it is justified to add soft limits here. Soft limits are a generic resource usage feature, even file system quotas support soft limits. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
261fb61a8bf6d3bd964ae6f1e6af49585d30db51 |
|
24-Sep-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: add comments explaining memory barriers Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge(). [akpm@linux-foundation.org: coding-style fixes] Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4b3bde4c983de36c59e6c1a24701f6fe816f9f55 |
|
24-Sep-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcg: remove the overhead associated with the root cgroup Change the memory cgroup to remove the overhead associated with accounting all pages in the root cgroup. As a side-effect, we can no longer set a memory hard limit in the root cgroup. A new flag to track whether the page has been accounted or not has been added as well. Flags are now set atomically for page_cgroup, pcg_default_flags is now obsolete and removed. [akpm@linux-foundation.org: fix a few documentation glitches] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
be367d09927023d081f9199665c8500f69f14d22 |
|
24-Sep-2009 |
Ben Blum <bblum@google.com> |
cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time Alter the ss->can_attach and ss->attach functions to be able to deal with a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a pre-patch to cgroup-procs-writable.patch.) Currently, new mode of the attach function can only tell the subsystem about the old cgroup of the threadgroup leader. No subsystem currently needs that information for each thread that's being moved, but if one were to be added (for example, one that counts tasks within a group) this bit would need to be reworked a bit to tell the subsystem the right information. [hidave.darkstar@gmail.com: fix build] Signed-off-by: Ben Blum <bblum@google.com> Signed-off-by: Paul Menage <menage@google.com> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Matt Helsley <matthltc@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Dave Young <hidave.darkstar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b7c46d151cb82856a429709d1227ba1648028232 |
|
22-Sep-2009 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: drop unneeded double negations Remove double negations where the operand is already boolean. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
887032670d47366a8c8f25396ea7c14b7b2cc620 |
|
30-Jul-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
cgroup avoid permanent sleep at rmdir After commit ec64f51545fffbc4cb968f0cea56341a4b07e85a ("cgroup: fix frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg) doesn't return -EBUSY by temporary ref counts. That commit expects all refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can wait permanently. This patch tries to fix that and change followings. - set CGRP_WAIT_ON_RMDIR flag before pre_destroy(). - clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case. if there are sleeping ones, wakes them up. - rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set. Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: Paul Menage <menage@google.com> Acked-by: Balbir Sigh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0 |
|
09-Jul-2009 |
Jens Axboe <jens.axboe@oracle.com> |
Fix congestion_wait() sync/async vs read/write confusion Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke the bdi congestion wait queue logic, causing us to wait on congestion for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
/mm/memcontrol.c
|
2ffebca6aa7e1687905c842dd8c5c1e811e574e7 |
|
18-Jun-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix lru rotation in isolate_pages Try to fix memcg's lru rotation sanity: make memcg use the same logic as the global LRU does. Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not handled. This makes memcg do the same behavior as global LRU and rotate LRU in the page is busy. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
22a668d7c3ef833e7d67e9cef587ecc78069d532 |
|
18-Jun-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix behavior under memory.limit equals to memsw.limit A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the user just want to limit the total size of applications, in other words, not very interested in memory usage itself. In this case, swap-out will be done only by global-LRU. But, under current implementation, memory.limit_in_bytes is checked at first and try_to_free_page() may do swap-out. But, that swap-out is useless for memsw.limit_in_bytes and the thread may hit limit again. This patch tries to fix the current behavior at memory.limit == memsw.limit case. And documentation is updated to explain the behavior of this special case. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8a9478ca7f4bcb8945cec7f95d52dae2d5e50cbd |
|
18-Jun-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix swap accounting This patch fixes mis-accounting of swap usage in memcg. In the current implementation, memcg's swap account is uncharged only when swap is completely freed. But there are several cases where swap cannot be freed cleanly. For handling that, this patch changes that memcg uncharges swap account when swap has no references other than cache. By this, memcg's swap entry accounting can be fully synchronous with the application's behavior. This patch also changes memcg's hooks for swap-out. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
338c843108bf5030d6765f4405126e70f8b77845 |
|
18-Jun-2009 |
Li Zefan <lizf@cn.fujitsu.com> |
memcg: remove some redundant checks We don't need to check do_swap_account in the case that the function which checks do_swap_account will never get called if do_swap_account == 0. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d69b042f3d7406ddba560143b1796020df760800 |
|
18-Jun-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcg: add file-based RSS accounting Add file RSS tracking per memory cgroup We currently don't track file RSS, the RSS we report is actually anon RSS. All the file mapped pages, come in through the page cache and get accounted there. This patch adds support for accounting file RSS pages. It should 1. Help improve the metrics reported by the memory resource controller 2. Will form the basis for a future shared memory accounting heuristic that has been proposed by Kamezawa. Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat to "anon_rss". We however, add "mapped_file" data and hope to educate the end user through documentation. [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.cn> Cc: Paul Menage <menage@google.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
56e49d218890f49b0057710a4b6fef31f5ffbfec |
|
17-Jun-2009 |
Rik van Riel <riel@redhat.com> |
vmscan: evict use-once pages first When the file LRU lists are dominated by streaming IO pages, evict those pages first, before considering evicting other pages. This should be safe from deadlocks or performance problems because only three things can happen to an inactive file page: 1) referenced twice and promoted to the active list 2) evicted by the pageout code 3) under IO, after which it will get evicted or promoted The pages freed in this way can either be reused for streaming IO, or allocated for something else. If the pages are used for streaming IO, this pageout pattern continues. Otherwise, we will fall back to the normal pageout pattern. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Elladan <elladan@eskimo.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
46f7e602fb32e02145ef14f8c0ca6d399f0a96b9 |
|
28-May-2009 |
Nikanth Karthikesan <knikanth@suse.de> |
memcg: fix build warning and avoid checking for mem != null again and again Fix build warning, "mem_cgroup_is_obsolete defined but not used" when CONFIG_DEBUG_VM is not set. Also avoid checking for !mem again and again. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e767e0561d7fd2333df1921f1ab4176211f9036b |
|
28-May-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix deadlock between lock_page_cgroup and mapping tree_lock mapping->tree_lock can be acquired from interrupt context. Then, following dead lock can occur. Assume "A" as a page. CPU0: lock_page_cgroup(A) interrupted -> take mapping->tree_lock. CPU1: take mapping->tree_lock -> lock_page_cgroup(A) This patch tries to fix above deadlock by moving memcg's hook to out of mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock. After this patch, lock_page_cgroup() is not called under mapping->tree_lock. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ae3abae64f177586be55b04a7fb7047a34b21a3e |
|
01-May-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix mem_cgroup_shrink_usage() Current mem_cgroup_shrink_usage() has two problems. 1. It doesn't call mem_cgroup_out_of_memory and doesn't update last_oom_jiffies, so pagefault_out_of_memory invokes global OOM. 2. Considering hierarchy, shrinking has to be done from the mem_over_limit, not from the memcg which the page would be charged to. mem_cgroup_try_charge_swapin() does all of these things properly, so we use it and call cancel_charge_swapin when it succeeded. The name of "shrink_usage" is not appropriate for this behavior, so we change it too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.cn> Cc: Paul Menage <menage@google.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c0bd3f63ce01a1757dbce6373122a05fbf99ced7 |
|
01-May-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix try_get_mem_cgroup_from_swapcache() This is a bugfix for commit 3c776e64660028236313f0e54f3a9945764422df ("memcg: charge swapcache to proper memcg"). Used bit of swapcache is solid under page lock, but considering move_account, pc->mem_cgroup is not. We need lock_page_cgroup() anyway. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a8031cb00e286600ea08bd00a6812dbfec412376 |
|
13-Apr-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: remove warning when CONFIG_DEBUG_VM=n mm/memcontrol.c:318: warning: `mem_cgroup_is_obsolete' defined but not used [akpm@linux-foundation.org: simplify as suggested by Balbir] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
83aae4c737866da3280f51fd15da58eddd788397 |
|
03-Apr-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: cleanup cache_charge Current mem_cgroup_cache_charge is a bit complicated especially in the case of shmem's swap-in. This patch cleans it up by using try_charge_swapin and commit_charge_swapin. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a3b2d692690aef228e493b1beaafe5364cab3237 |
|
03-Apr-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
cgroups: use css id in swap cgroup for saving memory v5 Try to use CSS ID for records in swap_cgroup. By this, on 64bit machine, size of swap_cgroup goes down to 2 bytes from 8bytes. This means, when 2GB of swap is equipped, (assume the page size is 4096bytes) From size of swap_cgroup = 2G/4k * 8 = 4Mbytes. To size of swap_cgroup = 2G/4k * 2 = 1Mbytes. Reduction is large. Of course, there are trade-offs. This CSS ID will add overhead to swap-in/swap-out/swap-free. But in general, - swap is a resource which the user tend to avoid use. - If swap is never used, swap_cgroup area is not used. - Reading traditional manuals, size of swap should be proportional to size of memory. Memory size of machine is increasing now. I think reducing size of swap_cgroup makes sense. Note: - ID->CSS lookup routine has no locks, it's under RCU-Read-Side. - memcg can be obsolete at rmdir() but not freed while refcnt from swap_cgroup is available. Changelog v4->v5: - reworked on to memcg-charge-swapcache-to-proper-memcg.patch Changlog ->v4: - fixed not configured case. - deleted unnecessary comments. - fixed NULL pointer bug. - fixed message in dmesg. [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3c776e64660028236313f0e54f3a9945764422df |
|
03-Apr-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: charge swapcache to proper memcg memcg_test.txt says at 4.1: This swap-in is one of the most complicated work. In do_swap_page(), following events occur when pte is unchanged. (1) the page (SwapCache) is looked up. (2) lock_page() (3) try_charge_swapin() (4) reuse_swap_page() (may call delete_swap_cache()) (5) commit_charge_swapin() (6) swap_free(). Considering following situation for example. (A) The page has not been charged before (2) and reuse_swap_page() doesn't call delete_from_swap_cache(). (B) The page has not been charged before (2) and reuse_swap_page() calls delete_from_swap_cache(). (C) The page has been charged before (2) and reuse_swap_page() doesn't call delete_from_swap_cache(). (D) The page has been charged before (2) and reuse_swap_page() calls delete_from_swap_cache(). memory.usage/memsw.usage changes to this page/swp_entry will be Case (A) (B) (C) (D) Event Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 =========================================== (3) +1/+1 +1/+1 +1/+1 +1/+1 (4) - 0/ 0 - -1/ 0 (5) 0/-1 0/ 0 -1/-1 0/ 0 (6) - 0/-1 - 0/-1 =========================================== Result 1/ 1 1/ 1 1/ 1 1/ 1 In any cases, charges to this page should be 1/ 1. In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means charges to the memcg("foo") to which the "current" belongs. OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa") to which the page has been charged. So, if the "foo" and "baa" is different(for example because of task move), this charge will be moved from "baa" to "foo". I think this is an unexpected behavior. This patch fixes this by modifying mem_cgroup_try_get_from_swapcache() to return the memcg to which the swapcache has been charged if PCG_USED bit is set. IIUC, checking PCG_USED bit of swapcache is safe under page lock. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c137b5ece4b111e46981aae7da77315b9909809f |
|
03-Apr-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: remove mem_cgroup_calc_mapped_ratio() Currently, mem_cgroup_calc_mapped_ratio() is unused at all. it can be removed and KAMEZAWA-san suggested it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e222432bfa7dcf6ec008622a978c9f284ed5e3a9 |
|
03-Apr-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcg: show memcg information during OOM Add RSS and swap to OOM output from memcg Display memcg values like failcnt, usage and limit when an OOM occurs due to memcg. Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki, Daisuke Nishimura and KOSAKI Motohiro for review. Sample output ------------- Task in /a/x killed as a result of limit of /a memory: usage 1048576kB, limit 1048576kB, failcnt 4183 memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0 [akpm@linux-foundation.org: compilation fix] [akpm@linux-foundation.org: fix kerneldoc and whitespace] [akpm@linux-foundation.org: add printk facility level] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0b7f569e45bb6be142d87017030669a6a7d327a1 |
|
03-Apr-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix OOM killer under memcg This patch tries to fix OOM Killer problems caused by hierarchy. Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to kill a task in memcg. But, when hierarchy is used, it's broken and correct task cannot be killed. For example, in following cgroup /groupA/ hierarchy=1, limit=1G, 01 nolimit 02 nolimit All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to groupA's 1Gbytes but OOM Killer just kills tasks in groupA. This patch provides makes the bad process be selected from all tasks under hierarchy. BTW, currently, oom_jiffies is updated against groupA in above case. oom_jiffies of tree should be updated. To see how oom_jiffies is used, please check mem_cgroup_oom_called() callers. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
81d39c20f5ee2437d71709beb82597e2a38efbbc |
|
03-Apr-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm As pointed out, shrinking memcg's limit should return -EBUSY after reasonable retries. This patch tries to fix the current behavior of shrink_usage. Before looking into "shrink should return -EBUSY" problem, we should fix hierarchical reclaim code. It compares current usage and current limit, but it only makes sense when the kernel reclaims memory because hit limits. This is also a problem. What this patch does are. 1. add new argument "shrink" to hierarchical reclaim. If "shrink==true", hierarchical reclaim returns immediately and the caller checks the kernel should shrink more or not. (At shrinking memory, usage is always smaller than limit. So check for usage < limit is useless.) 2. For adjusting to above change, 2 changes in "shrink"'s retry path. 2-a. retry_count depends on # of children because the kernel visits the children under hierarchy one by one. 2-b. rather than checking return value of hierarchical_reclaim's progress, compares usage-before-shrink and usage-after-shrink. If usage-before-shrink <= usage-after-shrink, retry_count is decremented. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
14067bb3e24b96d92e22d19c18c0119edf5575e5 |
|
03-Apr-2009 |
KAMEZAWA Hiroyuki <kamzawa.hiroyu@jp.fujitsu.com> |
memcg: hierarchical stat Clean up memory.stat file routine and show "total" hierarchical stat. This patch does - renamed get_all_zonestat to be get_local_zonestat. - remove old mem_cgroup_stat_desc, which is only for per-cpu stat. - add mcs_stat to cover both of per-cpu/per-lru stat. - add "total" stat of hierarchy (*) - add a callback system to scan all memcg under a root. == "total" is added. [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat cache 0 rss 0 pgpgin 0 pgpgout 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 50331648 hierarchical_memsw_limit 9223372036854775807 total_cache 65536 total_rss 192512 total_pgpgin 218 total_pgpgout 155 total_inactive_anon 0 total_active_anon 135168 total_inactive_file 61440 total_active_file 4096 total_unevictable 0 == (*) maybe the user can do calc hierarchical stat by his own program in userland but if it can be written in clean way, it's worth to be shown, I think. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
04046e1a0a34286382e913f8fc461440c21d88e8 |
|
03-Apr-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: use CSS ID Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy. Assume folloing tree. group_A (ID=3) /01 (ID=4) /0A (ID=7) /02 (ID=10) group_B (ID=5) and task in group_A/01/0A hits limit at group_A. reclaim will be done in following order (round-robin). group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10) -> group_A -> ..... Round robin by ID. The last visited cgroup is recorded and restart from it when it start reclaim again. (More smart algorithm can be implemented..) No cgroup_mutex or hierarchy_mutex is required. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ec64f51545fffbc4cb968f0cea56341a4b07e85a |
|
03-Apr-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
cgroup: fix frequent -EBUSY at rmdir In following situation, with memory subsystem, /groupA use_hierarchy==1 /01 some tasks /02 some tasks /03 some tasks /04 empty When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim is triggered and the kernel walks tree under groupA. In this case, rmdir /groupA/04 fails with -EBUSY frequently because of temporal refcnt from the kernel. In general. cgroup can be rmdir'd if there are no children groups and no tasks. Frequent fails of rmdir() is not useful to users. (And the reason for -EBUSY is unknown to users.....in most cases) This patch tries to modify above behavior, by - retries if css_refcnt is got by someone. - add "return value" to pre_destroy() and allows subsystem to say "we're really busy!" Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
299b4eaa302138426d5a9ecd954de1f565d76c94 |
|
29-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: NULL pointer dereference at rmdir on some NUMA systems N_POSSIBLE doesn't means there is memory...and force_empty can visit invalid node which have no pgdat. To visit all valid nodes, N_HIGH_MEMORY should be used. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7bcc1bb1232de6efc0b85e0c7fe38e90b2436318 |
|
29-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: get/put parents at create/free The lifetime of struct cgroup and struct mem_cgroup is different and mem_cgroup has its own reference count for handling references from swap_cgroup. This causes strange problem that the parent mem_cgroup dies while child mem_cgroup alive, and this problem causes a bug in case of use_hierarchy==1 because res_counter_uncharge climbs up the tree. This patch is for avoiding it by getting the parent at create, and putting it at freeing. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by; KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
068b38c1fa7a9210608f27ac521897ccc5f9b726 |
|
15-Jan-2009 |
Li Zefan <lizf@cn.fujitsu.com> |
memcg: fix a race when setting memory.swappiness (suppose: memcg->use_hierarchy == 0 and memcg->swappiness == 60) echo 10 > /memcg/0/swappiness | mem_cgroup_swappiness_write() | ... | echo 1 > /memcg/0/use_hierarchy | mkdir /mnt/0/1 | sub_memcg->swappiness = 60; memcg->swappiness = 10; | In the above scenario, we end up having 2 different swappiness values in a single hierarchy. We should hold cgroup_lock() when cheking cgrp->children list. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0eb253e223c88b982461e59154fcad1b82597592 |
|
15-Jan-2009 |
Li Zefan <lizf@cn.fujitsu.com> |
memcg: fix section mismatch At system boot when creating the top cgroup, mem_cgroup_create() calls enable_swap_cgroup() which is marked as __init, so mark mem_cgroup_create() as __ref to avoid false section mismatch warning. Reported-by: Rakib Mullick <rakib.mullick@gmail.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by; KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4d1c627389c8ba6d9e703208567ffcdbd356f682 |
|
15-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: make oom less frequently In previous implementation, mem_cgroup_try_charge checked the return value of mem_cgroup_try_to_free_pages, and just retried if some pages had been reclaimed. But now, try_charge(and mem_cgroup_hierarchical_reclaim called from it) only checks whether the usage is less than the limit. This patch tries to change the behavior as before to cause oom less frequently. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c268e9946d7dc30ac4e55cdc3f43c8af1ae8153c |
|
15-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix hierarchical reclaim If root_mem has no children, last_scaned_child is set to root_mem itself. But after some children added to root_mem, mem_cgroup_get_next_node can mem_cgroup_put the root_mem although root_mem has not been mem_cgroup_get. This patch fixes this behavior by: - Set last_scanned_child to NULL if root_mem has no children or DFS search has returned to root_mem itself(root_mem is not a "child" of root_mem). Make mem_cgroup_get_first_node return root_mem in this case. There are no mem_cgroup_get/put for root_mem. - Rename mem_cgroup_get_next_node to __mem_cgroup_get_next_node, and mem_cgroup_get_first_node to mem_cgroup_get_next_node. Make mem_cgroup_hierarchical_reclaim call only new mem_cgroup_get_next_node. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
40d58138f832a48208cdce57d6572a033b1f7a23 |
|
15-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix error path of mem_cgroup_move_parent There is a bug in error path of mem_cgroup_move_parent. Extra refcnt got from try_charge should be dropped, and usages incremented by try_charge should be decremented in both error paths: A: failure at get_page_unless_zero B: failure at isolate_lru_page This bug makes this parent directory unremovable. In case of A, rmdir doesn't return, because res.usage doesn't go down to 0 at mem_cgroup_force_empty even after all the pc in lru are removed. In case of B, rmdir fails and returns -EBUSY, because it has extra ref counts even after res.usage goes down to 0. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
bd112db872c2f69993c86f458467acb4a14da010 |
|
15-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: fix mem_cgroup_get_reclaim_stat_from_page In case of swapin, a new page is added to lru before it is charged, so page->pc->mem_cgroup points to NULL or last mem_cgroup the page was charged before. In the latter case, if the mem_cgroup has already freed by rmdir, the area pointed to by page->pc->mem_cgroup may have invalid data. Actually, I saw general protection fault. general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map CPU 4 Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_region_hash dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac lp sg ide_cd_mod cdrom button serio_raw acpi_memhotplug parport_pc e1000 rtc_cmos parport rtc_core rtc_lib i2c_i801 i2c_core shpchp pcspkr ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd [last unloaded: microcode] Pid: 26038, comm: page01 Tainted: G W 2.6.28-rc9-mm1-mmotm-2008-12-22-16-14-f2ab3dea #1 RIP: 0010:[<ffffffff8028e710>] [<ffffffff8028e710>] update_page_reclaim_stat+0x2f/0x42 RSP: 0000:ffff8801ee457da8 EFLAGS: 00010002 RAX: 32353438312021c8 RBX: 0000000000000000 RCX: 32353438312021c8 RDX: 0000000000000000 RSI: ffff8800cb0b1000 RDI: ffff8801164d1d28 RBP: ffff880110002cb8 R08: ffff88010f2eae23 R09: 0000000000000001 R10: ffff8800bc514b00 R11: ffff880110002c00 R12: 0000000000000000 R13: ffff88000f484100 R14: 0000000000000003 R15: 00000000001200d2 FS: 00007f8a261726f0(0000) GS:ffff88010f2eaa80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f8a25d22000 CR3: 00000001ef18c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process page01 (pid: 26038, threadinfo ffff8801ee456000, task ffff8800b585b960) Stack: ffffe200071ee568 ffff880110001f00 0000000000000000 ffffffff8028ea17 ffff88000f484100 0000000000000000 0000000000000020 00007f8a25d22000 ffff8800bc514b00 ffffffff8028ec34 0000000000000000 0000000000016fd8 Call Trace: [<ffffffff8028ea17>] ? ____pagevec_lru_add+0xc1/0x13c [<ffffffff8028ec34>] ? drain_cpu_pagevecs+0x36/0x89 [<ffffffff802a4f8c>] ? swapin_readahead+0x78/0x98 [<ffffffff8029a37a>] ? handle_mm_fault+0x3d9/0x741 [<ffffffff804da654>] ? do_page_fault+0x3ce/0x78c [<ffffffff804d7a42>] ? trace_hardirqs_off_thunk+0x3a/0x3c [<ffffffff804d860f>] ? page_fault+0x1f/0x30 Code: cc 55 48 8d af b8 0d 00 00 48 89 f7 53 89 d3 e8 39 85 02 00 48 63 d3 48 ff 44 d5 10 45 85 e4 74 05 48 ff 44 d5 00 48 85 c0 74 0e <48> ff 44 d0 10 45 85 e4 74 04 48 ff 04 d0 5b 5d 41 5c c3 41 54 RIP [<ffffffff8028e710>] update_page_reclaim_stat+0x2f/0x42 RSP <ffff8801ee457da8> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2cb378c862777d050c20db903b119a029845fdcb |
|
08-Jan-2009 |
Paul Menage <menage@google.com> |
cgroups: use hierarchy_mutex in memory controller Update the memory controller to use its hierarchy_mutex rather than calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir() from occurring in its hierarchy. Signed-off-by: Paul Menage <menage@google.com> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b5a84319a4343a0db753436fd8147e61eaafa7ea |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix shmem's swap accounting Now, you can see following even when swap accounting is enabled. 1. Create Group 01, and 02. 2. allocate a "file" on tmpfs by a task under 01. 3. swap out the "file" (by memory pressure) 4. Read "file" from a task in group 02. 5. the charge of "file" is moved to group 02. This is not ideal behavior. This is because SwapCache which was loaded by read-ahead is not taken into account.. This is a patch to fix shmem's swapcache behavior. - remove mem_cgroup_cache_charge_swapin(). - Add SwapCache handler routine to mem_cgroup_cache_charge(). By this, shmem's file cache is charged at add_to_page_cache() with GFP_NOWAIT. - pass the page of swapcache to shrink_mem_cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
544122e5e0ee27d5aac4a441f7746712afbf248c |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix LRU accounting for SwapCache Now, a page can be deleted from SwapCache while do_swap_page(). memcg-fix-swap-accounting-leak-v3.patch handles that, but, LRU handling is still broken. (above behavior broke assumption of memcg-synchronized-lru patch.) This patch is a fix for LRU handling (especially for per-zone counters). At charging SwapCache, - Remove page_cgroup from LRU if it's not used. - Add page cgroup to LRU if it's not linked to. Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
54595fe2652f04dc8f5b985312c7cef5aa7bf722 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: use css_tryget in memcg From:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> css_tryget() newly is added and we can know css is alive or not and get refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is not called.) This patch replaces css_get() to css_tryget(), where I cannot explain why css_get() is safe. And removes memcg->obsolete flag. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a7ba0eef3af51cd1b6fc4028e4705b3ea2ea9469 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix double free and make refcnt sane 1. Fix double-free BUG in error route of mem_cgroup_create(). mem_cgroup_free() itself frees per-zone-info. 2. Making refcnt of memcg simple. Add 1 refcnt at creation and call free when refcnt goes down to 0. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
03f3c433648a97ae7c86be789edba67690f6ea60 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix swap accounting leak Fix swapin charge operation of memcg. Now, memcg has hooks to swap-out operation and checks SwapCache is really unused or not. That check depends on contents of struct page. I.e. If PageAnon(page) && page_mapped(page), the page is recoginized as still-in-use. Now, reuse_swap_page() calles delete_from_swap_cache() before establishment of any rmap. Then, in followinig sequence (Page fault with WRITE) try_charge() (charge += PAGESIZE) commit_charge() (Check page_cgroup is used or not..) reuse_swap_page() -> delete_from_swapcache() -> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE) ...... New charge is uncharged soon.... To avoid this, move commit_charge() after page_mapcount() goes up to 1. By this, try_charge() (usage += PAGESIZE) reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set) commit_charge() (If page_cgroup is not marked as PCG_USED, add new charge.) Accounting will be correct. Changelog (v2) -> (v3) - fixed invalid charge to swp_entry==0. - updated documentation. Changelog (v1) -> (v2) - fixed comment. [nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
42e9abb628def2c335a4ecf130bb6c88d916d885 |
|
08-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: change try_to_free_pages to hierarchical_reclaim mem_cgroup_hierarchicl_reclaim() works properly even when !use_hierarchy now (by memcg-hierarchy-avoid-unnecessary-reclaim.patch), so, instead of try_to_free_mem_cgroup_pages(), it should be used in many cases. The only exception is force_empty. The group has no children in this case. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7f4d454dee2e0bdd21bafd413d1c53e443a26540 |
|
08-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: avoid deadlock caused by race between oom and cpuset_attach mpol_rebind_mm(), which can be called from cpuset_attach(), does down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be called under cgroup_mutex. OTOH, page fault path does down_read(mm->mmap_sem) and calls mem_cgroup_try_charge_xxx(), which may eventually calls mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls cgroup_lock(). This means cgroup_lock() can be called under down_read(mm->mmap_sem). If those two paths race, deadlock can happen. This patch avoid this deadlock by: - remove cgroup_lock() from mem_cgroup_out_of_memory(). - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task() (->attach handler of memory cgroup) and mem_cgroup_out_of_memory. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a5e924f5f8abf97944e625d74967cc9452cfbce8 |
|
08-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: remove mem_cgroup_try_charge After previous patch, mem_cgroup_try_charge is not used by anyone, so we can remove it. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3bb4edf24b26358eccfc69ac8b9a9c36ccc312da |
|
08-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: don't trigger oom at page migration I think triggering OOM at mem_cgroup_prepare_migration would be just a bit overkill. Returning -ENOMEM would be enough for mem_cgroup_prepare_migration. The caller would handle the case anyway. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
fee7b548e6f2bd4bfd03a1a45d3afd593de7d5e9 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: show real limit under hierarchy mode Show "real" limit of memcg. This helps my debugging and maybe useful for users. While testing hierarchy like this mount -t cgroup none /cgroup -t memory mkdir /cgroup/A set use_hierarchy==1 to "A" mkdir /cgroup/A/01 mkdir /cgroup/A/01/02 mkdir /cgroup/A/01/03 mkdir /cgroup/A/01/03/04 mkdir /cgroup/A/08 mkdir /cgroup/A/08/01 .... and set each own limit to them, "real" limit of each memcg is unclear. This patch shows real limit by checking all ancestors. Changelog: (v1) -> (v2) - remove "if" and use "min(a,b)" Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c772be939e078afd2505ede7d596a30f8f61de95 |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: fix calculation of active_ratio Currently, inactive_ratio of memcg is calculated at setting limit. because page_alloc.c does so and current implementation is straightforward porting. However, memcg introduced hierarchy feature recently. In hierarchy restriction, memory limit is not only decided memory.limit_in_bytes of current cgroup, but also parent limit and sibling memory usage. Then, The optimal inactive_ratio is changed frequently. So, everytime calculation is better. Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a7885eb8ad465ec9db99ac5b5e6680f0ca8e11c8 |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: swappiness Currently, /proc/sys/vm/swappiness can change swappiness ratio for global reclaim. However, memcg reclaim doesn't have tuning parameter for itself. In general, the optimal swappiness depend on workload. (e.g. hpc workload need to low swappiness than the others.) Then, per cgroup swappiness improve administrator tunability. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2733c06ac864ed40b9dfbbd5270f3f16949bd4a1 |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: protect prev_priority Currently, mem_cgroup doesn't have own lock and almost its member doesn't need. (e.g. mem_cgroup->info is protected by zone lock, mem_cgroup->stat is per cpu variable) However, there is one explict exception. mem_cgroup->prev_priorit need lock, but doesn't protect. Luckly, this is NOT bug because prev_priority isn't used for current reclaim code. However, we plan to use prev_priority future again. Therefore, fixing is better. In addition, we plan to reuse this lock for another member. Then "reclaim_param_lock" name is better than "prev_priority_lock". Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7f016ee8b6a9a43f768e6252021f169abec4fa1f |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: show reclaim stat Add the following four fields to memory.stat file: - inactive_ratio - recent_rotated_anon - recent_rotated_file - recent_scanned_anon - recent_scanned_file Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9439c1c95b5c25b8031b2a7eb7e1590eb84be7f5 |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: remove mem_cgroup_cal_reclaim() Now, get_scan_ratio() return correct value although memcg reclaim. Then, mem_cgroup_calc_reclaim() can be removed. So, memcg reclaim get the same capability of anon/file reclaim balancing as global reclaim now. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3e2f41f1f64744f7942980d93cc93dd3e5924560 |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: add zone_reclaim_stat Introduce mem_cgroup_per_zone::reclaim_stat member and its statics collecting function. Now, get_scan_ratio() can calculate correct value on memcg reclaim. [hugh@veritas.com: avoid reclaim_stat oops when disabled] Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a3d8e0549d913e30968fa02e505dfe02c0a23e0d |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: add mem_cgroup_zone_nr_pages() Introduce mem_cgroup_zone_nr_pages(). It is called by zone_nr_pages() helper function. This patch doesn't have any behavior change. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
14797e2363c2b2f1ce139fd1c5a215e4e05aa1d9 |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: add inactive_anon_is_low() The inactive_anon_is_low() is key component of active/inactive anon balancing on reclaim. However current inactive_anon_is_low() function only consider global reclaim. Therefore, we need following ugly scan_global_lru() condition. if (lru == LRU_ACTIVE_ANON && (!scan_global_lru(sc) || inactive_anon_is_low(zone))) { shrink_active_list(nr_to_scan, zone, sc, priority, file); return 0; it cause that memcg reclaim always deactivate pages when shrink_list() is called. To make mem_cgroup_inactive_anon_is_low() improve active/inactive anon balancing of memcgroup. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: "Pekka Enberg" <penberg@cs.helsinki.fi> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
549927620b04a8f8073ce2ee2a8977f209af2ee5 |
|
08-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
memcg: add null check to page_cgroup_zoneinfo() If CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, page_cgroup::mem_cgroup can be NULL. Therefore null checking is better. A later patch uses this function. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
670ec2f170301425fc4fdfa63d40652071fe85f6 |
|
08-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: hierarchy avoid unnecessary reclaim If hierarchy is not used, no tree-walk is necessary. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a7fe942e94b2f66aa0f11d37699c0ec8155d3ad1 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: swapout refcnt fix css's refcnt is dropped before end of following access. Hold it until end of access. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b85a96c0b6cb79c67e7b01b66368f2e31579d7c5 |
|
08-Jan-2009 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: memory swap controller: fix limit check There are scatterd calls of res_counter_check_under_limit(), and most of them don't take mem+swap accounting into account. define mem_cgroup_check_under_limit() and avoid direct use of res_counter_check_limit(). Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f9717d28d673468883df8ac34b47268719ac5a3d |
|
08-Jan-2009 |
Nikanth Karthikesan <knikanth@suse.de> |
memcg: check group leader fix Remove unnecessary codes (...fragments of not-implemented functionalilty...) Reported-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2c26fdd70c3094fa3e84caf9ef434911933d5477 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: revert gfp mask fix My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will happen at memory reclaim. But in recent discussion, it's NACKed because it sounds ugly. This patch is for reverting it and add some clean up to gfp_mask of callers of charge. No behavior change but need review before generating HUNK in deep queue. This patch also adds explanation to meaning of gfp_mask passed to charge functions in memcontrol.h. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
887007561ae58628f03aa9046949747c04f63be8 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix reclaim result checks check_under_limit logic was wrong and this check should be against mem_over_limit rather than mem. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a636b327f731143ccc544b966cfd8de6cb6d72c6 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: avoid unnecessary system-wide-oom-killer Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee "don't accnout against process with TIF_MEMDIE". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
18f59ea7de08db2449ba99185e8d8cc30e7acac5 |
|
08-Jan-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcg: memory cgroup hierarchy feature selector Don't enable multiple hierarchy support by default. This patch introduces a features element that can be set to enable the nested depth hierarchy feature. This feature can only be enabled when the cgroup for which the feature this is enabled, has no children. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6d61ef409d6ba168972f7c2f8c35baaade636a58 |
|
08-Jan-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcg: memory cgroup hierarchical reclaim This patch introduces hierarchical reclaim. When an ancestor goes over its limit, the charging routine points to the parent that is above its limit. The reclaim process then starts from the last scanned child of the ancestor and reclaims until the ancestor goes below its limit. [akpm@linux-foundation.org: coding-style fixes] [d-nishimura@mtf.biglobe.ne.jp: mem_cgroup_from_res_counter should handle both mem->res and mem->memsw] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
28dbc4b6a01fb579a9441c7b81e3d3413dc452df |
|
08-Jan-2009 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcg: memory cgroup resource counters for hierarchy Add support for building hierarchies in resource counters. Cgroups allows us to build a deep hierarchy, but we currently don't link the resource counters belonging to the memory controller control groups, in the same fashion as the corresponding cgroup entries in the cgroup hierarchy. This patch provides the infrastructure for resource counters that have the same hiearchy as their cgroup counter parts. These set of patches are based on the resource counter hiearchy patches posted by Pavel Emelianov. NOTE: Building hiearchies is expensive, deeper hierarchies imply charging the all the way up to the root. It is known that hiearchies are expensive, so the user needs to be careful and aware of the trade-offs before creating very deep ones. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f8d665422603ee1b8ed04dcad4242f14d623c941 |
|
08-Jan-2009 |
Hirokazu Takahashi <taka@valinux.co.jp> |
memcg: add mem_cgroup_disabled() We check mem_cgroup is disabled or not by checking mem_cgroup_subsys.disabled. I think it has more references than expected, now. replacing if (mem_cgroup_subsys.disabled) with if (mem_cgroup_disabled()) give us good look, I think. [kamezawa.hiroyu@jp.fujitsu.com: fix typo] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
08e552c69c6930d64722de3ec18c51844d06ee28 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: synchronized LRU A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8c7c6e34a1256a5082d38c8e9bd1474476912715 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: mem+swap controller core This patch implements per cgroup limit for usage of memory+swap. However there are SwapCache, double counting of swap-cache and swap-entry is avoided. Mem+Swap controller works as following. - memory usage is limited by memory.limit_in_bytes. - memory + swap usage is limited by memory.memsw_limit_in_bytes. This has following benefits. - A user can limit total resource usage of mem+swap. Without this, because memory resource controller doesn't take care of usage of swap, a process can exhaust all the swap (by memory leak.) We can avoid this case. And Swap is shared resource but it cannot be reclaimed (goes back to memory) until it's used. This characteristic can be trouble when the memory is divided into some parts by cpuset or memcg. Assume group A and group B. After some application executes, the system can be.. Group A -- very large free memory space but occupy 99% of swap. Group B -- under memory shortage but cannot use swap...it's nearly full. Ability to set appropriate swap limit for each group is required. Maybe someone wonder "why not swap but mem+swap ?" - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of mem+swap. In other words, when we want to limit the usage of swap without affecting global LRU, mem+swap limit is better than just limiting swap. Accounting target information is stored in swap_cgroup which is per swap entry record. Charge is done as following. map - charge page and memsw. unmap - uncharge page/memsw if not SwapCache. swap-out (__delete_from_swap_cache) - uncharge page - record mem_cgroup information to swap_cgroup. swap-in (do_swap_page) - charged as page and memsw. record in swap_cgroup is cleared. memsw accounting is decremented. swap-free (swap_free()) - if swap entry is freed, memsw is uncharged by PAGE_SIZE. There are people work under never-swap environments and consider swap as something bad. For such people, this mem+swap controller extension is just an overhead. This overhead is avoided by config or boot option. (see Kconfig. detail is not in this patch.) TODO: - maybe more optimization can be don in swap-in path. (but not very safe.) But we just do simple accounting at this stage. [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex] [hugh@veritas.com: memswap controller core swapcache fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c077719be8e9e6b55702117513d1b5f41d80404a |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: mem+swap controller Kconfig Config and control variable for mem+swap controller. This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP (memory resource controller swap extension.) For accounting swap, it's obvious that we have to use additional memory to remember "who uses swap". This adds more overhead. So, it's better to offer "choice" to users. This patch adds 2 choices. This patch adds 2 parameters to enable swap extension or not. - CONFIG - boot option Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d13d144309d2e5a3e6ad978b16c1d0226ddc9231 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: handle swap caches SwapCache support for memory resource controller (memcg) Before mem+swap controller, memcg itself should handle SwapCache in proper way. This is cut-out from it. In current memcg, SwapCache is just leaked and the user can create tons of SwapCache. This is a leak of account and should be handled. SwapCache accounting is done as following. charge (anon) - charged when it's mapped. (because of readahead, charge at add_to_swap_cache() is not sane) uncharge (anon) - uncharged when it's dropped from swapcache and fully unmapped. means it's not uncharged at unmap. Note: delete from swap cache at swap-in is done after rmap information is established. charge (shmem) - charged at swap-in. this prevents charge at add_to_page_cache(). uncharge (shmem) - uncharged when it's dropped from swapcache and not on shmem's radix-tree. at migration, check against 'old page' is modified to handle shmem. Comparing to the old version discussed (and caused troubles), we have advantages of - PCG_USED bit. - simple migrating handling. So, situation is much easier than several months ago, maybe. [hugh@veritas.com: memcg: handle swap caches build fix] Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c1e862c1f5ad34771b6d0a528cf681e0dcad7c86 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: new force_empty to free pages under group By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of memory usage and force_empty is removed. This patch adds "force_empty" again, in reasonable manner. memory.force_empty file works when #echo 0 (or some) > memory.force_empty and have following function. 1. only works when there are no task in this cgroup. 2. free all page under this cgroup as much as possible. 3. page which cannot be freed will be moved up to parent. 4. Then, memcg will be empty after above echo returns. This is much better behavior than old "force_empty" which just forget all accounts. This patch also check signal_pending() and above "echo" can be stopped by "Ctrl-C". [akpm@linux-foundation.org: cleanup] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c8dad2bb6307f5b00f804a686917105206a4d5c9 |
|
08-Jan-2009 |
Jan Blunck <jblunck@suse.de> |
memcg: reduce size of mem_cgroup by using nr_cpu_ids As Jan Blunck <jblunck@suse.de> pointed out, allocating per-cpu stat for memcg to the size of NR_CPUS is not good. This patch changes mem_cgroup's cpustat allocation not based on NR_CPUS but based on nr_cpu_ids. Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
f817ed48535ac6510ebae7c4116f24a5f9268834 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: move all acccounting to parent at rmdir() This patch provides a function to move account information of a page between mem_cgroups and rewrite force_empty to make use of this. This moving of page_cgroup is done under - lru_lock of source/destination mem_cgroup is held. - lock_page_cgroup() is held. Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should confirm pc->mem_cgroup is still valid or not. Typical code can be following. (while page is not under lock_page()) mem = pc->mem_cgroup; mz = page_cgroup_zoneinfo(pc) spin_lock_irqsave(&mz->lru_lock); if (pc->mem_cgroup == mem) ...../* some list handling */ spin_unlock_irqrestore(&mz->lru_lock); Of course, better way is lock_page_cgroup(pc); .... unlock_page_cgroup(pc); But you should confirm the nest of lock and avoid deadlock. If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock, you don't have to worry about what pc->mem_cgroup points to. moved pages are added to head of lru, not to tail. Expected users of this routine is: - force_empty (rmdir) - moving tasks between cgroup (for moving account information.) - hierarchy (maybe useful.) force_empty(rmdir) uses this move_account and move pages to its parent. This "move" will not cause OOM (I added "oom" parameter to try_charge().) If the parent is busy (not enough memory), force_empty calls try_to_free_page() and reduce usage. Purpose of this behavior is - Fix "forget all" behavior of force_empty and avoid leak of accounting. - By "moving first, free if necessary", keep pages on memory as much as possible. Adding a switch to change behavior of force_empty to - free first, move if necessary - free all, if there is mlocked/busy pages, return -EBUSY. is under consideration. (I'll add if someone requtests.) This patch also removes memory.force_empty file, a brutal debug-only interface. Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
01b1ae63c2270cbacfd43fea94578c17950eb548 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: simple migration handling Now, management of "charge" under page migration is done under following manner. (Assume migrate page contents from oldpage to newpage) before - "newpage" is charged before migration. at success. - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace) at failure - "newpage" is uncharged. - "oldpage" is charged if necessary (*1) But (*1) is not reliable....because of GFP_ATOMIC. This patch tries to change behavior as following by charge/commit/cancel ops. before - charge PAGE_SIZE (no target page) success - commit charge against "newpage". failure - commit charge against "oldpage". (PCG_USED bit works effectively to avoid double-counting) - if "oldpage" is obsolete, cancel charge of PAGE_SIZE. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
bced0520fe462bb94021dcabd32e99630c171be2 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix gfp_mask of callers of charge Fix misuse of gfp_kernel. Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL. I think that this is from the fact that page_cgroup *was* dynamically allocated. But now, we allocate all page_cgroup at boot. And mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE + specified GFP_RECLAIM_MASK. * This is because we just want to reduce memory usage. "Where we should reclaim from ?" is not a problem in memcg. This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible. Note: This patch is not for fixing behavior but for showing sane information in source code. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7a81b88cb53e335ff7d019e6398c95792c817d93 |
|
08-Jan-2009 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: introduce charge-commit-cancel style of functions There is a small race in do_swap_page(). When the page swapped-in is charged, the mapcount can be greater than 0. But, at the same time some process (shares it ) call unmap and make mapcount 1->0 and the page is uncharged. CPUA CPUB mapcount == 1. (1) charge if mapcount==0 zap_pte_range() (2) mapcount 1 => 0. (3) uncharge(). (success) (4) set page's rmap() mapcount 0=>1 Then, this swap page's account is leaked. For fixing this, I added a new interface. - charge account to res_counter by PAGE_SIZE and try to free pages if necessary. - commit register page_cgroup and add to LRU if necessary. - cancel uncharge PAGE_SIZE because of do_swap_page failure. CPUA (1) charge (always) (2) set page's rmap (mapcount > 0) (3) commit charge was necessary or not after set_pte(). This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting. Usual mem_cgroup_charge_common() does charge -> commit at a time. And this patch also adds following function to clarify all charges. - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge() called against newly allocated anon pages. - mem_cgroup_charge_migrate_fixup() called only from remove_migration_ptes(). we'll have to rewrite this later.(this patch just keeps old behavior) This function will be removed by additional patch to make migration clearer. Good for clarifying "what we do" Then, we have 4 following charge points. - newpage - swap-in - add-to-cache. - migration. [akpm@linux-foundation.org: add missing inline directives to stubs] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d38d2a7582012ecf53aac33683ca5c689093cf65 |
|
06-Jan-2009 |
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> |
mm: make mem_cgroup_resize_limit() static Sparse output following warnings. mm/memcontrol.c:782:5: warning: symbol 'mem_cgroup_resize_limit' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
94b6da5ab8293b04a300ba35c72eddfa94db8b02 |
|
22-Oct-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix page_cgroup allocation page_cgroup_init() is called from mem_cgroup_init(). But at this point, we cannot call alloc_bootmem(). (and this caused panic at boot.) This patch moves page_cgroup_init() to init/main.c. Time table is following: == parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this. .... cgroup_init_early() # "early" init of cgroup. .... setup_arch() # memmap is allocated. ... page_cgroup_init(); mem_init(); # we cannot call alloc_bootmem after this. .... cgroup_init() # mem_cgroup is initialized. == Before page_cgroup_init(), mem_map must be initialized. So, I added page_cgroup_init() to init/main.c directly. (*) maybe this is not very clean but - cgroup_init_early() is too early - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem(). use of vmalloc area in x86-32 is important and we should avoid very large vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init() directly to init/main.c [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration] [akpm@linux-foundation.org: fix build] Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
52d4b9ac0b985168009c2a57098324e67bae171f |
|
19-Oct-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: allocate all page_cgroup at boot Allocate all page_cgroup at boot and remove page_cgroup poitner from struct page. This patch adds an interface as struct page_cgroup *lookup_page_cgroup(struct page*) All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported. Remove page_cgroup pointer reduces the amount of memory by - 4 bytes per PAGE_SIZE. - 8 bytes per PAGE_SIZE if memory controller is disabled. (even if configured.) On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory. On my x86-64 server with 48GB of memory, this saves 96MB of memory. I think this reduction makes sense. By pre-allocation, kmalloc/kfree in charge/uncharge are removed. This means - we're not necessary to be afraid of kmalloc faiulre. (this can happen because of gfp_mask type.) - we can avoid calling kmalloc/kfree. - we can avoid allocating tons of small objects which can be fragmented. - we can know what amount of memory will be used for this extra-lru handling. I added printk message as "allocated %ld bytes of page_cgroup" "please try cgroup_disable=memory option if you don't want" maybe enough informative for users. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c05555b572921c464d064d9267f7f7bc06d424fa |
|
19-Oct-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: atomic ops for page_cgroup->flags This patch makes page_cgroup->flags to be atomic_ops and define functions (and macros) to access it. Before trying to modify memory resource controller, this atomic operation on flags is necessary. Most of flags in this patch is for LRU and modfied under mz->lru_lock but we'll add another flags which is not for LRU soon. For example, we'll place LOCK bit on flags field. We need atomic operation to modify LRU bit without LOCK. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
addb9efebb2ee2202d324e75b593b39868528f68 |
|
19-Oct-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: optimize per-cpu statistics Some obvious optimization to memcg. I found mem_cgroup_charge_statistics() is a little big (in object) and does unnecessary address calclation. This patch is for optimization to reduce the size of this function. And res_counter_charge() is 'likely' to succeed. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b7abea9630bc8ffc663a751e46680db25c4cdf8d |
|
19-Oct-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: make page->mapping NULL before uncharge This patch tries to make page->mapping to be NULL before mem_cgroup_uncharge_cache_page() is called. "page->mapping == NULL" is a good check for "whether the page is still radix-tree or not". This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page(); Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7b854121eb3e5ba0241882ff939e2c485228c9c5 |
|
19-Oct-2008 |
Lee Schermerhorn <Lee.Schermerhorn@hp.com> |
Unevictable LRU Page Statistics Report unevictable pages per zone and system wide. Kosaki Motohiro added support for memory controller unevictable statistics. [riel@redhat.com: fix printk in show_free_areas()] [akpm@linux-foundation.org: fix units in /proc/vmstats] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
894bc310419ac95f4fa4142dc364401a7e607f65 |
|
19-Oct-2008 |
Lee Schermerhorn <Lee.Schermerhorn@hp.com> |
Unevictable LRU Infrastructure When the system contains lots of mlocked or otherwise unevictable pages, the pageout code (kswapd) can spend lots of time scanning over these pages. Worse still, the presence of lots of unevictable pages can confuse kswapd into thinking that more aggressive pageout modes are required, resulting in all kinds of bad behaviour. Infrastructure to manage pages excluded from reclaim--i.e., hidden from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to maintain "unevictable" pages on a separate per-zone LRU list, to "hide" them from vmscan. Kosaki Motohiro added the support for the memory controller unevictable lru list. Pages on the unevictable list have both PG_unevictable and PG_lru set. Thus, PG_unevictable is analogous to and mutually exclusive with PG_active--it specifies which LRU list the page is on. The unevictable infrastructure is enabled by a new mm Kconfig option [CONFIG_]UNEVICTABLE_LRU. A new function 'page_evictable(page, vma)' in vmscan.c tests whether or not a page may be evictable. Subsequent patches will add the various !evictable tests. We'll want to keep these tests light-weight for use in shrink_active_list() and, possibly, the fault path. To avoid races between tasks putting pages [back] onto an LRU list and tasks that might be moving the page from non-evictable to evictable state, the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()' -- tests the "evictability" of a page after placing it on the LRU, before dropping the reference. If the page has become unevictable, putback_lru_page() will redo the 'putback', thus moving the page to the unevictable list. This way, we avoid "stranding" evictable pages on the unevictable list. [akpm@linux-foundation.org: fix fallout from out-of-order merge] [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build] [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check] [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework] [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c] [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure] [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch] [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4f98a2fee8acdb4ac84545df98cccecfd130f8db |
|
19-Oct-2008 |
Rik van Riel <riel@redhat.com> |
vmscan: split LRU lists into anon & file sets Split the LRU lists in two, one set for pages that are backed by real file systems ("file") and one for pages that are backed by memory and swap ("anon"). The latter includes tmpfs. The advantage of doing this is that the VM will not have to scan over lots of anonymous pages (which we generally do not want to swap out), just to find the page cache pages that it should evict. This patch has the infrastructure and a basic policy to balance how much we scan the anon lists and how much we scan the file lists. The big policy changes are in separate patches. [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset] [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru] [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page] [hugh@veritas.com: memcg swapbacked pages active] [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED] [akpm@linux-foundation.org: fix /proc/vmstat units] [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration] [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo] [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b69408e88bd86b98feb7b9a38fd865e1ddb29827 |
|
19-Oct-2008 |
Christoph Lameter <cl@linux-foundation.org> |
vmscan: Use an indexed array for LRU variables Currently we are defining explicit variables for the inactive and active list. An indexed array can be more generic and avoid repeating similar code in several places in the reclaim code. We are saving a few bytes in terms of code size: Before: text data bss dec hex filename 4097753 573120 4092484 8763357 85b7dd vmlinux After: text data bss dec hex filename 4097729 573120 4092484 8763333 85b7c5 vmlinux Having an easy way to add new lru lists may ease future work on the reclaim code. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
31a78f23bac0069004e69f98808b6988baccb6b6 |
|
29-Sep-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
mm owner: fix race between swapoff and exit There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a10cebf56ca7e7c034d1b6646230c6553e478967 |
|
22-Sep-2008 |
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> |
memcg: check under limit at shrink_usage Current memory cgroup(both in mainline and -mm) doesn't account swap caches as memory(swap cache support is dropped temporarily now). So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that have been moved to swap cache. But this makes mem_cgroup_shrink_usage fail easily if most of the pages are anon/shmem, and then shmem_getpage returns -ENOMEM and the process will be killed. This patch adds res_counter_check_under_limit to avoid these cases. BTW, even if swap cache support is enabled again, if a process is moved to another cgroup, which has been just made, between precharge and shrink_usage in shmem_getpage, shrink_usage may fail just because there is no pages to reclaim. So this change would make sense anyway. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9623e078c1f4692a91531af2f639ec8aff8f0472 |
|
13-Aug-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: fix oops in mem_cgroup_shrink_usage Got an oops in mem_cgroup_shrink_usage() when testing loop over tmpfs: yes, of course, loop0 has no mm: other entry points check but this didn't. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4ef1b0fd61333b3b81ebe29283898c6c84b15c9f |
|
30-Jul-2008 |
Li Zefan <lizf@cn.fujitsu.com> |
memcg: remove redundant check in move_task() It's guaranteed by cgroup that old_cgrp != cgrp. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
628f42355389cfb596ca3a5a5f64fb9054a2a06a |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: limit change shrink usage Shrinking memory usage at limit change. [akpm@linux-foundation.org: coding-style fixes] Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cede86acd8bd5d2205dec28db8ac86410a3a19e8 |
|
25-Jul-2008 |
Li Zefan <lizf@cn.fujitsu.com> |
memcg: clean up checking of the disabled flag Those checks are unnecessary, because when the subsystem is disabled it can't be mounted, so those functions won't get called. The check is needed in functions which will be called in other places except cgroup. [hugh@veritas.com: further checking of disabled flag] Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
accf163e6ab729f1fc5fffaa0310e498270bf4e7 |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: remove a redundant check Because of remove refcnt patch, it's very rare case to that mem_cgroup_charge_common() is called against a page which is accounted. mem_cgroup_charge_common() is called when. 1. a page is added into file cache. 2. an anon page is _newly_ mapped. A racy case is that a newly-swapped-in anonymous page is referred from prural threads in do_swap_page() at the same time. (a page is not Locked when mem_cgroup_charge() is called from do_swap_page.) Another case is shmem. It charges its page before calling add_to_page_cache(). Then, mem_cgroup_charge_cache() is called twice. This case is handled in mem_cgroup_cache_charge(). But this check may be too hacky... Signed-off-by : KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b76734e5e34e1889ab9fc5f3756570b1129f0f50 |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: add hints for branch Showing brach direction for obvious conditions. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c9b0ed51483cc2fc42bb801b6675c4231b0e4634 |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: helper function for relcaim from shmem. A new call, mem_cgroup_shrink_usage() is added for shmem handling and relacing non-standard usage of mem_cgroup_charge/uncharge. Now, shmem calls mem_cgroup_charge() just for reclaim some pages from mem_cgroup. In general, shmem is used by some process group and not for global resource (like file caches). So, it's reasonable to reclaim pages from mem_cgroup where shmem is mainly used. [hugh@veritas.com: shmem_getpage release page sooner] [hugh@veritas.com: mem_cgroup_shrink_usage css_put] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
69029cd550284e32de13d6dd2f77b723c8a0e444 |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: remove refcnt from page_cgroup memcg: performance improvements Patch Description 1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed) 2/5 ... swapcache handling patch 3/5 ... add helper function for shmem's memory reclaim patch 4/5 ... optimize by likely/unlikely ppatch 5/5 ... remove redundunt check patch (shmem handling is fixed.) Unix bench result. == 2.6.26-rc2-mm1 + memory resource controller Execl Throughput 2915.4 lps (29.6 secs, 3 samples) C Compiler Throughput 1019.3 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 5796.0 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 1097.7 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 565.3 lpm (60.0 secs, 3 samples) File Read 1024 bufsize 2000 maxblocks 1022128.0 KBps (30.0 secs, 3 samples) File Write 1024 bufsize 2000 maxblocks 544057.0 KBps (30.0 secs, 3 samples) File Copy 1024 bufsize 2000 maxblocks 346481.0 KBps (30.0 secs, 3 samples) File Read 256 bufsize 500 maxblocks 319325.0 KBps (30.0 secs, 3 samples) File Write 256 bufsize 500 maxblocks 148788.0 KBps (30.0 secs, 3 samples) File Copy 256 bufsize 500 maxblocks 99051.0 KBps (30.0 secs, 3 samples) File Read 4096 bufsize 8000 maxblocks 2058917.0 KBps (30.0 secs, 3 samples) File Write 4096 bufsize 8000 maxblocks 1606109.0 KBps (30.0 secs, 3 samples) File Copy 4096 bufsize 8000 maxblocks 854789.0 KBps (30.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 126145.2 lpm (30.0 secs, 3 samples) INDEX VALUES TEST BASELINE RESULT INDEX Execl Throughput 43.0 2915.4 678.0 File Copy 1024 bufsize 2000 maxblocks 3960.0 346481.0 875.0 File Copy 256 bufsize 500 maxblocks 1655.0 99051.0 598.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 854789.0 1473.8 Shell Scripts (8 concurrent) 6.0 1097.7 1829.5 ========= FINAL SCORE 991.3 == 2.6.26-rc2-mm1 + this set == Execl Throughput 3012.9 lps (29.9 secs, 3 samples) C Compiler Throughput 981.0 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 5872.0 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 1120.3 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 578.0 lpm (60.0 secs, 3 samples) File Read 1024 bufsize 2000 maxblocks 1003993.0 KBps (30.0 secs, 3 samples) File Write 1024 bufsize 2000 maxblocks 550452.0 KBps (30.0 secs, 3 samples) File Copy 1024 bufsize 2000 maxblocks 347159.0 KBps (30.0 secs, 3 samples) File Read 256 bufsize 500 maxblocks 314644.0 KBps (30.0 secs, 3 samples) File Write 256 bufsize 500 maxblocks 151852.0 KBps (30.0 secs, 3 samples) File Copy 256 bufsize 500 maxblocks 101000.0 KBps (30.0 secs, 3 samples) File Read 4096 bufsize 8000 maxblocks 2033256.0 KBps (30.0 secs, 3 samples) File Write 4096 bufsize 8000 maxblocks 1611814.0 KBps (30.0 secs, 3 samples) File Copy 4096 bufsize 8000 maxblocks 847979.0 KBps (30.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 128148.7 lpm (30.0 secs, 3 samples) INDEX VALUES TEST BASELINE RESULT INDEX Execl Throughput 43.0 3012.9 700.7 File Copy 1024 bufsize 2000 maxblocks 3960.0 347159.0 876.7 File Copy 256 bufsize 500 maxblocks 1655.0 101000.0 610.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 847979.0 1462.0 Shell Scripts (8 concurrent) 6.0 1120.3 1867.2 ========= FINAL SCORE 1004.6 This patch: Remove refcnt from page_cgroup(). After this, * A page is charged only when !page_mapped() && no page_cgroup is assigned. * Anon page is newly mapped. * File page is added to mapping->tree. * A page is uncharged only when * Anon page is fully unmapped. * File page is removed from LRU. There is no change in behavior from user's view. This patch also removes unnecessary calls in rmap.c which was used only for refcnt mangement. [akpm@linux-foundation.org: fix warning] [hugh@veritas.com: fix shmem_unuse_inode charging] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e8589cc189f96b87348ae83ea4db38eaac624135 |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: better migration handling This patch changes page migration under memory controller to use a different algorithm. (thanks to Christoph for new idea.) Before: - page_cgroup is migrated from an old page to a new page. After: - a new page is accounted , no reuse of page_cgroup. Pros: - We can avoid compliated lock depndencies and races in migration. Cons: - new param to mem_cgroup_charge_common(). - mem_cgroup_getref() is added for handling ref_cnt ping-pong. This version simplifies complicated lock dependency in page migraiton under memory resource controller. new refcnt sequence is following. a mapped page: prepage_migration() ..... +1 to NEW page try_to_unmap() ..... all refs to OLD page is gone. move_pages() ..... +1 to NEW page if page cache. remap... ..... all refs from *map* is added to NEW one. end_migration() ..... -1 to New page. page's mapcount + (page_is_cache) refs are added to NEW one. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
508b7be0a5b06b64203512ed9b34191cddc83f56 |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: avoid unnecessary initialization * remove over-killing initialization (in fast path) * makeing the condition for PAGE_CGROUP_FLAG_ACTIVE be more obvious. Signed-off-by: KAMEAZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
a181b0e888a1d917edcab57cd73ccf7d8e75a46c |
|
25-Jul-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: make global var read_mostly mem_cgroup_subsys and page_cgroup_cache should be read_mostly and MEM_CGROUP_RECLAIM_RETRIES can be just a fixed number. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
856c13aa1ff6136c1968414fdea5938ea9d5ebf2 |
|
25-Jul-2008 |
Paul Menage <menage@google.com> |
cgroup files: convert res_counter_write() to be a cgroups write_string() handler Currently res_counter_write() is a raw file handler even though it's ultimately taking a number, since in some cases it wants to pre-process the string when converting it to a number. This patch converts res_counter_write() from a raw file handler to a write_string() handler; this allows some of the boilerplate copying/locking/checking to be removed, and simplies the cleanup path, since these functions are now performed by the cgroups framework. [lizf@cn.fujitsu.com: build fix] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
55e462b05b5df4fd113c4a304c4f487d44b0898e |
|
01-May-2008 |
Balaji Rao <balajirrao@gmail.com> |
memcg: simple stats for memory resource controller Implement trivial statistics for the memory resource controller. Signed-off-by: Balaji Rao <balajirrao@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
1faf8e40a8ab12ae1f7f474965e6fb031e43f8d6 |
|
29-Apr-2008 |
Li Zefan <lizf@cn.fujitsu.com> |
memcg: remove redundant initialization in mem_cgroup_create() *mem has been zeroed, that means mem->info has already been filled with 0. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
33327948782bcef89c78eb47af86b6a2df9fd4a5 |
|
29-Apr-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcgroup: use vmalloc for mem_cgroup allocation On ia64, this kmalloc() requires order-4 pages. But this is not necessary to be physically contiguous. For big mem_cgroup, vmalloc is better. For small ones, kmalloc is used. [akpm@linux-foundation.org: simplification] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4a56d02e34baedbea5eb1fd558f2b856b8c7db1e |
|
29-Apr-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcgroup: make the memory controller more desktop responsive This patch makes the memory controller more responsive on my desktop. 1. Set all cached pages as inactive. We were by default marking all pages as active, thus forcing us to go through two passes for reclaiming pages 2. Remove congestion_wait(), since we already have that logic in do_try_to_free_pages() Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3eae90c3cdd4e762d0f4f5e939c98780fccded57 |
|
29-Apr-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: remove redundant function calls remove_list/add_list uses page_cgroup_zoneinfo() in it. So, it's called twice before and after lock. mz = page_cgroup_zoneinfo(); lock(); mz = page_cgroup_zoneinfo(); .... unlock(); And address of mz never changes. This is not good. This patch fixes this behavior. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
29f2a4dac856e9433a502b05b40e8e90385d8e27 |
|
29-Apr-2008 |
Pavel Emelyanov <xemul@openvz.org> |
memcgroup: implement failcounter reset This is a very common requirement from people using the resource accounting facilities (not only memcgroup but also OpenVZ beancounters). They want to put the cgroup in an initial state without re-creating it. For example after re-configuring a group people want to observe how this new configuration fits the group needs without saving the previous failcnt value. Merge two resets into one mem_cgroup_reset() function to demonstrate how multiplexing work. Besides, I have plans to move the files, that correspond to res_counter to the res_counter.c file and somehow "import" them into controller. I don't know how to make it gracefully yet, but merging resets of max_usage and failcnt in one function will be there for sure. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
85cc59db12724e1248f5e4841e61339cf485d5c7 |
|
29-Apr-2008 |
Pavel Emelyanov <xemul@openvz.org> |
memcgroup: use triggers in force_empty and max_usage files These two files are essentially event callbacks. They do not care about the contents of the string, but only about the fact of the write itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b6ac57d50a375aa2f267e1b2b56c46564a936d00 |
|
29-Apr-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memcgroup: move memory controller allocations to their own slabs Move the memory controller data structure page_cgroup to its own slab cache. It saves space on the system, allocations are not necessarily pushed to order of 2 and should provide performance benefits. Users who disable the memory controller can also double check that the memory controller is not allocating page_cgroup's. NOTE: Hugh Dickins brought up the issue of whether we want to mark page_cgroup as __GFP_MOVABLE or __GFP_RECLAIMABLE. I don't think there is an easy answer at the moment. page_cgroup's are associated with user pages, they can be reclaimed once the user page has been reclaimed, so it might make sense to mark them as __GFP_RECLAIMABLE. For now, I am leaving the marking to default values that the slab allocator uses. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c84872e168d10926acd2dee975d19172eef79252 |
|
29-Apr-2008 |
Pavel Emelyanov <xemul@openvz.org> |
memcgroup: add the max_usage member on the res_counter This field is the maximal value of the usage one since the counter creation (or since the latest reset). To reset this to the usage value simply write anything to the appropriate cgroup file. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cf475ad28ac35cc9ba612d67158f29b73b38b05d |
|
29-Apr-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
cgroups: add an owner to the mm_struct Remove the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: David Rientjes <rientjes@google.com>, Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c27e8818a09bbdfe7c07c629cb2c27e1a742e156 |
|
29-Apr-2008 |
Paul Menage <menage@google.com> |
CGroup API files: drop mem_cgroup_force_empty() This function isn't needed - a NULL pointer in the cftype read function will result in the same EINVAL response to userspace. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c64745cf0f34f2cb08fc28c93d844e583d0d591d |
|
29-Apr-2008 |
Paul Menage <menage@google.com> |
CGroup API files: use cgroup map for memcontrol stats file Remove the seq_file boilerplate used to construct the memcontrol stats map, and instead use the new map representation for cgroup control files Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2c3daa722b624eaf0c5ea60e4f180bd0684542e2 |
|
29-Apr-2008 |
Paul Menage <menage@google.com> |
CGroup API files: use read_u64 in memory controller Update the memory controller to use read_u64 for its limit/usage/failcnt control files, calling the new res_counter_read_u64() function. Signed-off-by: Paul Menage <menage@google.com> Cc: "Li Zefan" <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
41e3355de052693c7a0cad74b845148d262edadf |
|
09-Apr-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memcg: fix node_state handling This should be N_NORMAL_MEMORY. N_NORMAL_MEMORY is "true" if a node has memory for the kernel. N_HIGH_MEMORY is "true" if a node has memory for HIGHMEM. (If CONFIG_HIGHMEM=n, always "true") This check is used for testing whether we can use kmalloc_node() on a node. Then, if there is a node which only contains HIGHMEM, the system will call kmalloc_node() which doesn't contain memory for the kernel. If it happens under SLUB, the kernel will panic. I think this only happens on x86_32-numa. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4077960e2a38ec59096ff993cd080056e17f3707 |
|
04-Apr-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
memory controller: make memory resource control aware of boot options A boot option for the memory controller was discussed on lkml. It is a good idea to add it, since it saves memory for people who want to turn off the memory controller. By default the option is on for the following two reasons: 1. It provides compatibility with the current scheme where the memory controller turns on if the config option is enabled 2. It allows for wider testing of the memory controller, once the config option is enabled We still allow the create, destroy callbacks to succeed, since they are not aware of boot options. We do not populate the directory will memory resource controller specific files. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
52ea27eb4cd5f250f33638029a134ff03c5e6bbb |
|
20-Mar-2008 |
Pavel Emelyanov <xemul@openvz.org> |
memcgroup: fix check for thread being a group leader in memcgroup The check t->pid == t->pid is not the blessed way to check whether a task is a group leader. This is not about the code beautifulness only, but about pid namespaces fixes - both the tgid and the pid fields on the task_struct are (slowly :( ) becoming deprecated. Besides, the thread_group_leader() macro makes only one dereference :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
fb59e9f1e9786635ea12e12bf6adbb132e10f979 |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: fix oops on NULL lru list While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list. I couldn't see what racing tasks on other cpus were doing, but surmise that another must have been in mem_cgroup_charge_common on the same page, between its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc which I'd almost changed to a kmalloc). Normally such a race cannot happen, the ref_cnt prevents it, the final uncharge cannot race with the initial charge. But force_empty buggers the ref_cnt, that's what it's all about; and thereafter forced pages are vulnerable to races such as this (just think of a shared page also mapped into an mm of another mem_cgroup than that just emptied). And remain vulnerable until they're freed indefinitely later. This patch just fixes the oops by moving the unlock_page_cgroups down below adding to and removing from the list (only possible given the previous patch); and while we're at it, we might as well make it an invariant that page->page_cgroup is always set while pc is on lru. But this behaviour of force_empty seems highly unsatisfactory to me: why have a ref_cnt if we always have to cope with it being violated (as in the earlier page migration patch). We may prefer force_empty to move pages to an orphan mem_cgroup (could be the root, but better not), from which other cgroups could recover them; we might need to reverse the locking again; but no time now for such concerns. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9b3c0a07e0fca35e36751680de3e4c76dbff5df3 |
|
04-Mar-2008 |
Hirokazu Takahashi <taka@valinux.co.jp> |
memcg: simplify force_empty and move_lists As for force_empty, though this may not be the main topic here, mem_cgroup_force_empty_list() can be implemented simpler. It is possible to make the function just call mem_cgroup_uncharge_page() instead of releasing page_cgroups by itself. The tip is to call get_page() before invoking mem_cgroup_uncharge_page(), so the page won't be released during this function. Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges, the page might have been reassigned to an lru of a different mem_cgroup, and now be emptied from that; but Hugh claims that's okay, the end state is the same as when it hasn't gone to another list. And once force_empty stops taking lock_page_cgroup within mz->lru_lock, mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while holding page_cgroup lock (but still has to use try_lock_page_cgroup). Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2680eed723b664d83e6181ae275fac0ec8fa05ff |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: fix mem_cgroup_move_lists locking Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went into page freeing, I've hit it from time to time in testing on some machines, sometimes only after many days. Recently found a machine which could usually produce it within a few hours, which got me there at last. The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the arrangement of structures was such that you got page_cgroups from the lru list neatly put on to SLUB's freelist. Kamezawa-san identified the same hole independently. The main problem was that it was missing the lock_page_cgroup it needs to safely page_get_page_cgroup; but it's tricky to go beyond that too, and I couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for comments on the constraints. This patch immediately gets replaced by a simpler one from Hirokazu-san; but is it just foolish pride that tells me to put this one on record, in case we need to come back to it later? Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6d48ff8bcfd403ec8d3ef7a56538ea9e6f773b9c |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: css_put after remove_list mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from it, and before removing page_cgroup from one of its lru lists: isn't there a danger that struct mem_cgroup memory could be freed and reused before completing that, so corrupting something? Never seen it, and for all I know there may be other constraints which make it impossible; but let's be defensive and reverse the ordering there. mem_cgroup_force_empty_list is safe because there's an extra css_get around all its works; but even so, change its ordering the same way round, to help get in the habit of doing it like this. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
b9c565d5a29a795f970b4a1340393d8fc6722fb9 |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: remove clear_page_cgroup and atomics Remove clear_page_cgroup: it's an unhelpful helper, see for example how mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it (serious races from that? I'm not sure). Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be atomic: it's always manipulated under lock_page_cgroup, except where force_empty unilaterally reset it to 0 (and how does uncharge's atomic_dec_and_test protect against that?). Simplify this page_cgroup locking: if you've got the lock and the pc is attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to check that pc->page matches page (we're on the way to finding why sometimes it doesn't, but this patch doesn't fix that). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d5b69e38f8cdb1e41cc022305c86c9739bf1ffdb |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: memcontrol uninlined and static More cleanup to memcontrol.c, this time changing some of the code generated. Let the compiler decide what to inline (except for page_cgroup_locked which is only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup etc. was quite a waste since bit_spin_lock etc. are inlines in a header file; made mem_cgroup_force_empty and mem_cgroup_write_strategy static. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8869b8f6e09a1b49bf915eb03f663f2e4e8fbcd4 |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: memcontrol whitespace cleanups Sorry, before getting down to more important changes, I'd like to do some cleanup in memcontrol.c. This patch doesn't change the code generated, but cleans up whitespace, moves up a double declaration, removes an unused enum, removes void returns, removes misleading comments, that kind of thing. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8289546e573d5ff681cdf0fc7a1184cca66fdb55 |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: remove mem_cgroup_uncharge Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page, (a trivial wrapper around it) and mem_cgroup_end_migration (which does the same as mem_cgroup_uncharge_page). And it often ends up having to lock just to let its caller unlock. Remove it (but leave the silly locking until a later patch). Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7e924aafa4b03ff71de34af8553d9a1ebc86c071 |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: mem_cgroup_charge never NULL My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to mem_cgroup_charge_common. It seemed convenient at the time, but hard to justify now: there's a perfectly appropriate swappage to charge and uncharge instead, this is not on any hot path through shmem_getpage, and no performance hit was observed from the slight extra overhead. So revert that NULL page handling from mem_cgroup_charge_common; and make it clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that was a helper I found more of a hindrance to understanding. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9442ec9df40d952b0de185ae5638a74970388e01 |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: bad page if page_cgroup when free Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it were set here, it'd likely cause corruption when the page is reused. Don't use page_assign_page_cgroup to clear it: that should be private to memcontrol.c, and always called with the lock taken; and memmap_init_zone doesn't need it either - like page->mapping and other pointers throughout the kernel, Linux assumes pointers in zeroed structures are NULL pointers. Instead use page_reset_bad_cgroup, added to memcontrol.h for this only. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
427d5416f317681498337ab19218d195edea02d6 |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: move_lists on page not page_cgroup Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup: it's more convenient if it acts upon the page itself not the page_cgroup; and in a later patch this becomes important to handle within memcontrol.c. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
bd845e38c7a7251a95a8f2c38aa7fb87140b771d |
|
04-Mar-2008 |
Hugh Dickins <hugh@veritas.com> |
memcg: mm_match_cgroup not vm_match_cgroup vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
2dda81ca31dc73e695ff8b83351f7aaefbef192a |
|
24-Feb-2008 |
Li Zefan <lizf@cn.fujitsu.com> |
memcgroup: return negative error code in mem_cgroup_create() Cgroup requires the subsystem to return negative error code on error in the create method. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
7fde4c3eb7ee68828d76a2148ed6d70b6a794add |
|
24-Feb-2008 |
Li Zefan <lizf@cn.fujitsu.com> |
memcgroup: remove a useless VM_BUG_ON() Remove this VM_BUG_ON(), as Balbir stated: We used to have a for loop with !list_empty() as a termination condition and VM_BUG_ON(!pc) is a spill over. With the new loop, VM_BUG_ON(!pc) does not make sense. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Balbir Singh <balbir@in.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
60c12b1202a60eabb1c61317e5d2678fcea9893f |
|
09-Feb-2008 |
David Rientjes <rientjes@google.com> |
memcontrol: add vm_match_cgroup() mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer is pointing to a specific cgroup. Instead of returning the pointer, we can just do the test itself in a new macro: vm_match_cgroup(mm, cgroup) returns non-zero if the mm's mem_cgroup points to cgroup. Otherwise it returns zero. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3c541e14bfa553133c3473a6ed3e4c0583ea2285 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller remove control_type feature Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was felt that control_type might not be a good thing to implement right away. We can add this flexibility at a later point when required. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
072c56c13e1302fcdc39961dc64e76485731ad67 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: per-zone-lock for cgroup Now, lru is per-zone. Then, lru_lock can be (should be) per-zone, too. This patch implementes per-zone lru lock. lru_lock is placed into mem_cgroup_per_zone struct. lock can be accessed by mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone); &mz->lru_lock or mz = page_cgroup_zoneinfo(page_cgroup); &mz->lru_lock Signed-off-by: KAMEZAWA hiroyuki <kmaezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
1ecaab2bd221251a3fd148abb08e8b877f1e93c8 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: per zone lru for cgroup This patch implements per-zone lru for memory cgroup. This patch makes use of mem_cgroup_per_zone struct for per zone lru. LRU can be accessed by mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone); &mz->active_list &mz->inactive_list or mz = page_cgroup_zoneinfo(page_cgroup); &mz->active_list &mz->inactive_list Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cc38108e1ba7f3b9e12b82d0236fa3730c2e0439 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: calculate the number of pages to be scanned per cgroup Define function for calculating the number of scan target on each Zone/LRU. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6c48a1d040a9a9eaa4acdd7d4cb3885e04bf8413 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: remember reclaim priority in memory cgroup Functions to remember reclaim priority per cgroup (as zone->prev_priority) [akpm@linux-foundation.org: build fixes] [akpm@linux-foundation.org: more build fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
5932f3671bb2dd873c5ac443cbf5dc2cd167ae94 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: calculate active/inactive imbalance per cgroup Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
58ae83db2a40dea15d4277d499a11dadc823c388 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: calculate mapper_ratio per cgroup Define function for calculating mapped_ratio in memory cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
6d12e2d8ddbe653d80ea4f71578481c1bc933025 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: per-zone active inactive counter This patch adds per-zone status in memory cgroup. These values are often read (as per-zone value) by page reclaiming. In current design, per-zone stat is just a unsigned long value and not an atomic value because they are modified only under lru_lock. (So, atomic_ops is not necessary.) This patch adds ACTIVE and INACTIVE per-zone status values. For handling per-zone status, this patch adds struct mem_cgroup_per_zone { ... } and some helper functions. This will be useful to add per-zone objects in mem_cgroup. This patch turns memory controller's early_init to be 0 for calling kmalloc() in initialization. Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c0149530d0bb356c933a09f3c8103ea02f452d8a |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
per-zone and reclaim enhancements for memory controller: nid/zid helper function for cgroup Add macro to get node_id and zone_id of page_cgroup. Will be used in per-zone-xxx patches and others. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
df878fb04dea044378274d40d063279a9cb787fb |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memory cgroup enhancements: implicit force_empty() at rmdir Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at rmdir(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d2ceb9b7ddedbb2e8e590bc6ce33c854043016f9 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memory cgroup enhancements: add memory.stat file Show accounted information of memory cgroup by memory.stat file [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix printk warning] Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
d52aa412d43827033a8e2ce4415ef6e8f8d53635 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memory cgroup enhancements: add status accounting function for memory cgroup Add statistics account infrastructure for memory controller. All account information is stored per-cpu and caller will not have to take lock or use atomic ops. This will be used by memory.stat file later. CACHE includes swapcache now. I'd like to divide it to PAGECACHE and SWAPCACHE later. This patch adds 3 functions for accounting. * __mem_cgroup_stat_add() ... for usual routine. * __mem_cgroup_stat_add_safe ... for calling under irq_disabled section. * mem_cgroup_read_stat() ... for reading stat value. * renamed PAGECACHE to CACHE (because it may include swapcache *now*) [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix smp_processor_id-in-preemptible] [akpm@linux-foundation.org: uninline things] [akpm@linux-foundation.org: remove dead code] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Kirill Korotaev <dev@sw.ru> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Paul Menage <menage@google.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3564c7c45156b358efe921ab2e4e516dad92c94c |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memory cgroup enhancements: remember "a page is on active list of cgroup or not" Remember page_cgroup is on active_list or not in page_cgroup->flags. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
82369553d6d3bc67c54129a02e0bc0b5b88f3045 |
|
07-Feb-2008 |
Hugh Dickins <hugh@veritas.com> |
memcgroup: fix hang with shmem/tmpfs The memcgroup regime relies upon a cgroup reclaiming pages from itself within add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs rely upon using add_to_page_cache while holding a spinlock: when it cannot wait. The consequence is that when a cgroup reaches its limit, shmem_getpage just hangs - unless there is outside memory pressure too, neither kswapd nor radix_tree_preload get it out of the retry loop. In most cases we can mem_cgroup_cache_charge the page waitably first, to attach the page_cgroup in advance, so add_to_page_cache will do no more than increment a count; then mem_cgroup_uncharge_page after (in both success and failure cases) to balance the books again. And where there used to be a congestion_wait for kswapd (recently made redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page to go through a cycle of allocation and freeing, without accounting to any particular page, and without updating the statistics vector. This brings the cgroup below its limit so the next try usually succeeds. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3be91277e754c7db04eae145ba622b3a3e3ad96d |
|
07-Feb-2008 |
Hugh Dickins <hugh@veritas.com> |
memcgroup: tidy up mem_cgroup_charge_common Tidy up mem_cgroup_charge_common before extending it. Adjust some comments, but mainly clean up its loop: I've an aversion to loops full of continues, then a break or a goto at the bottom. And the is_atomic test should be on the __GFP_WAIT bit, not GFP_ATOMIC bits. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ac44d354d5c9ced49b1165d6496f134501134219 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller use rcu_read_lock() in mem_cgroup_cache_charge() Hugh Dickins noticed that we were using rcu_dereference() without rcu_read_lock() in the cache charging routine. The patch below fixes this problem Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
217bc3194d57150549e9234e6ddfee30de28cc78 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memory cgroup enhancements: remember "a page is charged as page cache" Add a flag to page_cgroup to remember "this page is charged as cache." cache here includes page caches and swap cache. This is useful for implementing precise accounting in memory cgroup. TODO: distinguish page-cache and swap-cache Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
cc8475822f8a4b17e9b76e7fadb6b9a341860422 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
memory cgroup enhancements: force_empty interface for dropping all account in empty cgroup This patch adds an interface "memory.force_empty". Any write to this file will drop all charges in this cgroup if there is no task under. %echo 1 > /....../memory.force_empty will drop all charges of memory cgroup if cgroup's tasks is empty. This is useful to invoke rmdir() against memory cgroup successfully. Tested and worked well on x86_64/fake-NUMA system. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
436c6541b13a73790646eb11429bdc8ee50eec41 |
|
07-Feb-2008 |
Hugh Dickins <hugh@veritas.com> |
memcgroup: fix zone isolation OOM mem_cgroup_charge_common shows a tendency to OOM without good reason, when a memhog goes well beyond its rss limit but with plenty of swap available. Seen on x86 but not on PowerPC; seen when the next patch omits swapcache from memcgroup, but we presume it can happen without. mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM avoidance. Already it has to scan beyond the nr_to_scan limit when it finds a !LRU page or an active page when handling inactive or an inactive page when handling active. It needs to do exactly the same when it finds a page from the wrong zone (the x86 tests had two zones, the PowerPC tests had only one). Don't increment scan and then decrement it in these cases, just move the incrementation down. Fix recent off-by-one when checking against nr_to_scan. Cut out "Check if the meta page went away from under us", presumably left over from early debugging: no amount of such checks could save us if this list really were being updated without locking. This change does make the unlimited scan while holding two spinlocks even worse - bad for latency and bad for containment; but that's a separate issue which is better left to be fixed a little later. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Pavel Emelianov <xemul@openvz.org> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ff7283fa3a66823933991ad55a558a3a01d5ab27 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
bugfix for memory cgroup controller: avoid !PageLRU page in mem_cgroup_isolate_pages This patch makes mem_cgroup_isolate_pages() to be - ignore !PageLRU pages. - fixes the bug that isolation makes no progress if page_zone(page) != zone page once find. (just increment scan in this case.) kswapd and memory migration removes a page from list when it handles a page for reclaiming/migration. Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will be safe to avoid touching !PageLRU() page and its page_cgroup. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
ae41be374293e70e1ed441d986afcc6e744ef9d9 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
bugfix for memory cgroup controller: migration under memory controller fix While using memory control cgroup, page-migration under it works as following. == 1. uncharge all refs at try to unmap. 2. charge regs again remove_migration_ptes() == This is simple but has following problems. == The page is uncharged and charged back again if *mapped*. - This means that cgroup before migration can be different from one after migration - If page is not mapped but charged as page cache, charge is just ignored (because not mapped, it will not be uncharged before migration) This is memory leak. == This patch tries to keep memory cgroup at page migration by increasing one refcnt during it. 3 functions are added. mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup mem_cgroup_page_migration() --- copy page->page_cgroup from old page to new page. During migration - old page is under PG_locked. - new page is under PG_locked, too. - both old page and new page is not on LRU. These 3 facts guarantee that page_cgroup() migration has no race. Tested and worked well in x86_64/fake-NUMA box. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
9175e0311ec9e6d1bf1f6dfecf9268baf08765e6 |
|
07-Feb-2008 |
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> |
bugfix for memory controller: add helper function for assigning cgroup to page This patch adds following functions. - clear_page_cgroup(page, pc) - page_cgroup_assign_new_page_group(page, pc) Mainly for cleanup. A manner "check page->cgroup again after lock_page_cgroup()" is implemented in straight way. A comment in mem_cgroup_uncharge() will be removed by force-empty patch Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
4c4a22148909e4c003562ea7ffe0a06e26919e3c |
|
07-Feb-2008 |
David Rientjes <rientjes@google.com> |
memcontrol: move oom task exclusion to tasklist scan Creates a helper function to return non-zero if a task is a member of a memory controller: int task_in_mem_cgroup(const struct task_struct *task, const struct mem_cgroup *mem); When the OOM killer is constrained by the memory controller, the exclusion of tasks that are not a member of that controller was previously misplaced and appeared in the badness scoring function. It should be excluded during the tasklist scan in select_bad_process() instead. [akpm@linux-foundation.org: build fix] Cc: Christoph Lameter <clameter@sgi.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
3062fc67dad01b1d2a15d58c709eff946389eca4 |
|
07-Feb-2008 |
David Rientjes <rientjes@google.com> |
memcontrol: move mm_cgroup to header file Inline functions must preceed their use, so mm_cgroup() should be defined in linux/memcontrol.h. include/linux/memcontrol.h:48: warning: 'mm_cgroup' declared inline after being called include/linux/memcontrol.h:48: warning: previous declaration of 'mm_cgroup' was here [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: nuther build fix] Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
e1a1cd590e3fcb0d2e230128daf2337ea55387dc |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller: make charging gfp mask aware Nick Piggin pointed out that swap cache and page cache addition routines could be called from non GFP_KERNEL contexts. This patch makes the charging routine aware of the gfp context. Charging might fail if the cgroup is over it's limit, in which case a suitable error is returned. This patch was tested on a Powerpc box. I am still looking at being able to test the path, through which allocations happen in non GFP_KERNEL contexts. [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
bed7161a519a2faef53e1bce1b47595e297c1d14 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller: make page_referenced() cgroup aware Make page_referenced() cgroup aware. Without this patch, page_referenced() can cause a page to be skipped while reclaiming pages. This patch ensures that other cgroups do not hold pages in a particular cgroup hostage. It is required to ensure that shared pages are freed from a cgroup when they are not actively referenced from the cgroup that brought them in Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8697d33194faae6fdd6b2e799f6308aa00cfdf67 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller: add switch to control what type of pages to limit Choose if we want cached pages to be accounted or not. By default both are accounted for. A new set of tunables are added. echo -n 1 > mem_control_type switches the accounting to account for only mapped pages echo -n 3 > mem_control_type switches the behaviour back [bunk@kernel.org: mm/memcontrol.c: clenups] [akpm@linux-foundation.org: fix sparc32 build] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
c7ba5c9e8176704bfac0729875fa62798037584d |
|
07-Feb-2008 |
Pavel Emelianov <xemul@openvz.org> |
Memory controller: OOM handling Out of memory handling for cgroups over their limit. A task from the cgroup over limit is chosen using the existing OOM logic and killed. TODO: 1. As discussed in the OLS BOF session, consider implementing a user space policy for OOM handling. [akpm@linux-foundation.org: fix build due to oom-killer changes] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
0eea10301708c64a6b793894c156e21ddd15eb64 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller improve user interface Change the interface to use bytes instead of pages. Page sizes can vary across platforms and configurations. A new strategy routine has been added to the resource counters infrastructure to format the data as desired. Suggested by David Rientjes, Andrew Morton and Herbert Poetzl Tested on a UML setup with the config for memory control enabled. [kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
66e1707bc34609f626e2e7b4fe7e454c9748bad5 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller: add per cgroup LRU and reclaim Add the page_cgroup to the per cgroup LRU. The reclaim algorithm has been modified to make the isolate_lru_pages() as a pluggable component. The scan_control data structure now accepts the cgroup on behalf of which reclaims are carried out. try_to_free_pages() has been extended to become cgroup aware. [akpm@linux-foundation.org: fix warning] [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member] [bunk@kernel.org: make do_try_to_free_pages() static] [hugh@veritas.com: memcgroup: fix try_to_free order] [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
67e465a77ba658635309ee00b367bec6555ea544 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller: task migration Allow tasks to migrate from one cgroup to the other. We migrate mm_struct's mem_cgroup only when the thread group id migrates. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8a9f3ccd24741b50200c3f33d62534c7271f3dfc |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller: memory accounting Add the accounting hooks. The accounting is carried out for RSS and Page Cache (unmapped) pages. There is now a common limit and accounting for both. The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap() time. Page cache is accounted at add_to_page_cache(), __delete_from_page_cache(). Swap cache is also accounted for. Each page's page_cgroup is protected with the last bit of the page_cgroup pointer, this makes handling of race conditions involving simultaneous mappings of a page easier. A reference count is kept in the page_cgroup to deal with cases where a page might be unmapped from the RSS of all tasks, but still lives in the page cache. Credits go to Vaidyanathan Srinivasan for helping with reference counting work of the page cgroup. Almost all of the page cache accounting code has help from Vaidyanathan Srinivasan. [hugh@veritas.com: fix swapoff breakage] [akpm@linux-foundation.org: fix locking] Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: <Valdis.Kletnieks@vt.edu> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
78fb74669e80883323391090e4d26d17fe29488f |
|
07-Feb-2008 |
Pavel Emelianov <xemul@openvz.org> |
Memory controller: accounting setup Basic setup routines, the mm_struct has a pointer to the cgroup that it belongs to and the the page has a page_cgroup associated with it. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|
8cdea7c05454260c0d4d83503949c358eb131d17 |
|
07-Feb-2008 |
Balbir Singh <balbir@linux.vnet.ibm.com> |
Memory controller: cgroups setup Setup the memory cgroup and add basic hooks and controls to integrate and work with the cgroup. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Kirill Korotaev <dev@sw.ru> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: David Rientjes <rientjes@google.com> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/mm/memcontrol.c
|