History log of /kernel/cgroup.c
Revision Date Author Comments
4ac39048706e38815cd6784244a4e1eebc72eff0 03-Jun-2015 Christian Poetzsch <christian.potzsch@imgtec.com> Fix generic cgroup subsystem permission checks

In 53b5e2f generic cgroup subsystem permission checks have been added.
When this is been done within procs_write an empty taskset is added to
the tasks css set. When a task later on migrates to a new group we see a
dmesg warning cause the mg_node isn't empty (cgroup.c:2086). Cause this
happens all the time this spams dmesg.

I am not really familiar with this code, but it looks to me like adding
the taskset is just a temporary action in this context. Therefore this
taskset should be removed after the actual check. This is what this fix
does.

This problem was seen and the fix tested on x86 using l-mr1 and master.

Change-Id: I9894d39e8b5692ef65149002b07e65a84a33ffea
Signed-off-by: Christian Poetzsch <christian.potzsch@imgtec.com>
53b5e2f0b1cebfd3493ba8ba3bb9d14924a9fa11 13-Jul-2011 Colin Cross <ccross@android.com> cgroup: Add generic cgroup subsystem permission checks

Rather than using explicit euid == 0 checks when trying to move
tasks into a cgroup via CFS, move permission checks into each
specific cgroup subsystem. If a subsystem does not specify a
'allow_attach' handler, then we fall back to doing our checks
the old way.

Use the 'allow_attach' handler for the 'cpu' cgroup to allow
non-root processes to add arbitrary processes to a 'cpu' cgroup
if it has the CAP_SYS_NICE capability set.

This version of the patch adds a 'allow_attach' handler instead
of reusing the 'can_attach' handler. If the 'can_attach' handler
is reused, a new cgroup that implements 'can_attach' but not
the permission checks could end up with no permission checks
at all.

Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c
Original-Author: San Mehat <san@google.com>
Signed-off-by: Colin Cross <ccross@android.com>
63351df85f7ece011618955054c1b0d15b393691 26-Jan-2015 Ruchi Kandoi <kandoiruchi@google.com> Fix compile errors in accordance with changes from 3.14 to 3.18

Signed-off-by: Ruchi Kandoi<kandoiruchi@google.com>
16474253cb671233c662ad89a88fee802dc36e15 07-Nov-2014 Rom Lemarchand <romlem@android.com> cgroup: refactor allow_attach function into common code

move cpu_cgroup_allow_attach to a common subsys_cgroup_allow_attach.
This allows any process with CAP_SYS_NICE to move tasks across cgroups if
they use this function as their allow_attach handler.

Bug: 18260435
Change-Id: I6bb4933d07e889d0dc39e33b4e71320c34a2c90f
Signed-off-by: Rom Lemarchand <romlem@android.com>
e756c7b698604f11a979f2781d06eb7b80aba363 25-Sep-2014 Zefan Li <lizefan@huawei.com> Revert "cgroup: remove redundant variable in cgroup_mount()"

This reverts commit 0c7bf3e8cab7900e17ce7f97104c39927d835469.

If there are child cgroups in the cgroupfs and then we umount it,
the superblock will be destroyed but the cgroup_root will be kept
around. When we mount it again, cgroup_mount() will find this
cgroup_root and allocate a new sb for it.

So with this commit we will be trapped in a dead loop in the case
described above, because kernfs_pin_sb() keeps returning NULL.

Currently I don't see how we can avoid using both pinned_sb and
new_sb, so just revert it.

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2aad2a86f6685c10360ec8a5a55eb9ab7059cb72 24-Sep-2014 Tejun Heo <tj@kernel.org> percpu_ref: add PERCPU_REF_INIT_* flags

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC : start ref in atomic mode
* PERCPU_REF_INIT_DEAD : start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
0c7bf3e8cab7900e17ce7f97104c39927d835469 20-Sep-2014 Zefan Li <lizefan@huawei.com> cgroup: remove redundant variable in cgroup_mount()

Both pinned_sb and new_sb indicate if a new superblock is needed,
so we can just remove new_sb.

Note now we must check if kernfs_tryget_sb() returns NULL, because
when it returns NULL, kernfs_mount() may still re-use an existing
superblock, which is just allocated by another concurent mount.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
3e2cd91ab92665148616a80dc0745c499d2746a7 20-Sep-2014 Zefan Li <lizefan@huawei.com> cgroup: fix missing unlock in cgroup_release_agent()

The patch 971ff4935538: "cgroup: use a per-cgroup work for release
agent" from Sep 18, 2014, leads to the following static checker
warning:

kernel/cgroup.c:5310 cgroup_release_agent()
warn: 'mutex:&cgroup_mutex' is sometimes locked here and sometimes unlocked.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a25eb52e81a40e986179a790fbb5a1f02f482b7a 19-Sep-2014 Zefan Li <lizefan@huawei.com> cgroup: remove CGRP_RELEASABLE flag

We call put_css_set() after setting CGRP_RELEASABLE flag in
cgroup_task_migrate(), but in other places we call it without setting
the flag. I don't see the necessity of this flag.

Moreover once the flag is set, it will never be cleared, unless writing
to the notify_on_release control file, so it can be quite confusing
if we look at the output of debug.releasable.

# mount -t cgroup -o debug xxx /cgroup
# mkdir /cgroup/child
# cat /cgroup/child/debug.releasable
0 <-- shows 0 though the cgroup is empty
# echo $$ > /cgroup/child/tasks
# cat /cgroup/child/debug.releasable
0
# echo $$ > /cgroup/tasks && echo $$ > /cgroup/child/tasks
# cat /proc/child/debug.releasable
1 <-- shows 1 though the cgroup is not empty

This patch removes the flag, and now debug.releasable shows if the
cgroup is empty or not.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
006f4ac49742b5f70ef7e39176fd42a500144ccc 18-Sep-2014 Zefan Li <lizefan@huawei.com> cgroup: simplify proc_cgroup_show()

Use the ONE macro instead of REG, and we can simplify proc_cgroup_show().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
971ff49355387fef41d1327434d8939721a4eb35 18-Sep-2014 Zefan Li <lizefan@huawei.com> cgroup: use a per-cgroup work for release agent

Instead of using a global work to schedule release agent on removable
cgroups, we change to use a per-cgroup work to do this, which makes
the code much simpler.

v2: use a dedicated work instead of reusing css->destroy_work. (Tejun)

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
eb4aec84d6bdf98d00cedb41c18000f7a31e648a 18-Sep-2014 Zefan Li <lizefan@huawei.com> cgroup: fix unbalanced locking

cgroup_pidlist_start() holds cgrp->pidlist_mutex and then calls
pidlist_array_load(), and cgroup_pidlist_stop() releases the mutex.

It is wrong that we release the mutex in the failure path in
pidlist_array_load(), because cgroup_pidlist_stop() will be called
no matter if cgroup_pidlist_start() returns errno or not.

Fixes: 4bac00d16a8760eae7205e41d2c246477d42a210
Cc: <stable@vger.kernel.org> # 3.14+
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
0c8fc2c1210556434835adfb2274f41704853e8a 17-Sep-2014 Li Zefan <lizefan@huawei.com> cgroup: remove bogus comments

We never grab cgroup mutex in fork and exit paths no matter whether
notify_on_release is set or not.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
244bb9a6336d2aa53526261ec35c593ebd5c1a33 17-Sep-2014 Li Zefan <lizefan@huawei.com> cgroup: remove redundant code in cgroup_rmdir()

We no longer clear kn->priv in cgroup_rmdir(), so we don't need
to get an extra refcnt.

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
6213daab2547fdc0d02a86abf3ac209ac6881ae3 17-Sep-2014 Li Zefan <lizefan@huawei.com> cgroup: remove some useless forward declarations

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a34375ef9e65340a138fc0be287de5c940d260fc 08-Sep-2014 Tejun Heo <tj@kernel.org> percpu-refcount: add @gfp to percpu_ref_init()

Percpu allocator now supports allocation mask. Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing. Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>
aa32362f011c6e863132b16c1761487166a4bad2 04-Sep-2014 Li Zefan <lizefan@huawei.com> cgroup: check cgroup liveliness before unbreaking kernfs

When cgroup_kn_lock_live() is called through some kernfs operation and
another thread is calling cgroup_rmdir(), we'll trigger the warning in
cgroup_get().

------------[ cut here ]------------
WARNING: CPU: 1 PID: 1228 at kernel/cgroup.c:1034 cgroup_get+0x89/0xa0()
...
Call Trace:
[<c16ee73d>] dump_stack+0x41/0x52
[<c10468ef>] warn_slowpath_common+0x7f/0xa0
[<c104692d>] warn_slowpath_null+0x1d/0x20
[<c10bb999>] cgroup_get+0x89/0xa0
[<c10bbe58>] cgroup_kn_lock_live+0x28/0x70
[<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230
[<c10be5b2>] cgroup_tasks_write+0x12/0x20
[<c10bb7b0>] cgroup_file_write+0x40/0x130
[<c11aee71>] kernfs_fop_write+0xd1/0x160
[<c1148e58>] vfs_write+0x98/0x1e0
[<c114934d>] SyS_write+0x4d/0xa0
[<c16f656b>] sysenter_do_call+0x12/0x12
---[ end trace 6f2e0c38c2108a74 ]---

Fix this by calling css_tryget() instead of cgroup_get().

v2:
- move cgroup_tryget() right below cgroup_get() definition. (Tejun)

Cc: <stable@vger.kernel.org> # 3.15+
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a4189487da1b4f8260c6006b9dc47c3c4107a5ae 04-Sep-2014 Li Zefan <lizefan@huawei.com> cgroup: delay the clearing of cgrp->kn->priv

Run these two scripts concurrently:

for ((; ;))
{
mkdir /cgroup/sub
rmdir /cgroup/sub
}

for ((; ;))
{
echo $$ > /cgroup/sub/cgroup.procs
echo $$ > /cgroup/cgroup.procs
}

A kernel bug will be triggered:

BUG: unable to handle kernel NULL pointer dereference at 00000038
IP: [<c10bbd69>] cgroup_put+0x9/0x80
...
Call Trace:
[<c10bbe19>] cgroup_kn_unlock+0x39/0x50
[<c10bbe91>] cgroup_kn_lock_live+0x61/0x70
[<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230
[<c10be5b2>] cgroup_tasks_write+0x12/0x20
[<c10bb7b0>] cgroup_file_write+0x40/0x130
[<c11aee71>] kernfs_fop_write+0xd1/0x160
[<c1148e58>] vfs_write+0x98/0x1e0
[<c114934d>] SyS_write+0x4d/0xa0
[<c16f656b>] sysenter_do_call+0x12/0x12

We clear cgrp->kn->priv in the end of cgroup_rmdir(), but another
concurrent thread can access kn->priv after the clearing.

We should move the clearing to css_release_work_fn(). At that time
no one is holding reference to the cgroup and no one can gain a new
reference to access it.

v2:
- move RCU_INIT_POINTER() into the else block. (Tejun)
- remove the cgroup_parent() check. (Tejun)
- update the comment in css_tryget_online_from_dir().

Cc: <stable@vger.kernel.org> # 3.15+
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
251f8c0364f99fc21fcc7b07e4ec6b4f3250d841 25-Aug-2014 Dongsheng Yang <yangds.fnst@cn.fujitsu.com> cgroup: fix a typo in comment.

There is no function named cgroup_enable_task_cg_links().
Instead, the correct function name in this comment should
be cgroup_enabled_task_cg_lists().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
fa8137be6ba632041e725e4623258ba27a2cf9be 08-Aug-2014 Vivek Goyal <vgoyal@redhat.com> cgroup: Display legacy cgroup files on default hierarchy

Kernel command line parameter cgroup__DEVEL__legacy_files_on_dfl forces
legacy cgroup files to show up on default hierarhcy if susbsystem does
not have any files defined for default hierarchy.

But this seems to be working only if legacy files are defined in
ss->legacy_cftypes. If one adds some cftypes later using
cgroup_add_legacy_cftypes(), these files don't show up on default
hierarchy. Update the function accordingly so that the dynamically
added legacy files also show up in the default hierarchy if the target
subsystem is also using the base legacy files for the default
hierarchy.

tj: Patch description and comment updates.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
71b1fb5c4473a5b1e601d41b109bdfe001ec82e0 18-Aug-2014 Alban Crequy <alban.crequy@collabora.co.uk> cgroup: reject cgroup names with '\n'

/proc/<pid>/cgroup contains one cgroup path on each line. If cgroup names are
allowed to contain "\n", applications cannot parse /proc/<pid>/cgroup safely.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
5de4fa13c4df302db41e80ca679df24fdad0d661 15-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test

cgrp_dfl_root_inhibit_ss_mask determines which subsystems are not
supported on the default hierarchy and is currently initialized
statically and just includes the debug subsystem. Now that there's
cgroup_subsys->dfl_files, we can easily tell which subsystems support
the default hierarchy or not.

Let's initialize cgrp_dfl_root_inhibit_ss_mask by testing whether
cgroup_subsys->dfl_files is NULL. After all, subsystems with NULL
->dfl_files aren't useable on the default hierarchy anyway.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
05ebb6e60f044a9cef2549b6204559276500f363 15-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core

cgroup now distinguishes cftypes for the default and legacy
hierarchies more explicitly by using separate arrays and
CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE should be and are used only
inside cgroup core proper. Let's make it clear that the flags are
internal by prefixing them with double underscores.

CFTYPE_INSANE is renamed to __CFTYPE_NOT_ON_DFL for consistency. The
two flags are also collected and assigned bits >= 16 so that they
aren't mixed with the published flags.

v2: Convert the extra ones in cgroup_exit_cftypes() which are added by
revision to the previous patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a8ddc8215e1a4cd9dc5d6210811cfc381a489ec2 15-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: distinguish the default and legacy hierarchies when handling cftypes

Until now, cftype arrays carried files for both the default and legacy
hierarchies and the files which needed to be used on only one of them
were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This
gets confusing very quickly and we may end up exposing interface files
to the default hierarchy without thinking it through.

This patch makes cgroup core provide separate sets of interfaces for
cftype handling so that the cftypes for the default and legacy
hierarchies are clearly distinguished. The previous two patches
renamed the existing ones so that they clearly indicate that they're
for the legacy hierarchies. This patch adds the interface for the
default hierarchy and apply them selectively depending on the
hierarchy type.

* cftypes added through cgroup_subsys->dfl_cftypes and
cgroup_add_dfl_cftypes() only show up on the default hierarchy.

* cftypes added through cgroup_subsys->legacy_cftypes and
cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.

* cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
same array for the cases where the interface files are identical on
both types of hierarchies.

* This makes all the existing subsystem interface files legacy-only by
default and all subsystems will have no interface file created when
enabled on the default hierarchy. Each subsystem should explicitly
review and compose the interface for the default hierarchy.

* A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
makes subsystems which haven't decided the interface files for the
default hierarchy to present the legacy files on the default
hierarchy so that its behavior on the default hierarchy can be
tested. As the awkward name suggests, this is for development only.

* memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
array isn't used on the default hierarchy. The flag is removed.

v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.

v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2cf669a58dc08fa065a8bd0dca866c0e6cb358cc 15-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes()

Currently, cftypes added by cgroup_add_cftypes() are used for both the
unified default hierarchy and legacy ones and subsystems can mark each
file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
appear only on one of them. This is quite hairy and error-prone.
Also, we may end up exposing interface files to the default hierarchy
without thinking it through.

cgroup_subsys will grow two separate cftype addition functions and
apply each only on the hierarchies of the matching type. This will
allow organizing cftypes in a lot clearer way and encourage subsystems
to scrutinize the interface which is being exposed in the new default
hierarchy.

In preparation, this patch adds cgroup_add_legacy_cftypes() which
currently is a simple wrapper around cgroup_add_cftypes() and replaces
all cgroup_add_cftypes() usages with it.

While at it, this patch drops a completely spurious return from
__hugetlb_cgroup_file_init().

This patch doesn't introduce any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
5577964e64692e17cc498854b7e0833e6532cd64 15-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes

Currently, cgroup_subsys->base_cftypes is used for both the unified
default hierarchy and legacy ones and subsystems can mark each file
with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
only on one of them. This is quite hairy and error-prone. Also, we
may end up exposing interface files to the default hierarchy without
thinking it through.

cgroup_subsys will grow two separate cftype arrays and apply each only
on the hierarchies of the matching type. This will allow organizing
cftypes in a lot clearer way and encourage subsystems to scrutinize
the interface which is being exposed in the new default hierarchy.

In preparation, this patch renames cgroup_subsys->base_cftypes to
cgroup_subsys->legacy_cftypes. This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
a14c6874be09a05a48093df8df87ca021f310332 15-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[]

Currently cgroup_base_files[] contains the cgroup core interface files
for both legacy and default hierarchies with each file tagged with
CFTYPE_INSANE and CFTYPE_ONLY_ON_DFL. This is difficult to read.

Let's separate it out to two separate tables, cgroup_dfl_base_files[]
and cgroup_legacy_base_files[], and use the appropriate one in
cgroup_mkdir() depending on the hierarchy type. This makes tagging
each file unnecessary.

This patch doesn't introduce any behavior changes.

v2: cgroup_dfl_base_files[] was missing the termination entry
triggering WARN in cgroup_init_cftypes() for 0day kernel testing
robot. Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jet Chen <jet.chen@intel.com>
7b9a6ba56e9519ed5413a002dc0b0f01aa598bb5 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: clean up sane_behavior handling

After the previous patch to remove sane_behavior support from
non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to
indicate the default hierarchy while parsing mount options. This
patch makes the following cleanups around it.

* Don't show it in the mount option. Eventually the default hierarchy
will be assigned a different filesystem type.

* As sane_behavior is no longer effective on non-default hierarchies
and the default hierarchy doesn't accept any mount options,
parse_cgroupfs_options() can consider sane_behavior mount option as
indicating the default hierarchy and fail if any other options are
specified with it. While at it, remove one of the double blank
lines in the function.

* cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell
whether to mount the default hierarchy or not.

* As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether
to select the default hierarchy or not during mount, it doesn't need
to be set in the default hierarchy itself. cgroup_init_early()
updated accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
aa6ec29bee8692ce232132f1a1ea2a1f9196610e 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: remove sane_behavior support on non-default hierarchies

sane_behavior has been used as a development vehicle for the default
unified hierarchy. Now that the default hierarchy is in place, the
flag became redundant and confusing as its usage is allowed on all
hierarchies. There are gonna be either the default hierarchy or
legacy ones. Let's make that clear by removing sane_behavior support
on non-default hierarchies.

This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
cgroup_on_dfl() with sane_behavior specific part dropped.

On the default and legacy hierarchies w/o sane_behavior, this
shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
c1d5d42efdb3e0470c1cfd2fcb50bc3eae813283 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: make interface file "cgroup.sane_behavior" legacy-only

"cgroup.sane_behavior" is added to help distinguishing whether
sane_behavior is in effect or not. We now have the default hierarchy
where the flag is always in effect and are planning to remove
supporting sane behavior on the legacy hierarchies making this file on
the default hierarchy rather pointless. Let's make it legacy only and
thus always zero.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
7450e90bbb8d834c190cc8100d1cc41888358c7c 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: remove CGRP_ROOT_OPTION_MASK

cgroup_root->flags only contains CGRP_ROOT_* flags and there's no
reason to mask the flags. Remove CGRP_ROOT_OPTION_MASK.

This doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
af0ba6789c8e43518635606d0af1ff475ba7471a 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: implement cgroup_subsys->depends_on

Currently, the blkio subsystem attributes all of writeback IOs to the
root. One of the issues is that there's no way to tell who originated
a writeback IO from block layer. Those IOs are usually issued
asynchronously from a task which didn't have anything to do with
actually generating the dirty pages. The memory subsystem, when
enabled, already keeps track of the ownership of each dirty page and
it's desirable for blkio to piggyback instead of adding its own
per-page tag.

blkio piggybacking on memory is an implementation detail which
preferably should be handled automatically without requiring explicit
userland action. To achieve that, this patch implements
cgroup_subsys->depends_on which contains the mask of subsystems which
should be enabled together when the subsystem is enabled.

The previous patches already implemented the support for enabled but
invisible subsystems and cgroup_subsys->depends_on can be easily
implemented by updating cgroup_refresh_child_subsys_mask() so that it
calculates cgroup->child_subsys_mask considering
cgroup_subsys->depends_on of the explicitly enabled subsystems.

Documentation/cgroups/unified-hierarchy.txt is updated to explain that
subsystems may not become immediately available after being unused
from userland and that dependency could be a factor in it. As
subsystems may already keep residual references, this doesn't
significantly change how subsystem rebinding can be used.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
b4536f0cab2b18414e26101a2b9d484c5cbea0c0 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: implement cgroup_subsys->css_reset()

cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

The previous patches added support for explicitly and implicitly
enabled subsystems and showing/hiding their interface files. An
explicitly enabled subsystem may become implicitly enabled if it's
turned off through "cgroup.subtree_control" but there are subsystems
depending on it. In such cases, the subsystem, as it's turned off
when seen from userland, shouldn't enforce any resource control.
Also, the subsystem may be explicitly turned on later again and its
interface files should be as close to the intial state as possible.

This patch adds cgroup_subsys->css_reset() which is invoked when a css
is hidden. The callback should disable resource control and reset the
state to the vanilla state.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
f63070d350e3562baa6196f1043e01cd8da2509a 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: make interface files visible iff enabled on cgroup->subtree_control

cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

The preceding patch distinguished cgroup->subtree_control and
->child_subsys_mask where the former is the subsystems explicitly
configured by the userland and the latter is all enabled subsystems
currently is equal to the former but will include subsystems
implicitly enabled through dependency.

Subsystems which are enabled due to dependency shouldn't be visible to
userland. This patch updates cgroup_subtree_control_write() and
create_css() such that interface files are not created for implicitly
enabled subsytems.

* @visible paramter is added to create_css(). Interface files are
created only when true.

* If an already implicitly enabled subsystem is turned on through
"cgroup.subtree_control", the existing css should be used. css
draining is skipped.

* cgroup_subtree_control_write() computes the new target
cgroup->child_subsys_mask and create/kill or show/hide csses
accordingly.

As the two subsystem masks are still kept identical, this patch
doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
667c24917144e34880f821486bf0a6e4d05a3a14 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: introduce cgroup->subtree_control

cgroup is implementing support for subsystem dependency which would
require a way to enable a subsystem even when it's not directly
configured through "cgroup.subtree_control".

Previously, cgroup->child_subsys_mask directly reflected
"cgroup.subtree_control" and the enabled subsystems in the child
cgroups. This patch adds cgroup->subtree_control which
"cgroup.subtree_control" operates on. cgroup->child_subsys_mask is
now calculated from cgroup->subtree_control by
cgroup_refresh_child_subsys_mask(), which sets it identical to
cgroup->subtree_control for now.

This will allow using cgroup->child_subsys_mask for all the enabled
subsystems including the implicit ones and ->subtree_control for
tracking the explicitly requested ones. This patch keeps the two
masks identical and doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
c29adf24e0c443fb4433efb6a62bd91fdb739c34 09-Jul-2014 Tejun Heo <tj@kernel.org> cgroup: reorganize cgroup_subtree_control_write()

Make the following two reorganizations to
cgroup_subtree_control_write(). These are to prepare for future
changes and shouldn't cause any functional difference.

* Move availability above css offlining wait.

* Move cgrp->child_subsys_mask update above new css creation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
3a32bd72d77058d768dbb38183ad517f720dd1bc 30-Jun-2014 Li Zefan <lizefan@huawei.com> cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()

We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.

Run two instances of this script concurrently:

for ((; ;))
{
mount -t cgroup -o cpuacct xxx /cgroup
umount /cgroup
}

After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.

This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().

To fix this, we try to increase both the refcnt of the superblock and the
percpu refcnt of cgroup root.

v2:
- we should try to get both the superblock refcnt and cgroup_root refcnt,
because cgroup_root may have no superblock assosiated with it.
- adjust/add comments.

tj: Updated comments. Renamed @sb to @pinned_sb.

Cc: <stable@vger.kernel.org> # 3.15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
970317aa48c6ef66cd023c039c2650c897bad927 30-Jun-2014 Li Zefan <lizefan@huawei.com> cgroup: fix mount failure in a corner case

# cat test.sh
#! /bin/bash

mount -t cgroup -o cpu xxx /cgroup
umount /cgroup

mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /cgroup busy
mount: according to mtab, xxx is already mounted on /cgroup

It's because the cgroupfs_root of the first mount was under destruction
asynchronously.

Fix this by delaying and then retrying mount for this case.

v3:
- put the refcnt immediately after getting it. (Tejun)

v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.

tj: Updated the comment a bit.

Cc: <stable@vger.kernel.org> # 3.15
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
9a1049da9bd2cd83fe11d46433e603c193aa9c71 28-Jun-2014 Tejun Heo <tj@kernel.org> percpu-refcount: require percpu_ref to be exited explicitly

Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.

* It's impossible to re-init a released reference counter without
going through re-allocation.

* In the similar vein, it's impossible to initialize a percpu_ref
count with static percpu variables.

* We need and have an explicit destructor anyway for failure paths -
percpu_ref_cancel_init().

This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit(). percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().

All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Li Zefan <lizefan@huawei.com>
99bae5f94185c2cc65701e95c54e31e2f4345b88 12-Jun-2014 Li Zefan <lizefan@huawei.com> cgroup: fix broken css_has_online_children()

After running:

# mount -t cgroup cpu xxx /cgroup && mkdir /cgroup/sub && \
rmdir /cgroup/sub && umount /cgroup

I found the cgroup root still existed:

# cat /proc/cgroups
#subsys_name hierarchy num_cgroups enabled
cpuset 0 1 1
cpu 1 1 1
...

It turned out css_has_online_children() is broken.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Sigend-off-by: Tejun Heo <tj@kernel.org>
c731ae1d0f02a300697a8b1564780ad28a6c2013 05-Jun-2014 Li Zefan <lizefan@huawei.com> cgroup: disallow disabled controllers on the default hierarchy

After booting with cgroup_disable=memory, I still saw memcg files
in the default hierarchy, and I can write to them, though it won't
take effect.

# dmesg
...
Disabling memory control group subsystem
...
# mount -t cgroup -o __DEVEL__sane_behavior xxx /cgroup
# ls /cgroup
...
memory.failcnt memory.move_charge_at_immigrate
memory.force_empty memory.numa_stat
memory.limit_in_bytes memory.oom_control
...
# cat /cgroup/memory.usage_in_bytes
0

tj: Minor comment update.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1f779fb28aa07350d72976d304591d216ca86f0e 04-Jun-2014 Li Zefan <lizefan@huawei.com> cgroup: don't destroy the default root

The default root is allocated and initialized at boot phase, so we
shouldn't destroy the default root when it's umounted, otherwise
it will lead to disaster.

Just try mount and then umount the default root, and the kernel will
crash immediately.

v2:
- No need to check for CSS_NO_REF in cgroup_get/put(). (Tejun)
- Better call cgroup_put() for the default root in kill_sb(). (Tejun)
- Add a comment.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
c9482a5bdcc09be9096f40e858c5fe39c389cd52 26-Apr-2014 Jianyu Zhan <nasa4836@gmail.com> kernfs: move the last knowledge of sysfs out from kernfs

There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
26fc9cd200ec839e0b3095e05ae018f27314e7aa 26-Apr-2014 Jianyu Zhan <nasa4836@gmail.com> kernfs: move the last knowledge of sysfs out from kernfs

There is still one residue of sysfs remaining: the sb_magic
SYSFS_MAGIC. However this should be kernfs user specific,
so this patch moves it out. Kerrnfs user should specify their
magic number while mouting.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
5533e0114425dcdb878f11b291f2727af8667a7c 15-May-2014 Tejun Heo <tj@kernel.org> cgroup: disallow debug controller on the default hierarchy

The debug controller, as its name suggests, exposes cgroup core
internals to userland to aid debugging. Unfortunately, except for the
name, there's no provision to prevent its usage in production
configurations and the controller is widely enabled and mounted
leaking internal details to userland. Like most other debug
information, the information exposed by debug isn't interesting even
for debugging itself once the related parts are working reliably.

This controller has no reason for existing. This patch implements
cgrp_dfl_root_inhibit_ss_mask which can suppress specific subsystems
on the default hierarchy and adds the debug subsystem to it so that it
can be gradually deprecated as usages move towards the unified
hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
f3d4650015301d1c880df4523f7e7ef320a38aab 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: convert cgroup_has_live_children() into css_has_online_children()

Now that cgroup liveliness and css onliness are the same state,
convert cgroup_has_live_children() into css_has_online_children() so
that it can be used for actual csses too. The function now uses
css_for_each_child() for iteration and is published.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
184faf32328c65c9d86b19577b8d8b90bdd2cd2e 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: use CSS_ONLINE instead of CGRP_DEAD

Use CSS_ONLINE on the self css to indicate whether a cgroup has been
killed instead of CGRP_DEAD. This will allow re-using css online test
for cgroup liveliness test. This doesn't introduce any functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
c2931b70a32c705b9bd5762f5044f9eac8a52bb3 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: iterate cgroup_subsys_states directly

Currently, css_next_child() is implemented as finding the next child
cgroup which has the css enabled, which used to be the only way to do
it as only cgroups participated in sibling lists and thus could be
iteratd. This works as long as what's required during iteration is
not missing online csses; however, it turns out that there are use
cases where offlined but not yet released csses need to be iterated.
This is difficult to implement through cgroup iteration the unified
hierarchy as there may be multiple dying csses for the same subsystem
associated with single cgroup.

After the recent changes, the cgroup self and regular csses behave
identically in how they're linked and unlinked from the sibling lists
including assertion of CSS_RELEASED and css_next_child() can simply
switch to iterating csses directly. This both simplifies the logic
and ensures that all visible non-released csses are included in the
iteration whether there are multiple dying csses for a subsystem or
not.

As all other iterators depend on css_next_child() for sibling
iteration, this changes behaviors of all css iterators. Add and
update explanations on the css states which are included in traversal
to all iterators.

As css iteration could always contain offlined csses, this shouldn't
break any of the current users and new usages which need iteration of
all on and offline csses can make use of the new semantics.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
de3f034182ecbf0efbcef7ab8b253c6c3049a592 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: introduce CSS_RELEASED and reduce css iteration fallback window

css iterations allow the caller to drop RCU read lock. As long as the
caller keeps the current position accessible, it can simply re-grab
RCU read lock later and continue iteration. This is achieved by using
CGRP_DEAD to detect whether the current positions next pointer is safe
to dereference and if not re-iterate from the beginning to the next
position using ->serial_nr.

CGRP_DEAD is used as the marker to invalidate the next pointer and the
only requirement is that the marker is set before the next sibling
starts its RCU grace period. Because CGRP_DEAD is set at the end of
cgroup_destroy_locked() but the cgroup is unlinked when the reference
count reaches zero, we currently have a rather large window where this
fallback re-iteration logic can be triggered.

This patch introduces CSS_RELEASED which is set when a css is unlinked
from its sibling list. This still keeps the re-iteration logic
working while drastically reducing the window of its activation.
While at it, rewrite the comment in css_next_child() to reflect the
new flag and better explain the synchronization.

This will also enable iterating csses directly instead of through
cgroups.

v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by
CSS_NO_REF.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0cb51d71c1fa9234afe4213089844be76ec1765a 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: move cgroup->serial_nr into cgroup_subsys_state

We're moving towards using cgroup_subsys_states as the fundamental
structural blocks. All csses including the cgroup->self and actual
ones now form trees through css->children and ->sibling which follow
the same rules as what cgroup->children and ->sibling followed. This
patch moves cgroup->serial_nr which is used to implement css iteration
into css.

Note that all csses, regardless of their types, allocate their serial
numbers from the same monotonically increasing counter. This doesn't
affect the ordering needed by css iteration or cause any other
material behavior changes. This will be used to update css iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1fed1b2e36ba1aa0257004a97e75bbdb70f216b5 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: link all cgroup_subsys_states in their sibling lists

Currently, while all csses have ->children and ->sibling, only the
self csses of cgroups make use of them. This patch makes all other
csses to link themselves on the sibling lists too. This will be used
to update css iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d5c419b68e368fdd9f1857bf8d4bb4480edb9b80 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: move cgroup->sibling and ->children into cgroup_subsys_state

We're moving towards using cgroup_subsys_states as the fundamental
structural blocks. Let's move cgroup->sibling and ->children into
cgroup_subsys_state. This is pure move without functional change and
only cgroup->self's fields are actually used. Other csses will make
use of the fields later.

While at it, update init_and_link_css() so that it zeroes the whole
css before initializing it and remove explicit zeroing of ->flags.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d51f39b05ce0008118c45945e681b20484990571 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroup->parent

cgroup->parent is redundant as cgroup->self.parent can also be used to
determine the parent cgroup and we're moving towards using
cgroup_subsys_states as the fundamental structural blocks. This patch
introduces cgroup_parent() which follows cgroup->self.parent and
removes cgroup->parent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5c9d535b893f30266ea29fe377cb9b002fcd76aa 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: remove css_parent()

cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.

This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.

v2: New usage from device_cgroup.c converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@cmpxchg.org>
3b514d24e200fcdcde0a57c354a51d3677a86743 16-May-2014 Tejun Heo <tj@kernel.org> cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css

9395a4500404 ("cgroup: enable refcnting for root csses") enabled
reference counting for root csses (cgroup_subsys_states) so that
cgroup's self csses can be used to manage the lifetime of the
containing cgroups.

Unfortunately, this change was incorrect. During early init,
cgrp_dfl_root self css refcnt is used. percpu_ref can't initialized
during early init and its initialization is deferred till
cgroup_init() time. This means that cpu was using percpu_ref which
wasn't properly initialized. Due to the way percpu variables are laid
out on x86, this didn't blow up immediately on x86 but ended up
incrementing and decrementing the percpu variable at offset zero,
whatever it may be; however, on other archs, this caused fault and
early boot failure.

As cgroup self csses for root cgroups of non-dfl hierarchies need
working refcounting, we can't revert 9395a4500404. This patch adds
CSS_NO_REF which explicitly inhibits reference counting on the css and
sets it on all normal (non-self) csses and cgroup_dfl_root self css.

v2: cgrp_dfl_root.self is the offending one. Set the flag on it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Warren <swarren@nvidia.com>
Tested-by: Stephen Warren <swarren@nvidia.com>
Fixes: 9395a4500404 ("cgroup: enable refcnting for root csses")
9d755d33f0db8c9b49438f71b38a56e375b34360 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: use cgroup->self.refcnt for cgroup refcnting

Currently cgroup implements refcnting separately using atomic_t
cgroup->refcnt. The destruction paths of cgroup and css are rather
complex and bear a lot of similiarities including the use of RCU and
bouncing to a work item.

This patch makes cgroup use the refcnt of self css for refcnting
instead of using its own. This makes cgroup refcnting use css's
percpu refcnt and share the destruction mechanism.

* css_release_work_fn() and css_free_work_fn() are updated to handle
both csses and cgroups. This is a bit messy but should do until we
can make cgroup->self a full css, which currently can't be done
thanks to multiple hierarchies.

* cgroup_destroy_locked() now performs
percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp).

* Negative refcnt sanity check in cgroup_get() is no longer necessary
as percpu_ref already handles it.

* Similarly, as a cgroup which hasn't been killed will never be
released regardless of its refcnt value and percpu_ref has sanity
check on kill, cgroup_is_dead() sanity check in cgroup_put() is no
longer necessary.

* As whether a refcnt reached zero or not can only be decided after
the reference count is killed, cgroup_root->cgrp's refcnting can no
longer be used to decide whether to kill the root or not. Let's
make cgroup_kill_sb() explicitly initiate destruction if the root
doesn't have any children. This makes sense anyway as unmounted
cgroup hierarchy without any children should be destroyed.

While this is a bit messy, this will allow pushing more bookkeeping
towards cgroup->self and thus handling cgroups and csses in more
uniform way. In the very long term, it should be possible to
introduce a base subsystem and convert the self css to a proper one
making things whole lot simpler and unified.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
9395a4500404e05173eda9a2d198b6fa500e90c5 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: enable refcnting for root csses

Currently, css_get(), css_tryget() and css_tryget_online() are noops
for root csses as an optimization; however, we're planning to use css
refcnts to track of cgroup lifetime too and root cgroups also need to
be reference counted. Since css has been converted to percpu_refcnt,
the overhead of refcnting is miniscule and this optimization isn't too
meaningful anymore. Furthermore, controllers which optimize the root
cgroup often never even invoke these functions in their hot paths.

This patch enables refcnting for root csses too. This makes CSS_ROOT
flag unused and removes it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
25e15d835036a70a53dcc993beaa036f8919a373 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: bounce css release through css->destroy_work

css release is planned to do more and would require process context.
Bounce it through css->destroy_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
249f3468a282dcbad53484c821bebb447f14ee03 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_destory_css_killed()

cgroup_destroy_css_killed() is cgroup destruction stage which happens
after all csses are offlined. After the recent updates, it no longer
does anything other than putting the base reference. This patch
removes the function and makes cgroup_destroy_locked() put the base
ref at the end isntead.

This also makes cgroup->nr_css unnecessary. Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
4e4e28472365f8c7a7c55f6b5706f68bc40c5b13 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: move cgroup->sibling unlinking to cgroup_put()

Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
cgroup_put(). This is later but still before the RCU grace period, so
it doesn't break css_next_child() although there now is a larger
window in which a dead cgroup is visible during css iteration. As css
iteration always could have included offline csses, this doesn't
affect correctness; however, it does make css_next_child() fall back
to reiterting mode more often. This also makes cgroup_put() directly
take cgroup_mutex, which limits where it can be called from. These
are not immediately problematic and will be dealt with later.

This change enables simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
9e4173e1f24fa3bd562f13b92ee34c7dfb1db7c9 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked()

Currently, check_for_release() on the parent of a destroyed cgroup is
invoked from cgroup_destroy_css_killed(). This is because this is
where the destroyed cgroup can be removed from the parent's children
list. check_for_release() tests the emptiness of the list directly,
so invoking it before removing the cgroup from the list makes it think
that the parent still has children even when it no longer does.

This patch updates check_for_release() to use
cgroup_has_live_children() instead of directly testing ->children
emptiness and moves check_for_release(parent) earlier to the end of
cgroup_destroy_locked(). As cgroup_has_live_children() ignores
cgroups marked DEAD, check_for_release() functions correctly as long
as it's called after asserting DEAD.

This makes release notification slightly more timely and more
importantly enables further simplification of cgroup destruction path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cbc125efada6d8c2555dd35e938694eb9b7cd791 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked()

We're expecting another user.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
9d800df12d31734a6853915e9d2deb5d6747985f 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: rename cgroup->dummy_css to ->self and move it to the top

cgroup->dummy_css is used as the placeholder css when performing css
oriended operations on the cgroup. We're gonna shift more cgroup
management to this css. Let's rename it to ->self and move it to the
top.

This is pure rename and field relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a015edd26e28afe225cdd04f25794bd2b3bbe2da 14-May-2014 Tejun Heo <tj@kernel.org> cgroup: use restart_syscall() for mount retries

cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
which is being destroyed. The retry currently loops inside
cgroup_mount() proper. This patch makes it return with
restart_syscall() instead so that retry travels out to userland
boundary.

This slightly simplifies the logic and more importantly makes the
retry logic behave better when the wait for some reason becomes
lengthy or infinite by allowing the operation to be suspended or
terminated from userland.

v2: The original patch forgot to free memory allocated for @opts.
Fixed. Caught by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
8353da1f91f12a3079ecc849226f371242d2807c 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_tree_mutex

cgroup_tree_mutex was introduced to work around the circular
dependency between cgroup_mutex and kernfs active protection - some
kernfs file and directory operations needed cgroup_mutex putting
cgroup_mutex under active protection but cgroup also needs to be able
to access cgroup hierarchies and cftypes to determine which
kernfs_nodes need to be removed. cgroup_tree_mutex nested above both
cgroup_mutex and kernfs active protection and used to protect the
hierarchy and cftypes. While this worked, it added a lot of double
lockings and was generally cumbersome.

kernfs provides a mechanism to opt out of active protection and cgroup
was already using it for removal and subtree_control. There's no
reason to mix both methods of avoiding circular locking dependency and
the preceding cgroup_kn_lock_live() changes applied it to all relevant
cgroup kernfs operations making it unnecessary to nest cgroup_mutex
under kernfs active protection. The previous patch reversed the
original lock ordering and put cgroup_mutex above kernfs active
protection.

After these changes, all cgroup_tree_mutex usages are now accompanied
by cgroup_mutex making the former completely redundant. This patch
removes cgroup_tree_mutex and all its usages.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
01f6474ce04fffd6282b569ac0a31f4b98d4c82a 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: nest kernfs active protection under cgroup_mutex

After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
longer nested below kernfs active protection. The two don't have any
relationship now.

This patch nests kernfs active protection under cgroup_mutex. All
cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
temporary cgroup_mutex releases over kernfs operations are removed,
and cgroup_add/rm_cftypes() grab both mutexes.

This makes cgroup_tree_mutex redundant, which will be removed by the
next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
e76ecaeef65c497153ceacf59c2e21c070d43f64 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods

Make __cgroup_procs_write() and cgroup_release_agent_write() use
cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
cgroup_lock_live_group(). This puts the operations under both
cgroup_tree_mutex and cgroup_mutex protection without circular
dependency from kernfs active protection. Also, this means that
cgroup_mutex is no longer nested below kernfs active protection.
There is no longer any place where the two locks interact.

This leaves cgroup_lock_live_group() without any user. Removed.

This will help simplifying cgroup locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a9746d8da786bc79b3b4ae1baa0fbbc4b795c1b7 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock()

cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
share the logic to break active protection so that they can grab
cgroup_tree_mutex which nests above active protection and/or remove
self. Factor out this logic into cgroup_kn_lock_live() and
cgroup_kn_unlock().

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cfc79d5bec04cdf26cd207d3e73d8bd59fd780a8 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: move cgroup->kn->priv clearing to cgroup_rmdir()

The ->priv field of a cgroup directory kernfs_node points back to the
cgroup. This field is RCU cleared in cgroup_destroy_locked() for
non-kernfs accesses from css_tryget_from_dir() and
cgroupstats_build().

As these are only applicable to cgroups which finished creation
successfully and fully initialized cgroups are always removed by
cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().

This will help simplifying cgroup locking and shouldn't introduce any
behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ddab2b6e0e516098efe6d3b69585bb25b1408d61 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write()

Move cgroup_lock_live_group() invocation upwards to right below
cgroup_tree_mutex in cgroup_subtree_control_write(). This is to help
the planned locking simplification.

This doesn't make any userland-visible behavioral changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b3bfd983ca94cf1393accc11e90123c83909babb 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: collapse cgroup_create() into croup_mkdir()

cgroup_mkdir() is the sole user of cgroup_create(). Let's collapse
the latter into the former. This will help simplifying locking.
While at it, remove now stale comment about inode locking.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ba0f4d761503bd9ce4c7458b56bfd7c3fdb51e86 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: reorganize cgroup_create()

Reorganize cgroup_create() so that all paths share unlock out path.

* All err_* labels are renamed to out_* as they're now shared by both
success and failure paths.

* @err renamed to @ret for the similar reason as above and so that
it's more consistent with other functions.

* cgroup memory allocation moved after locking so that freeing failed
cgroup happens before unlocking. While this moves more code inside
critical section, memory allocations inside cgroup locking are
already pretty common and this is unlikely to make any noticeable
difference.

* While at it, replace a stray @parent->root dereference with @root.

This reorganization will help simplifying locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b7fc5ad235936379fae67a9f7b50bb53487a1a3a 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroup->control_kn

Now that cgroup_subtree_control_write() has access to the associated
kernfs_open_file and thus the kernfs_node, there's no need to cache it
in cgroup->control_kn on creation. Remove cgroup->control_kn and use
@of->kn directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
acbef755f40e204b8a6503fa79958d51a898762a 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write()

cgroup_tasks_write() and cgroup_procs_write() are currently using
cftype->write_u64(). This patch converts them to use cftype->write()
instead. This allows access to the associated kernfs_open_file which
will be necessary to implement the planned kernfs active protection
manipulation for these files.

This shifts buffer parsing to attach_task_by_pid() and makes it return
@nbytes on success. Let's rename it to __cgroup_procs_write() to
clearly indicate that this is a write handler implementation.

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
6770c64e5c8da4705d1f0973bdeb5c2bf4f3a404 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: replace cftype->trigger() with cftype->write()

cftype->trigger() is pointless. It's trivial to ignore the input
buffer from a regular ->write() operation. Convert all ->trigger()
users to ->write() and remove ->trigger().

This patch doesn't introduce any visible behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
451af504df0c62f695a69b83c250486e77c66378 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: replace cftype->write_string() with cftype->write()

Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.

* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.

* Should return @nbytes on success instead of 0.

* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.

cftype->write_string() has no user left after the conversions and
removed.

While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

This patch doesn't introduce any visible behavior changes.

v2: netprio was missing from conversion. Converted.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
b41686401e501430ffe93b575ef7959d2ecc6f2e 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: implement cftype->write()

During the recent conversion to kernfs, cftype's seq_file operations
are updated so that they are directly mapped to kernfs operations and
thus can fully access the associated kernfs and cgroup contexts;
however, write path hasn't seen similar updates and none of the
existing write operations has access to, for example, the associated
kernfs_open_file.

Let's introduce a new operation cftype->write() which maps directly to
the kernfs write operation and has access to all the arguments and
contexts. This will replace ->write_string() and ->trigger() and ease
manipulation of kernfs active protection from cgroup file operations.

Two accessors - of_cft() and of_css() - are introduced to enable
accessing the associated cgroup context from cftype->write() which
only takes kernfs_open_file for the context information. The
accessors for seq_file operations - seq_cft() and seq_css() - are
rewritten to wrap the of_ accessors.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ec903c0c858e4963a9e0724bdcadfa837253341c 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: rename css_tryget*() to css_tryget_online*()

Unlike the more usual refcnting, what css_tryget() provides is the
distinction between online and offline csses instead of protection
against upping a refcnt which already reached zero. cgroup is
planning to provide actual tryget which fails if the refcnt already
reached zero. Let's rename the existing trygets so that they clearly
indicate that they're onliness.

I thought about keeping the existing names as-are and introducing new
names for the planned actual tryget; however, given that each
controller participates in the synchronization of the online state, it
seems worthwhile to make it explicit that these functions are about
on/offline state.

Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
to css_tryget_online_from_dir(). This is pure rename.

v2: cgroup_freezer grew new usages of css_tryget(). Update
accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
46cfeb043b04f5878154bea36714709d46028495 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: use release_agent_path_lock in cgroup_release_agent_show()

release_path is now protected by release_agent_path_lock to allow
accessing it without grabbing cgroup_mutex; however,
cgroup_release_agent_show() was still grabbing cgroup_mutex. Let's
convert it to release_agent_path_lock so that we don't have to worry
about this one for the planned locking updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
7d331fa985d3a39d5b8cb60caf016d3e53e57c91 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write()

After waiting for a child to finish offline,
cgroup_subtree_control_write() jumps up to retry from after the input
parsing and active protection breaking. This retry makes the
scheduled locking update - removal of cgroup_tree_mutex - more
difficult. Let's simplify it by returning with restart_syscall() for
retries.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d37167ab7b3d67d53519585a44c47416e6758ed2 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: update and fix parsing of "cgroup.subtree_control"

I was confused that strsep() was equivalent to strtok_r() in skipping
over consecutive delimiters. strsep() just splits at the first
occurrence of one of the delimiters which makes the parsing very
inflexible, which makes allowing multiple whitespace chars as
delimters kinda moot. Let's just be consistently strict and require
list of tokens separated by spaces. This is what
Documentation/cgroups/unified-hierarchy.txt describes too.

Also, parsing may access beyond the end of the string if the string
ends with spaces or is zero-length. Make sure it skips zero-length
tokens. Note that this also ensures that the parser doesn't puke on
multiple consecutive spaces.

v2: Add zero-length token skipping.

v3: Added missing space after "==". Spotted by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0ab7a60dea71c285dfbb65e344d895b9c4f7bcb9 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: css_release() shouldn't clear cgroup->subsys[]

c1a71504e971 ("cgroup: don't recycle cgroup id until all csses' have
been destroyed") made cgroup ID persist until a cgroup is released and
add cgroup->subsys[] clearing to css_release() so that css_from_id()
doesn't return a css which has already been released which happens
before cgroup release; however, the right change here was updating
offline_css() to clear cgroup->subsys[] which was done by e32978031016
("cgroup: cgroup->subsys[] should be cleared after the css is
offlined") instead of clearing it from css_release().

We're now clearing cgroup->subsys[] twice. This is okay for
traditional hierarchies as a css's lifetime is the same as its
cgroup's; however, this confuses unified hierarchy and turning on and
off a controller repeatedly using "cgroup.subtree_control" can lead to
an oops like the following which happens because cgroup->subsys[] is
incorrectly cleared asynchronously by css_release().

BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0
PGD 1170d067 PUD f0ab067 PMD 0
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
RIP: 0010:[<ffffffff81130c11>] [<ffffffff81130c11>] kill_css+0x21/0x1c0
RSP: 0018:ffff88000e199dc8 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
FS: 00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
Stack:
ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
Call Trace:
[<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00
[<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0
[<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170
[<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0
[<ffffffff811f35ad>] SyS_write+0x4d/0xc0
[<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b
Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
RIP [<ffffffff81130c11>] kill_css+0x21/0x1c0
RSP <ffff88000e199dc8>
CR2: 0000000000000008
---[ end trace e7aae1f877c4e1b4 ]---

Remove the unnecessary cgroup->subsys[] clearing from css_release().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
54504e977ceee0bea6fbe8b632eceea771b18c6c 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: cgroup_idr_lock should be bh

cgroup_idr_remove() can be invoked from bh leading to lockdep
detecting possible AA deadlock (IN_BH/ON_BH). Make the lock bh-safe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0cee8b7786467907e12d1d8f872e6dc73bc95204 13-May-2014 Tejun Heo <tj@kernel.org> cgroup: fix offlining child waiting in cgroup_subtree_control_write()

cgroup_subtree_control_write() waits for offline to complete
child-by-child before enabling a controller; however, it has a couple
bugs.

* It doesn't initialize the wait_queue_t. This can lead to infinite
hang on the following schedule() among other things.

* It forgets to pin the child before releasing cgroup_tree_mutex and
performing schedule(). The child may already be gone by the time it
wakes up and invokes finish_wait(). Pin the child being waited on.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5024ae29cd285ce9e736776414da645d3a91828c 08-May-2014 Tejun Heo <tj@kernel.org> cgroup: introduce task_css_is_root()

Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.

Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.

v2: cgroup no longer supports modular controllers. No need to export
init_css_set. Pointed out by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
60106946ca7f63396680130b25511ccf6b7d5bff 05-May-2014 Fabian Frederick <fabf@skynet.be> kernel/cgroup.c: fix 2 kernel-doc warnings

Fix typo and variable name.

tj: Updated @cgrp argument description in cgroup_destroy_css_killed()

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Tejun Heo <tj@kernel.org>
15a4c835e4ed3e60dd68727cd1907e3dd89563f4 04-May-2014 Tejun Heo <tj@kernel.org> cgroup, memcg: implement css->id and convert css_from_id() to use it

Until now, cgroup->id has been used to identify all the associated
csses and css_from_id() takes cgroup ID and returns the matching css
by looking up the cgroup and then dereferencing the css associated
with it; however, now that the lifetimes of cgroup and css are
separate, this is incorrect and breaks on the unified hierarchy when a
controller is disabled and enabled back again before the previous
instance is released.

This patch adds css->id which is a subsystem-unique ID and converts
css_from_id() to look up by the new css->id instead. memcg is the
only user of css_from_id() and also converted to use css->id instead.

For traditional hierarchies, this shouldn't make any functional
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
ddfcadab35dda6e5bc23ccf1c3055ecb63a71e49 04-May-2014 Tejun Heo <tj@kernel.org> cgroup: update init_css() into init_and_link_css()

init_css() takes the cgroup the new css belongs to as an argument and
initializes the new css's ->cgroup and ->parent pointers but doesn't
acquire the matching reference counts. After the previous patch,
create_css() puts init_css() and reference acquisition right next to
each other. Let's move reference acquistion into init_css() and
rename the function to init_and_link_css(). This makes sense and is
easier to follow. This makes the root csses to hold a reference on
cgrp_dfl_root.cgrp, which is harmless.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a2bed8209a3afc3b2cf1c28383fb48155c1fea46 04-May-2014 Tejun Heo <tj@kernel.org> cgroup: use RCU free in create_css() failure path

Currently, when create_css() fails in the middle, the half-initialized
css is freed by invoking cgroup_subsys->css_free() directly. This
patch updates the function so that it invokes RCU free path instead.
As the RCU free path puts the parent css and owning cgroup, their
references are now acquired right after a new css is successfully
allocated.

This doesn't make any visible difference now but is to enable
implementing css->id and RCU protected lookup by such IDs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
6fa4918d03c39351aef3573ac3e7958d6a5ad9b6 04-May-2014 Tejun Heo <tj@kernel.org> cgroup: protect cgroup_root->cgroup_idr with a spinlock

Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
ends up requiring cgroup_put() to be invoked under sleepable context.
This is okay for now but is an unusual requirement and we'll soon add
css->id which will have the same problem but won't be able to simply
grab cgroup_mutex as removal will have to happen from css_release()
which can't sleep.

Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
which protects the idr operations with the lock and use them for
cgroup_root->cgroup_idr. cgroup_put() no longer needs to grab
cgroup_mutex and css_from_id() is updated to always require RCU read
lock instead of either RCU read lock or cgroup_mutex, which doesn't
affect the existing users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
7d699ddb2b181a2c76e5ea18b1bdf102c4bebe4b 04-May-2014 Tejun Heo <tj@kernel.org> cgroup, memcg: allocate cgroup ID from 1

Currently, cgroup->id is allocated from 0, which is always assigned to
the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
invalid IDs and ends up incrementing all IDs by one.

It's reasonable to reserve 0 for special purposes. This patch updates
cgroup core so that ID 0 is not used and the root cgroups get ID 1.
The ID incrementing is removed form memcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Li Zefan <lizefan@huawei.com>
69dfa00ccb72a37f3810687ca110e5a8154c6eed 04-May-2014 Tejun Heo <tj@kernel.org> cgroup: make flags and subsys_masks unsigned int

There's no reason to use atomic bitops for cgroup_subsys_state->flags,
cgroup_root->flags and various subsys_masks. This patch updates those
to use bitwise and/or operations instead and converts them form
unsigned long to unsigned int.

This makes the fields occupy (marginally) smaller space and makes it
clear that they don't require atomicity.

This patch doesn't cause any behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ed3d261b53f51c9505822d757d1800c79fb68788 26-Apr-2014 Joe Perches <joe@perches.com> cgroup: Use more current logging style

Use pr_fmt and remove embedded prefixes.
Realign modified multi-line statements to open parenthesis.
Convert embedded function name to "%s: ", __func__

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a2a1f9eaf945c46b5b2bc0e439cba68888e3d540 26-Apr-2014 Jianyu Zhan <nasa4836@gmail.com> cgroup: replace pr_warning with preferred pr_warn

As suggested by scripts/checkpatch.pl, substitude all pr_warning()
with pr_warn().

No functional change.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
f8719ccf7bc0858384c7e93d8c57fe69ae8c9eac 26-Apr-2014 Jianyu Zhan <nasa4836@gmail.com> cgroup: remove orphaned cgroup_pidlist_seq_operations

6612f05b88fa309c9 ("cgroup: unify pidlist and other file handling")
has removed the only user of cgroup_pidlist_seq_operations :
cgroup_pidlist_open().

This patch removes it.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2f0edc04e702fc07d29621f9e361b9120a7594d0 26-Apr-2014 Jianyu Zhan <nasa4836@gmail.com> cgroup: clean up obsolete comment for parse_cgroupfs_options()

1d5be6b287c8efc87 ("cgroup: move module ref handling into
rebind_subsystems()") makes parse_cgroupfs_options() no longer takes
refcounts on subsystems.

And unified hierachy makes parse_cgroupfs_options not need to call
with cgroup_mutex held to protect the cgroup_subsys[].

So this patch removes BUG_ON() and the comment. As the comment
doesn't contain useful information afterwards, the whole comment is
removed.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
842b597ee0a7e1aa5a3148164ffdba00ec17f614 26-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: implement cgroup.populated for the default hierarchy

cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
currently provides release_agent for it; unfortunately, this mechanism
is riddled with issues.

* It delivers events by forking and execing a userland binary
specified as the release_agent. This is a long deprecated method of
notification delivery. It's extremely heavy, slow and cumbersome to
integrate with larger infrastructure.

* There is single monitoring point at the root. There's no way to
delegate management of a subtree.

* The event isn't recursive. It triggers when a cgroup doesn't have
any tasks or child cgroups. Events for internal nodes trigger only
after all children are removed. This again makes it impossible to
delegate management of a subtree.

* Events are filtered from the kernel side. "notify_on_release" file
is used to subscribe to or suppress release event. This is
unnecessarily complicated and probably done this way because event
delivery itself was expensive.

This patch implements interface file "cgroup.populated" which can be
used to monitor whether the cgroup's subhierarchy has tasks in it or
not. Its value is 0 if there is no task in the cgroup and its
descendants; otherwise, 1, and kernfs_notify() notificaiton is
triggers when the value changes, which can be monitored through poll
and [di]notify.

This is a lot ligther and simpler and trivially allows delegating
management of subhierarchy - subhierarchy monitoring can block further
propgation simply by putting itself or another process in the root of
the subhierarchy and monitor events that it's interested in from there
without interfering with monitoring higher in the tree.

v2: Patch description updated as per Serge.

v3: "cgroup.subtree_populated" renamed to "cgroup.populated". The
subtree_ prefix was a bit confusing because
"cgroup.subtree_control" uses it to denote the tree rooted at the
cgroup sans the cgroup itself while the populated state includes
the cgroup itself.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Lennart Poettering <lennart@poettering.net>
f8f22e53a262ebee37fc98004f16b066cf5bc125 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: implement dynamic subtree controller enable/disable on the default hierarchy

cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.

This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.

root - A - B - C
\ D

root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.

The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.

When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.

* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.

* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.

* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.

* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.

* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.

* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".

This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.

v2: Non-root cgroups now also have "cgroup.controllers".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
f817de98513d060023be4fa1d061b29a6515273e 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: prepare migration path for unified hierarchy

Unified hierarchy implementation would require re-migrating tasks onto
the same cgroup on the default hierarchy to reflect updated effective
csses. Update cgroup_migrate_prepare_dst() so that it accepts NULL as
the destination cgrp. When NULL is specified, the destination is
considered to be the cgroup on the default hierarchy associated with
each css_set.

After this change, the identity check in cgroup_migrate_add_src()
isn't sufficient for noop detection as the associated csses may change
without any cgroup association changing. The only way to tell whether
a migration is noop or not is testing whether the source and
destination csets are identical. The noop check in
cgroup_migrate_add_src() is removed and cset identity test is added to
cgroup_migreate_prepare_dst(). If it's detected that source and
destination csets are identical, the cset is removed removed from
@preloaded_csets and all the migration nodes are cleared which makes
cgroup_migrate() ignore the cset.

Also, make the function append the destination css_sets to
@preloaded_list so that destination css_sets always come after source
css_sets.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
7fd8c565d8a501486d63d7ee07fd6582e97db437 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: update subsystem rebind restrictions

Because the default root couldn't have any non-root csses attached to
it, rebinding away from it was always allowed; however, the default
hierarchy will soon host the unified hierarchy and have non-root csses
so the rebind restrictions need to be updated accordingly.

Instead of special casing rebinding from the default hierarchy and
then checking whether the source hierarchy has children cgroups, which
implies non-root csses for !dfl hierarchies, simply check whether the
source hierarchy has non-root csses for the subsystem using
css_next_child().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
6803c006282768ec850760766a6e4eb1a6ff87df 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: add css_set->dfl_cgrp

To implement the unified hierarchy behavior, we'll need to be able to
determine the associated cgroup on the default hierarchy from css_set.
Let's add css_set->dfl_cgrp so that it can be accessed conveniently
and efficiently.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
bd53d617b34c781dac8e22dbc75e8f182d918ecf 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: allow cgroup creation and suppress automatic css creation in the unified hierarchy

Now that effective css handling has been added and iterators updated
accordingly, it's safe to allow cgroup creation in the default
hierarchy. Unblock cgroup creation in the default hierarchy.

As the default hierarchy will implement explicit enabling and
disabling of controllers on each cgroup, suppress automatic css
enabling on cgroup creation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
e32978031016f56be977a9a856ba4d9f447db51f 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: cgroup->subsys[] should be cleared after the css is offlined

After a css finishes offlining, offline_css() mistakenly performs
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], css) which just sets the
cgroup->subsys[] pointer to the current value. The intention was to
clear it after offline is complete, not reassign the same value.

Update it to assign NULL instead of the current value. This makes
cgroup_css() to return NULL once offline is complete. All the
existing users of the function either can handle NULL return already
or guarantee that the css doesn't get offlined.

While this is a bugfix, as css lifetime is currently tied to the
cgroup it belongs to, this bug doesn't cause any actual problems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3ebb2b6ef38875b866ec0118bfae7bc52afd0166 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: teach css_task_iter about effective csses

Currently, css_task_iter iterates tasks associated with a css by
visiting each css_set associated with the owning cgroup and walking
tasks of each of them. This works fine for !unified hierarchies as
each cgroup has its own css for each associated subsystem on the
hierarchy; however, on the planned unified hierarchy, a cgroup may not
have csses associated and its tasks would be considered associated
with the matching css of the nearest ancestor which has the subsystem
enabled.

This means that on the default unified hierarchy, just walking all
tasks associated with a cgroup isn't enough to walk all tasks which
are associated with the specified css. If any of its children doesn't
have the matching css enabled, task iteration should also include all
tasks from the subtree. We already added cgroup->e_csets[] to list
all css_sets effectively associated with a given css and walk css_sets
on that list instead to achieve such iteration.

This patch updates css_task_iter iteration such that it walks css_sets
on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
requested on an non-dummy css. Thanks to the previous iteration
update, this change can be achieved with the addition of
css_task_iter->ss and minimal updates to css_advance_task_iter() and
css_task_iter_start().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0f0a2b4fa6210147131082999f1f16d7fb79abf8 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: reorganize css_task_iter

This patch reorganizes css_task_iter so that adding effective css
support is easier.

* s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency

* ->origin_css is used to determine whether the iteration reached the
last css_set. Replace it with explicit ->cset_head so that
css_advance_task_iter() doesn't have to know the termination
condition directly.

* css_task_iter_next() currently assumes that it's walking list of
cgrp_cset_link and reaches into the current cset through the current
link to determine the termination conditions for task walking. As
this won't always be true for effective css walking, add
->tasks_head and ->mg_tasks_head and use them to control task
walking so that css_task_iter_next() doesn't have to know how
css_sets are being walked.

This patch doesn't make any behavior changes. The iteration logic
stays unchanged after the patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3b281afbc3a06cd69c54e6db1a04a8e73997723f 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: make css_next_child() skip missing csses

css_next_child() walks the children of the specified css. It does
this by finding the next cgroup and then returning the requested css.
On the default unified hierarchy, a cgroup may not have a css
associated with it even if the hierarchy has the subsystem enabled.
This patch updates css_next_child() so that it skips children without
the requested css associated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2d8f243a5e6efa57fb7c46fe83fafa45b33d0ec2 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: implement cgroup->e_csets[]

On the default unified hierarchy, a cgroup may be associated with
csses of its ancestors, which means that a css of a given cgroup may
be associated with css_sets of descendant cgroups. This means that we
can't walk all tasks associated with a css by iterating the css_sets
associated with the cgroup as there are css_sets which are pointing to
the css but linked on the descendants.

This patch adds per-subsystem list heads cgroup->e_csets[]. Any
css_set which is pointing to a css is linked to
css->cgroup->e_csets[$SUBSYS_ID] through
css_set->e_cset_node[$SUBSYS_ID]. The lists are protected by
css_set_rwsem and will allow us to walk all css_sets associated with a
given css so that we can find out all associated tasks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
aec3dfcb2e43892180ee053e8c260dcdeccf4392 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: introduce effective cgroup_subsys_state

In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.

When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.

This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.

For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.

While at it, fix incorrect locking comment for for_each_css().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
f392e51cd6ae6f6ee5b9b6d611cdc282b4c1711e 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: update cgroup->subsys_mask to ->child_subsys_mask and restore cgroup_root->subsys_mask

944196278d3d ("cgroup: move ->subsys_mask from cgroupfs_root to
cgroup") moved ->subsys_mask from cgroup_root to cgroup to prepare for
the unified hierarhcy; however, it turns out that carrying the
subsys_mask of the children in the parent, instead of itself, is a lot
more natural. This patch restores cgroup_root->subsys_mask and morphs
cgroup->subsys_mask into cgroup->child_subsys_mask.

* Uses of root->cgrp.subsys_mask are restored to root->subsys_mask.

* Remove automatic setting and clearing of cgrp->subsys_mask and
instead just inherit ->child_subsys_mask from the parent during
cgroup creation. Note that this doesn't affect any current
behaviors.

* Undo __kill_css() separation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ea8fd3b47ff4ed4b1b5942bf3e0cb8d8f590ec59 23-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: cgroup_apply_cftypes() shouldn't skip the default hierarhcy

cgroup_apply_cftypes() skip creating or removing files if the
subsystem is attached to the default hierarchy, which led to missing
files in the root of the default hierarchy.

Skipping made sense when the default hierarchy was dummy; however, now
that the default hierarchy is full functional and planned to be used
as the unified hierarchy, it shouldn't be skipped over.

Reported-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
e37a06f10994c2ba86f54d8f96734f2483a869b8 17-Apr-2014 Li Zefan <lizefan@huawei.com> cgroup: fix the retry path of cgroup_mount()

If we hit the retry path, we'll call parse_cgroupfs_options() again,
but the string we pass to it has been modified by the previous call
to this function.

This bug can be observed by:

# mount -t cgroup -o name=foo,cpuset xxx /mnt && umount /mnt && \
mount -t cgroup -o name=foo,cpuset xxx /mnt
mount: wrong fs type, bad option, bad superblock on xxx,
missing codepage or helper program, or other error
...

The second mount passed "name=foo,cpuset" to the parser, and then it
hit the retry path and call the parser again, but this time the string
passed to the parser is "name=foo".

To fix this, we avoid calling parse_cgroupfs_options() again in this
case.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
49957f8e2a43035a97d05bddefa394492a969c0d 07-Apr-2014 Tejun Heo <tj@kernel.org> cgroup: newly created dirs and files should be owned by the creator

While converting cgroup to kernfs, 2bd59d48ebfb ("cgroup: convert to
kernfs") accidentally dropped the logic which makes newly created
cgroup dirs and files owned by the current uid / gid. This broke
cases where cgroup subtree management is delegated to !root as the sub
manager wouldn't be able to create more than single level of hierarchy
or put tasks into child cgroups it created.

Among other things, this breaks user session management in systemd and
one of the symptoms was 90s hang during shutdown. User session
systemd running as the user creates a sub-service to initiate shutdown
and tries to put kill(1) into it but fails because cgroup.procs is
owned by root. This leads to 90s hang during shutdown.

Implement cgroup_kn_set_ugid() which sets a kn's uid and gid to those
of the caller and use it from file and dir creation paths.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
c6b3d5bcd67c75961a1e8b9564d1475c0f194a84 04-Apr-2014 Li Zefan <lizefan@huawei.com> cgroup: fix top cgroup refcnt leak

As mount() and kill_sb() is not a one-to-one match, If we mount the same
cgroupfs in serveral mount points, and then umount all of them, kill_sb()
will be called only once.

Try:
# mount -t cgroup -o cpuacct xxx /cgroup
# mount -t cgroup -o cpuacct xxx /cgroup2
# cat /proc/cgroups | grep cpuacct
cpuacct 2 1 1
# umount /cgroup
# umount /cgroup2
# cat /proc/cgroups | grep cpuacct
cpuacct 2 1 1

You'll see cgroupfs will never be freed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1ec41830e087cda1f62dda4182c2b62811eb0ffc 28-Mar-2014 Li Zefan <lizefan@huawei.com> cgroup: remove useless argument from cgroup_exit()

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
e8604cb43690b781f9a7ad4a770f3e10259fe939 28-Mar-2014 Li Zefan <lizefan@huawei.com> cgroup: fix spurious lockdep warning in cgroup_exit()

cgroup_exit() is called in fork and exit path. If it's called in the
failure path during fork, PF_EXITING isn't set, and then lockdep will
complain.

Fix this by removing cgroup_exit() in that failure path. cgroup_fork()
does nothing that needs cleanup.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
01a971406177c2ca9834be6914a67e20f463a3e6 23-Mar-2014 Monam Agarwal <monamagarwal123@gmail.com> cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c

This patch replaces rcu_assign_pointer(x, NULL) with
RCU_INIT_POINTER(x, NULL)

The rcu_assign_pointer() ensures that the initialization of a
structure is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to
initialize. So, rcu_assign_pointer(p, NULL) can be safely converted
to RCU_INIT_POINTER(p, NULL)

Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
e1b2dc176f2d5be7952c47a4e4e8f3b06a90db1c 20-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: break kernfs active_ref protection in cgroup directory operations

cgroup_tree_mutex should nest above the kernfs active_ref protection;
however, cgroup_create() and cgroup_rename() were grabbing
cgroup_tree_mutex while under kernfs active_ref protection. This has
actualy possibility to lead to deadlocks in case these operations race
against cgroup_rmdir() which invokes kernfs_remove() on directory
kernfs_node while holding cgroup_tree_mutex.

Neither cgroup_create() or cgroup_rename() requires active_ref
protection. The former already has enough synchronization through
cgroup_lock_live_group() and the latter doesn't care, so this can be
fixed by updating both functions to break all active_ref protections
before grabbing cgroup_tree_mutex.

While this patch fixes the immediate issue, it probably needs further
work in the long term - kernfs directories should enable lockdep
annotations and maybe the better way to handle this is marking
directory nodes as not needing active_ref protection rather than
breaking it in each operation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1b9aba49eab5e85b0d3de8ba630cda6d68546297 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: fix cgroup_taskset walking order

cgroup_taskset is used to track and iterate target tasks while
migrating a task or process and should guarantee that the first task
iterated is the task group leader if a process is being migrated.

b3dc094e9390 ("cgroup: use css_set->mg_tasks to track target tasks
during migration") replaced flex array cgroup_taskset->tc_array with
css_set->mg_tasks list to remove process size limit and dynamic
allocation during migration; unfortunately, it incorrectly used list
operations which don't preserve order breaking the guarantee that
cgroup_taskset_first() returns the leader for a process target.

Fix it by using order preserving list operations. Note that as
multiple src_csets may map to a single dst_cset, the iteration order
may change across cgroup_task_migrate(); however, the leader is still
guaranteed to be the first entry.

The switch to list_splice_tail_init() at the end of cgroup_migrate()
isn't strictly necessary. Let's still do it for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
8cbbf2c972c4444cad36f61cd571714c39b8cf04 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: implement CFTYPE_ONLY_ON_DFL

This cftype flag makes the file only appear on the default hierarchy.
This will later be used for cgroup.controllers file.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a2dd424750807f83632a2a70293961086fd8f870 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: make cgrp_dfl_root mountable

cgrp_dfl_root will be used as the default unified hierarchy. This
patch makes cgrp_dfl_root mountable by making the following changes.

* cgroup_init_early() now initializes cgrp_dfl_root w/
CGRP_ROOT_SANE_BEHAVIOR. The default hierarchy is always sane.

* parse_cgroupfs_options() and cgroup_mount() are updated such that
cgrp_dfl_root is mounted if sane_behavior is specified w/o any
subsystems.

* rebind_subsystems() now populates the root directory of
cgrp_dfl_root. Note that the function still guarantees success of
rebinding subsystems to cgrp_dfl_root. If populating fails while
rebinding to cgrp_dfl_root, it whines but ignores the error.

* For backward compatibility, the default hierarchy shows up in
/proc/$PID/cgroup only after it's explicitly mounted so that
userland which doesn't make use of it doesn't see any change.

* "current_css_set_cg_links" file of debug cgroup now treats the
default hierarchy the same as other hierarchies. This is visible to
userland. Given that it's for debug controller, this should be
fine.

* While at it, implement cgroup_on_dfl() which tests whether a give
cgroup is on the default hierarchy or not.

The above changes make cgrp_dfl_root mostly equivalent to other
controllers but the actual unified hierarchy behaviors are not
implemented yet. Let's plug child cgroup creation in cgrp_dfl_root
from create_cgroup() for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
4d3bb511b5f9980fc3e9ae5939ebc475b231d3fc 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: drop const from @buffer of cftype->write_string()

cftype->write_string() just passes on the writeable buffer from kernfs
and there's no reason to add const restriction on the buffer. The
only thing const achieves is unnecessarily complicating parsing of the
buffer. Drop const from @buffer.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
3dd06ffa9df99aa88f4e01eaa0c9d3cb362430b0 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: rename cgroup_dummy_root and related names

The dummy root will be repurposed to serve as the default unified
hierarchy. Let's rename things in preparation.

* s/cgroup_dummy_root/cgrp_dfl_root/
* s/cgroupfs_root/cgroup_root/ as we don't do fs part directly anymore
* s/cgroup_root->top_cgroup/cgroup_root->cgrp/ for brevity

This is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
944196278d3dea0cece1636de417b56897d9a108 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: move ->subsys_mask from cgroupfs_root to cgroup

cgroupfs_root->subsys_mask represents the controllers attached to the
hierarchy. This patch moves the field to cgroup. Subsystem
initialization and rebinding updates the top cgroup's subsys_mask.
For !root cgroups, the subsys_mask bits are set from create_css() and
cleared from kill_css(), which effectively means that all cgroups will
have the same subsys_mask as the top cgroup.

While this doesn't make any difference now, this will help
implementation of the default unified hierarchy where !root cgroups
may have subsets of the top_cgroup's subsys_mask.

While at it, __kill_css() is split out of kill_css(). The former
doesn't care about the subsys_mask while the latter becomes noop if
the controller is already killed and clears the matching bit if not
before proceeding to killing the css. This will be used later by the
default unified hierarchy implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5df3603229e520442502ff7fc5715c77bbf61912 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding

Currently, while rebinding, cgroup_dummy_root serves as the anchor
point. In addition to the target root, rebind_subsystems() takes
@added_mask and @removed_mask. The subsystems specified in the former
are expected to be on the dummy root and then moved to the target
root. The ones in the latter are moved from non-dummy root to dummy.
Now that the dummy root is a fully functional one and we're planning
to use it for the default unified hierarchy, this level of distinction
between dummy and non-dummy roots is quite awkward.

This patch updates rebind_subsystems() to take the target root and one
subsystem mask and move the specified subsystmes to the target root
which may or may not be the dummy root. IOW, unbinding now becomes
moving the subsystems to the dummy root and binding to non-dummy root.
This makes the dummy root mostly equivalent to other hierarchies in
terms of the mechanism of moving subsystems around; however, we still
retain all the semantical restrictions so that this patch doesn't
introduce any visible behavior differences. Another noteworthy detail
is that rebind_subsystems() guarantees that moving a subsystem to the
dummy root never fails so that valid unmounting attempts always
succeed.

This unifies binding and unbinding of subsystems. The invocation
points of ->bind() were inconsistent between the two and now moved
after whole rebinding is complete. This doesn't break the current
users and generally makes more sense.

All rebind_subsystems() users are converted accordingly. Note that
cgroup_remount() now makes two calls to rebind_subsystems() to bind
and then unbind the requested subsystems.

This will allow repurposing of the dummy hierarchy as the default
unified hierarchy and shouldn't make any userland visible behavior
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
985ed670144c25058f235276f69d687de1b7c7ba 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root

cgroup_dummy_root is used to host controllers which aren't attached to
any other hierarchy. The root is minimally set up during kernfs
bootstrap and didn't go through full hierarchy initialization. We're
planning to use cgroup_dummy_root for the default unified hierarchy
and thus want it to be fully functional.

Replace the special initialization, which was collected into
cgroup_init() by the previous patch, with an invocation of
cgroup_setup_root(). This simplifies the init path and makes
cgroup_dummy_root a full hierarchy with its own kernfs_root and all.

As this puts the dummy hierarchy on the cgroup_roots list, rename
for_each_active_root() to for_each_root() and update its users to skip
the dummy root for now.

This patch doesn't cause any userland visible behavior changes at this
point.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
172a2c0685ff3bc0b7a611b308aac0694de34594 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: reorganize cgroup bootstrapping

* Fields of init_css_set and css_set_count are now set using
initializer instead of programmatically from cgroup_init_early().

* init_cgroup_root() now also takes @opts and performs the optional
part of initialization too. The leftover part of
cgroup_root_from_opts() is collapsed into its only caller -
cgroup_mount().

* Initialization of cgroup_root_count and linking of init_css_set are
moved from cgroup_init_early() to to cgroup_init(). None of the
early_init users depends on init_css_set being linked.

* Subsystem initializations are moved after dummy hierarchy init and
init_css_set linking.

These changes reorganize the bootstrap logic so that the dummy
hierarchy can share the usual hierarchy init path and be made more
normal. These changes don't make noticeable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5d77381fd8aa631a8fda718c395da1319afb5d2d 19-Mar-2014 Tejun Heo <tj@kernel.org> cgroup: relocate setting of CGRP_DEAD

In cgroup_destroy_locked(), move setting of CGRP_DEAD above
invocations of kill_css(). This doesn't make any visible behavior
difference now but will be used to inhibit manipulating controller
enable states of a dying cgroup on the unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3eb59ec64fc7a3f4576da23f811b39331b830ba2 18-Mar-2014 Li Zefan <lizefan@huawei.com> cgroup: fix a failure path in create_css()

If online_css() fails, we should remove cgroup files belonging
to css->ss.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
952aaa125428fae883670a2c2e40ea8044ca1eaa 25-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: update cgroup_transfer_tasks() to either succeed or fail

cgroup_transfer_tasks() can currently fail in the middle due to memory
allocation failure. When that happens, the function just aborts and
returns error code and there's no way to tell how many actually got
migrated at the point of failure and or to revert the partial
migration.

Update it to use cgroup_migrate{_add_src|prepare_dst|migrate|finish}()
so that the function either succeeds or fails as a whole as long as
->can_attach() doesn't fail.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0e1d768f1b1873272ec4e8dc1482bb5281855017 25-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: drop task_lock() protection around task->cgroups

For optimization, task_lock() is additionally used to protect
task->cgroups. The optimization is pretty dubious as either
css_set_rwsem is grabbed anyway or PF_EXITING already protects
task->cgroups. It adds only overhead and confusion at this point.
Let's drop task_[un]lock() and update comments accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
eaf797abc53b0ab3f0a02d4ef873a565fcce6daa 25-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: update how a newly forked task gets associated with css_set

When a new process is forked, cgroup_fork() associates it with the
css_set of its parent but doesn't link it into it. After the new
process is linked to tasklist, cgroup_post_fork() does the linking.

This is problematic for cgroup_transfer_tasks() as there's no way to
tell whether there are tasks which are pointing to a css_set but not
linked yet. It is impossible to implement an operation which transfer
all tasks of a cgroup to another and the current
cgroup_transfer_tasks() can easily be tricked into leaving a newly
forked process behind if it gets called between cgroup_fork() and
cgroup_post_fork().

Let's make association with a css_set and linking atomic by moving it
to cgroup_post_fork(). cgroup_fork() sets child->cgroups to
init_css_set as a placeholder and cgroup_post_fork() is updated to
perform both the association with the parent's cgroup and linking
there. This means that a newly created task will point to
init_css_set without holding a ref to it much like what it does on the
exit path. Empty cg_list is used to indicate that the task isn't
holding a ref to the associated css_set.

This fixes an actual bug with cgroup_transfer_tasks(); however, I'm
not marking it for -stable. The whole thing is broken in multiple
other ways which require invasive updates to fix and I don't think
it's worthwhile to bother with backporting this particular one.
Fortunately, the only user is cpuset and these bugs don't crash the
machine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1958d2d53dadbb1c9aaf0b37741f13a60098b243 25-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: split process / task migration into four steps

Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.

This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.

This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.

The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.

Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ceb6a081f6f52d17ec9e46e271cc26a1eb8a7573 25-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: separate out cset_group_from_root() from task_cgroup_from_root()

This will be used by the planned migration path update.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b3dc094e93905ae9c1bc0815402ad8e5b203d068 25-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: use css_set->mg_tasks to track target tasks during migration

Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.

* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.

* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.

Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.

To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.

Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.

To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.

This resolves the above listed issues with moderate additional
complexity.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
c75611282cf1bf717c1866e7a7eb4d0743815187 25-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: add css_set->mg_tasks

Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.

* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.

* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.

Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.

To resolve the situation, we're going to use task->cg_list during
migration too. Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set. Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.

In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks. Note that ->mg_tasks
isn't used yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
532de3fc72adc2a6525c4d53c07bf81e1732083d 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: update cgroup_enable_task_cg_lists() to grab siglock

Currently, there's nothing preventing cgroup_enable_task_cg_lists()
from missing set PF_EXITING and race against cgroup_exit(). Depending
on the timing, cgroup_exit() may finish with the task still linked on
css_set leading to list corruption. Fix it by grabbing siglock in
cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
visible.

This whole on-demand cg_list optimization is extremely fragile and has
ample possibility to lead to bugs which can cause things like
once-a-year oops during boot. I'm wondering whether the better
approach would be just adding "cgroup_disable=all" handling which
disables the whole cgroup rather than tempting fate with this
on-demand craziness.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
dc5736ed7aaf942caaac0c15af74a018e04ec79d 17-Feb-2014 Li Zefan <lizefan@huawei.com> cgroup: add a validation check to cgroup_add_cftyps()

Fengguang reported this bug:

BUG: unable to handle kernel NULL pointer dereference at 0000003c
IP: [<cc90b4ad>] cgroup_cfts_commit+0x27/0x1c1
...
Call Trace:
[<cc9d1129>] ? kmem_cache_alloc_trace+0x33f/0x3b7
[<cc90c6fc>] cgroup_add_cftypes+0x8f/0xca
[<cd78b646>] cgroup_init+0x6a/0x26a
[<cd764d7d>] start_kernel+0x4d7/0x57a
[<cd7642ef>] i386_start_kernel+0x92/0x96

This happens in a corner case. If CGROUP_SCHED=y but CFS_BANDWIDTH=n &&
FAIR_GROUP_SCHED=n && RT_GROUP_SCHED=n, we have:

cpu_files[] = {
{ } /* terminate */
}

When we pass cpu_files to cgroup_apply_cftypes(), as cpu_files[0].ss
is NULL, we'll access NULL pointer.

The bug was introduced by commit de00ffa56ea3132c6013fc8f07133b8a1014cf53
("cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()").

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
6534fd6c15858fe4ce4ae568106225e68d5afa81 14-Feb-2014 Li Zefan <lizefan@huawei.com> cgroup: fix memory leak in cgroup_mount()

We should free the memory allocated in parse_cgroupfs_options() before
calling this function again.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
bad34660344f37db8b55ce8bc139bddc7d83af1b 14-Feb-2014 Li Zefan <lizefan@huawei.com> cgroup: fix locking in cgroupstats_build()

css_set_lock has been converted to css_set_rwsem, and rwsem can't nest
inside rcu_read_lock.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
430af8ad9dad82d775d688155e1db1da385d3e7a 13-Feb-2014 Fengguang Wu <fengguang.wu@intel.com> cgroup: fix coccinelle warnings

kernel/cgroup.c:2256:1-3: WARNING: PTR_RET can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: coccinelle/api/ptr_ret.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
8541fecc04a91842f023cbfe2c376d4de3b5047e 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: unexport functions

With module support gone, a lot of functions no longer need to be
exported. Unexport them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
9db8de3722d184b8a431afd6bef803d6867ac889 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: cosmetic updates to cgroup_attach_task()

cgroup_attach_task() is planned to go through restructuring. Let's
tidy it up a bit in preparation.

* Update cgroup_attach_task() to receive the target task argument in
@leader instead of @tsk.

* Rename @tsk to @task.

* Rename @retval to @ret.

This is purely cosmetic.

v2: get_nr_threads() was using uninitialized @task instead of @leader.
Fixed. Reported by Dan Carpenter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
bc668c7519ff8b4681af80e92f463bec7bf7cf9e 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size()

The two functions don't have any users left. Remove them along with
cgroup_taskset->cur_cgrp.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cb0f1fe9ba47c202a98a9d41ad5c12c0ac7732e9 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: move css_set_rwsem locking outside of cgroup_task_migrate()

Instead of repeatedly locking and unlocking css_set_rwsem inside
cgroup_task_migrate(), update cgroup_attach_task() to grab it outside
of the loop and update cgroup_task_migrate() to use
put_css_set_locked().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
89c5509b0d71d1609761bf72d33333ab206dac9f 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit()

put_css_set() is performed in two steps - it first tries to put
without grabbing css_set_rwsem if such put wouldn't make the count
zero. If that fails, it puts after write-locking css_set_rwsem. This
patch separates out the second phase into put_css_set_locked() which
should be called with css_set_rwsem locked.

Also, put_css_set_taskexit() is droped and put_css_set() is made to
take @taskexit. There are only a handful users of these functions.
No point in providing different variants.

put_css_locked() will be used by later changes. This patch doesn't
introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
889ed9ceaa97bb02bf5d7349e24639f7fc5f4fa0 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: remove css_scan_tasks()

css_scan_tasks() doesn't have any user left. Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
96d365e0b86ee7ec6366c99669687e54c9f145e3 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem

Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks(). The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end. It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.

We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users(). As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.

While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.

Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
e406d1cfff6ab189c8676072d211809c94fecaf0 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks()

Reimplement cgroup_transfer_tasks() so that it repeatedly fetches the
first task in the cgroup and then tranfers it. This achieves the same
result without using css_scan_tasks() which is scheduled to be
removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
07bc356ed2950048d33d667e933e1b913c6e6b6d 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count()

cgroup_task_count() read-locks css_set_lock and walks all tasks to
count them and then returns the result. The only thing all the users
want is determining whether the cgroup is empty or not. This patch
implements cgroup_has_tasks() which tests whether cgroup->cset_links
is empty, replaces all cgroup_task_count() usages and unexports it.

Note that the test isn't synchronized. This is the same as before.
The test has always been racy.

This will help planned css_set locking update.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
afeb0f9fd425239aa477c842480f240bfb6325b3 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: relocate cgroup_enable_task_cg_lists()

Move it above so that prototype isn't necessary. Let's also move the
definition of use_task_css_set_links next to it.

This is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
56fde9e01de45bcfabbb444d33e8bdd8388d2da0 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: enable task_cg_lists on the first cgroup mount

Tasks are not linked on their css_sets until cgroup task iteration is
actually used. This is to avoid incurring overhead on the fork and
exit paths for systems which have cgroup compiled in but don't use it.

This lazy binding also affects the task migration path. It has to be
careful so that it doesn't link tasks to css_sets when task_cg_lists
linking is not enabled yet. Unfortunately, this conditional linking
in the migration path interferes with planned migration updates.

This patch moves the lazy binding a bit earlier, to the first cgroup
mount. It's a clear indication that cgroup is being used on the
system and task_cg_lists linking is highly likely to be enabled soon
anyway through "tasks" and "cgroup.procs" files.

This allows cgroup_task_migrate() to always link @tsk->cg_list. Note
that it may still race with cgroup_post_fork() but who wins that race
is inconsequential.

While at it, make use_task_css_set_links a bool, add sanity checks in
cgroup_enable_task_cg_lists() and css_task_iter_start(), and update
the former so that it's guaranteed and assumes to run only once.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
35585573055f37837eb752ee22eb5523682ca742 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: drop CGRP_ROOT_SUBSYS_BOUND

Before kernfs conversion, due to the way super_block lookup works,
cgroup roots were created and made visible before being fully
initialized. This in turn required a special flag to mark that the
root hasn't been fully initialized so that the destruction path can
tell fully bound ones from half initialized.

That flag is CGRP_ROOT_SUBSYS_BOUND and no longer necessary after the
kernfs conversion as the lookup and creation of new root are atomic
w.r.t. cgroup_mutex. This patch removes the flag and passes the
requests subsystem mask to cgroup_setup_root() so that it can set the
respective mask bits as subsystems are bound.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d3ba07c3aa9ae3e03329b0a7f1a067c0647aa2af 13-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: disallow xattr, release_agent and name if sane_behavior

Disallow more mount options if sane_behavior. Note that xattr used to
generate warning.

While at it, simplify option check in cgroup_mount() and update
sane_behavior comment in cgroup.h.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1a11533fbd71792e8c5d36f6763fbce8df0d231d 13-Feb-2014 Tejun Heo <tj@kernel.org> Revert "cgroup: use an ordered workqueue for cgroup destruction"

This reverts commit ab3f5faa6255a0eb4f832675507d9e295ca7e9ba.
Explanation from Hugh:

It's because more thorough testing, by others here, found that it
wasn't always solving the problem: so I asked Tejun privately to
hold off from sending it in, until we'd worked out why not.

Most of our testing being on a v3,11-based kernel, it was perfectly
possible that the problem was merely our own e.g. missing Tejun's
8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups").

But that turned out not to be enough to fix it either. Then Filipe
pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
before we ever get to put the offline on to the workqueue: by the
time we get to the workqueue, the ordering has already been lost.

So, thanks for the Acks, but I'm afraid that this ordered workqueue
solution is just not good enough: we should simply forget that patch
and provide a different answer."

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
776f02fa4e1ad70557c0318c70ce928e0642bee0 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroupfs_root->refcnt

Currently, cgroupfs_root and its ->top_cgroup are separated reference
counted and the latter's is ignored. There's no reason to do this
separately. This patch removes cgroupfs_root->refcnt and destroys
cgroupfs_root when the top_cgroup is released.

* cgroup_put() updated to ignore cgroup_is_dead() test for top
cgroups. cgroup_free_fn() updated to handle root destruction when
releasing a top cgroup.

* As root destruction is now bounced through cgroup destruction, it is
asynchronous. Update cgroup_mount() so that it waits for pending
release which is currently implemented using msleep(). Converting
this to proper wait_queue isn't hard but likely unnecessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3c9c825b8b50de7dbb015e6bfc04bb2da79364d9 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t

root->number_of_cgroups is currently an integer protected with
cgroup_mutex. Except for sanity checks and proc reporting, the only
place it's used is to check whether the root has any child during
remount; however, this is a bit flawed as the counter is not
decremented when the cgroup is unlinked but when it's released,
meaning that there could be an extended period where all cgroups are
removed but remount is still not allowed because some internal objects
are lingering. While not perfect either, it'd be better to use
emptiness test on root->top_cgroup.children.

This patch updates cgroup_remount() to test top_cgroup's children
instead, which makes number_of_cgroups only actual usage statistics
printing in proc implemented in proc_cgroupstats_show(). Let's
shorten its name and make it an atomic_t so that we don't have to
worry about its synchronization. It's purely auxiliary at this point.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
e61734c55c24cdf11b07e52a74aec4dc4a7f4bd0 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroup->name

cgroup->name handling became quite complicated over time involving
dedicated struct cgroup_name for RCU protection. Now that cgroup is
on kernfs, we can drop all of it and simply use kernfs_name/path() and
friends. Replace cgroup->name and all related code with kernfs
name/path constructs.

* Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
of kernfs counterparts, which involves semantic changes.
pr_cont_cgroup_name() and pr_cont_cgroup_path() added.

* cgroup->name handling dropped from cgroup_rename().

* All users of cgroup_name/path() updated to the new semantics. Users
which were formatting the string just to printk them are converted
to use pr_cont_cgroup_name/path() instead, which simplifies things
quite a bit. As cgroup_name() no longer requires RCU read lock
around it, RCU lockings which were protecting only cgroup_name() are
removed.

v2: Comment above oom_info_lock updated as suggested by Michal.

v3: dummy_top doesn't have a kn associated and
pr_cont_cgroup_name/path() ended up calling the matching kernfs
functions with NULL kn leading to oops. Test for NULL kn and
print "/" if so. This issue was reported by Fengguang Wu.

v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
6f30558f37bfbd428e3854c2c34b5c32117c8f7e 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: make cgroup hold onto its kernfs_node

cgroup currently releases its kernfs_node when it gets removed. While
not buggy, this makes cgroup->kn access rules complicated than
necessary and leads to things like get/put protection around
kernfs_remove() in cgroup_destroy_locked(). In addition, we want to
use kernfs_name/path() and friends but also want to be able to
determine a cgroup's name between removal and release.

This patch makes cgroup hold onto its kernfs_node until freed so that
cgroup->kn is always accessible.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
21a2d3430ba8c188af405a5c2eb9c06bdcb6add6 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: simplify dynamic cftype addition and removal

Dynamic cftype addition and removal using cgroup_add/rm_cftypes()
respectively has been quite hairy due to vfs i_mutex. As i_mutex
nests outside cgroup_mutex, cgroup_mutex has to be released and
regrabbed on each iteration through the hierarchy complicating the
process. Now that i_mutex is no longer in play, it can be simplified.

* Just holding cgroup_tree_mutex is enough. No need to meddle with
cgroup_mutex.

* No reason to play the unlock - relock - check serial_nr dancing.
Everything can be atomically while holding cgroup_tree_mutex.

* cgroup_cfts_prepare() is replaced with direct locking of
cgroup_tree_mutex.

* cgroup_cfts_commit() no longer fiddles with locking. It just
applies the cftypes change to the existing cgroups in the hierarchy.
Renamed to cgroup_cfts_apply().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0adb070426dde2fd0b84e7f4f5cefcd8f0b24410 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: remove cftype_set

cftype_set was added primarily to allow registering the same cftype
array more than once for different subsystems. Nobody uses or needs
such thing and it's already broken because each cftype has ->ss
pointer which is initialized during registration.

Let's add list_head ->node to cftype and use the first cftype entry in
the array to link them instead of allocating separate cftype_set.
While at it, trigger WARN if cft seems previously initialized during
registration.

This simplifies cftype handling a bit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
80b13586997d8e584caa772bd99e2a3e55ac6abe 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: relocate cgroup_rm_cftypes()

cftype handling is about to be revamped. Relocate cgroup_rm_cftypes()
above cgroup_add_cftypes() in preparation. This is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
86bf4b68759141459864ebd36ac3038a9cda895b 12-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: warn if "xattr" is specified with "sane_behavior"

Mount option "xattr" is no longer necessary as it's enabled by default
on kernfs. Warn if "xattr" is specified with "sane_behavior" so that
the option can be removed in the future.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2bd59d48ebfb3df41ee56938946ca0dd30887312 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: convert to kernfs

cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.

Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.

Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.

* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.

* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.

* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.

While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.

Overall
-------

* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.

* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.

* All vfs plumbing around dentry, inode and bdi removed.

* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.

Synchronization and object lifetime
-----------------------------------

* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.

* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.

* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.

* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.

File and directory handling
---------------------------

* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.

* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.

* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.

* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.

* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.

* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.

* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.

v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.

As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.

- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.

- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?

v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.

v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
f2e85d574e881ff3c597518c1ab48c86f9109880 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: relocate functions in preparation of kernfs conversion

Relocate cgroup_init/exit_root_id(), cgroup_free_root(),
cgroup_kill_sb() and cgroup_file_name() in preparation of kernfs
conversion.

These are pure relocations to make kernfs conversion easier to follow.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
59f5296b51b86718dd6eecf0a268b2f1a1ec0a2d 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: misc preps for kernfs conversion

* Un-inline seq_css(). After kernfs conversion, the function will
need to dereference internal data structures.

* Add cgroup_get/put_root() and replace direct super_block->s_active
manipulatinos with them. These will be converted to kernfs_root
refcnting.

* Add cgroup_get/put() and replace dget/put() on cgrp->dentry with
them. These will be converted to kernfs refcnting.

* Update current_css_set_cg_links_read() to use cgroup_name() instead
of reaching into the dentry name. The end result is the same.

These changes don't make functional differences but will make
transition to kernfs easier.

v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b1664924062393bb048203bd4622e0b1c9e1d328 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: introduce cgroup_ino()

mm/memory-failure.c::hwpoison_filter_task() has been reaching into
cgroup to extract the associated ino to be used as a filtering
criterion. This is an implementation detail which shouldn't be
depended upon from outside cgroup proper and is about to change with
the scheduled kernfs conversion.

This patch introduces a proper interface to determine the associated
ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
instead of reaching directly into cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
2da440a26ce4743bd3e71ba964ba3f983d09bba5 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: introduce cgroup_init/exit_cftypes()

Factor out cft->ss initialization into cgroup_init_cftypes() from
cgroup_add_cftypes() and add cft->ss clearing to cgroup_rm_cftypes()
through cgroup_exit_cftypes().

This doesn't make any meaningful difference now but the two new
functions will be expanded during kernfs transition.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5f46990787e2721b4db190ddc8af6fdbe8f010d7 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: update the meaning of cftype->max_write_len

cftype->max_write_len is used to extend the maximum size of writes.
It's interpreted in such a way that the actual maximum size is one
less than the specified value. The default size is defined by
CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its
value is decremented by 1 and then compared for equality with max
size, which means that the actual default size is
CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.

There's no point in having a limit that low. Update its definition so
that it means the actual string length sans termination and anything
below PAGE_SIZE-1 is treated as PAGE_SIZE-1.

.max_write_len for "release_agent" is updated to PATH_MAX-1 and
cgroup_release_agent_write() is updated so that the redundant strlen()
check is removed and it uses strlcpy() instead of strcpy().
.max_write_len initializations in blk-throttle.c and cfq-iosched.c are
no longer necessary and removed. The one in cpuset is kept unchanged
as it's an approximated value to begin with.

This will also make transition to kernfs smoother.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
de00ffa56ea3132c6013fc8f07133b8a1014cf53 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()

Currently, cgroup_subsys->base_cftypes registration is different from
dynamic cftypes registartion. Instead of going through
cgroup_add_cftypes(), cgroup_init_subsys() invokes
cgroup_init_cftsets() which makes use of cgroup_subsys->base_cftset
which doesn't involve dynamic allocation.

While avoiding dynamic allocation is somewhat nice, having two
separate paths for cftypes registration is nasty, especially as we're
planning to add more operations during cftypes registration.

This patch drops cgroup_init_cftsets() and cgroup_subsys->base_cftset
and registers base_cftypes using cgroup_add_cftypes(). This is done
as a separate step in cgroup_init() instead of a part of
cgroup_init_subsys(). This is because cgroup_init_subsys() can be
called very early during boot when kmalloc() isn't available yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
8d7e6fb0a1db970ac3589f87af0f2a20ef46654b 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: update cgroup name handling

Straightforward updates to cgroup name handling in preparation of
kernfs conversion.

* cgroup_alloc_name() is updated to take const char * isntead of
dentry * for name source.

* cgroup name formatting is separated out into cgroup_file_name().
While at it, buffer length protection is added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d427dfeb120b92c0c5e2ca9d1ec6952de67ebad9 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: factor out cgroup_setup_root() from cgroup_mount()

Factor out new root initialization into cgroup_setup_root() from
cgroup_mount(). This makes it easier to follow and will ease kernfs
conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
8e30e2b8ba0ee58aa0f442d0b4a3cac1a4f2efb5 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: restructure locking and error handling in cgroup_mount()

cgroup is scheduled to be converted to kernfs. After conversion,
cgroup_mount() won't use the sget() machinery for finding out existing
super_blocks but instead would do that directly. It'll search the
existing cgroupfs_roots for a matching one and create a new one iff a
match doesn't exist. To ease such conversion, this patch restructures
locking and error handling of the function.

cgroup_tree_mutex and cgroup_mutex are grabbed from the get-go and
held until return. For now, due to the way vfs locks nest outside
cgroup mutexes, the two cgroup mutexes are temporarily dropped across
sget() and inode mutex locking, which looks quite ridiculous; however,
these will be removed through kernfs conversion and structuring the
code this way makes the conversion less painful.

The error goto labels are consolidated to two. This looks unwieldy
now but the next patch will factor out creation of new root into a
separate function with accompanying error handling and it'll look a
lot better.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
4ac0601744eb86e982fbdadde35f1945f7ce5882 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: release cgroup_mutex over file removals

Now that cftypes and all tree modification operations are protected by
cgroup_tree_mutex, we can drop cgroup_mutex while deleting files and
directories. Drop cgroup_mutex over removals.

This doesn't make any noticeable difference now but is to help kernfs
conversion. In kernfs, removals are sync points which drain in-flight
operations as those operations would grab cgroup_mutex, trying to
delete under cgroup_mutex would deadlock. This can be resolved by
just holding the outer cgroup_tree_mutex which nests outside both
kernfs active reference and cgroup_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ace2bee8135a3dc725958b8d08c55ee9df813d39 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: introduce cgroup_tree_mutex

Currently cgroup uses combination of inode->i_mutex'es and
cgroup_mutex for synchronization. With the scheduled kernfs
conversion, i_mutex'es will be removed. Unfortunately, just using
cgroup_mutex isn't possible. All kernfs file and syscall operations,
most of which require grabbing cgroup_mutex, will be called with
kernfs active ref held and, if we try to perform kernfs removals under
cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the
target node.

Let's introduce a new outer mutex, cgroup_tree_mutex, which protects
stuff used during hierarchy changing operations - cftypes and all the
operations which may affect the cgroupfs. It also covers css
association and iteration. This allows cgroup_css(), for_each_css()
and other css iterators to be called under cgroup_tree_mutex. The new
mutex will nest above both kernfs's active ref protection and
cgroup_mutex. By protecting tree modifications with a separate outer
mutex, we can get rid of the forementioned deadlock condition.

Actual file additions and removals now require cgroup_tree_mutex
instead of cgroup_mutex. Currently, cgroup_tree_mutex is never used
without cgroup_mutex; however, we'll soon add hierarchy modification
sections which are only protected by cgroup_tree_mutex. In the
future, we might want to make the locking more granular by better
splitting the coverages of the two mutexes. For now, this should do.

v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5a17f543ed6808e9085063277fe46795dea484bd 11-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: improve css_from_dir() into css_tryget_from_dir()

css_from_dir() returns the matching css (cgroup_subsys_state) given a
dentry and subsystem. The function doesn't pin the css before
returning and requires the caller to be holding RCU read lock or
cgroup_mutex and handling pinning on the caller side.

Given that users of the function are likely to want to pin the
returned css (both existing users do) and that getting and putting
css's are very cheap, there's no reason for the interface to be tricky
like this.

Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
the found css and return it only if pinning succeeded. The callers
are updated so that they no longer do RCU locking and pinning around
the function and just use the returned css.

This will also ease converting cgroup to kernfs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
0ab02ca8f887908152d1a96db5130fc661d36a1e 11-Feb-2014 Li Zefan <lizefan@huawei.com> cgroup: protect modifications to cgroup_idr with cgroup_mutex

Setup cgroupfs like this:
# mount -t cgroup -o cpuacct xxx /cgroup
# mkdir /cgroup/sub1
# mkdir /cgroup/sub2

Then run these two commands:
# for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
# for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &

After seconds you may see this warning:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
idr_remove called for id=6 which is not allocated.
...
Call Trace:
[<ffffffff8156063c>] dump_stack+0x7a/0x96
[<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0
[<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50
[<ffffffff81300aa7>] sub_remove+0x87/0x1b0
[<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0
[<ffffffff81300bf5>] idr_remove+0x25/0xd0
[<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0
[<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0
[<ffffffff8107cdbc>] process_one_work+0x26c/0x550
[<ffffffff8107eefe>] worker_thread+0x12e/0x3b0
[<ffffffff81085f96>] kthread+0xe6/0xf0
[<ffffffff81570bac>] ret_from_fork+0x7c/0xb0
---[ end trace 2d1577ec10cf80d0 ]---

It's because allocating/removing cgroup ID is not properly synchronized.

The bug was introduced when we converted cgroup_ida to cgroup_idr.
While synchronization is already done inside ida_simple_{get,remove}(),
users are responsible for concurrent calls to idr_{alloc,remove}().

tj: Refreshed on top of b58c89986a77 ("cgroup: fix error return from
cgroup_create()").

Fixes: 4e96ee8e981b ("cgroup: convert cgroup_ida to cgroup_idr")
Cc: <stable@vger.kernel.org> #3.12+
Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
3417ae1f5f59bbf36c3defbbf2a76c5ca498db2a 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_root_mutex

cgroup_root_mutex was added to avoid deadlock involving namespace_sem
via cgroup_show_options(). It added a lot of overhead for the small
purpose of it and, because it's nested under cgroup_mutex, it has very
limited usefulness. The previous patch made cgroup_show_options() not
use cgroup_root_mutex, so nobody needs it anymore. Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
69e943b7d3c2dcca1087e03e556ac6cb0d4433b4 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: update locking in cgroup_show_options()

cgroup_show_options() grabs cgroup_root_mutex to protect the options
changing while printing; however, holding root_mutex or not doesn't
really make much difference for the function. subsys_mask can be
atomically tested and most of the options aren't allowed to change
anyway once mounted.

The only field which needs synchronization is ->release_agent_path.
This patch introduces a dedicated spinlock to synchronize accesses to
the field and drops cgroup_root_mutex locking from
cgroup_show_options(). The next patch will remove cgroup_root_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
aec25020f5d4b69aea5317551d1cb7043f6b04fb 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: rename cgroup_subsys->subsys_id to ->id

It's no longer referenced outside cgroup core, so renaming is easy.
Let's rename it for consistency & brevity.

This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
073219e995b4a3f8cf1ce8228b7ef440b6994ac0 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: clean up cgroup_subsys names and initialization

cgroup_subsys is a bit messier than it needs to be.

* The name of a subsys can be different from its internal identifier
defined in cgroup_subsys.h. Most subsystems use the matching name
but three - cpu, memory and perf_event - use different ones.

* cgroup_subsys_id enums are postfixed with _subsys_id and each
cgroup_subsys is postfixed with _subsys. cgroup.h is widely
included throughout various subsystems, it doesn't and shouldn't
have claim on such generic names which don't have any qualifier
indicating that they belong to cgroup.

* cgroup_subsys->subsys_id should always equal the matching
cgroup_subsys_id enum; however, we require each controller to
initialize it and then BUG if they don't match, which is a bit
silly.

This patch cleans up cgroup_subsys names and initialization by doing
the followings.

* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
cgroup_subsys with _cgrp_subsys.

* With the above, renaming subsys identifiers to match the userland
visible names doesn't cause any naming conflicts. All non-matching
identifiers are renamed to match the official names.

cpu_cgroup -> cpu
mem_cgroup -> memory
perf -> perf_event

* controllers no longer need to initialize ->subsys_id and ->name.
They're generated in cgroup core and set automatically during boot.

* Redundant cgroup_subsys declarations removed.

* While updating BUG_ON()s in cgroup_init_early(), convert them to
WARN()s. BUGging that early during boot is stupid - the kernel
can't print anything, even through serial console and the trap
handler doesn't even link stack frame properly for back-tracing.

This patch doesn't introduce any behavior changes.

v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
classid handling into core").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
3ed80a62bf959d34ebd4d553b026fbe7e6fbcc54 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: drop module support

With module supported dropped from net_prio, no controller is using
cgroup module support. None of actual resource controllers can be
built as a module and we aren't gonna add new controllers which don't
control resources. This patch drops module support from cgroup.

* cgroup_[un]load_subsys() and cgroup_subsys->module removed.

* As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
cgroup_subsys.h now uses IS_ENABLED() directly.

* enum cgroup_subsys_id now exactly matches the list of enabled
controllers as ordered in cgroup_subsys.h.

* cgroup_subsys[] is now a contiguously occupied array. Size
specification is no longer necessary and dropped.

* for_each_builtin_subsys() is removed and for_each_subsys() is
updated to not require any locking.

* module ref handling is removed from rebind_subsystems().

* Module related comments dropped.

v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs
classid handling into core").

v3: Added {} around the if (need_forkexit_callback) block in
cgroup_post_fork() for readability as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
48573a893303986e3b0b2974d6fb11f3d1bb7064 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: fix locking in cgroup_cfts_commit()

cgroup_cfts_commit() walks the cgroup hierarchy that the target
subsystem is attached to and tries to apply the file changes. Due to
the convolution with inode locking, it can't keep cgroup_mutex locked
while iterating. It currently holds only RCU read lock around the
actual iteration and then pins the found cgroup using dget().

Unfortunately, this is incorrect. Although the iteration does check
cgroup_is_dead() before invoking dget(), there's nothing which
prevents the dentry from going away inbetween. Note that this is
different from the usual css iterations where css_tryget() is used to
pin the css - css_tryget() tests whether the css can be pinned and
fails if not.

The problem can be solved by simply holding cgroup_mutex instead of
RCU read lock around the iteration, which actually reduces LOC.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
b58c89986a77a23658682a100eb15d8edb571ebb 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: fix error return from cgroup_create()

cgroup_create() was returning 0 after allocation failures. Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
eb46bf89696972b856a9adb6aebd5c7b65c266e4 08-Feb-2014 Tejun Heo <tj@kernel.org> cgroup: fix error return value in cgroup_mount()

When cgroup_mount() fails to allocate an id for the root, it didn't
set ret before jumping to unlock_drop ending up returning 0 after a
failure. Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
ab3f5faa6255a0eb4f832675507d9e295ca7e9ba 07-Feb-2014 Hugh Dickins <hughd@google.com> cgroup: use an ordered workqueue for cgroup destruction

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding cgroup_mutex
which prevents the child from reaching its mem_cgroup_reparent_charges().

Just use an ordered workqueue for cgroup_destroy_wq.

tj: Committing as the temporary fix until the reverse dependency can
be removed from memcg. Comment updated accordingly.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Suggested-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Tejun Heo <tj@kernel.org>
dd4b0a4676907481256d16d5de0851b315a6f22c 18-Jan-2014 SeongJae Park <sj38.park@gmail.com> cgroup: trivial style updates

* Place newline before function opening brace in cgroup_kill_sb().

* Insert space before assignment in attach_task_by_pid()

tj: merged two patches into one.

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
c1a71504e9715812a2d15e7c03b5aa147ae70ded 17-Dec-2013 Li Zefan <lizefan@huawei.com> cgroup: don't recycle cgroup id until all csses' have been destroyed

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc. Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1 # let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

tj: Updated comment to note planned changes for cgrp->id.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
10bf2f7e7db993273419ca9f51f5934e4cf71768 12-Dec-2013 Vladimir Davydov <vdavydov@parallels.com> cgroup: fix fail path in cgroup_load_subsys()

Calling cgroup_unload_subsys() from cgroup_load_subsys() after
online_css() failure will result in a NULL ptr dereference on attempt to
offline_css(), because online_css() only assigns css to cgroup on
success. Let's fix that by skipping calls to offline_css() and
css_free() in cgroup_unload_subsys() if there is no css, and freeing css
in cgroup_load_subsys() on online_css() failure.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0be8669dd530f60cf3f59f084518570c1dfb47ee 09-Dec-2013 Wei Yongjun <yongjun_wei@trendmicro.com.cn> cgroup: fix missing unlock on error in cgroup_load_subsys()

Add the missing unlock before return from function cgroup_load_subsys()
in the error handling case.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b85d20404cef6493f9d2edbafe83f9c72aece9a8 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: remove for_each_root_subsys()

After the previous patch which introduced for_each_css(),
for_each_root_subsys() only has two users left. This patch replaces
it with for_each_subsys() + explicit subsys_mask testing and remove
for_each_root_subsys() along with cgroupfs_root->subsys_list handling.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1c6727af4b495a9ec74c46d1fc08e508e675899d 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: implement for_each_css()

There are enough places where css's of a cgroup are iterated, which
currently uses for_each_root_subsys() + explicit cgroup_css(). This
patch implements for_each_css() and replaces the above combination
with it.

This patch doesn't introduce any behavior changes.

v2: Updated to apply cleanly on top of v2 of "cgroup: fix css leaks on
online_css() failure"

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
c81c925ad9b0460a14ec35b52c61158da0733d51 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: factor out cgroup_subsys_state creation into create_css()

Now that all opertations to create a css (cgroup_subsys_state) are
collected into a single loop in cgroup_create(), it's easy to factor
it out into its own function. Factor out css creation into
create_css(). This makes the code easier to follow and will enable
decoupling css creation from cgroup creation which is necessary for
the planned unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
9d403e99238ed6d7151a2c07db6cf8f6932ef3d5 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: combine css handling loops in cgroup_create()

Now that css operations in cgroup_create() are back-to-back, there
isn't much point in allocating css's in one loop and onlining them in
another. Merge the two loops so that a css is allocated and onlined
on each iteration.

css_ar[] is no longer necessary and replaced with a single pointer.
This also simplifies the error handling path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0d80255e42b54419cfc6b10a3ec74b60fe04b8d7 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: reorder operations in cgroup_create()

cgroup_create() currently does the followings.

1. alloc cgroup
2. alloc css's
3. create the directory and commit to cgroup creation
4. online css's
5. create cgroup and css files

The sequence performs allocations before other operations but it
doesn't buy anything because each of the above steps may fail and
should be unrollable. Reorganize the sequence such that cgroup
operations are done before css operations.

1. alloc cgroup
2. create the directory and files and commit to cgroup creation
3. alloc css's
4. create files for and online css's

This simplifies the code a bit and enables further simplification and
separating out css creation from cgroup creation which is necessary
for the planned unified hierarchy where css's will be created and
destroyed dynamically across the lifetime of a cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
780cd8b347b52584704438e599274f25c0a0fd13 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: make for_each_subsys() useable under cgroup_root_mutex

We want to use for_each_subsys() in cgroupfs_root handling where only
cgroup_root_mutex is held. The only way cgroup_subsys[] can change is
through module load/unload, make cgroup_[un]load_subsys() grab
cgroup_root_mutex too and update the lockdep annotation in
for_each_subsys() to allow either cgroup_mutex or cgroup_root_mutex.

* Lockdep annotation is moved from inner 'if' condition to outer 'for'
init caluse. There's no reason to execute the assertion every loop.

* Loop index @i is renamed to @ssid. Indices iterating through subsys
will be [re]named to @ssid gradually.

v2: cgroup_assert_mutex_or_root_locked() caused build failure if
!CONFIG_LOCKEDP. Conditionalize its definition. The build failure
was reported by kbuild test bot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot <fengguang.wu@intel.com>
87fb54f1b5a447662854f176eeb1ba92d5ffc1d5 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: css iterations and css_from_dir() are safe under cgroup_mutex

Currently, all css iterations and css_from_dir() require RCU read lock
whether the caller is holding cgroup_mutex or not, which is
unnecessarily restrictive. They are all safe to use under
cgroup_mutex without holding RCU read lock.

Factor out cgroup_assert_mutex_or_rcu_locked() from css_from_id() and
apply it to all css iteration functions and css_from_dir().

v2: cgroup_assert_mutex_or_rcu_locked() definition doesn't need to be
inside CONFIG_PROVE_RCU ifdef as rcu_lockdep_assert() is always
defined and conditionalized. Move it outside of the ifdef block.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
266ccd505e8acb98717819cef9d91d66c7b237cc 06-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: fix cgroup_create() error handling path

ae7f164a09 ("cgroup: move cgroup->subsys[] assignment to
online_css()") moved cgroup->subsys[] assignements later in
cgroup_create() but didn't update error handling path accordingly
leading to the following oops and leaking later css's after an
online_css() failure. The oops is from cgroup destruction path being
invoked on the partially constructed cgroup which is not ready to
handle empty slots in cgrp->subsys[] array.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
PGD a780a067 PUD aadbe067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
Hardware name:
task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
RIP: 0010:[<ffffffff810eeaa8>] [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
RSP: 0018:ffff8800a781bd98 EFLAGS: 00010282
RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
FS: 00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
Stack:
ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
Call Trace:
[<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
[<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
[<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
[<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
[<ffffffff8169e569>] system_call_fastpath+0x16/0x1b

This patch moves reference bumping inside online_css() loop, clears
css_ar[] as css's are brought online successfully, and updates
err_destroy path so that either a css is fully online and destroyed by
cgroup_destroy_locked() or the error path frees it. This creates a
duplicate css free logic in the error path but it will be cleaned up
soon.

v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
invoked with a cgroup which doesn't have all css's populated.
Update cgroup_destroy_locked() so that it skips NULL css's.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Reported-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: stable@vger.kernel.org # v3.12+
6612f05b88fa309c91a345690411217959bb2486 05-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: unify pidlist and other file handling

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs. With the previous
changes, the difference between pidlist and other files are very
small. Both are served by seq_file in a pretty standard way with the
only difference being !pidlist files use single_open().

This patch adds cftype->seq_start(), ->seq_next and ->seq_stop() and
implements the matching cgroup_seqfile_start/next/stop() which either
emulates single_open() behavior or invokes cftype->seq_*() operations
if specified. This allows using single seq_operations for both
pidlist and other files and makes cgroup_pidlist_operations and
cgorup_pidlist_open() no longer necessary. As cgroup_pidlist_open()
was the only user of cftype->open(), the method is dropped together.

This brings cftype file interface very close to kernfs interface and
mapping shouldn't be too difficult. Once converted to kernfs, most of
the plumbing code including cgroup_seqfile_*() will be removed as
kernfs provides those facilities.

This patch does not introduce any behavior changes.

v2: Refreshed on top of the updated "cgroup: introduce struct
cgroup_pidlist_open_file".

v3: Refreshed on top of the updated "cgroup: attach cgroup_open_file
to all cgroup files".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2da8ca822d49c8b8781800ad155aaa00e7bb5f1a 05-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: replace cftype->read_seq_string() with cftype->seq_show()

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs. This patch
replaces cftype->read_seq_string() with cftype->seq_show() which is
not limited to single_open() operation and will map directcly to
kernfs seq_file interface.

The conversions are mechanical. As ->seq_show() doesn't have @css and
@cft, the functions which make use of them are converted to use
seq_css() and seq_cft() respectively. In several occassions, e.f. if
it has seq_string in its name, the function name is updated to fit the
new method better.

This patch does not introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
7da112792753d71aed44b918395892a1fc53048a 05-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: attach cgroup_open_file to all cgroup files

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs. This patch
attaches cgroup_open_file, which used to be attached to pidlist files,
to all cgroup files, introduces seq_css/cft() accessors to determine
the cgroup_subsys_state and cftype associated with a given cgroup
seq_file, exports them as public interface.

This doesn't cause any behavior changes but unifies cgroup file
handling across different file types and will help converting them to
kernfs seq_show() interface.

v2: Li pointed out that the original patch was using
single_open_size() incorrectly assuming that the size param is
private data size. Fix it by allocating @of separately and
passing it to single_open() and explicitly freeing it in the
release path. This isn't the prettiest but this path is gonna be
restructured by the following patches pretty soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5d22444f427511fe7c4733b233680309af5b071c 05-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: generalize cgroup_pidlist_open_file

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs. This patch renames
cgroup_pidlist_open_file to cgroup_open_file and updates it so that it
only contains a field to identify the specific file, ->cfe, and an
opaque ->priv pointer. When cgroup is converted to kernfs, this will
be replaced by kernfs_open_file which contains about the same
information.

As whether the file is "cgroup.procs" or "tasks" should now be
determined from cgroup_open_file->cfe, the cftype->private for the two
files now carry the file type and cgroup_pidlist_start() reads the
type through cfe->type->private. This makes the distinction between
cgroup_tasks_open() and cgroup_procs_open() unnecessary.
cgroup_pidlist_open() is now directly used as the open method.

This patch doesn't make any behavior changes.

v2: Refreshed on top of the updated "cgroup: introduce struct
cgroup_pidlist_open_file".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
896f5199631560202885715da1b2f018632084e5 05-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: unify read path so that seq_file is always used

With the recent removal of cftype->read() and ->read_map(), only three
operations are remaining, ->read_u64(), ->read_s64() and
->read_seq_string(). Currently, the first two are handled directly
while the last is handled through seq_file.

It is trivial to serve the first two through the seq_file path too.
This patch restructures read path so that all operations are served
through cgroup_seqfile_show(). This makes all cgroup files seq_file -
single_open/release() are now used by default,
cgroup_seqfile_operations is dropped, and cgroup_file_operations uses
seq_read() for read.

This simplifies the code and makes the read path easy to convert to
use kernfs.

Note that, while cgroup_file_operations uses seq_read() for read, it
still uses generic_file_llseek() for seeking instead of seq_lseek().
This is different from cgroup_seqfile_operations but shouldn't break
anything and brings the seeking behavior aligned with kernfs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a742c59de66ea080afa3edaf3428b3cdd5aa87cd 05-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: unify cgroup_write_X64() and cgroup_write_string()

cgroup_write_X64() and cgroup_write_string() both implement about the
same buffering logic. Unify the two into cgroup_file_write() which
always allocates dynamic buffer for simplicity and uses kstrto*()
instead of simple_strto*().

This patch doesn't make any visible behavior changes except for
possibly different error value from kstrsto*().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
6e0755b08dd6a3b5260fafc6969268c2ba261300 05-Dec-2013 Tejun Heo <tj@kernel.org> cgroup: remove cftype->read(), ->read_map() and ->write()

In preparation of conversion to kernfs, cgroup file handling is being
consolidated so that it can be easily mapped to the seq_file based
interface of kernfs.

After recent updates, ->read() and ->read_map() don't have any user
left and ->write() never had any user. Remove them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
afb2bc14e1c989cf0635bd04edb5ff55b8c1c7bd 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: don't guarantee cgroup.procs is sorted if sane_behavior

For some reason, tasks and cgroup.procs guarantee that the result is
sorted. This is the only reason this whole pidlist logic is necessary
instead of just iterating through sorted member tasks. We can't do
anything about the existing interface but at least ensure that such
expectation doesn't exist for the new interface so that pidlist logic
may be removed in the distant future.

This patch scrambles the sort order if sane_behavior so that the
output is usually not sorted in the new interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
045023658ca1e30dc0bb1f148b42c95b740d3e02 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_pidlist->use_count

After the recent changes, pidlist ref is held only between
cgroup_pidlist_start() and cgroup_pidlist_stop() during which
cgroup->pidlist_mutex is also held. IOW, the reference count is
redundant now. While in use, it's always one and pidlist_mutex is
held - holding the mutex has exactly the same protection.

This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
so that pidlist_mutex is not released inbetween and drops
pidlist->use_count.

This patch shouldn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
4bac00d16a8760eae7205e41d2c246477d42a210 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: load and release pidlists from seq_file start and stop respectively

Currently, pidlists are reference counted from file open and release
methods. This means that holding onto an open file may waste memory
and reads may return data which is very stale. Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show. This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration. This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

The previous patches implemented delayed release and restructured
pidlist handling so that pidlists can be loaded and released from
seq_file start / stop. This patch actually moves pidlist load to
start and release to stop.

This means that pidlist is pinned only between start and stop and may
go away between two consecutive read calls if the two calls are apart
by more than CGROUP_PIDLIST_DESTROY_DELAY. cgroup_pidlist_start()
thus can't re-use the stored cgroup_pid_list_open_file->pidlist
directly. During start, it's only used as a hint indicating whether
this is the first start after open or not and pidlist is always looked
up or created.

pidlist_mutex locking and reference counting are moved out of
pidlist_array_load() so that pidlist_array_load() can perform lookup
and creation atomically. While this enlarges the area covered by
pidlist_mutex, given how the lock is used, it's highly unlikely to be
noticeable.

v2: Refreshed on top of the updated "cgroup: introduce struct
cgroup_pidlist_open_file".

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
069df3b7aeb3f4e926c4da9630c92010909af512 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_pidlist->rwsem

cgroup_pidlist locking is needlessly complicated. It has outer
cgroup->pidlist_mutex to protect the list of pidlists associated with
a cgroup and then each pidlist has rwsem to synchronize updates and
reads. Given that the only read access is from seq_file operations
which are always invoked back-to-back, the rwsem is a giant overkill.
All it does is adding unnecessary complexity.

This patch removes cgroup_pidlist->rwsem and protects all accesses to
pidlists belonging to a cgroup with cgroup->pidlist_mutex.
pidlist->rwsem locking is removed if it's nested inside
cgroup->pidlist_mutex; otherwise, it's replaced with
cgroup->pidlist_mutex locking.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
e6b817103d168a76e4044ebcdbc08225d77a81cb 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: refactor cgroup_pidlist_find()

Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
separate out finding proper to cgroup_pidlist_find(). Also, move
locking to the caller.

This patch is preparation for pidlist restructure and doesn't
introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
62236858f3cc558135d1ab6b2af44d22c489bb7f 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: introduce struct cgroup_pidlist_open_file

For pidlist files, seq_file->private pointed to the loaded
cgroup_pidlist; however, pidlist loading is planned to be moved to
cgroup_pidlist_start() for kernfs conversion and seq_file->private
needs to carry more information from open to allow that.

This patch introduces struct cgroup_pidlist_open_file which contains
type, cgrp and pidlist and updates pidlist seq_file->private to point
to it using seq_open_private() and seq_release_private(). Note that
this eventually will be replaced by kernfs_open_file.

While this patch makes more information available to seq_file
operations, they don't use it yet and this patch doesn't introduce any
behavior changes except for allocation of the extra private struct.

v2: use __seq_open_private() instead of seq_open_private() for brevity
as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b1a21367314f36a819c0676e0999f34db12ee6ed 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: implement delayed destruction for cgroup_pidlist

Currently, pidlists are reference counted from file open and release
methods. This means that holding onto an open file may waste memory
and reads may return data which is very stale. Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show. This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration. This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

This patch implements delayed release of pidlist. As pidlists could
be lingering on cgroup removal waiting for the timer to expire, cgroup
free path needs to queue the destruction work item immediately and
flush. As those work items are self-destroying, each work item can't
be flushed directly. A new workqueue - cgroup_pidlist_destroy_wq - is
added to serve as flush domain.

Note that this patch just adds delayed release on top of the current
implementation and doesn't change where pidlist is loaded and
released. Following patches will make those changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b9f3cecaba592d4e98cd155c91b1414323ed51e8 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: remove cftype->release()

Now that pidlist files don't use cftype->release(), it doesn't have
any user left. Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ac1e69aa78e2e799a8d5475595ba744bdf29b597 29-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: don't skip seq_open on write only opens on pidlist files

Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
if the file is opened write-only, which is a sensible optimization as
pidlist loading can be costly and there often are occasions where
tasks or cgroup.procs is opened write-only. However, pidlist init and
release are planned to be moved to cgroup_pidlist_start/stop()
respectively which would make this optimization unnecessary.

This patch removes the optimization and always fully initializes
pidlist files regardless of open mode. This will help moving pidlist
handling to start/stop by unifying rw paths and removes the need for
specifying cftype->release() in addition to .release in
cgroup_pidlist_operations as file->f_op is now always overridden. As
pidlist files were the only user of cftype->release(), the next patch
will remove the method.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
e605b36575e896edd8161534550c9ea021b03bc0 28-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: fix cgroup_subsys_state leak for seq_files

If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file->f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().

Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818ba42 ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release(). The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.

Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.12
b36824c75c7855585d6476eef2b234f6e0e68872 23-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: unexport cgroup_css() and remove __file_cft()

Now that cgroup_event is made memcg specific, the temporarily exported
functions are no longer necessary. Unexport cgroup_css() and remove
__file_cft() which doesn't have any user left.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
fba94807837850e211f8975e1970e23e7804ff4d 23-Nov-2013 Tejun Heo <tj@kernel.org> cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg

cgroup_event is being moved from cgroup core to memcg and the
implementation is already moved by the previous patch. This patch
moves the data fields and callbacks.

* cgroup->event_list[_lock] are moved to mem_cgroup.

* cftype->[un]register_event() are moved to cgroup_event. This makes
it impossible for individual cftype definitions to specify their
event callbacks. This is worked around by simply hard-coding
filename to event callback mapping in cgroup_write_event_control().
This is awkward and inflexible, which is actually desirable given
that we don't want to grow more usages of this feature.

* eventfd_ctx declaration is removed from cgroup.h, which makes
vmpressure.h miss eventfd_ctx declaration. Include eventfd.h from
vmpressure.h.

v2: Use file name from dentry instead of cftype. This will allow
removing all cftype handling in the function.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
79bd9814e5ec9a288d6599f53aeac0b548fdfe52 23-Nov-2013 Tejun Heo <tj@kernel.org> cgroup, memcg: move cgroup_event implementation to memcg

cgroup_event is way over-designed and tries to build a generic
flexible event mechanism into cgroup - fully customizable event
specification for each user of the interface. This is utterly
unnecessary and overboard especially in the light of the planned
unified hierarchy as there's gonna be single agent. Simply generating
events at fixed points, or if that's too restrictive, configureable
cadence or single set of configureable points should be enough.

Thankfully, memcg is the only user and gets to keep it. Replacing it
with something simpler on sane_behavior is strongly recommended.

This patch moves cgroup_event and "cgroup.event_control"
implementation to mm/memcontrol.c. Clearing of events on cgroup
destruction is moved from cgroup_destroy_locked() to
mem_cgroup_css_offline(), which shouldn't make any noticeable
difference.

cgroup_css() and __file_cft() are exported to enable the move;
however, this will soon be reverted once the event code is updated to
be memcg specific.

Note that "cgroup.event_control" will now exist only on the hierarchy
with memcg attached to it. While this change is visible to userland,
it is unlikely to be noticeable as the file has never been meaningful
outside memcg.

Aside from the above change, this is pure code relocation.

v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
poll.h inclusion moved from cgroup.c to memcontrol.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
e5fca243abae1445afbfceebda5f08462ef869d3 22-Nov-2013 Tejun Heo <tj@kernel.org> cgroup: use a dedicated workqueue for cgroup destruction

Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
b26d4cd385fc51e8844e2cdf9ba2051f5bba11a5 26-Oct-2013 Al Viro <viro@zeniv.linux.org.uk> consolidate simple ->d_delete() instances

Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ea84753c98a7ac6b74e530b64c444a912b3835ca 12-Oct-2013 Anjana V Kumar <anjanavk12@gmail.com> cgroup: fix to break the while loop in cgroup_attach_task() correctly

Both Anjana and Eunki reported a stall in the while_each_thread loop
in cgroup_attach_task().

It's because, when we attach a single thread to a cgroup, if the cgroup
is exiting or is already in that cgroup, we won't break the loop.

If the task is already in the cgroup, the bug can lead to another thread
being attached to the cgroup unexpectedly:

# echo 5207 > tasks
# cat tasks
5207
# echo 5207 > tasks
# cat tasks
5207
5215

What's worse, if the task to be attached isn't the leader of the thread
group, we might never exit the loop, hence cpu stall. Thanks for Oleg's
analysis.

This bug was introduced by commit 081aa458c38ba576bdd4265fc807fa95b48b9e79
("cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()")

[ lizf: - fixed the first continue, pointed out by Oleg,
- rewrote changelog. ]

Cc: <stable@vger.kernel.org> # 3.9+
Reported-by: Eunki Kim <eunki_kim@samsung.com>
Reported-by: Anjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: Anjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2ff2a7d03bbe472ed44a8380dbdbea490d81c59d 23-Sep-2013 Li Zefan <lizefan@huawei.com> cgroup: kill css_id

The only user of css_id was memcg, and it has been convered to use
cgroup->id, so kill css_id.

Signed-off-by: Li Zefan <lizefan@huwei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
58b79a91f57efec9457de8ff93a4cc4fb8daf753 06-Sep-2013 Tejun Heo <tj@kernel.org> cgroup: fix cgroup post-order descendant walk of empty subtree

bd8815a6d8 ("cgroup: make css_for_each_descendant() and friends
include the origin css in the iteration") updated cgroup descendant
iterators to include the origin css; unfortuantely, it forgot to drop
special case handling in css_next_descendant_post() for empty subtree
leading to failure to visit the origin css without any child.

Fix it by dropping the special case handling and always returning the
leftmost descendant on the first iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
4e10f3c98888ee88ea2543aa636db6410fa47477 30-Aug-2013 Al Viro <viro@zeniv.linux.org.uk> Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c

kernel/cgroup.c is the only place in the tree that relies on eventfd.h
pulling file.h; move that include there. Switch from eventfd_fget()/fput()
to fdget()/fdput(), while we are at it - eventfd_ctx_fileget() will fail
on non-eventfd descriptors just fine, no need to do that check twice...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
bb78a92f47696b2da49f2692b6a9fa56d07c444a 29-Aug-2013 Hugh Dickins <hughd@google.com> cgroup: fix rmdir EBUSY regression in 3.11

On 3.11-rc we are seeing cgroup directories left behind when they should
have been removed. Here's a trivial reproducer:

cd /sys/fs/cgroup/memory
mkdir parent parent/child; rmdir parent/child parent
rmdir: failed to remove `parent': Device or resource busy

It's because cgroup_destroy_locked() (step 1 of destruction) leaves
cgroup on parent's children list, letting cgroup_offline_fn() (step 2 of
destruction) remove it; but step 2 is run by work queue, which may not
yet have removed the children when parent destruction checks the list.

Fix that by checking through a non-empty list of children: if every one
of them has already been marked CGRP_DEAD, then it's safe to proceed:
those children are invisible to userspace, and should not obstruct rmdir.

(I didn't see any reason to keep the cgrp->children checks under the
unrelated css_set_lock, so moved them out.)

tj: Flattened nested ifs a bit and updated comment so that it's
correct on both for-3.11-fixes and for-3.12.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
d1625964da51bda61306ad3ec45307a799c21f08 27-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: fix cgroup_css() invocation in css_from_id()

ca8bdcaff0 ("cgroup: make cgroup_css() take cgroup_subsys * instead
and allow NULL subsys") missed one conversion in css_from_id(), which
was newly added. As css_from_id() doesn't have any user yet, this
doesn't break anything other than generating a build warning.

Convert it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
7c918cbbd829669bf70ffcc45962d5d992942243 27-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()

cgroup_event will be moved to its only user - memcg. Replace
__d_cgrp() usage with css_from_dir(), which is already exported. This
also simplifies the code a bit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
7941cb027dccedec3c047271554ddcf4be2e0697 27-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup

Currently, each registered cgroup_event holds an extra reference to
the cgroup. This is a bit weird as events are subsystem specific and
will also be incorrect in the planned unified hierarchy as css
(cgroup_subsys_state) may come and go dynamically across the lifetime
of a cgroup. Holding onto cgroup won't prevent the target css from
going away.

Update cgroup_event to hold onto the css the traget file belongs to
instead of cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
9fa4db334c7d9570aec7a5121e84fae99aae1d04 27-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: implement CFTYPE_NO_PREFIX

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix. This patch adds CFTYPE_NO_ which
disables the automatic prefix. This is to work around historical
baggages and shouldn't be used for new files.

This will be used to move "cgroup.event_control" from cgroup core to
memcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Glauber Costa <glommer@gmail.com>
ca8bdcaff0d77990fb69e0f946018c96a70851cc 27-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys

cgroup_css() is no longer used in hot paths. Make it take struct
cgroup_subsys * and allow the users to specify NULL subsys to obtain
the dummy_css. This removes open-coded NULL subsystem testing in a
couple users and generally simplifies the code.

After this patch, css_from_dir() also allows NULL @ss and returns the
matching dummy_css. This behavior change doesn't affect its only user
- perf.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
35cf083619da5677f83e9a8eae813f0b413d7082 27-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax

cgroup_css_from_dir() will grow another user. In preparation, make
the following changes.

* All css functions are prefixed with just "css_", rename it to
css_from_dir().

* Take dentry * instead of file * as dentry is what ultimately
identifies a cgroup and file may not always be available. Note that
the function now checkes whether @dentry->d_inode is NULL as the
caller now may specify a negative dentry.

* Make it take cgroup_subsys * instead of integer subsys_id. This
simplifies the function and allows specifying no subsystem for
cgroup->dummy_css.

* Make return section a bit less verbose.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
6e6eab0efdf48fb2d8d7aee904d7740acb4661c6 15-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: fix cgroup_write_event_control()

81eeaf0411 ("cgroup: make cftype->[un]register_event() deal with
cgroup_subsys_state inst ead of cgroup") updated the cftype event
methods to take @css (cgroup_subsys_state) instead of @cgroup;
however, it incorrectly used @css passed to
cgroup_write_event_control(), which the dummy_css for the cgroup as
the file is a cgroup core file. This leads to oops on event
registration.

Fix it by using the css matching the event target file. Note that
cgroup_write_event_control() now disallows cgroup core files from
being event sources. This is for simplicity and doesn't matter as
cgroup_event will be moved and made specific to memcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
0bfb4aa67cef4982adc70590a31624d7b35a0bda 15-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: fix subsystem file accesses on the root cgroup

105347ba5 ("cgroup: make cgroup_file_open() rcu_read_lock() around
cgroup_css() and add cfent->css") added cfent->css to cache the
associted cgroup_subsys_state across file operations.

A cfent is associated with single css throughout its lifetime and the
origimal commit initialized the cache pointer during cgroup_add_file()
and verified that it matches the actual one in cgroup_file_open().
While this works fine for !root cgroups, it's broken for root cgroups
as files in a root cgroup are created before the css's are associated
with the cgroup and thus cgroup_css() call in cgroup_add_file()
returns NULL associating all cfents in the root cgroup with NULL css.
This makes cgroup_file_open() trigger WARN and fail with -ENODEV for
all !core subsystem files in the root cgroups.

There's no reason to initialize cfent->css separately from
cgroup_add_file(). As the association never changes,
cgroup_file_open() can set it unconditionally every time and
containing the logic in cgroup_file_open() makes more sense anyway as
the only reason it's necessary is file->private_data being already
occupied.

Fix it by setting cfent->css unconditionally from cgroup_file_open().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1cb650b91ba582f6737457b7d22e368585596d2c 19-Aug-2013 Li Zefan <lizefan@huawei.com> cgroup: change cgroup_from_id() to css_from_id()

Now we want cgroup core to always provide the css to use to the
subsystems, so change this API to css_from_id().

Uninline css_from_id(), because it's getting bigger and cgroup_css()
has been unexported.

While at it, remove the #ifdef, and shuffle the order of the args.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
930913a31289202d232186b82854b26d7fb7cf4d 16-Aug-2013 Li Zhong <zhong@linux.vnet.ibm.com> cgroup: use css_get() in cgroup_create() to check CSS_ROOT

It seems that the root css doesn't have refcnt allocated(not needed?),
and would cause the booting error attached.

This patch tries to use css_get() to not increase the refcnt if parent
is root.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740
PGD 0
Oops: 0002 [#1]
Modules linked in:
CPU: 0 PID: 1 Comm: systemd Not tainted 3.11.0-rc5-next-20130815+ #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
task: ffff88007f868000 ti: ffff88007f864000 task.ti: ffff88007f864000
RIP: 0010:[<ffffffff810b37cc>] [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740
RSP: 0018:ffff88007f865df8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff81a46ee0 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81a415c0
RBP: ffff88007f865ec8 R08: 0000000000000001 R09: 0000000000000000
R10: ffff88007ce6d060 R11: 0000000000000000 R12: ffff88007ce6d000
R13: ffff88007ce6d060 R14: ffffffff81a46d80 R15: ffff88007c6e8018
FS: 00007f13dbf6f840(0000) GS:ffffffff81a23000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000007b7e5000 CR4: 00000000000006b0
Stack:
ffffffff810b380d 0000000000000002 ffff88007f865e18 ffffffff81167069
ffff88007f865ed8 ffffffff8116a3f5 ffff880037454400 ffff88007c6e8018
ffff88007c6e8028 ffff88007c6e8328 ffff88007c6e8000 ffff88007ce6d000
Call Trace:
[<ffffffff810b380d>] ? cgroup_mkdir+0x3bd/0x740
[<ffffffff81167069>] ? lookup_hash+0x19/0x20
[<ffffffff8116a3f5>] ? kern_path_create+0x95/0x170
[<ffffffff8116ce3e>] vfs_mkdir+0x9e/0xf0
[<ffffffff8116d7a0>] SyS_mkdirat+0x60/0xe0
[<ffffffff8116d839>] SyS_mkdir+0x19/0x20
[<ffffffff814c960d>] tracesys+0xcf/0xd4
Code: ad 70 ff ff ff 48 89 9d 60 ff ff ff 4d 89 d5 4c 8b bd 68 ff ff ff 4c 8b 65 88 eb 50 0f 1f 00 48 8b 43 18 a8 03 0f 85 6c 03 00 00 <ff> 00 e8 1d 0a fb ff 85 c0 74 0d 80 3d f0 45 a1 00 00 0f 84 4c
RIP [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740
RSP <ffff88007f865df8>
CR2: 0000000000000000
---[ end trace a4b14b49bc46fd60 ]---

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
0c21ead136a900c36f1ab74fd7d09a306dc31324 14-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: RCU protect each cgroup_subsys_state release

With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.

When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup. When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup. After a grace period, all css's are freed along
with the cgroup itself.

This patch moves the RCU grace period and css freeing from cgroup
release path to css release path. css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period. css_free_work_fn() is updated to free the css
directly.

The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.

This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3c14f8b44fafaa60519440bea1591e495b928327 14-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: move subsys file removal to kill_css()

With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. This patch moves subsys file removal from
cgroup_destroy_locked() to kill_css().

While this changes the order of destruction operations, the changes
shouldn't be noticeable to cgroup subsystems or userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
edae0c3358947f8be5ca99f762d89e0c38e1f5d5 14-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: factor out kill_css()

Factor out css ref killing from cgroup_destroy_locked() into
kill_css(). We're gonna add more to the path and the factored out
function will eventually be called from other places too.

While at it, replace open coded percpu_ref_get() with css_get() for
consistency. This shouldn't cause any functional difference as the
function is not used for root cgroups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
09a503ea3a816b285b0b402b7f785eaec0c7a7e1 14-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: decouple cgroup_subsys_state destruction from cgroup destruction

Currently, css (cgroup_subsys_state) lifetime is tied to that of the
associated cgroup. css's are created when the associated cgroup is
created and destroyed when it gets destroyed. Also, individual css's
aren't RCU protected but the whole cgroup is. With the planned
unified hierarchy, css's will need to be dynamically created and
destroyed within the lifetime of a cgroup.

To enable such usages, this patch decouples css destruction from
cgroup destruction - offline_css() invocation and the final css_put()
are moved from cgroup_destroy_css_killed() to css_killed_work_fn().
Now each css is individually offlined and put as its reference count
is killed instead of waiting for all css's attached to the cgroup to
finish refcnt killing and then proceeding to offlining and putting
them together.

While this changes the order of destruction operations, the changes
shouldn't be noticeable to cgroup subsystems or userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
f20104de55a212a9742d8df1807f1f29dc95b748 14-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: replace cgroup->css_kill_cnt with ->nr_css

Currently, css (cgroup_subsys_state) lifetime is tied to that of the
associated cgroup. With the planned unified hierarchy, css's will be
dynamically created and destroyed within the lifetime of a cgroup. To
enable such usages, css's will be individually RCU protected instead
of being tied to the cgroup.

cgroup->css_kill_cnt is used during cgroup destruction to wait for css
reference count disable; however, this model doesn't work once css's
lifetimes are managed separately from cgroup's. This patch replaces
it with cgroup->nr_css which is an cgroup_mutex protected integer
counting the number of attached css's. The count is incremented from
online_css() and decremented after refcnt kill is confirmed. If the
count reaches zero and the cgroup is marked dead, the second stage of
cgroup destruction is kicked off. If a cgroup doesn't have any css
attached at the time of rmdir, cgroup_destroy_locked() now invokes the
second stage directly as no css kill confirmation would happen.

cgroup_offline_fn() - the second step of cgroup destruction - is
renamed to cgroup_destroy_css_killed() and now expects to be called
with cgroup_mutex held.

While this patch changes how css destruction is punted to work items,
it shouldn't change any visible behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
223dbc38d2a8745a93749dc75ed909e274ce075d 14-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item

css (cgroup_subsys_state) offlining, which requires process context,
will be moved to ref kill confirmation. In preparation, bounce
css_killed handling through css->destroy_work.

css_ref_killed_fn() is renamed to css_killed_ref_fn() so that it's
consistent with the new css_killed_work_fn().

This patch adds an additional work item bouncing but doesn't change
the actual logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ae7f164a09408bf21ab3c82a9e80a3ff37aa9e3e 14-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: move cgroup->subsys[] assignment to online_css()

Currently, css (cgroup_subsys_state) lifetime is tied to that of the
associated cgroup. With the planned unified hierarchy, css's will be
dynamically created and destroyed within the lifetime of a cgroup. To
enable such usages, css's will be individually RCU protected instead
of being tied to the cgroup.

In preparation, this patch moves cgroup->subsys[] assignment from
init_css() to online_css(). As this means that a newly initialized
css should be remembered separately and that cgroup_css() returns NULL
between init and online, cgroup_create() is updated so that it stores
newly created css's in a local array css_ar[] and
cgroup_init/load_subsys() are updated to use local variable @css
instead of using cgroup_css(). This change also slightly simplifies
error path of cgroup_create().

While this patch changes when cgroup->subsys[] is initialized, this
change isn't visible to subsystems or userland.

v2: This patch wasn't updated accordingly after the previous "cgroup:
reorganize css init / exit paths" was updated leading to missing a
css_ar[] conversion in cgroup_create() and thus boot failure. Fix
it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
623f926b050e12b0f5e3a2f4d11c36e4ddd63541 13-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: reorganize css init / exit paths

css (cgroup_subsys_state) lifetime management is about to be
restructured. In prepartion, make the following mostly trivial
changes.

* init_cgroup_css() is renamed to init_css() so that it's consistent
with other css handling functions.

* alloc_css_id(), online_css() and offline_css() updated to take @css
instead of cgroups and subsys IDs.

This patch doesn't make any functional changes.

v2: v1 merged two for_each_root_subsys() loops in cgroup_create() but
Li Zefan pointed out that it breaks error path. Dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
73e80ed8007fc48a6deeb295ba37159fad274bd2 13-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: add __rcu modifier to cgroup->subsys[]

For the planned unified hierarchy, each css (cgroup_subsys_state) will
be RCU protected so that it can be created and destroyed individually
while allowing RCU accesses. Previous changes ensured that all
cgroup->subsys[] accesses use the cgroup_css() accessor. This patch
adds __rcu modifier to cgroup->subsys[], add matching RCU dereference
in cgroup_css() and convert all assignments to either
rcu_assign_pointer() or RCU_INIT_POINTER().

This change prepares for the actual RCUfication of css's and doesn't
introduce any visible behavior change. The conversion is verified
with sparse and all accesses are properly RCU annotated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
105347ba5da3e87facce2337c50cd5df93cc6bec 13-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_file_open() rcu_read_lock() around cgroup_css() and add cfent->css

For the planned unified hierarchy, each css (cgroup_subsys_state) will
be RCU protected so that it can be created and destroyed individually
while allowing RCU accesses, and cgroup_css() will soon require either
holding cgroup_mutex or RCU read lock.

This patch updates cgroup_file_open() such that it acquires the
associated css under rcu_read_lock(). While cgroup_file_css() usages
in other file operations are safe due to the reference from open,
cgroup_css() wouldn't know that and will still trigger warnings. It'd
be cleanest to store the acquired css in file->prvidate_data for
further file operations but that's already used by seqfile. This
patch instead adds cfent->css to cache the associated css. Note that
while this field is initialized during cfe init, it should only be
considered valid while the file is open.

This patch doesn't change visible behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b77d7b6088377998ebf65eaea5e51008c2d75e94 13-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: cgroup_css_from_dir() now should be called with RCU read locked

cgroup->subsys[] will become RCU protected and thus all cgroup_css()
usages should either be under RCU read lock or cgroup_mutex. This
patch updates cgroup_css_from_dir() which returns the matching
cgroup_subsys_state given a directory file and subsys_id so that it
requires RCU read lock and updates its sole user
perf_cgroup_connect().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
0ae78e0bf10ac38ab53548e18383afc9997eca22 13-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: add cgroup_subsys_state->parent

With the planned unified hierarchy, css's (cgroup_subsys_state) will
be RCU protected and allowed to be attached and detached dynamically
over the course of a cgroup's lifetime. This means that css's will
stay accessible after being detached from its cgroup - the matching
pointer in cgroup->subsys[] cleared - for ref draining and RCU grace
period.

cgroup core still wants to guarantee that the parent css is never
destroyed before its children and css_parent() always returns the
parent regardless of the state of the child css as long as it's
accessible.

This patch makes css's hold onto their parents and adds css->parent so
that the parent css is never detroyed before its children and can be
determined without consulting the cgroups.

cgroup->dummy_css is also updated to point to the parent dummy_css;
however, it doesn't need to worry about object lifetime as the parent
cgroup is already pinned by the child.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
35ef10da65d43211f4cd7e7822cbb3becdfc0ae1 13-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: rename cgroup_subsys_state->dput_work and its callback function

css (cgroup_subsys_state) will become RCU protected and there will be
two stages which require punting to work item during release. To
prepare for using the work item for multiple times, rename
css->dput_work to css->destroy_work and css_dput_fn() to
css_free_work_fn() and move work item initialization from css init to
right before the actual usage.

This reorganization doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
40e93b39cd5b6a347333a95152ce37deef37bbd0 13-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: always use cgroup_css()

cgroup_css() is the accessor for cgroup->subsys[] but is not used
consistently. cgroup->subsys[] will become RCU protected and
cgroup_css() will grow synchronization sanity checks. In preparation,
make all cgroup->subsys[] dereferences use cgroup_css() consistently.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
bd8815a6d802fc16a7a106e170593aa05dc17e72 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make css_for_each_descendant() and friends include the origin css in the iteration

Previously, all css descendant iterators didn't include the origin
(root of subtree) css in the iteration. The reasons were maintaining
consistency with css_for_each_child() and that at the time of
introduction more use cases needed skipping the origin anyway;
however, given that css_is_descendant() considers self to be a
descendant, omitting the origin css has become more confusing and
looking at the accumulated use cases rather clearly indicates that
including origin would result in simpler code overall.

While this is a change which can easily lead to subtle bugs, cgroup
API including the iterators has recently gone through major
restructuring and no out-of-tree changes will be applicable without
adjustments making this a relatively acceptable opportunity for this
type of change.

The conversions are mostly straight-forward. If the iteration block
had explicit origin handling before or after, it's moved inside the
iteration. If not, if (pos == origin) continue; is added. Some
conversions add extra reference get/put around origin handling by
consolidating origin handling and the rest. While the extra ref
operations aren't strictly necessary, this shouldn't cause any
noticeable difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
95109b627ba6a043c181fa5fa45d1c754dd44fbc 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: unexport cgroup_css()

cgroup_css() no longer has any user left outside cgroup.c proper and
we don't want subsystems to grow new usages of the function. cgroup
core should always provide the css to use to the subsystems, which
will make dynamic creation and destruction of css's across the
lifetime of a cgroup much more manageable than exposing the cgroup
directly to subsystems and let them dereference css's from it.

Make cgroup_css() a static function in cgroup.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d99c8727e7bbc01b70e2c57e6127bfab26b868fd 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

cgroup_taskset which is used by the subsystem attach methods is the
last cgroup subsystem API which isn't using css as the handle. Update
cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

The conversions are pretty mechanical. One exception is
cpuset::cgroup_cs(), which lost its last user and got removed.

This patch shouldn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
81eeaf0411204f52af8ef78ff107cfca2fcfec1d 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state instead of cgroup

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

cftype->[un]register_event() is among the remaining couple interfaces
which still use struct cgroup. Convert it to cgroup_subsys_state.
The conversion is mostly mechanical and removes the last users of
mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed.

v2: indentation update as suggested by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
72ec7029937f0518eff21b8762743c31591684f5 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make task iterators deal with cgroup_subsys_state instead of cgroup

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

This patch converts task iterators to deal with css instead of cgroup.
Note that under unified hierarchy, different sets of tasks will be
considered belonging to a given cgroup depending on the subsystem in
question and making the iterators deal with css instead cgroup
provides them with enough information about the iteration.

While at it, fix several function comment formats in cpuset.c.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
e535837b1dae17b5a2d76ea1bc22ac1a79354624 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: remove struct cgroup_scanner

cgroup_scan_tasks() takes a pointer to struct cgroup_scanner as its
sole argument and the only function of that struct is packing the
arguments of the function call which are consisted of five fields.
It's not too unusual to pack parameters into a struct when the number
of arguments gets excessive or the whole set needs to be passed around
a lot, but neither holds here making it just weird.

Drop struct cgroup_scanner and pass the params directly to
cgroup_scan_tasks(). Note that struct cpuset_change_nodemask_arg was
added to cpuset.c to pass both ->cs and ->newmems pointer to
cpuset_change_nodemask() using single data pointer.

This doesn't make any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
c59cd3d840b1b0a8f996cbbd9132128dcaabbeb9 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_task_iter remember the cgroup being iterated

Currently all cgroup_task_iter functions require @cgrp to be passed
in, which is superflous and increases chance of usage error. Make
cgroup_task_iter remember the cgroup being iterated and drop @cgrp
argument from next and end functions.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
0942eeeef68f9493c1bcb1a52baf612b73fcf9fb 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: rename cgroup_iter to cgroup_task_iter

cgroup now has multiple iterators and it's quite confusing to have
something which walks over tasks of a single cgroup named cgroup_iter.
Let's rename it to cgroup_task_iter.

While at it, reformat / update comments and replace the overview
comment above the interface function decls with proper function
comments. Such overview can be useful but function comments should be
more than enough here.

This is pure rename and doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
d515876e9d951d8cf7fc7c90db2967664bdc89ee 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: relocate cgroup_advance_iter()

For some reason, cgroup_advance_iter() is standing lonely all away
from its iter comrades. Relocate it.

This is cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
492eb21b98f88e411a8bb43d6edcd7d7022add10 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup

cgroup is currently in the process of transitioning to using css
(cgroup_subsys_state) as the primary handle instead of cgroup in
subsystem API. For hierarchy iterators, this is beneficial because

* In most cases, css is the only thing subsystems care about anyway.

* On the planned unified hierarchy, iterations for different
subsystems will need to skip over different subtrees of the
hierarchy depending on which subsystems are enabled on each cgroup.
Passing around css makes it unnecessary to explicitly specify the
subsystem in question as css is intersection between cgroup and
subsystem

* For the planned unified hierarchy, css's would need to be created
and destroyed dynamically independent from cgroup hierarchy. Having
cgroup core manage css iteration makes enforcing deref rules a lot
easier.

Most subsystem conversions are straight-forward. Noteworthy changes
are

* blkio: cgroup_to_blkcg() is no longer used. Removed.

* freezer: cgroup_freezer() is no longer used. Removed.

* devices: cgroup_to_devcgroup() is no longer used. Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
f48e3924dca268c677c4e338e5d91ad9e6fe6b9e 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: always use cgroup_next_child() to walk the children list

There are several places where the children list is accessed directly.
This patch converts those places to use cgroup_next_child(). This
will help updating the hierarchy iterators to use @css instead of
@cgrp.

While cgroup_next_child() can be heavy in pathological cases - e.g. a
lot of dead children, this shouldn't cause any noticeable behavior
differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3b287a505ef4024634beb12a93773254909d5dae 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: convert cgroup_next_sibling() to cgroup_next_child()

cgroup is transitioning to using css (cgroup_subsys_state) as the main
subsys interface handle instead of cgroup and the iterators will be
updated to use css too. The iterators need to walk the cgroup
hierarchy and return the css's matching the origin css, which is a bit
cumbersome to open code.

This patch converts cgroup_next_sibling() to cgroup_next_child() so
that it can handle all steps of direct child iteration. This will be
used to update iterators to take @css instead of @cgrp. In addition
to the new iteration init handling, cgroup_next_child() is
restructured so that the different branches share the end of iteration
condition check.

This patch doesn't change any behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
182446d087906de40e514573a92a97b203695f71 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: pass around cgroup_subsys_state instead of cgroup in file methods

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.

This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.

Most subsystem conversions are straight forwards but there are some
interesting ones.

* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.

* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.

* cpu: cgroup_tg() doesn't have any user left. Removed.

* cpuacct: cgroup_ca() doesn't have any user left. Removed.

* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.

* net_cls: cgrp_cls_state() doesn't have any user left. Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
67f4c36f83455b253445b2cb28ac9a2c4f85d99a 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: add cgroup->dummy_css

cgroup subsystem API is being converted to use css
(cgroup_subsys_state) as the main handle, which makes things a bit
awkward for subsystem agnostic core features - the "cgroup.*"
interface files and various iterations - a bit awkward as they don't
have a css to use.

This patch adds cgroup->dummy_css which has NULL ->ss and whose only
role is pointing back to the cgroup. This will be used to support
subsystem agnostic features on the coming css based API.

css_parent() is updated to handle dummy_css's. Note that css will
soon grow its own ->parent field and css_parent() will be made
trivial.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
f7d58818ba4249f04a83b73aaac135640050bb4f 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: pin cgroup_subsys_state when opening a cgroupfs file

Previously, each file read/write operation relied on the inode
reference count pinning the cgroup and simply checked whether the
cgroup was marked dead before proceeding to invoke the per-subsystem
callback. This was rather silly as it didn't have any synchronization
or css pinning around the check and the cgroup may be removed and all
css refs drained between the DEAD check and actual method invocation.

This patch pins the css between open() and release() so that it is
guaranteed to be alive for all file operations and remove the silly
DEAD checks from cgroup_file_read/write().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2bb566cb68dfafad328af666ebadf0e49accd6ca 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: add subsys backlink pointer to cftype

cgroup is transitioning to using css (cgroup_subsys_state) instead of
cgroup as the primary subsystem handle. The cgroupfs file interface
will be converted to use css's which requires finding out the
subsystem from cftype so that the matching css can be determined from
the cgroup.

This patch adds cftype->ss which points to the subsystem the file
belongs to. The field is initialized while a cftype is being
registered. This makes it unnecessary to explicitly specify the
subsystem for other cftype handling functions. @ss argument dropped
from various cftype handling functions.

This patch shouldn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
eb95419b023abacb415e2a18fea899023ce7624d 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.

* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.

* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.

* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.

This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are

* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.

* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.

This patch shouldn't cause any behavior differences.

v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.

Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
72c97e54e0f043d33b246d7460ae0a36c4b8c643 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: add subsystem pointer to cgroup_subsys_state

Currently, given a cgroup_subsys_state, there's no way to find out
which subsystem the css is for, which we'll need to convert the cgroup
controller API to primarily use @css instead of @cgroup. This patch
adds cgroup_subsys_state->ss which points to the subsystem the @css
belongs to.

While at it, remove the comment about accessing @css->cgroup to
determine the hierarchy. cgroup core will provide API to traverse
hierarchy of css'es and we don't want subsystems to directly walk
cgroup hierarchies anymore.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
8af01f56a03e9cbd91a55d688fce1315021efba8 09-Aug-2013 Tejun Heo <tj@kernel.org> cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/

The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.

We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.

Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b395890a092d8ecbe54f005179e3dec4b6bf752a 01-Aug-2013 Li Zefan <lizefan@huawei.com> cgroup: rename cgroup_pidlist->mutex

It's a rw_semaphore not a mutex.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
876ede8b2b9880615be0de3ec7b8afd0a1786e76 01-Aug-2013 Li Zefan <lizefan@huawei.com> cgroup: restructure the failure path in cgroup_write_event_control()

It uses a single label and checks the validity of each pointer. This
is err-prone, and actually we had a bug because one of the check was
insufficient.

Use multi lables as we do in other places.

v2:
- drop initializations of local variables.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
4e96ee8e981b5140a2bcc5fff0d5c0eef39a62ee 31-Jul-2013 Li Zefan <lizefan@huawei.com> cgroup: convert cgroup_ida to cgroup_idr

This enables us to lookup a cgroup by its id.

v4:
- add a comment for idr_remove() in cgroup_offline_fn().

v3:
- on success, idr_alloc() returns the id but not 0, so fix the BUG_ON()
in cgroup_init().
- pass the right value to idr_alloc() so that the id for dummy cgroup is 0.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
6f4b7e632d78c2d91502211c430722cc66428492 31-Jul-2013 Li Zefan <lizefan@huawei.com> cgroup: more naming cleanups

Constantly use @cset for css_set variables and use @cgrp as cgroup
variables.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
e0798ce27346edb8aa369b5b39af5a47fdf2b25c 31-Jul-2013 Li Zefan <lizefan@huawei.com> cgroup: remove struct cgroup_seqfile_state

We can use struct cfent instead.

v2:
- remove cgroup_seqfile_release().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2a4ac63333584b2791986cf2270f5ba9a4b97606 31-Jul-2013 Li Zefan <lizefan@huawei.com> cgroup: remove sparse tags from offline_css()

This should have been removed in commit d7eeac1913ff
("cgroup: hold cgroup_mutex before calling css_offline").

While at it, update the comments.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
da0a12caffad2eeadea429f83818408e7b77379a 31-Jul-2013 Li Zefan <lizefan@huawei.com> cgroup: fix a leak when percpu_ref_init() fails

ss->css_free() is not called when perfcpu_ref_init() fails.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a698b4488ab98deef6c3beeba3e27fea17650132 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: remove gratuituous BUG_ON()s from rebind_subsystems()

rebind_subsystems() performs santiy checks even on subsystems which
aren't specified to be added or removed and the checks aren't all that
useful given that these are in a very cold path while the violations
they check would trip up in much hotter paths.

Let's remove these from rebind_subsystems().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1d5be6b287c8efc879fbe578e2b7bc8f7a38f313 12-Jul-2013 Tejun Heo <tj@kernel.org> cgroup: move module ref handling into rebind_subsystems()

Module ref handling in cgroup is rather weird.
parse_cgroupfs_options() grabs all the modules for the specified
subsystems. A module ref is kept if the specified subsystem is newly
bound to the hierarchy. If not, or the operation fails, the refs are
dropped. This scatters module ref handling across multiple functions
making it difficult to track. It also make the function nasty to use
for dynamic subsystem binding which is necessary for the planned
unified hierarchy.

There's nothing which requires the subsystem modules to be pinned
between parse_cgroupfs_options() and rebind_subsystems() in both mount
and remount paths. parse_cgroupfs_options() can just parse and
rebind_subsystems() can handle pinning the subsystems that it wants to
bind, which is a natural part of its task - binding - anyway.

Move module ref handling into rebind_subsystems() which makes the code
a lot simpler - modules are gotten iff it's gonna be bound and put iff
unbound or binding fails.

v2: Li pointed out that if a controller module is unloaded between
parsing and binding, rebind_subsystems() won't notice the missing
controller as it only iterates through existing controllers. Fix
it by updating rebind_subsystems() to compare @added_mask to
@pinned and fail with -ENOENT if they don't match.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
786e1448d9c5d2a469bcc9d2aecacd418ee1aca0 14-Jul-2013 Al Viro <viro@zeniv.linux.org.uk> cgroup: we can use simple_lookup() now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
913ffdb54366f94eec65c656cae8c6e00e1ab1b0 12-Jul-2013 Tejun Heo <tj@kernel.org> cgroup: replace task_cgroup_path_from_hierarchy() with task_cgroup_path()

task_cgroup_path_from_hierarchy() was added for the planned new users
and none of the currently planned users wants to know about multiple
hierarchies. This patch drops the multiple hierarchy part and makes
it always return the path in the first non-dummy hierarchy.

As unified hierarchy will always have id 1, this is guaranteed to
return the path for the unified hierarchy if mounted; otherwise, it
will return the path from the hierarchy which happens to occupy the
lowest hierarchy id, which will usually be the first hierarchy mounted
after boot.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jan Kaluža <jkaluza@redhat.com>
f172e67cf9d842bc646d0f66792e38435a334b1e 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: move number_of_cgroups test out of rebind_subsystems() into cgroup_remount()

rebind_subsystems() currently fails if the hierarchy has any !root
cgroups; however, on the planned unified hierarchy,
rebind_subsystems() will be used while populated. Move the test to
cgroup_remount(), which is the only place the test is necessary
anyway.

As it's impossible for the other two callers of rebind_subsystems() to
have populated hierarchy, this doesn't make any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3126121fb30941552b1a806c7c2e686bde57e270 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: make rebind_subsystems() handle file additions and removals with proper error handling

Currently, creating and removing cgroup files in the root directory
are handled separately from the actual subsystem binding and unbinding
which happens in rebind_subsystems(). Also, rebind_subsystems() users
aren't handling file creation errors properly. Let's integrate
top_cgroup file handling into rebind_subsystems() so that it's simpler
to use and everyone handles file creation errors correctly.

* On a successful return, rebind_subsystems() is guaranteed to have
created all files of the new subsystems and deleted the ones
belonging to the removed subsystems. After a failure, no file is
created or removed.

* cgroup_remount() no longer needs to make explicit populate/clear
calls as it's all handled by rebind_subsystems(), and it gets proper
error handling automatically.

* cgroup_mount() has been updated such that the root dentry and cgroup
are linked before rebind_subsystems(). Also, the init_cred dancing
and base file handling are moved right above rebind_subsystems()
call and proper error handling for the base files is added. While
at it, add a comment explaining what's going on with the cred thing.

* cgroup_kill_sb() calls rebind_subsystems() to unbind all subsystems
which now implies removing all subsystem files which requires the
directory's i_mutex. Grab it. This means that files on the root
cgroup are removed earlier - they used to be deleted from generic
super_block cleanup from vfs. This doesn't lead to any functional
difference and it's cleaner to do the clean up explicitly for all
files.

Combined with the previous changes, this makes all cgroup file
creation errors handled correctly.

v2: Added comment on init_cred.

v3: Li spotted that cgroup_mount() wasn't freeing tmp_links after base
file addition failure. Fix it by adding free_tmp_links error
handling label.

v4: v3 introduced build bugs which got noticed by Fengguang's awesome
kbuild test robot. Fixed, and shame on me.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
b420ba7db15659253d4f286a0ba479d336371999 12-Jul-2013 Tejun Heo <tj@kernel.org> cgroup: use for_each_subsys() instead of for_each_root_subsys() in cgroup_populate/clear_dir()

rebind_subsystems() will be updated to handle file creations and
removals with proper error handling and to do that will need to
perform file operations before actually adding the subsystem to the
hierarchy.

To enable such usage, update cgroup_populate/clear_dir() to use
for_each_subsys() instead of for_each_root_subsys() so that they
operate on all subsystems specified by @subsys_mask whether that
subsystem is currently bound to the hierarchy or not.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
bee550994f6b0c1179bd3ccea58dc5c2c4ccf842 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: update error handling in cgroup_populate_dir()

cgroup_populate_dir() didn't use to check whether the actual file
creations were successful and could return success with only subset of
the requested files created, which is nasty.

This patch udpates cgroup_populate_dir() so that it either succeeds
with all files or fails with no file.

v2: The original patch also converted for_each_root_subsys() usages to
for_each_subsys() without explaining why. That part has been
moved to a separate patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
628f7cd47ab758cae0353d1a6decf3d1459dca24 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: separate out cgroup_base_files[] handling out of cgroup_populate/clear_dir()

cgroup_populate/clear_dir() currently take @base_files and adds and
removes, respectively, cgroup_base_files[] to the directory. File
additions and removals are being reorganized for proper error handling
and more dynamic handling for the unified hierarchy, and mixing base
and subsys file handling into the same functions gets a bit confusing.

This patch moves base file handling out of cgroup_populate/clear_dir()
into their users - cgroup_mount(), cgroup_create() and
cgroup_destroy_locked().

Note that this changes the behavior of base file removal. If
@base_files is %true, cgroup_clear_dir() used to delete files
regardless of cftype until there's no files left. Now, only files
with matching cfts are removed. As files can only be created by the
base or registered cftypes, this shouldn't result in any behavior
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
9ccece80ae19ed42439fc0ced76858f189cd41e8 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: fix cgroup_add_cftypes() error handling

cgroup_add_cftypes() uses cgroup_cfts_commit() to actually create the
files; however, both functions ignore actual file creation errors and
just assume success. This can lead to, for example, blkio hierarchy
with some of the cgroups with only subset of interface files populated
after cfq-iosched is loaded under heavy memory pressure, which is
nasty.

This patch updates cgroup_cfts_commit() and cgroup_add_cftypes() to
guarantee that all files are created on success and no file is created
on failure.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b1f28d3109349899e87377e89f9d8ab5bc95ec57 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: fix error path of cgroup_addrm_files()

cgroup_addrm_files() mishandled error return value from
cgroup_add_file() and returns error iff the last file fails to create.
As we're in the process of cleaning up file add/rm error handling and
will reliably propagate file creation failures, there's no point in
keeping adding files after a failure.

Replace the broken error collection logic with immediate error return.
While at it, add lockdep assertions and function comment.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
8f89140ae41ccd9c63344e6823faa862aa7435e3 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: minor updates around cgroup_clear_directory()

* Rename it to cgroup_clear_dir() and make it take the pointer to the
target cgroup instead of the the dentry. This makes the function
consistent with its counterpart - cgroup_populate_dir().

* Move cgroup_clear_directory() invocation from cgroup_d_remove_dir()
to cgroup_remount() so that the function doesn't have to determine
the cgroup pointer back from the dentry. cgroup_d_remove_dir() now
only deals with vfs, which is slightly cleaner.

This patch doesn't introduce any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
c7ba8287cd11f2fc9e2feee9e1fac34b7293658f 29-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an existing hierarchy

0ce6cba357 ("cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when
comparing mount options") only updated the remount path but
CGRP_ROOT_SUBSYS_BOUND should also be ignored when comparing options
while mounting an existing hierarchy. As option mismatch triggers a
warning but doesn't fail the mount without sane_behavior, this only
triggers a spurious warning message.

Fix it by only comparing CGRP_ROOT_OPTION_MASK bits when comparing new
and existing root options.

Signed-off-by: Tejun Heo <tj@kernel.org>
0ce6cba35777cf96a54ce0d5856dc962566b8717 28-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options

1672d04070 ("cgroup: fix cgroupfs_root early destruction path")
introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
subsys binding on a new root; however, this broke remounts.
cgroup_remount() doesn't allow changing root options via remount and
CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
makes the function reject all remounts.

Fix it by putting the options part in the lower 16 bits of root->flags
and masking the comparions. While at it, make cgroup_remount() emit
an error message explaining why it's rejecting a remount request, so
that it's less of a mystery.

Signed-off-by: Tejun Heo <tj@kernel.org>
e2bd416f6246d11be29999c177d2534943a5c2df 28-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts()

eb178d06332 ("cgroup: grab cgroup_mutex in
drop_parsed_module_refcounts()") made drop_parsed_module_refcounts()
grab cgroup_mutex to make lockdep assertion in for_each_subsys()
happy. Unfortunately, cgroup_remount() calls the function while
holding cgroup_mutex in its failure path leading to the following
deadlock.

# mount -t cgroup -o remount,memory,blkio cgroup blkio

cgroup: option changes via remount are deprecated (pid=525 comm=mount)

=============================================
[ INFO: possible recursive locking detected ]
3.10.0-rc4-work+ #1 Not tainted
---------------------------------------------
mount/525 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0

but task is already holding lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(cgroup_mutex);
lock(cgroup_mutex);

*** DEADLOCK ***

May be due to missing lock nesting notation

4 locks held by mount/525:
#0: (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30
#1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200
#2: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
#3: (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200

stack backtrace:
CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8
ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056
ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640
Call Trace:
[<ffffffff81c24bb1>] dump_stack+0x19/0x1b
[<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60
[<ffffffff810f531c>] lock_acquire+0x9c/0x1f0
[<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410
[<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
[<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200
[<ffffffff811c9bb2>] do_remount_sb+0x82/0x190
[<ffffffff811e9d41>] do_mount+0x5f1/0xa30
[<ffffffff811ea203>] SyS_mount+0x83/0xc0
[<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b

Fix it by moving the drop_parsed_module_refcounts() invocation outside
cgroup_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
a4ea1cc90604df08d471ae84eb9627319d10c844 22-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: always use RCU accessors for protected accesses

kernel/cgroup.c still has places where a RCU pointer is set and
accessed directly without going through RCU_INIT_POINTER() or
rcu_dereference_protected(). They're all properly protected accesses
so nothing is broken but it leads to spurious sparse RCU address space
warnings.

Substitute direct accesses with RCU_INIT_POINTER() and
rcu_dereference_protected(). Note that %true is specified as the
extra condition for all derference updates. This isn't ideal as all
it does is suppressing warning without actually policing
synchronization rules; however, most are scheduled to be removed
pretty soon along with css_id itself, so no reason to be more
elaborate.

Combined with the previous changes, this removes all RCU related
sparse warnings from cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by; Li Zefan <lizefan@huawei.com>
a8ad805cfde00be8fe3b3dae8890996dbeb91e2c 22-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: fix RCU accesses around task->cgroups

There are several places in kernel/cgroup.c where task->cgroups is
accessed and modified without going through proper RCU accessors.
None is broken as they're all lock protected accesses; however, this
still triggers sparse RCU address space warnings.

* Consistently use task_css_set() for task->cgroups dereferencing.

* Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on
exit.

* Remove unnecessary rcu_dereference_raw() from cset->subsys[]
dereference in cgroup_exit().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Li Zefan <lizefan@huawei.com>
eb178d063324d9c30f673db3877b892a48ade21e 26-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: grab cgroup_mutex in drop_parsed_module_refcounts()

This isn't strictly necessary as all subsystems specified in
@subsys_mask are guaranteed to be pinned; however, it does spuriously
trigger lockdep warning. Let's grab cgroup_mutex around it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1672d040709b789671c0502e7aac9d632c2f9175 26-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: fix cgroupfs_root early destruction path

cgroupfs_root used to have ->actual_subsys_mask in addition to
->subsys_mask. a8a648c4ac ("cgroup: remove
cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
essentially temporary and doesn't belong in cgroupfs_root; however,
the patch made it impossible to tell whether a cgroupfs_root actually
has the subsystems bound or just have the bits set leading to the
following BUG when trying to mount with subsystems which are already
mounted elsewhere.

kernel BUG at kernel/cgroup.c:1038!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
...
CPU: 1 PID: 7973 Comm: mount Tainted: G W 3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
RIP: 0010:[<ffffffff81249b29>] [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
...
Call Trace:
[<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
[<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
[<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
[<ffffffff81257169>] cpuset_mount+0xd9/0x110
[<ffffffff813d2580>] mount_fs+0xb0/0x2d0
[<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
[<ffffffff814070b5>] do_new_mount+0x145/0x2c0
[<ffffffff814085d6>] do_mount+0x356/0x3c0
[<ffffffff8140873d>] SyS_mount+0xfd/0x140
[<ffffffff854eb600>] tracesys+0xdd/0xe2

We still want rebind_subsystems() to take added/removed masks, so
let's fix it by marking whether a cgroupfs_root has finished binding
or not. Also, document what's going on around ->subsys_mask
initialization so that similar mistakes aren't repeated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Li Zefan <lizefan@huawei.com>
fc76df706123602214da494ba98bccea83e2cfff 25-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy

Before 1a57423166 ("cgroup: make hierarchy_id use cyclic idr"),
hierarchy IDs were allocated from 0. As the dummy hierarchy was
always the one first initialized, it got assigned 0 and all other
hierarchies from 1. The patch accidentally changed the minimum
useable ID to 2.

Let's restore ID 0 for dummy_root and while at it reserve 1 for
unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
30159ec7a9db7f3c91e2b27e66389c49302efd5c 25-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: implement for_each_[builtin_]subsys()

There are quite a few places where all loaded [builtin] subsys are
iterated. Implement for_each_[builtin_]subsys() and replace manual
iterations with those to simplify those places a bit. The new
iterators automatically skip NULL subsystems. This shouldn't cause
any functional difference.

Iteration loops which scan all subsystems and then skipping modular
ones explicitly are converted to use for_each_builtin_subsys().

While at it, reorder variable declarations and adjust whitespaces a
bit in the affected functions.

v2: Add lockdep_assert_held() in for_each_subsys() and add comments
about synchronization as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
82fe9b0da0d50e2795a49c268676fd132cbc3eea 25-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: move init_css_set initialization inside cgroup_mutex

cgroup_init() was doing init_css_set initialization outside
cgroup_mutex, which is fine but we want to add lockdep annotation on
subsystem iterations and cgroup_init() will trigger it spuriously.
Move init_css_set initialization inside cgroup_mutex.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5549c497913ad860d3dff4386c6423268bb85693 25-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: s/for_each_subsys()/for_each_root_subsys()/

for_each_subsys() walks over subsystems attached to a hierarchy and
we're gonna add iterators which walk over all available subsystems.
Rename for_each_subsys() to for_each_root_subsys() so that it's more
appropriately named and for_each_subsys() can be used to iterate all
subsystems.

While at it, remove unnecessary underbar prefix from macro arguments,
put them inside parentheses, and adjust indentation for the two
for_each_*() macros.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b326f9d0dbd066b0aafbe88e6011a680a36de6e8 25-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: clean up find_css_set() and friends

find_css_set() passes uninitialized on-stack template[] array to
find_existing_css_set() which sets the entries for all subsystems.
Passing around an uninitialized array is a bit icky and we want to
introduce an iterator which only iterates loaded subsystems. Let's
initialize it on definition.

While at it, also make the following cosmetic cleanups.

* Convert to proper /** comments.

* Reorder variable declarations.

* Replace comment on synchronization with lockdep_assert_held().

This patch doesn't make any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a8a648c4acee2095262f7fa65b0d8a68a03c32e4 25-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: remove cgroup->actual_subsys_mask

cgroup curiously has two subsystem masks, ->subsys_mask and
->actual_subsys_mask. The latter only exists because the new target
subsys_mask is passed into rebind_subsystems() via @root>subsys_mask.
rebind_subsystems() needs to know what the current mask is to decide
how to reach the target mask so ->actual_subsys_mask is used as the
temp location to remember the current state.

Adding a temporary field to a permanent data structure is rather silly
and can be misleading. Update rebind_subsystems() to take @added_mask
and @removed_mask instead and remove @root->actual_subsys_mask.

This patch shouldn't introduce any behavior changes.

v2: Comment and description updated as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
9871bf9550d25e488cd2f0ce958d3f59f17fa720 25-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: prefix global variables with "cgroup_"

Global variable names in kernel/cgroup.c are asking for trouble -
subsys, roots, rootnode and so on. Rename them to have "cgroup_"
prefix.

* s/subsys/cgroup_subsys/

* s/rootnode/cgroup_dummy_root/

* s/dummytop/cgroup_cummy_top/

* s/roots/cgroup_roots/

* s/root_count/cgroup_root_count/

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
03c78cbebb323fc97295ff97dc5e009d56371d57 14-Jun-2013 Li Zefan <lizefan@huawei.com> cgroup: rename cont to cgrp

Cont is short for container. control group was named process container
at first, but then people found container already has a meaning in
linux kernel.

Clean up the leftover variable name @cont.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
00356bd5f0f5e04183fb15805eb29e97c2fc20ac 18-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: clean up cgroup_serial_nr_cursor

cgroup_serial_nr_cursor was created atomic64_t because I thought it
was never gonna used for anything other than assigning unique numbers
to cgroups and didn't want to worry about synchronization; however,
now we're using it as an event-stamp to distinguish cgroups created
before and after certain point which assumes that it's protected by
cgroup_mutex.

Let's make it clear by making it a u64. Also, rename it to
cgroup_serial_nr_next and make it point to the next nr to allocate so
that where it's pointing to is clear and more conventional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
e8c82d20a9f729cf4b9f73043f7fd4e0872bebfd 18-Jun-2013 Li Zefan <lizefan@huawei.com> cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()

We used root->allcg_list to iterate cgroup hierarchy because at that time
cgroup_for_each_descendant_pre() hasn't been invented.

tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the
assignment right above releasing cgroup_mutex and explain what's
going on there.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
794611a1dfcb055d7d41ce133378dd8197d73e38 18-Jun-2013 Li Zefan <lizefan@huawei.com> cgroup: make serial_nr_cursor available throughout cgroup.c

The next patch will use it to determine if a cgroup is newly created
while we're iterating the cgroup hierarchy.

tj: Rephrased the comment on top of cgroup_serial_nr_cursor.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
f57947d27711451a7739a25bba6cddc8a385e438 18-Jun-2013 Li Zefan <lizefan@huawei.com> cgroup: fix memory leak in cgroup_rm_cftypes()

The memory allocated in cgroup_add_cftypes() should be freed. The
effect of this bug is we leak a bit memory everytime we unload
cfq-iosched module if blkio cgroup is enabled.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1c8158eeae0f37d0eee9f1fbe68080df6a408df2 18-Jun-2013 Li Zefan <lizefan@huawei.com> cgroup: fix umount vs cgroup_event_remove() race

commit 5db9a4d99b0157a513944e9a44d29c9cec2e91dc
Author: Tejun Heo <tj@kernel.org>
Date: Sat Jul 7 16:08:18 2012 -0700

cgroup: fix cgroup hierarchy umount race

This commit fixed a race caused by the dput() in css_dput_fn(), but
the dput() in cgroup_event_remove() can also lead to the same BUG().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
084457f284abf6789d90509ee11dae383842b23b 18-Jun-2013 Li Zefan <lizefan@huawei.com> cgroup: fix umount vs cgroup_cfts_commit() race

cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex
is dropped, but dget() won't prevent cgroupfs from being umounted. When
the race happens, vfs will see some dentries with non-zero refcnt while
umount is in process.

Keep running this:
mount -t cgroup -o blkio xxx /cgroup
umount /cgroup

And this:
modprobe cfq-iosched
rmmod cfs-iosched

After a while, the BUG() in shrink_dcache_for_umount_subtree() may
be triggered:

BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup]

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
6db8e85c5c1f89cd0183b76dab027c81009f129f 14-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: disallow rename(2) if sane_behavior

cgroup's rename(2) isn't a proper migration implementation - it can't
move the cgroup to a different parent in the hierarchy. All it can do
is swapping the name string for that cgroup. This isn't useful and
can mislead users to think that cgroup supports proper cgroup-level
migration. Disallow rename(2) if sane_behavior.

v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs
return value when ->rename is not implemented as suggested by Li.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d3daf28da16a30af95bfb303189a634a87606725 14-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: use percpu refcnt for cgroup_subsys_states

A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.

This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.

The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.

This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.

v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().

- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
ea15f8ccdb430af1e8bc9b4e19a230eb4c356777 14-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: split cgroup destruction into two steps

Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item. The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent. The separation is to allow using percpu refcnt for css.

Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name. As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine. A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.

As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both. This patch
renames cgroup->free_work to ->destroy_work and uses it for both
purposes. INIT_WORK() is now performed right before queueing the work
item.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
455050d23e1bfc47ca98e943ad5b2f3a9bbe45fb 14-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: reorder the operations in cgroup_destroy_locked()

This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the ->sibling list. This will be used to make css use
percpu refcnt.

While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.

While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
6f3d828f0fb7fdaffc6f32cb8a1cb7fcf8824598 13-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: remove cgroup->count and use

cgroup->count tracks the number of css_sets associated with the cgroup
and used only to verify that no css_set is associated when the cgroup
is being destroyed. It's superflous as the destruction path can
simply check whether cgroup->cset_links is empty instead.

Drop cgroup->count and check ->cset_links directly from
cgroup_destroy_locked().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ddd69148bdc45e5e3e55bfde3571daecd5a96d75 13-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: drop unnecessary RCU dancing from __put_css_set()

__put_css_set() does RCU read access on @cgrp across dropping
@cgrp->count so that it can continue accessing @cgrp even if the count
reached zero and destruction of the cgroup commenced. Given that both
sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
is unnecessary. Just making cgroup_destroy_locked() grab css_set_lock
while checking @cgrp->count is enough.

Remove the RCU read locking from __put_css_set() and make
cgroup_destroy_locked() read-lock css_set_lock when checking
@cgrp->count. This will also allow removing @cgrp->count.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
54766d4a1d3d6f84ff8fa475cd8f165c0a0000eb 13-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: rename CGRP_REMOVED to CGRP_DEAD

We will add another flag indicating that the cgroup is in the process
of being killed. REMOVING / REMOVED is more difficult to distinguish
and cgroup_is_removing()/cgroup_is_removed() are a bit awkward. Also,
later percpu_ref usage will involve "kill"ing the refcnt.

s/CGRP_REMOVED/CGRP_DEAD/
s/cgroup_is_removed()/cgroup_is_dead()

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
f4f4be2bd2889c69a8698edef8dbfd4f6759aa87 13-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: use kzalloc() instead of kmalloc()

There's no point in using kmalloc() instead of the clearing variant
for trivial stuff. We can live dangerously elsewhere. Use kzalloc()
instead and drop 0 inits.

While at it, do trivial code reorganization in cgroup_file_open().

This patch doesn't introduce any functional changes.

v2: I was caught in the very distant past where list_del() didn't
poison and the initial version converted list_del()s to
list_del_init()s too. Li and Kent took me out of the stasis
chamber.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
69d0206c793a17431eacee2694ee7a4b25df76b7 13-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: bring some sanity to naming around cg_cgroup_link

cgroups and css_sets are mapped M:N and this M:N mapping is
represented by struct cg_cgroup_link which forms linked lists on both
sides. The naming around this mapping is already confusing and struct
cg_cgroup_link exacerbates the situation quite a bit.

>From cgroup side, it starts off ->css_sets and runs through
->cgrp_link_list. From css_set side, it starts off ->cg_links and
runs through ->cg_link_list. This is rather reversed as
cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
Also, this is the only place which is still using the confusing "cg"
for css_sets. This patch cleans it up a bit.

* s/cgroup->css_sets/cgroup->cset_links/
s/css_set->cg_links/css_set->cgrp_links/
s/cgroup_iter->cg_link/cgroup_iter->cset_link/

* s/cg_cgroup_link/cgrp_cset_link/

* s/cgrp_cset_link->cg/cgrp_cset_link->cset/
s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/

* s/init_css_set_link/init_cgrp_cset_link/
s/free_cg_links/free_cgrp_cset_links/
s/allocate_cg_links/allocate_cgrp_cset_links/

* s/cgl[12]/link[12]/ in compare_css_sets()

* s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
adustments.

* Comment and whiteline adjustments.

After the changes, we have

list_for_each_entry(link, &cont->cset_links, cset_link) {
struct css_set *cset = link->cset;

instead of

list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
struct css_set *cset = link->cg;

This patch is purely cosmetic.

v2: Fix broken sentences in the patch description.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
5abb8855734fd7b3fa7f91c13916d0e35d99763c 13-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: consistently use @cset for struct css_set variables

cgroup.c uses @cg for most struct css_set variables, which in itself
could be a bit confusing, but made much worse by the fact that there
are places which use @cg for struct cgroup variables.
compare_css_sets() epitomizes this confusion - @[old_]cg are struct
css_set while @cg[12] are struct cgroup.

It's not like the whole deal with cgroup, css_set and cg_cgroup_link
isn't already confusing enough. Let's give it some sanity by
uniformly using @cset for all struct css_set variables.

* s/cg/cset/ for all css_set variables.

* s/oldcg/old_cset/ s/oldcgrp/old_cgrp/. The same for the ones
prefixed with "new".

* s/cg/cgrp/ for cgroup variables in compare_css_sets().

* s/css/cset/ for the cgroup variable in task_cgroup_from_root().

* Whiteline adjustments.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
3fc3db9a3ae0ce108badf31a4a00e41b4236f5fc 13-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: remove now unused css_depth()

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d5c56ced775f6bdc32b689b01c9c4f9b66e18610 04-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: clean up the cftype array for the base cgroup files

* Rename it from files[] (really?) to cgroup_base_files[].

* Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
used inconsistently. Just use "cgroup." directly.

* Collect insane files at the end. Note that only the insane ones are
missing "cgroup." prefix.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cc5943a7816ba6c00639837a62131386619548dc 04-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: mark "notify_on_release" and "release_agent" cgroup files insane

The empty cgroup notification mechanism currently implemented in
cgroup is tragically outdated. Forking and execing userland process
stopped being a viable notification mechanism more than a decade ago.
We're gonna have a saner mechanism. Let's make it clear that this
abomination is going away.

Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
f12dc020149fad7087e119e54cffea668272bf7d 04-Jun-2013 Tejun Heo <tj@kernel.org> cgroup: mark "tasks" cgroup file as insane

Some resources controlled by cgroup aren't per-task and cgroup core
allowing threads of a single thread_group to be in different cgroups
forced memcg do explicitly find the group leader and use it. This is
gonna be nasty when transitioning to unified hierarchy and in general
we don't want and won't support granularity finer than processes.

Mark "tasks" with CFTYPE_INSANE.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: cgroups@vger.kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
2a0ff3fbe39bc93f719ff857e5a359d9780579ff 26-May-2013 Jeff Liu <jeff.liu@oracle.com> cgroup: warn about mismatching options of a new mount of an existing hierarchy

With the new __DEVEL__sane_behavior mount option was introduced,
if the root cgroup is alive with no xattr function, to mount a
new cgroup with xattr will be rejected in terms of design which
just fine. However, if the root cgroup does not mounted with
__DEVEL__sane_hehavior, to create a new cgroup with xattr option
will succeed although after that the EA function does not works
as expected but will get ENOTSUPP for setting up attributes under
either cgroup. e.g.

setfattr: /cgroup2/test: Operation not supported

Instead of keeping silence in this case, it's better to drop a log
entry in warning level. That would be helpful to understand the
reason behind the scene from the user's perspective, and this is
essentially an improvement does not break the backward compatibilities.

With this fix, above mount attemption will keep up works as usual but
the following line cound be found at the system log:

[ ...] cgroup: new mount options do not match the existing superblock

tj: minor formatting / message updates.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
75501a6d59e989e5c286716e5b3b66ace4660e83 24-May-2013 Tejun Heo <tj@kernel.org> cgroup: update iterators to use cgroup_next_sibling()

This patch converts cgroup_for_each_child(),
cgroup_next_descendant_pre/post() and thus
cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
instead of manually dereferencing ->sibling.next.

The only reason the iterators couldn't allow dropping RCU read lock
while iteration is in progress was because they couldn't determine the
next sibling safely once RCU read lock is dropped. Using
cgroup_next_sibling() removes that problem and enables all iterators
to allow dropping RCU read lock in the middle. Comments are updated
accordingly.

This makes the iterators easier to use and will simplify controllers.

Note that @cgroup argument is renamed to @cgrp in
cgroup_for_each_child() because it conflicts with "struct cgroup" used
in the new macro body.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
53fa5261747a90746531e8a1c81eeb78fedc2f71 24-May-2013 Tejun Heo <tj@kernel.org> cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()

Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.

It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.

This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.

This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.

Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.

v2: Typo fix as per Serge.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
bdc7119f1bdd0632d42f435941dc290216a436e7 24-May-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_is_removed() static

cgroup_is_removed() no longer has external users and it shouldn't grow
any - controllers should deal with cgroup_subsys_state on/offline
state instead of cgroup removal state. Make it static.

While at it, make it return bool.

Signed-off-by: Tejun Heo <tj@kernel.org>
7805d000db30a3787a4c969bab6ae4d8a5fd8ce6 24-May-2013 Tejun Heo <tj@kernel.org> cgroup: fix a subtle bug in descendant pre-order walk

When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty. This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.

There's no reason to have the early exit path. Remove it along with
the later assumption that the subtree isn't empty. This simplifies
the code a bit and fixes the subtle bug.

While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org
857a2beb09ab83e9a8185821ae16db7dfbe8b837 15-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: implement task_cgroup_path_from_hierarchy()

kdbus folks want a sane way to determine the cgroup path that a given
task belongs to on a given hierarchy, which is a reasonble thing to
expect from cgroup core.

Implement task_cgroup_path_from_hierarchy().

v2: Dropped unnecessary NULL check on the return value of
task_cgroup_from_root() as suggested by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <greg@kroah.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <daniel@zonque.org>
1a574231669f8c3065c83974e9557fcbbd94b8a6 14-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: make hierarchy_id use cyclic idr

We want to be able to lookup a hierarchy from its id and cyclic
allocation is a whole lot simpler with idr. Convert to idr and use
idr_alloc_cyclc().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
54e7b4eb15fc4354d5ada5469e3db4a220ddb3ed 14-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: drop hierarchy_id_lock

Now that hierarchy_id alloc / free are protected by the cgroup
mutexes, there's no need for this separate lock. Drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
fa3ca07e96185aa1496b405472399a2a2a336a17 14-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: refactor hierarchy_id handling

We're planning to converting hierarchy_ida to an idr and use it to
look up hierarchy from its id. As we want the mapping to happen
atomically with cgroupfs_root registration, this patch refactors
hierarchy_id init / exit so that ida operations happen inside
cgroup_[root_]mutex.

* s/init_root_id()/cgroup_init_root_id()/ and make it return 0 or
-errno like a normal function.

* Move hierarchy_id initialization from cgroup_root_from_opts() into
cgroup_mount() block where the root is confirmed to be used and
being registered while holding both mutexes.

* Split cgroup_drop_id() into cgroup_exit_root_id() and
cgroup_free_root(), so that ID release can happen before dropping
the mutexes in cgroup_kill_sb(). The latter expects hierarchy_id to
be exited before being invoked.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d6cbf35dac8a3dadb9103379820c96d7c85df3d9 14-May-2013 Li Zefan <lizefan@huawei.com> cgroup: initialize xattr before calling d_instantiate()

cgroup_create_file() calls d_instantiate(), which may decide to look
at the xattrs on the file. Smack always does this and SELinux can be
configured to do so.

But cgroup_add_file() didn't initialize xattrs before calling
cgroup_create_file(), which finally leads to dereferencing NULL
dentry->d_fsdata.

This bug has been there since cgroup xattr was introduced.

Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Ivan Bulatovic <combuster@archlinux.us>
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
8d8b97ba499cb69fccb5fd9f2b439e3265fc3f27 20-Apr-2013 Al Viro <viro@zeniv.linux.org.uk> take cgroup_open() and cpuset_open() to fs/proc/base.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
6d2488f64a240191f0733c1f32d73607916b01b7 30-Apr-2013 Michal Hocko <mhocko@suse.cz> cgroup: remove css_get_next

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ying Han <yinghan@google.com>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7ef70e48735e17d2be5c8e8f85052842b16b923a 26-Apr-2013 Li Zefan <lizefan@huawei.com> cgroup: restore the call to eventfd->poll()

I mistakenly removed the call to eventfd->poll() while I was actually
intending to remove the return value...

Calling evenfd->poll() will hook cgroup_event_wake() to the poll
waitqueue, which will be called to unregister eventfd when rmdir a
cgroup or close eventfd.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
cc20e01cd607282d48f8ea538aba10fa850a4312 26-Apr-2013 Li Zefan <lizefan@huawei.com> cgroup: fix use-after-free when umounting cgroupfs

Try:
# mount -t cgroup xxx /cgroup
# mkdir /cgroup/sub && rmdir /cgroup/sub && umount /cgroup

And you might see this:

ida_remove called for id=1 which is not allocated.

It's because cgroup_kill_sb() is called to destroy root->cgroup_ida
and free cgrp->root before ida_simple_removed() is called. What's
worse is we're accessing cgrp->root while it has been freed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
712317ad97f41e738e1a19aa0a6392a78a84094e 19-Apr-2013 Li Zefan <lizefan@huawei.com> cgroup: fix broken file xattrs

We should store file xattrs in struct cfent instead of struct cftype,
because cftype is a type while cfent is object instance of cftype.

For example each cgroup has a tasks file, and each tasks file is
associated with a uniq cfent, but all those files share the same
struct cftype.

Alexey Kodanev reported a crash, which can be reproduced:

# mount -t cgroup -o xattr /sys/fs/cgroup
# mkdir /sys/fs/cgroup/test
# setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks
# rmdir /sys/fs/cgroup/test
# umount /sys/fs/cgroup
oops!

In this case, simple_xattrs_free() will free the same struct simple_xattrs
twice.

tj: Dropped unused local variable @cft from cgroup_diput().

Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
05fb22ec5456a472a5eadcaacb3e51eca1f8c79c 15-Apr-2013 Li Zefan <lizefan@huawei.com> cgroup: remove cgrp->top_cgroup

It's not used, and it can be retrieved via cgrp->root->top_cgroup.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
873fe09ea5df6ccf6bb34811d8c9992aacb67598 15-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: introduce sane_behavior mount option

It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.

As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility. This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors. As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.

Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default. Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.

This patch introduces the mount option and changes the following
behaviors in cgroup core.

* Mount options "noprefix" and "clone_children" are disallowed. Also,
cgroupfs file cgroup.clone_children is not created.

* When mounting an existing superblock, mount options should match.
This is currently pretty crazy. If one mounts a cgroup, creates a
subdirectory, unmounts it and then mount it again with different
option, it looks like the new options are applied but they aren't.

* Remount is disallowed.

The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.

v2: Dropped unnecessary explicit file permission setting sane_behavior
cftype entry as suggested by Li Zefan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
25a7e6848db76e22677aff202d9c4ef3503be15b 15-Apr-2013 Tejun Heo <tj@kernel.org> move cgroupfs_root to include/linux/cgroup.h

While controllers shouldn't be accessing cgroupfs_root directly, it
being hidden inside kern/cgroup.c makes somethings pretty silly. This
makes routing hierarchy-wide settings which need to be visible to
controllers cumbersome.

We're gonna add another hierarchy-wide setting which needs to be
accessed from controllers. Move cgroupfs_root and its flags to the
header file so that we can access root settings with inline helpers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
9343862945fdd3dd5cb7648bb24cabe40faa9ad9 15-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix

There's no reason to be using bitops, which tends to be more
cumbersome, to handle root flags. Convert them to masks. Also, as
they'll be moved to include/linux/cgroup.h and it's generally a good
idea, add CGRP_ prefix.

Note that flags are assigned from (1 << 1). The first bit will be
used by a flag which will be added soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
da1f296fd2bfd5ad3c53d72a1ece593e821cf374 14-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: make cgroup_path() not print double slashes

While reimplementing cgroup_path(), 65dff759d2 ("cgroup: fix
cgroup_path() vs rename() race") introduced a bug where the path of a
non-root cgroup would have two slahses at the beginning, which is
caused by treating the root cgroup which has the name '/' like
non-root cgroups.

$ grep systemd /proc/self/cgroup
1:name=systemd://user/root/1

Fix it by special casing root cgroup case and not looping over it in
the normal path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
26d5bbe5ba2073fc7ef9e69a55543b2376f5bad0 12-Apr-2013 Tejun Heo <tj@kernel.org> Revert "cgroup: remove bind() method from cgroup_subsys."

This reverts commit 84cfb6ab484b442d5115eb3baf9db7d74a3ea626. There
are scheduled changes which make use of the removed callback.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rami Rosen <ramirose@gmail.com>
Cc: Li Zefan <lizefan@huawei.com>
78574cf981cd3d9ae9f6adbd466a772310ec24ff 09-Apr-2013 Li Zefan <lizefan@huawei.com> cgroup: implement cgroup_is_descendant()

A couple controllers want to determine whether two cgroups are in
ancestor/descendant relationship. As it's more likely that the
descendant is the primary subject of interest and there are other
operations focusing on the descendants, let's ask is_descendent rather
than is_ancestor.

Implementation is trivial as the previous patch guarantees that all
ancestors of a cgroup stay accessible as long as the cgroup is
accessible.

tj: Removed depth optimization, renamed from cgroup_is_ancestor(),
rewrote descriptions.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
415cf07a1c1c65249773330434878ae7bcd92d0f 08-Apr-2013 Li Zefan <lizefan@huawei.com> cgroup: make sure parent won't be destroyed before its children

Suppose we rmdir a cgroup and there're still css refs, this cgroup won't
be freed. Then we rmdir the parent cgroup, and the parent is freed
immediately due to css ref draining to 0. Now it would be a disaster if
the still-alive child cgroup tries to access its parent.

Make sure this won't happen.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
84cfb6ab484b442d5115eb3baf9db7d74a3ea626 10-Apr-2013 Rami Rosen <ramirose@gmail.com> cgroup: remove bind() method from cgroup_subsys.

The bind() method of cgroup_subsys is not used in any of the
controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
devices, perf, hugetlb, cpu and cpuacct)

tj: Removed the entry on ->bind() from
Documentation/cgroups/cgroups.txt. Also updated a couple
paragraphs which were suggesting that dynamic re-binding may be
implemented. It's not gonna.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
479f614110b889d5783acdaec865ede3cdb96b97 29-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: Kill subsys.active flag

The only user was cpuacct.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/5155385A.4040207@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2219449a65ace0290cd9c2260ff337e326b8be8a 07-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_lock_is_held()

We don't want controllers to assume that the information is officially
available and do funky things with it.

The only user is task_subsys_state_check() which uses it to verify RCU
access context. We can move cgroup_lock_is_held() inside
CONFIG_PROVE_RCU but that doesn't add meaningful protection compared
to conditionally exposing cgroup_mutex.

Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU
and use lockdep_is_held() directly on the mutex in
task_subsys_state_check().

While at it, add parentheses around macro arguments in
task_subsys_state_check().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
47cfcd0922454e49f4923b1e2d31a5bf199237c3 07-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: kill cgroup_[un]lock()

Now that locking interface is unexported, there's no reason to keep
around these thin wrappers. Kill them and use mutex operations
directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b9777cf8d7c7854c3c38bd6621d993b85c2afcdf 07-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: unexport locking interface and cgroup_attach_task()

Now that all external cgroup_lock() users are gone, we can finally
unexport the locking interface and prevent future abuse of
cgroup_mutex.

Make cgroup_[un]lock() and cgroup_lock_live_group() static. Also,
cgroup_attach_task() doesn't have any user left and can't be used
without locking interface anyway. Make it static too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
7ae1bad99e27b8838d480a24edf4646a2fc547df 07-Apr-2013 Tejun Heo <tj@kernel.org> cgroup: relocate cgroup_lock_live_group() and cgroup_attach_task_all()

cgroup_lock_live_group() and cgroup_attach_task() are scheduled to be
made static. Relocate the former and cgroup_attach_task_all() so that
we don't need forward declarations.

This patch is pure relocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
8cc9934520e7f752fe45d5199664d741ba24a932 07-Apr-2013 Tejun Heo <tj@kernel.org> cgroup, cpuset: replace move_member_tasks_to_cpuset() with cgroup_transfer_tasks()

When a cpuset becomes empty (no CPU or memory), its tasks are
transferred with the nearest ancestor with execution resources. This
is implemented using cgroup_scan_tasks() with a callback which grabs
cgroup_mutex and invokes cgroup_attach_task() on each task.

Both cgroup_mutex and cgroup_attach_task() are scheduled to be
unexported. Implement cgroup_transfer_tasks() in cgroup proper which
is essentially the same as move_member_tasks_to_cpuset() except that
it takes cgroups instead of cpusets and @to comes before @from like
normal functions with those arguments, and replace
move_member_tasks_to_cpuset() with it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
1e2ccd1c0f67c3f958d6139de2496787b9a57182 01-Apr-2013 Kevin Wilson <wkevils@gmail.com> cgroup: remove unused parameter in cgroup_task_migrate().

This patch removes unused parameter from cgroup_task_migrate().

Signed-off-by: Kevin Wilson <wkevils@gmail.com>
Acked-by: Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
081aa458c38ba576bdd4265fc807fa95b48b9e79 13-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()

These two functions share most of the code.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
3ac1707a13a3da9cfc8f242a15b2fae6df2c5f88 12-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: fix an off-by-one bug which may trigger BUG_ON()

The 3rd parameter of flex_array_prealloc() is the number of elements,
not the index of the last element.

The effect of the bug is, when opening cgroup.procs, a flex array will
be allocated and all elements of the array is allocated with
GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to
allocate memory for it, it'll trigger a BUG_ON().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
14a40ffccd6163bbcd1d6f32b28a88ffe6149fc6 19-Mar-2013 Tejun Heo <tj@kernel.org> sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY

PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself. Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.

What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks. In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.

This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.

This will allow simplifying workqueue worker CPU affinity management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
80f36c2a1a612ca419e5b864a7e4808e797d9feb 12-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: remove useless code in cgroup_write_event_control()

eventfd_poll() never returns POLLHUP.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
6ee211ad0a22869af81eef10845922ac4dcb2d38 12-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: don't bother to resize pid array

When we open cgroup.procs, we'll allocate an buffer and store all tasks'
tgid in it, and then duplicate entries will be stripped. If that results
in a much smaller pid list, we'll re-allocate a smaller buffer.

But we've already sucessfully allocated memory and reading the procs
file is a short period and the memory will be freed very soon, so why
bother to re-allocate memory.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
d7eeac1913ff86a17f891cb4b73f03d4b94907d0 12-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: hold cgroup_mutex before calling css_offline()

cpuset no longer nests cgroup_mutex inside cpu_hotplug lock, so
we don't have to release cgroup_mutex before calling css_offline().

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
6dc01181eac16192dc4a5d1b310b78e2e97c003c 12-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: remove unused variables in cgroup_destroy_locked()

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
e7b2dcc52b0e2d598a469f01cc460ccdde6869f2 12-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: remove cgroup_is_descendant()

It was used by ns cgroup, and ns cgroup was removed long ago.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
7d8e0bf56a66bab08d2f316dd87e56c08cecb899 05-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: avoid accessing modular cgroup subsys structure without locking

subsys[i] is set to NULL in cgroup_unload_subsys() at modular unload,
and that's protected by cgroup_mutex, and then the memory *subsys[i]
resides will be freed.

So this is unsafe without any locking:

if (!ss || ss->module)
...

v2:
- add a comment for enum cgroup_subsys_id
- simplify the comment in cgroup_exit()

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
f50daa704f36a6544a902c52b6cf37b0493dfc5d 01-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: no need to check css refs for release notification

We no longer fail rmdir() when there're still css refs, so we don't
need to check css refs in check_for_release().

This also voids a bug. cgroup_has_css_refs() accesses subsys[i]
without cgroup_mutex, so it can race with cgroup_unload_subsys().

cgroup_has_css_refs()
...
if (ss == NULL || ss->root != cgrp->root)

if ss pointers to net_cls_subsys, and cls_cgroup module is unloaded
right after the former check but before the latter, the memory that
net_cls_subsys resides has become invalid.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
65dff759d2948cf18e2029fc5c0c595b8b7da3a5 01-Mar-2013 Li Zefan <lizefan@huawei.com> cgroup: fix cgroup_path() vs rename() race

rename() will change dentry->d_name. The result of this race can
be worse than seeing partially rewritten name, but we might access
a stale pointer because rename() will re-allocate memory to hold
a longer name.

As accessing dentry->name must be protected by dentry->d_lock or
parent inode's i_mutex, while on the other hand cgroup-path() can
be called with some irq-safe spinlocks held, we can't generate
cgroup path using dentry->d_name.

Alternatively we make a copy of dentry->d_name and save it in
cgrp->name when a cgroup is created, and update cgrp->name at
rename().

v5: use flexible array instead of zero-size array.
v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
- add cgroup_name() wrapper.
v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
v2: make cgrp->name RCU safe.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
b67bfe0d42cac56c512dd5da4b1b347a23f4b70a 28-Feb-2013 Sasha Levin <sasha.levin@oracle.com> hlist: drop the node parameter from iterators

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d228d9ec2c9a119ce15c6446ebeec05786ab3287 28-Feb-2013 Tejun Heo <tj@kernel.org> cgroup: convert to idr_alloc()

Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c897ff68beec97768c31f9de97de93f77ff2d87e 28-Feb-2013 Tejun Heo <tj@kernel.org> cgroup: don't use idr_remove_all()

idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated. Drop its usage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
496ad9aa8ef448058e36ca7a787c61f2e63f0f54 23-Jan-2013 Al Viro <viro@zeniv.linux.org.uk> new helper: file_inode(file)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
f169007b2773f285e098cb84c74aac0154d65ff7 18-Feb-2013 Li Zefan <lizefan@huawei.com> cgroup: fail if monitored file and event_control are in different cgroup

If we pass fd of memory.usage_in_bytes of cgroup A to cgroup.event_control
of cgroup B, then we won't get memory usage notification from A but B!

What's worse, if A and B are in different mount hierarchy, we'll end up
accessing NULL pointer!

Disallow this kind of invalid usage.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Tejun Heo <tj@kernel.org>
810cbee4fad570ff167132d4ecf247d99c48f71d 18-Feb-2013 Li Zefan <lizefan@huawei.com> cgroup: fix cgroup_rmdir() vs close(eventfd) race

commit 205a872bd6f9a9a09ef035ef1e90185a8245cc58 ("cgroup: fix lockdep
warning for event_control") solved a deadlock by introducing a new
bug.

Move cgrp->event_list to a temporary list doesn't mean you can traverse
this list locklessly, because at the same time cgroup_event_wake() can
be called and remove the event from the list. The result of this race
is disastrous.

We adopt the way how kvm irqfd code implements race-free event removal,
which is now described in the comments in cgroup_event_wake().

v3:
- call eventfd_signal() no matter it's eventfd close or cgroup removal
that removes the cgroup event.

Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
71b5707e119653039e6e95213f00479668c79b75 24-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: fix exit() vs rmdir() race

In cgroup_exit() put_css_set_taskexit() is called without any lock,
which might lead to accessing a freed cgroup:

thread1 thread2
---------------------------------------------
exit()
cgroup_exit()
put_css_set_taskexit()
atomic_dec(cgrp->count);
rmdir();
/* not safe !! */
check_for_release(cgrp);

rcu_read_lock() can be used to make sure the cgroup is alive.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
9ed8a659703876a9fe96ab86d1b296c2f0084242 24-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: remove bogus comments in cgroup_diput()

Since commit 48ddbe194623ae089cc0576e60363f2d2e85662a
("cgroup: make css->refcnt clearing on cgroup removal optional"),
each css holds a ref on cgroup's dentry, so cgroup_diput() won't be
called until all css' refs go down to 0, which invalids the comments.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
be44562613851235d801d41d5b3976dc4333f622 24-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: remove synchronize_rcu() from cgroup_diput()

Free cgroup via call_rcu(). The actual work is done through
workqueue.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
86a3db5643c7d29bb36ca85c7a4bb67ad4d88d77 24-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: remove duplicate RCU free on struct cgroup

When destroying a cgroup, though in cgroup_diput() we've called
synchronize_rcu(), we then still have to free it via call_rcu().

The story is, long ago to fix a race between reading /proc/sched_debug
and freeing cgroup, the code was changed to utilize call_rcu(). See
commit a47295e6bc42ad35f9c15ac66f598aa24debd4e2 ("cgroups: make
cgroup_path() RCU-safe")

As we've fixed cpu cgroup that cpu_cgroup_offline_css() is used
to unregister a task_group so there won't be concurrent access
to this task_group after synchronize_rcu() in diput(). Now we can
just kfree(cgrp).

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
fe1c06ca7523baa668c1eaf1e1016fa64753c32e 24-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: initialize cgrp->dentry before css_alloc()

With this change, we're guaranteed that cgroup_path() won't see NULL
cgrp->dentry, and thus we can remove the NULL check in it.

(Well, it's not strictly true, because dummptop.dentry is always NULL
but we already handle that separately.)

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
b5d646f5d5a135064232ff3a140a47a5b84bc911 24-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: remove a NULL check in cgroup_exit()

init_task.cgroups is initialized at boot phase, and whenver a ask
is forked, it's cgroups pointer is inherited from its parent, and
it's never set to NULL afterwards.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2739d3cce9816805fe26774fea2527d5b16e924d 21-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: fix bogus kernel warnings when cgroup_create() failed

If cgroup_create() failed and cgroup_destroy_locked() is called to
do cleanup, we'll see a bunch of warnings:

cgroup_addrm_files: failed to remove 2MB.limit_in_bytes, err=-2
cgroup_addrm_files: failed to remove 2MB.usage_in_bytes, err=-2
cgroup_addrm_files: failed to remove 2MB.max_usage_in_bytes, err=-2
cgroup_addrm_files: failed to remove 2MB.failcnt, err=-2
cgroup_addrm_files: failed to remove prioidx, err=-2
cgroup_addrm_files: failed to remove ifpriomap, err=-2
...

We failed to remove those files, because cgroup_create() has failed
before creating those cgroup files.

To fix this, we simply don't warn if cgroup_rm_file() can't find the
cft entry.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
130e3695a3edf6bf21464f2826720a79a6afdee0 14-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: remove synchronize_rcu() from rebind_subsystems()

Nothing's protected by RCU in rebind_subsystems(), and I can't think
of a reason why it is needed.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
5d65bc0ca1bceb73204dab943922ba3c83276a8c 14-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: remove synchronize_rcu() from cgroup_attach_{task|proc}()

These 2 syncronize_rcu()s make attaching a task to a cgroup
quite slow, and it can't be ignored in some situations.

A real case from Colin Cross: Android uses cgroups heavily to
manage thread priorities, putting threads in a background group
with reduced cpu.shares when they are not visible to the user,
and in a foreground group when they are. Some RPCs from foreground
threads to background threads will temporarily move the background
thread into the foreground group for the duration of the RPC.
This results in many calls to cgroup_attach_task.

In cgroup_attach_task() it's task->cgroups that is protected by RCU,
and put_css_set() calls kfree_rcu() to free it.

If we remove this synchronize_rcu(), there can be threads in RCU-read
sections accessing their old cgroup via current->cgroups with
concurrent rmdir operation, but this is safe.

# time for ((i=0; i<50; i++)) { echo $$ > /mnt/sub/tasks; echo $$ > /mnt/tasks; }

real 0m2.524s
user 0m0.008s
sys 0m0.004s

With this patch:

real 0m0.004s
user 0m0.004s
sys 0m0.000s

tj: These synchronize_rcu()s are utterly confused. synchornize_rcu()
necessarily has to come between two operations to guarantee that
the changes made by the former operation are visible to all rcu
readers before proceeding to the latter operation. Here,
synchornize_rcu() are at the end of attach operations with nothing
beyond it. Its only effect would be delaying completion of
write(2) to sysfs tasks/procs files until all rcu readers see the
change, which doesn't mean anything.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Colin Cross <ccross@google.com>
0ac801fe07374148714b5ef53df90ac5b1673c0c 10-Jan-2013 Li Zefan <lizefan@huawei.com> cgroup: use new hashtable implementation

Switch cgroup to use the new hashtable implementation. No functional changes.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
12a9d2fef1d35770d3cdc2cd1faabb83c45bc0fa 07-Jan-2013 Tejun Heo <tj@kernel.org> cgroup: implement cgroup_rightmost_descendant()

Implement cgroup_rightmost_descendant() which returns the right most
descendant of the specified cgroup. This can be used to skip the
cgroup's subtree while iterating with
cgroup_for_each_descendant_pre().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
8ec7d50f1ed2b072d23ce810f86df09ee0568d4b 18-Dec-2012 Tao Ma <boyu.mt@taobao.com> kernel: remove reference to feature-removal-schedule.txt

In commit 9c0ece069b32 ("Get rid of Documentation/feature-removal.txt"),
Linus removed feature-removal-schedule.txt from Documentation, but there
is still some reference to this file. So remove them.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f33fddc2b9573d8359f1007d4bbe5cd587a0c093 06-Dec-2012 Gao feng <gaofeng@cn.fujitsu.com> cgroup_rm_file: don't delete the uncreated files

in cgroup_add_file,when creating files for cgroup,
some of creation may be skipped. So we need to avoid
deleting these uncreated files in cgroup_rm_file,
otherwise the warning msg will be triggered.

"cgroup_addrm_files: failed to remove memory_pressure_enabled, err=-2"

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@redhat.com>
Cc: stable@vger.kernel.org
7083d0378a1746f2b45729cae494c6b92e75d73f 03-Dec-2012 Gao feng <gaofeng@cn.fujitsu.com> cgroup: remove subsystem files when remounting cgroup

cgroup_clear_directroy is called by cgroup_d_remove_dir
and cgroup_remount.

when we call cgroup_remount to remount the cgroup,the subsystem
may be unlinked from cgroupfs_root->subsys_list in rebind_subsystem,this
subsystem's files will not be removed in cgroup_clear_directroy.
And the system will panic when we try to access these files.

this patch removes subsystems's files before rebind_subsystems,
if rebind_subsystems failed, repopulate these removed files.

With help from Tejun.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
879a3d9dbbde823ac77d39131e7a287f31b8296f 30-Nov-2012 Gao feng <gaofeng@cn.fujitsu.com> cgroup: use cgroup_addrm_files() in cgroup_clear_directory()

cgroup_clear_directory() incorrectly invokes cgroup_rm_file() on each
cftset of the target subsystems, which only removes the first file of
each set. This leaves dangling files after subsystems are removed
from a cgroup root via remount.

Use cgroup_addrm_files() to remove all files of target subsystems.

tj: Move cgroup_addrm_files() prototype decl upwards next to other
global declarations. Commit message updated.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1f869e8711d18aaf6e2979922bc9377ad394b82f 30-Nov-2012 Glauber Costa <glommer@parallels.com> cgroup: warn about broken hierarchies only after css_online

If everything goes right, it shouldn't really matter if we are spitting
this warning after css_alloc or css_online. If we fail between then,
there are some ill cases where we would previously see the message and
now we won't (like if the files fail to be created).

I believe it really shouldn't matter: this message is intended in spirit
to be shown when creation succeeds, but with insane settings.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
9718ceb3431acdeee9bdec5c18e18266333970aa 28-Nov-2012 Greg Thelen <gthelen@google.com> cgroup: list_del_init() on removed events

Use list_del_init() rather than list_del() to remove events from
cgrp->event_list. No functional change. This is just defensive
coding.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
205a872bd6f9a9a09ef035ef1e90185a8245cc58 28-Nov-2012 Greg Thelen <gthelen@google.com> cgroup: fix lockdep warning for event_control

The cgroup_event_wake() function is called with the wait queue head
locked and it takes cgrp->event_list_lock. However, in cgroup_rmdir()
remove_wait_queue() was being called after taking
cgrp->event_list_lock. Correct the lock ordering by using a temporary
list to obtain the event list to remove from the wait queue.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Aaron Durbin <adurbin@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
fddfb02ad0d0d3b479c2a26a8ae7e6411b34706b 28-Nov-2012 Li Zhong <zhong@linux.vnet.ibm.com> cgroup: move list add after list head initilization

2243076ad1 ("cgroup: initialize cgrp->allcg_node in
init_cgroup_housekeeping()") initializes cgrp->allcg_node in
init_cgroup_housekeeping(). Then in init_cgroup_root(), we should
call init_cgroup_housekeeping() before adding it to &root->allcg_list;
otherwise, we are initializing an entry already in a list.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
d0b2fdd2a51203f04ea0a5d716e033c15e0231af 20-Nov-2012 Tao Ma <boyu.mt@taobao.com> cgroup: remove obsolete guarantee from cgroup_task_migrate.

'guarantee' is already removed from cgroup_task_migrate, so remove
the corresponding comments. Some other typos in cgroup are also
changed.

Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
0a950f65e1e64f4e82b4b5507773848ea88bcb8e 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: add cgroup->id

With the introduction of generic cgroup hierarchy iterators, css_id is
being phased out. It was unnecessarily complex, id'ing the wrong
thing (cgroups need IDs, not CSSes) and has other oddities like not
being available at ->css_alloc().

This patch adds cgroup->id, which is a simple per-hierarchy
ida-allocated ID which is assigned before ->css_alloc() and released
after ->css_free().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
033fa1c5f5f73833598a0beb022c0048fb769dad 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup, cpuset: remove cgroup_subsys->post_clone()

Currently CGRP_CPUSET_CLONE_CHILDREN triggers ->post_clone(). Now
that clone_children is cpuset specific, there's no reason to have this
rather odd option activation mechanism in cgroup core. cpuset can
check the flag from its ->css_allocate() and take the necessary
action.

Move cpuset_post_clone() logic to the end of cpuset_css_alloc() and
remove cgroup_subsys->post_clone().

Loosely based on Glauber's "generalize post_clone into post_create"
patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Glauber Costa <glommer@parallels.com>
Original-patch: <1351686554-22592-2-git-send-email-glommer@parallels.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
2260e7fc1f18ad815324605c1ce7d5c6fd9b19a2 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/

clone_children is only meaningful for cpuset and will stay that way.
Rename the flag to reflect that and update documentation. Also, drop
clone_children() wrapper in cgroup.c. The thin wrapper is used only a
few times and one of them will go away soon.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
92fb97487a7e41b222c1417cabd1d1ab7cc3a48c 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()

Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b1929db42f8a649d9a9e397119f628c27fd4021f 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: allow ->post_create() to fail

There could be cases where controllers want to do initialization
operations which may fail from ->post_create(). This patch makes
->post_create() return -errno to indicate failure and online_css()
relay such failures.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
4b8b47eb0001a89f271d671d959b235faa8f03e2 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: update cgroup_create() failure path

cgroup_create() was ignoring failure of cgroupfs files. Update it
such that, if file creation fails, it rolls back by calling
cgroup_destroy_locked() and returns failure.

Note that error out goto labels are renamed. The labels are a bit
confusing but will become better w/ later cgroup operation renames.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b8a2df6a5b576d04f78f54abf254c283527d8bbc 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: use mutex_trylock() when grabbing i_mutex of a new cgroup directory

All cgroup directory i_mutexes nest outside cgroup_mutex; however, new
directory creation is a special case. A new cgroup directory is
created while holding cgroup_mutex. Populating the new directory
requires both the new directory's i_mutex and cgroup_mutex. Because
all directory i_mutexes nest outside cgroup_mutex, grabbing both
requires releasing cgroup_mutex first, which isn't a good idea as the
new cgroup isn't yet ready to be manipulated by other cgroup
opreations.

This is worked around by grabbing the new directory's i_mutex while
holding cgroup_mutex before making it visible. As there's no other
user at that point, grabbing the i_mutex under cgroup_mutex can't lead
to deadlock.

cgroup_create_file() was using I_MUTEX_CHILD to tell lockdep not to
worry about the reverse locking order; however, this creates pseudo
locking dependency cgroup_mutex -> I_MUTEX_CHILD, which isn't true -
all directory i_mutexes are still nested outside cgroup_mutex. This
pseudo locking dependency can lead to spurious lockdep warnings.

Use mutex_trylock() instead. This will always succeed and lockdep
doesn't create any locking dependency for it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
d19e19de48aa0b90c56cd93c8a46ebac46273429 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: simplify cgroup_load_subsys() failure path

Now that cgroup_unload_subsys() can tell whether the root css is
online or not, we can safely call cgroup_unload_subsys() after idr
init failure in cgroup_load_subsys().

Replace the manual unrolling and invoke cgroup_unload_subsys() on
failure. This drops cgroup_mutex inbetween but should be safe as the
subsystem will fail try_module_get() and thus can't be mounted
inbetween. As this means that cgroup_unload_subsys() can be called
before css_sets are rehashed, remove BUG_ON() on %NULL
css_set->subsys[] from cgroup_unload_subsys().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
a31f2d3ff7fe20cbe2a143515a7d7c408b29dd0d 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: introduce CSS_ONLINE flag and on/offline_css() helpers

New helpers on/offline_css() respectively wrap ->post_create() and
->pre_destroy() invocations. online_css() sets CSS_ONLINE after
->post_create() is complete and offline_css() invokes ->pre_destroy()
iff CSS_ONLINE is set and clears it while also handling the temporary
dropping of cgroup_mutex.

This patch doesn't introduce any behavior change at the moment but
will be used to improve cgroup_create() failure path and allow
->post_create() to fail.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
42809dd4225b2f3127a4804314a1b33608620d96 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: separate out cgroup_destroy_locked()

Separate out cgroup_destroy_locked() from cgroup_destroy(). This will
be later used in cgroup_create() failure path.

While at it, add lockdep asserts on i_mutex and cgroup_mutex, and move
@d and @parent assignments to their declarations.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
02ae7486d05ae6df8395409a4945b2420f1e35c2 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: fix harmless bugs in cgroup_load_subsys() fail path and cgroup_unload_subsys()

* If idr init fails, cgroup_load_subsys() cleared dummytop->subsys[]
before calilng ->destroy() making CSS inaccessible to the callback,
and didn't unlink ss->sibling. As no modular controller uses
->use_id, this doesn't cause any actual problems.

* cgroup_unload_subsys() was forgetting to free idr, call
->pre_destroy() and clear ->active. As there currently is no
modular controller which uses ->use_id, ->pre_destroy() or ->active,
this doesn't cause any actual problems.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
648bb56d076bde31113f09a7d24d95bc8d4155ac 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: lock cgroup_mutex in cgroup_init_subsys()

Make cgroup_init_subsys() grab cgroup_mutex while initializing a
subsystem so that all helpers and callbacks are called under the
context they expect. This isn't strictly necessary as
cgroup_init_subsys() doesn't race with anybody but will allow adding
lockdep assertions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
b48c6a80a0f7584056f768268fc12b87745a393f 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: trivial cleanup for cgroup_init/load_subsys()

Consistently use @css and @dummytop in these two functions instead of
referring to them indirectly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
38b53abaa3e0c7e750ef73eee919cf42eee6b134 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: make CSS_* flags bit masks instead of bit positions

Currently, CSS_* flags are defined as bit positions and manipulated
using atomic bitops. There's no reason to use atomic bitops for them
and bit positions are clunkier to deal with than bit masks. Make
CSS_* bit masks instead and use the usual C bitwise operators to
access them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
febfcef60d4f9457785b45aab548bc7ee5ea158f 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: cgroup->dentry isn't a RCU pointer

cgroup->dentry is marked and used as a RCU pointer; however, it isn't
one - the final dentry put doesn't go through call_rcu(). cgroup and
dentry share the same RCU freeing rule via synchronize_rcu() in
cgroup_diput() (kfree_rcu() used on cgrp is unnecessary). If cgrp is
accessible under RCU read lock, so is its dentry and dereferencing
cgrp->dentry doesn't need any further RCU protection or annotation.

While not being accurate, before the previous patch, the RCU accessors
served a purpose as memory barriers - cgroup->dentry used to be
assigned after the cgroup was made visible to cgroup_path(), so the
assignment and dereferencing in cgroup_path() needed the memory
barrier pair. Now that list_add_tail_rcu() happens after
cgroup->dentry is assigned, this no longer is necessary.

Remove the now unnecessary and misleading RCU annotations from
cgroup->dentry. To make up for the removal of rcu_dereference_check()
in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts
the dereference rule of @cgrp, not cgrp->dentry.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
4e139afc22cb98d0d032ffce0285bfcc73ca5217 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: create directory before linking while creating a new cgroup

While creating a new cgroup, cgroup_create() links the newly allocated
cgroup into various places before trying to create its directory.
Because cgroup life-cycle is tied to the vfs objects, this makes it
impossible to use cgroup_rmdir() for rolling back creation - the
removal logic depends on having full vfs objects.

This patch moves directory creation above linking and collect linking
operations to one place. This allows directory creation failure to
share error exit path with css allocation failures and any failure
sites afterwards (to be added later) can use cgroup_rmdir() logic to
undo creation.

Note that this also makes the memory barriers around cgroup->dentry,
which currently is misleadingly using RCU operations, unnecessary.
This will be handled in the next patch.

While at it, locking BUG_ON() on i_mutex is converted to
lockdep_assert_held().

v2: Patch originally removed %NULL dentry check in cgroup_path();
however, Li pointed out that this patch doesn't make it
unnecessary as ->create() may call cgroup_path(). Drop the
change for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
28fd6f30ac3efd9170ae1ac89f3521d53b5eb83a 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: open-code cgroup_create_dir()

The operation order of cgroup creation is about to change and
cgroup_create_dir() is more of a hindrance than a proper abstraction.
Open-code it by moving the parent nlink adjustment next to self nlink
adjustment in cgroup_create_file() and the rest to cgroup_create().

This patch doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2243076ad128d0cc196cf6597e6ddcf6bc907676 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping()

Not strictly necessary but it's annoying to have uninitialized
list_head around.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
175431635ec09b1d1bba04979b006b99e8305a83 19-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: remove incorrect dget/dput() pair in cgroup_create_dir()

cgroup_create_dir() does weird dancing with dentry refcnt. On
success, it gets and then puts it achieving nothing. On failure, it
puts but there isn't no matching get anywhere leading to the following
oops if cgroup_create_file() fails for whatever reason.

------------[ cut here ]------------
kernel BUG at /work/os/work/fs/dcache.c:552!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
CPU 2
Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs
RIP: 0010:[<ffffffff811d9c0c>] [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
RSP: 0018:ffff88001a3ebef8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403
RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58
RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001
R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea
R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60
FS: 00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080)
Stack:
ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000
ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8
ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8
Call Trace:
[<ffffffff811cc889>] done_path_create+0x19/0x50
[<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80
[<ffffffff811d2009>] sys_mkdir+0x19/0x20
[<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b
Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41
RIP [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
RSP <ffff88001a3ebef8>
---[ end trace 1277bcfd9561ddb0 ]---

Fix it by dropping the unnecessary dget/dput() pair.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
17cf22c33e1f1b5e435469c84e43872579497653 02-Mar-2010 Eric W. Biederman <ebiederm@xmission.com> pidns: Use task_active_pid_ns where appropriate

The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.

Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.

So I have used task_active_pid_ns everywhere I can.

In fork since the pid has not yet been attached to the
process I use ns_of_pid, to achieve the same effect.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
574bd9f7c7c1d32f52dea5488018a6ff79031e22 09-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: implement generic child / descendant walk macros

Currently, cgroup doesn't provide any generic helper for walking a
given cgroup's children or descendants. This patch adds the following
three macros.

* cgroup_for_each_child() - walk immediate children of a cgroup.

* cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
in pre-order tree traversal.

* cgroup_for_each_descendant_post() - visit all descendants of a
cgroup in post-order tree traversal.

All three only require the user to hold RCU read lock during
traversal. Verifying that each iterated cgroup is online is the
responsibility of the user. When used with proper synchronization,
cgroup_for_each_descendant_pre() can be used to propagate state
updates to descendants in reliable way. See comments for details.

v2: s/config/state/ in commit message and comments per Michal. More
documentation on synchronization rules.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
eb6fd5040ee799009173daa49c3e7aa0362167c9 09-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: use rculist ops for cgroup->children

Use RCU safe list operations for cgroup->children. This will be used
to implement cgroup children / descendant walking which can be used by
controllers.

Note that cgroup_create() now puts a new cgroup at the end of the
->children list instead of head. This isn't strictly necessary but is
done so that the iteration order is more conventional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
a8638030f668884720b8f4456448d0ce33952b05 09-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: add cgroup_subsys->post_create()

Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.

This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.

This patch adds ->post_create(). It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().

When used in combination with the to-be-added generic descendant
iterators, ->post_create() can be used to implement reliable state
inheritance. It will be explained with the descendant iterators.

v2: Added a paragraph about its future use w/ descendant iterators per
Michal.

v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
316eb661f125397d46f16f94e3c81ad3dc4c1233 08-Nov-2012 Tao Ma <boyu.mt@taobao.com> cgroup: set 'start' with the right value in cgroup_path.

'start' is set to buf + buflen and do the '--' immediately.
Just set it to 'buf + buflen - 1' directly.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
bcf6de1b9129531215d26dd9af8331e84973bc52 05-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: make ->pre_destroy() return void

All ->pre_destory() implementations return 0 now, which is the only
allowed return value. Make it return void.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
b25ed609d0eecf077db607e88ea70bae83b6adb2 05-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir()

CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
destruction rollback somewhat working. cgroup_rmdir() used to drain
CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
helpers were used to allow the task performing rmdir to wait for the
next relevant event.

Unfortunately, the wait is visible to controllers too and the
mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
sleep at rmdir").

Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
unnecessary. Remove it and all the mechanisms supporting it. Note
that memcontrol.c changes are essentially revert of 887032670d
("cgroup avoid permanent sleep at rmdir").

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
1a90dd508b0b00e382fd61a46f55dc889ac21b39 05-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy()

Because ->pre_destroy() could fail and can't be called under
cgroup_mutex, cgroup destruction did something very ugly.

1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.

2. Release cgroup_mutex and call ->pre_destroy().

3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
otherwise.

4. Continue destroying.

In addition to being ugly, it has been always broken in various ways.
For example, memcg ->pre_destroy() expects the cgroup to be inactive
after it's done but tasks can be attached and detached between #2 and
#3 and the conditions that memcg verified in ->pre_destroy() might no
longer hold by the time control reaches #3.

Now that ->pre_destroy() is no longer allowed to fail. We can switch
to the following.

1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.

2. Deactivate CSS's and mark the cgroup removed thus preventing any
further operations which can invalidate the verification from #1.

3. Release cgroup_mutex and call ->pre_destroy().

4. Re-grab cgroup_mutex and continue destroying.

After this change, controllers can safely assume that ->pre_destroy()
will only be called only once for a given cgroup and, once
->pre_destroy() is called, the cgroup will stay dormant till it's
destroyed.

This removes the only reason ->pre_destroy() can fail - new task being
attached or child cgroup being created inbetween. Error out path is
removed and ->pre_destroy() invocation is open coded in
cgroup_rmdir().

v2: cgroup_call_pre_destroy() removal moved to this patch per Michal.
Commit message updated per Glauber.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
976c06bcccc50573997609fa7ec842479bd96ffb 05-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

This patch makes cgroup_create() fail if @parent is marked removed.
This is to prepare for further updates to cgroup_rmdir() path.

Note that this change isn't strictly necessary. cgroup can only be
created via mkdir and the removed marking and dentry removal happen
without releasing cgroup_mutex, so cgroup_create() can never race with
cgroup_rmdir(). Even after the scheduled updates to cgroup_rmdir(),
cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
rendering the added liveliness check unnecessary.

Do it anyway such that locking is contained inside cgroup proper and
we don't get nasty surprises if we ever grow another caller of
cgroup_create().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
e93160803ffda2e67d9ff9cacb63bb6868c8398f 05-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: kill CSS_REMOVED

CSS_REMOVED is one of the several contortions which were necessary to
support css reference draining on cgroup removal. All css->refcnts
which need draining should be deactivated and verified to equal zero
atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts
needed to be re-activated and css_tryget() shouldn't fail in the
process.

This was achieved by letting css_tryget() busy-loop until either the
refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
(committing to removal).

Now that css refcnt draining is no longer used, there's no need for
atomic rollback mechanism. css_tryget() simply can look at the
reference count and fail if it's deactivated - it's never getting
re-activated.

This patch removes CSS_REMOVED and updates __css_tryget() to fail if
the refcnt is deactivated. As deactivation and removal are a single
step now, they no longer need to be protected against css_tryget()
happening from irq context. Remove local_irq_disable/enable() from
cgroup_rmdir().

Note that this removes css_is_removed() whose only user is VM_BUG_ON()
in memcontrol.c. We can replace it with a check on the refcnt but
given that the only use case is a debug assert, I think it's better to
simply unexport it.

v2: Comment updated and explanation on local_irq_disable/enable()
added per Michal Hocko.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
ed95779340b50e362245c81b5dec0d11a1debfa8 05-Nov-2012 Tejun Heo <tj@kernel.org> cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
handling") removed the last user of __DEPRECATED_clear_css_refs. This
patch removes __DEPRECATED_clear_css_refs and mechanisms to support
it.

* Conditionals dependent on __DEPRECATED_clear_css_refs removed.

* cgroup_clear_css_refs() can no longer fail. All that needs to be
done are deactivating refcnts, setting CSS_REMOVED and putting the
base reference on each css. Remove cgroup_clear_css_refs() and the
failure path, and open-code the loops into cgroup_rmdir().

This patch keeps the two for_each_subsys() loops separate while open
coding them. They can be merged now but there are scheduled changes
which need them to be separate, so keep them separate to reduce the
amount of churn.

local_irq_save/restore() from cgroup_clear_css_refs() are replaced
with local_irq_disable/enable() for simplicity. This is safe as
cgroup_rmdir() is always called with IRQ enabled. Note that this IRQ
switching is necessary to ensure that css_tryget() isn't called from
IRQ context on the same CPU while lower context is between CSS
deactivation and setting CSS_REMOVED as css_tryget() would hang
forever in such cases waiting for CSS to be re-activated or
CSS_REMOVED set. This will go away soon.

v2: cgroup_call_pre_destroy() removal dropped per Michal. Commit
message updated to explain local_irq_disable/enable() conversion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
d87838321124061f6c935069d97f37010fa417e6 19-Oct-2012 Tejun Heo <tj@kernel.org> Revert "cgroup: Remove task_lock() from cgroup_post_fork()"

This reverts commit 7e3aa30ac8c904a706518b725c451bb486daaae9.

The commit incorrectly assumed that fork path always performed
threadgroup_change_begin/end() and depended on that for
synchronization against task exit and cgroup migration paths instead
of explicitly grabbing task_lock().

threadgroup_change is not locked when forking a new process (as
opposed to a new thread in the same process) and even if it were it
wouldn't be effective as different processes use different threadgroup
locks.

Revert the incorrect optimization.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20121008020000.GB2575@localhost>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
9bb71308b8133d643648776243e4d5599b1c193d 19-Oct-2012 Tejun Heo <tj@kernel.org> Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"

This reverts commit 7e381b0eb1e1a9805c37335562e8dc02e7d7848c.

The commit incorrectly assumed that fork path always performed
threadgroup_change_begin/end() and depended on that for
synchronization against task exit and cgroup migration paths instead
of explicitly grabbing task_lock().

threadgroup_change is not locked when forking a new process (as
opposed to a new thread in the same process) and even if it were it
wouldn't be effective as different processes use different threadgroup
locks.

Revert the incorrect optimization.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <20121008020000.GB2575@localhost>
Acked-by: Li Zefan <lizefan@huawei.com>
Bitterly-Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: stable@vger.kernel.org
1f5320d5972aa50d3e8d2b227b636b370e608359 04-Oct-2012 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> cgroup: notify_on_release may not be triggered in some cases

notify_on_release must be triggered when the last process in a cgroup is
move to another. But if the first(and only) process in a cgroup is moved to
another, notify_on_release is not triggered.

# mkdir /cgroup/cpu/SRC
# mkdir /cgroup/cpu/DST
#
# echo 1 >/cgroup/cpu/SRC/notify_on_release
# echo 1 >/cgroup/cpu/DST/notify_on_release
#
# sleep 300 &
[1] 8629
#
# echo 8629 >/cgroup/cpu/SRC/tasks
# echo 8629 >/cgroup/cpu/DST/tasks
-> notify_on_release for /SRC must be triggered at this point,
but it isn't.

This is because put_css_set() is called before setting CGRP_RELEASABLE
in cgroup_task_migrate(), and is a regression introduce by the
commit:74a1166d(cgroups: make procs file writable), which was merged
into v3.0.

Cc: Ben Blum <bblum@andrew.cmu.edu>
Cc: <stable@vger.kernel.org> # v3.0.x and later
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
5edee61edeaaebafe584f8fb7074c1ef4658596b 17-Oct-2012 Tejun Heo <tj@kernel.org> cgroup: cgroup_subsys->fork() should be called after the task is added to css_set

cgroup core has a bug which violates a basic rule about event
notifications - when a new entity needs to be added, you add that to
the notification list first and then make the new entity conform to
the current state. If done in the reverse order, an event happening
inbetween will be lost.

cgroup_subsys->fork() is invoked way before the new task is added to
the css_set. Currently, cgroup_freezer is the only user of ->fork()
and uses it to make new tasks conform to the current state of the
freezer. If FROZEN state is requested while fork is in progress
between cgroup_fork_callbacks() and cgroup_post_fork(), the child
could escape freezing - the cgroup isn't frozen when ->fork() is
called and the freezer couldn't see the new task on the css_set.

This patch moves cgroup_subsys->fork() invocation to
cgroup_post_fork() after the new task is added to the css_set.
cgroup_fork_callbacks() is removed.

Because now a task may be migrated during cgroup_subsys->fork(),
freezer_fork() is updated so that it adheres to the usual RCU locking
and the rather pointless comment on why locking can be different there
is removed (if it doesn't make anything simpler, why even bother?).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
8c7f6edbda01f1b1a2e60ad61f14fe38023e433b 13-Sep-2012 Tejun Heo <tj@kernel.org> cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.

For now, start with a single warning message. We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
8a8e04df4747661daaee77e98e102d99c9e09b98 12-Sep-2012 Daniel Wagner <daniel.wagner@bmw-carit.de> cgroup: Assign subsystem IDs during compile time

WARNING: With this change it is impossible to load external built
controllers anymore.

In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.

By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.

A close look is necessary on the RCU part which was introduces by
following patch:

commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
Author: Herbert Xu <herbert@gondor.apana.org.au> Mon May 24 09:12:34 2010
Committer: David S. Miller <davem@davemloft.net> Mon May 24 09:12:34 2010

cls_cgroup: Store classid in struct sock

Tis code was added to init_cgroup_cls()

/* We can't use rcu_assign_pointer because this is an int. */
smp_wmb();
net_cls_subsys_id = net_cls_subsys.subsys_id;

respectively to exit_cgroup_cls()

net_cls_subsys_id = -1;
synchronize_rcu();

and in module version of task_cls_classid()

rcu_read_lock();
id = rcu_dereference(net_cls_subsys_id);
if (id >= 0)
classid = container_of(task_subsys_state(p, id),
struct cgroup_cls_state, css)->classid;
rcu_read_unlock();

Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)

So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.

The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).

So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.

Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.

The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).

No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.

When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore. At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
80f4c87774721e864d5a5a1f7aca3e95fd90e194 12-Sep-2012 Daniel Wagner <daniel.wagner@bmw-carit.de> cgroup: Do not depend on a given order when populating the subsys array

The *_subsys_id will be used as index to access the subsys. Therefore
we need to care we populate the subsystem at the correct position by
using designated initialization.

With this change we are able to interleave builtin and modules in the subsys
array.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
5fc0b02544b3b9bd3db5a8156b5f3e7350f8e797 12-Sep-2012 Daniel Wagner <daniel.wagner@bmw-carit.de> cgroup: Wrap subsystem selection macro

Before we are able to define all subsystem ids at compile time we need
a more fine grained control what gets defined when we include
cgroup_subsys.h. For example we define the enums for the subsystems or
to declare for struct cgroup_subsys (builtin subsystem) by including
cgroup_subsys.h and defining SUBSYS accordingly.

Currently, the decision if a subsys is used is defined inside the
header by testing if CONFIG_*=y is true. By moving this test outside
of cgroup_subsys.h we are able to control it on the include level.

This is done by introducing IS_SUBSYS_ENABLED which then is defined
according the task, e.g. is CONFIG_*=y or CONFIG_*=m.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
be45c900fdc2c66baad5a7703fb8136991d88aeb 13-Sep-2012 Daniel Wagner <daniel.wagner@bmw-carit.de> cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT

CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
looping over the subsys array looking either at the builtin or the
module subsystems. Since all the builtin subsystems have an id which
is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
are sorted.

We are about to change id assignment to happen only at compile time
later in this series. That means we can't rely on the above trick
since all ids will always be defined at compile time. Furthermore,
ordering the builtin subsystems and the module subsystems is not
really necessary.

So we need a different way to know which subsystem is a builtin or a
module one. We can use the subsys[]->module pointer for this. Any
place where we need to know if a subsys is module we just check for
the pointer. If it is NULL then the subsystem is a builtin one.

With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
enum. Though we need to introduce a temporary placeholder so that we
don't get a compilation error when only CONFIG_CGROUP is selected and
no single controller. An empty enum definition is not valid. Later in
this series we are able to remove the placeholder again.

And with this change we get a fix for this:

kernel/cgroup.c: In function ‘cgroup_load_subsys’:
kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]

when CONFIG_CGROUP=y and no built in controller was enabled.

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org
a1a71b45a66fd3c3c453b55fbd180f8fccdd1daa 23-Aug-2012 Aristeu Rozanski <aris@redhat.com> cgroup: rename subsys_bits to subsys_mask

In a previous discussion, Tejun Heo suggested to rename references to
subsys_bits (added_bits, removed_bits, etc) by something more meaningful.

Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
03b1cde6b22f625ae832b939bc7379ec1466aec5 23-Aug-2012 Aristeu Rozanski <aris@redhat.com> cgroup: add xattr support

This is one of the items in the plumber's wish list.

For use cases:

>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart

v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support

Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
13af07df9b7e49f1987cf36aa048dc6c49d0f93d 23-Aug-2012 Aristeu Rozanski <aris@redhat.com> cgroup: revise how we re-populate root directory

When remounting cgroupfs with some subsystems added to it and some
removed, cgroup will remove all the files in root directory and then
re-popluate it.

What I'm doing here is, only remove files which belong to subsystems that
are to be unbinded, and only create files for newly-added subsystems.
The purpose is to have all other files untouched.

This is a preparation for cgroup xattr support.

v7:
- checkpatch warnings fixed
v6:
- no changes
v5:
- no changes
v4:
- refactored cgroup_clear_directory() to not use cgroup_rm_file()
- instead of going thru the list of files, get the file list using the
subsystems
- use 'subsys_mask' instead of {added,removed}_bits and made
cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
v3:
- refresh patches after recent refactoring

Original-patch-by: Li Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
9249e17fe094d853d1ef7475dd559a2cc7e23d42 25-Jun-2012 David Howells <dhowells@redhat.com> VFS: Pass mount flags to sget()

Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called. They could also be passed to the
compare function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b 10-Jun-2012 Al Viro <viro@zeniv.linux.org.uk> stop passing nameidata to ->lookup()

Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ce27e317ba22b359bde02216afab934dac3af095 03-Jul-2012 Tejun Heo <tj@kernel.org> cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode

While refactoring cgroup file removal path, 05ef1d7c4a "cgroup:
introduce struct cfent" incorrectly changed the @dir argument of
simple_unlink() to the inode of the file being deleted instead of that
of the containing directory.

The effect of this bug is minor - ctime and mtime of the parent
weren't properly updated on file deletion.

Fix it by using @cgrp->dentry->d_inode instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org
5db9a4d99b0157a513944e9a44d29c9cec2e91dc 08-Jul-2012 Tejun Heo <tj@kernel.org> cgroup: fix cgroup hierarchy umount race

48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

Destroying a superblock which has dentries with positive refcnts is a
critical bug and triggers BUG() in vfs code. As each cgroup dentry
holds an s_active reference, any lingering cgroup has both its dentry
and the superblock pinned and thus preventing premature release of
superblock.

Unfortunately, after 48ddbe1946, there's a small window while
releasing a cgroup which is directly under the root of the hierarchy.
When a cgroup directory is released, vfs layer first deletes the
corresponding dentry and then invokes dput() on the parent, which may
recurse further, so when a cgroup directly below root cgroup is
released, the cgroup is first destroyed - which releases the s_active
it was holding - and then the dentry for the root cgroup is dput().

This creates a window where the root dentry's refcnt isn't zero but
superblock's s_active is. If umount happens before or during this
window, vfs will see the root dentry with non-zero refcnt and trigger
BUG().

Before 48ddbe1946, this problem didn't exist because the last dentry
reference was guaranteed to be put synchronously from rmdir(2)
invocation which holds s_active around the whole process.

Fix it by holding an extra superblock->s_active reference across
dput() from css release, which is the dput() path added by 48ddbe1946
and the only one which doesn't hold an extra s_active ref across the
final cgroup dput().

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: shyju pv <shyju.pv@huawei.com>
Tested-by: shyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
7db5b3ca0ecdb2e8fad52a4770e4e320e61c77a6 08-Jul-2012 Tejun Heo <tj@kernel.org> Revert "cgroup: superblock can't be released with active dentries"

This reverts commit fa980ca87d15bb8a1317853f257a505990f3ffde. The
commit was an attempt to fix a race condition where a cgroup hierarchy
may be unmounted with positive dentry reference on root cgroup. While
the commit made the race condition slightly more difficult to trigger,
the race was still there and could be reliably triggered using a
different test case.

Revert the incorrect fix. The next commit will describe the race and
fix it correctly.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: shyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
8e3bbf42c6d73881956863cc3305456afe2bc4ea 14-Jun-2012 Salman Qazi <sqazi@google.com> cgroups: Account for CSS_DEACT_BIAS in __css_put

When we fixed the race between atomic_dec and css_refcnt, we missed
the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
the actual reference count. This can potentially cause a refcount leak
if __css_put races with cgroup_clear_css_refs.

Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
6be96a5c905178637ec06a5d456a76b2b304fca3 07-Jun-2012 Li Zefan <lizefan@huawei.com> cgroup: remove hierarchy_mutex

It was introduced for memcg to iterate cgroup hierarchy without
holding cgroup_mutex, but soon after that it was replaced with
a lockless way in memcg.

No one used hierarchy_mutex since that, so remove it.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
967db0ea65b0bf8507a7643ac8f296c4f2c0a834 07-Jun-2012 Salman Qazi <sqazi@google.com> cgroup: make sure that decisions in __css_put are atomic

__css_put is using atomic_dec on the ref count, and then
looking at the ref count to make decisions. This is prone
to races, as someone else may decrement ref count between
our decrement and our decision. Instead, we should base our
decisions on the value that we decremented the ref count to.

(This results in an actual race on Google's kernel which I
haven't been able to reproduce on the upstream kernel. Having
said that, it's still incorrect by inspection).

Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
91c63734f6908425903aed69c04035592f18d398 30-May-2012 Johannes Weiner <hannes@cmpxchg.org> kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite

Library functions should not grab locks when the callsites can do it,
even if the lock nests like the rcu read-side lock does.

Push the rcu_read_lock() from css_is_ancestor() to its single user,
mem_cgroup_same_or_subtree() in preparation for another user that may
already hold the rcu read-side lock.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fa980ca87d15bb8a1317853f257a505990f3ffde 24-May-2012 Tejun Heo <tj@kernel.org> cgroup: superblock can't be released with active dentries

48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

cgroup_create() does grab an active reference on the superblock to
prevent it from going away while there are !root cgroups; however, the
reference is put from cgroup_diput() which is invoked on cgroup
removal, so cgroup dentries which are removed but persisting due to
lingering csses already have released their superblock active refs
allowing superblock to be killed while those dentries are around.

Given the right condition, this makes cgroup_kill_sb() call
kill_litter_super() with dentries with non-zero d_count leading to
BUG() in shrink_dcache_for_umount_subtree().

Fix it by adding cgroup_dops->d_release() operation and moving
deactivate_super() to it. cgroup_diput() now marks dentry->d_fsdata
with itself if superblock should be deactivated and cgroup_d_release()
deactivates the superblock on dentry release.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
LKML-Reference: <CA+1xoqe5hMuxzCRhMy7J0XchDk2ZnuxOHJKikROk1-ReAzcT6g@mail.gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
14a590c3f987977d7b09ec926481ee0238c08eee 12-Mar-2012 Eric W. Biederman <ebiederm@xmission.com> userns: Convert cgroup permission checks to use uid_eq

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
c4c27fbdda4e8ba87806c415b6d15266b07bce4b 21-Apr-2012 Mike Galbraith <mgalbraith@suse.de> cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads

Allowing kthreadd to be moved to a non-root group makes no sense, it being
a global resource, and needlessly leads unsuspecting users toward trouble.

1. An RT workqueue worker thread spawned in a task group with no rt_runtime
allocated is not schedulable. Simple user error, but harmful to the box.

2. A worker thread which acquires PF_THREAD_BOUND can never leave a cpuset,
rendering the cpuset immortal.

Save the user some unexpected trouble, just say no.

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
86f82d561864e902c70282b6f17cf590c0f34691 10-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_subsys->populate()

With memcg converted, cgroup_subsys->populate() doesn't have any user
left. Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
48ddbe194623ae089cc0576e60363f2d2e85662a 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: make css->refcnt clearing on cgroup removal optional

Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.

This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).

Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.

Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.

http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251

Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
28b4c27b8e6bb6d7ff2875281a8484f8898a87ef 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: use negative bias on css->refcnt to block css_tryget()

When a cgroup is about to be removed, cgroup_clear_css_refs() is
called to check and ensure that there are no active css references.

This is currently achieved by dropping the refcnt to zero iff it has
only the base ref. If all css refs could be dropped to zero, ref
clearing is successful and CSS_REMOVED is set on all css. If not, the
base ref is restored. While css ref is zero w/o CSS_REMOVED set, any
css_tryget() attempt on it busy loops so that they are atomic
w.r.t. the whole css ref clearing.

This does work but dropping and re-instating the base ref is somewhat
hairy and makes it difficult to add more logic to the put path as
there are two of them - the regular css_put() and the reversible base
ref clearing.

This patch updates css ref clearing such that blocking new
css_tryget() and putting the base ref are separate operations.
CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
css_tryget() busy loops while refcnt is negative. After all css refs
are deactivated, if they were all one, ref clearing succeeded and
CSS_REMOVED is set and the base ref is put using the regular
css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
and the original postive values are restored.

css_refcnt() accessor which always returns the unbiased positive
reference counts is added and used to simplify refcnt usages. While
at it, relocate and reformat comments in cgroup_has_css_refs().

This separates css->refcnt deactivation and putting the base ref,
which enables the next patch to make ref clearing optional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
79578621b4847afdef48d19a28d00e3b188c37e1 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: implement cgroup_rm_cftypes()

Implement cgroup_rm_cftypes() which removes an array of cftypes from a
subsystem. It can be called whether the target subsys is attached or
not. cgroup core will remove the specified file from all existing
cgroups.

This will be used to improve sub-subsys modularity and will be helpful
for unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
05ef1d7c4a5b6d09cadd2b8e9b3c395363d1a89c 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: introduce struct cfent

This patch adds cfent (cgroup file entry) which is the association
between a cgroup and a file. This is in-cgroup representation of
files under a cgroup directory. This simplifies walking walking
cgroup files and thus cgroup_clear_directory(), which is now
implemented in two parts - cgroup_rm_file() and a loop around it.

cgroup_rm_file() will be used to implement cftype removal and cfent is
scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
"sever" semantics).

v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
if the file had openers at the time of removal. Moved to
cgroup_diput().

- cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
wasn't empty after removing all files. This triggered
spuriously if some files were open during directory clearing.
Removed.

v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
spuriously triggered for root cgroups because they don't go
through cgroup_clear_directory() on unmount. Don't trigger WARN
for root cgroups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
f6ea93723d0049ae5fbbb5292cb10c51d7a80801 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: relocate __d_cgrp() and __d_cft()

Move the two macros upwards as they'll be used earlier in the file.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
db0416b64977cb0f382175608286cc80c7573e38 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: remove cgroup_add_file[s]()

No controller is using cgroup_add_files[s](). Unexport them, and
convert cgroup_add_files() to handle NULL entry terminated array
instead of taking count explicitly and continue creation on failure
for internal use.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
4baf6e33251b37f111e21289f8ee71fe4cce236e 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: convert all non-memcg controllers to the new cftype interface

Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation. There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
6e6ff25bd548a0c5bf5163e4f63e2793dcefbdcb 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: merge cft_release_agent cftype array into the base files array

Now that cftype can express whether a file should only be on root,
cft_release_agent can be merged into the base files cftypes array.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
8e3f6541d45e1a002801e56a19530a90f775deba 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: implement cgroup_add_cftypes() and friends

Currently, cgroup directories are populated by subsys->populate()
callback explicitly creating files on each cgroup creation. This
level of flexibility isn't needed or desirable. It provides largely
unused flexibility which call for abuses while severely limiting what
the core layer can do through the lack of structure and conventions.

Per each cgroup file type, the only distinction that cgroup users is
making is whether a cgroup is root or not, which can easily be
expressed with flags.

This patch introduces cgroup_add_cftypes(). These deal with cftypes
instead of individual files - controllers indicate that certain types
of files exist for certain subsystem. Newly added CFTYPE_*_ON_ROOT
flags indicate whether a cftype should be excluded or created only on
the root cgroup.

cgroup_add_cftypes() can be called any time whether the target
subsystem is currently attached or not. cgroup core will create files
on the existing cgroups as necessary.

Also, cgroup_subsys->base_cftypes is added to ease registration of the
base files for the subsystem. If non-NULL on subsys init, the cftypes
pointed to by ->base_cftypes are automatically registered on subsys
init / load.

Further patches will convert the existing users and remove the file
based interface. Note that this interface allows dynamic addition of
files to an active controller. This will be used for sub-controller
modularity and unified hierarchy in the longer term.

This patch implements the new mechanism but doesn't apply it to any
user.

v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
cgroup_subsys->base_cftypes, which works better for cgroup_subsys
which is loaded as module.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
b0ca5a84fc3ad8f75bb2b7ac3dc6a77151cd3906 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: build list of all cgroups under a given cgroupfs_root

Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
going through cgroup->allcg_node. The list is protected by
cgroup_mutex and will be used to improve cgroup file handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
ff4c8d503e2583b485da46d0ec3dcc29639ab889 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()

cgroup_populate_dir() currently clears all files and then repopulate
the directory; however, the clearing part is only useful when it's
called from cgroup_remount(). Relocate the invocation to
cgroup_remount().

This is to prepare for further cgroup file handling updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
8b5a5a9dbca914d1f7d70276024d1525a3c94081 01-Apr-2012 Tejun Heo <tj@kernel.org> cgroup: deprecate remount option changes

This patch marks the following features for deprecation.

* Rebinding subsys by remount: Never reached useful state - only works
on empty hierarchies.

* release_agent update by remount: release_agent itself will be
replaced with conventional fsnotify notification.

v2: Lennart pointed out that "name=" is necessary for mounts w/o any
controller attached. Drop "name=" deprecation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
8f121918f2e49f852de1acdc5255cc1ef440d85b 30-Mar-2012 Tejun Heo <tj@kernel.org> cgroup: cgroup_attach_task() could return -errno after success

61d1d219c4 "cgroup: remove extra calls to find_existing_css_set" made
cgroup_task_migrate() return void. An unfortunate side effect was
that cgroup_attach_task() was depending on that function's return
value to clear its @retval on the success path. On cgroup mounts
without any subsystem with ->can_attach() callback,
cgroup_attach_task() ended up returning @retval without initializing
it on success.

For some reason, gcc failed to warn about it and it didn't cause
cgroup_attach_task() to return non-zero value in many cases, probably
due to difference in register allocation. When the problem
materializes, systemd fails to populate /systemd cgroup mount and
fails to boot.

Fix it by initializing @retval to zero on declaration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiri Kosina <jkosina@suse.cz>
LKML-Reference: <alpine.LNX.2.00.1203282354440.25526@pobox.suse.cz>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizefan@huawei.com>
ca464d69b19120a826aa2534de2511a6f542edf5 22-Mar-2012 Hugh Dickins <hughd@google.com> memcg: let css_get_next() rely upon rcu_read_lock()

Remove lock and unlock around css_get_next()'s call to idr_get_next().
memcg iterators (only users of css_get_next) already did rcu_read_lock(),
and its comment demands that; but add a WARN_ON_ONCE to make sure of it.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
42aee6c495e07dba7410b863a360db6bb9ec6d66 22-Mar-2012 Hugh Dickins <hughd@google.com> cgroup: revert ss_id_lock to spinlock

Commit c1e2ee2dc436 ("memcg: replace ss->id_lock with a rwlock") has now
been seen to cause the unfair behavior we should have expected from
converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose
get_new_cssid() is waiting for the wlock, while there are 19 tasks using
the rlock in css_get_next() to get on with their memcg workload (in an
artificial test, admittedly). Yet lib/idr.c was made suitable for RCU
way back: revert that commit, restoring ss->id_lock to a spinlock.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
48fde701aff662559b38d9a609574068f22d00fe 09-Jan-2012 Al Viro <viro@zeniv.linux.org.uk> switch open-coded instances of d_make_root() to new helper

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3ce3230a0cff484e5130153f244d4fb8a56b3a8b 08-Feb-2012 Frederic Weisbecker <fweisbec@gmail.com> cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list

Walking through the tasklist in cgroup_enable_task_cg_list() inside
an RCU read side critical section is not enough because:

- RCU is not (yet) safe against while_each_thread()

- If we use only RCU, a forking task that has passed cgroup_post_fork()
without seeing use_task_css_set_links == 1 is not guaranteed to have
its child immediately visible in the tasklist if we walk through it
remotely with RCU. In this case it will be missing in its css_set's
task list.

Thus we need to traverse the list (unfortunately) under the
tasklist_lock. It makes us safe against while_each_thread() and also
make sure we see all forked task that have been added to the tasklist.

As a secondary effect, reading and writing use_task_css_set_links are
now well ordered against tasklist traversing and modification. The new
layout is:

CPU 0 CPU 1

use_task_css_set_links = 1 write_lock(tasklist_lock)
read_lock(tasklist_lock) add task to tasklist
do_each_thread() { write_unlock(tasklist_lock)
add thread to css set links if (use_task_css_set_links)
} while_each_thread() add thread to css set links
read_unlock(tasklist_lock)

If CPU 0 traverse the list after the task has been added to the tasklist
then it is correctly added to the css set links. OTOH if CPU 0 traverse
the tasklist before the new task had the opportunity to be added to the
tasklist because it was too early in the fork process, then CPU 1
catches up and add the task to the css set links after it added the task
to the tasklist. The right value of use_task_css_set_links is guaranteed
to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
afterward to return the correct value.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
9a4b430451bb6d8d6b7dcdfbee0e1330b7c475a6 08-Feb-2012 Frederic Weisbecker <fweisbec@gmail.com> cgroup: Remove wrong comment on cgroup_enable_task_cg_list()

Remove the stale comment about RCU protection. Many callers
(all of them?) of cgroup_enable_task_cg_list() don't seem
to be in an RCU read side critical section. Besides, RCU is
not helpful to protect against while_each_thread().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
761b3ef50e1c2649cffbfa67a4dcb2dcdb7982ed 31-Jan-2012 Li Zefan <lizf@cn.fujitsu.com> cgroup: remove cgroup_subsys argument from callbacks

The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.

Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().

So we reduce a few lines of code, though the shrinking of object size
is minimal.

16 files changed, 113 insertions(+), 162 deletions(-)

text data bss dec hex filename
5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig
5486170 656987 7039960 13183117 c9288d vmlinux.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
61d1d219c4c0761059236a46867bc49943c4d29d 30-Jan-2012 Mandeep Singh Baines <msb@chromium.org> cgroup: remove extra calls to find_existing_css_set

In cgroup_attach_proc, we indirectly call find_existing_css_set 3
times. It is an expensive call so we want to call it a minimum
of times. This patch only calls it once and stores the result so
that it can be used later on when we call cgroup_task_migrate.

This required modifying cgroup_task_migrate to take the new css_set
(which we obtained from find_css_set) as a parameter. The nice side
effect of this is that cgroup_task_migrate is now identical for
cgroup_attach_task and cgroup_attach_proc. It also now returns a
void since it can never fail.

Changes in V5:
* https://lkml.org/lkml/2012/1/20/344 (Tejun Heo)
* Remove css_set_refs
Changes in V4:
* https://lkml.org/lkml/2011/12/22/421 (Li Zefan)
* Avoid GFP_KERNEL (sleep) in rcu_read_lock by getting css_set in
a separate loop not under an rcu_read_lock
Changes in V3:
* https://lkml.org/lkml/2011/12/22/13 (Li Zefan)
* Fixed earlier bug by creating a seperate patch to remove tasklist_lock
Changes in V2:
* https://lkml.org/lkml/2011/12/20/372 (Tejun Heo)
* Move find_css_set call into loop which creates the flex array
* Author
* Kill css_set_refs and use group_size instead
* Fix an off-by-one error in counting css_set refs
* Add a retval check in out_list_teardown

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
fb5d2b4cfc24963d0e8a7df57de1ecffa10a04cf 04-Jan-2012 Mandeep Singh Baines <msb@chromium.org> cgroup: replace tasklist_lock with rcu_read_lock

We can replace the tasklist_lock in cgroup_attach_proc with an
rcu_read_lock().

Changes in V4:
* https://lkml.org/lkml/2011/12/23/284 (Frederic Weisbecker)
* Minimize size of rcu_read_lock critical section
* Add comment
* https://lkml.org/lkml/2011/12/26/136 (Li Zefan)
* Split into two patches
Changes in V3:
* https://lkml.org/lkml/2011/12/22/419 (Frederic Weisbecker)
* Add an rcu_read_lock to protect against exit
Changes in V2:
* https://lkml.org/lkml/2011/12/22/86 (Tejun Heo)
* Use a goto instead of returning -EAGAIN

Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
b78949ebfb563c29808a9d0a772e3adb5561bc80 04-Jan-2012 Mandeep Singh Baines <msb@chromium.org> cgroup: simplify double-check locking in cgroup_attach_proc

To keep the complexity of the double-check locking in one place, move
the thread_group_leader check up into attach_task_by_pid(). This
allows us to use a goto instead of returning -EAGAIN.

While at it, convert a couple of returns to gotos and use rcu for the
!pid case also in order to simplify the logic.

Changes in V2:
* https://lkml.org/lkml/2011/12/22/86 (Tejun Heo)
* Use a goto instead of returning -EAGAIN

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
245282557c49754af3dbcc732316e814340d6bce 20-Jan-2012 Li Zefan <lizf@cn.fujitsu.com> cgroup: move struct cgroup_pidlist out from the header file

It's internally used only.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
34c80b1d93e6e20ca9dea0baf583a5b5510d92d4 09-Dec-2011 Al Viro <viro@zeniv.linux.org.uk> vfs: switch ->show_options() to struct dentry *

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
0d19ea866562e46989412a0676412fa0983c9ce7 27-Dec-2011 Li Zefan <lizf@cn.fujitsu.com> cgroup: fix to allow mounting a hierarchy by name

If we mount a hierarchy with a specified name, the name is unique,
and we can use it to mount the hierarchy without specifying its
set of subsystem names. This feature is documented is
Documentation/cgroups/cgroups.txt section 2.3

Here's an example:

# mount -t cgroup -o cpuset,name=myhier xxx /cgroup1
# mount -t cgroup -o name=myhier xxx /cgroup2

But it was broken by commit 32a8cf235e2f192eb002755076994525cdbaa35a
(cgroup: make the mount options parsing more accurate)

This fixes the regression.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
305f3c8b20ba1ca94829329acdbf22e689304dba 04-Jan-2012 Dan Carpenter <dan.carpenter@oracle.com> cgroup: move assignement out of condition in cgroup_attach_proc()

Gcc complains about this: "kernel/cgroup.c:2179:4: warning: suggest
parentheses around assignment used as truth value [-Wparentheses]"

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
a5e7ed3287e45f2eafbcf9e7e6fdc5a0191acf40 26-Jul-2011 Al Viro <viro@zeniv.linux.org.uk> cgroup: propagate mode_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
18bb1db3e7607e4a997d50991a6f9fa5b0f8722c 26-Jul-2011 Al Viro <viro@zeniv.linux.org.uk> switch vfs_mkdir() and ->mkdir() to umode_t

vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
7e3aa30ac8c904a706518b725c451bb486daaae9 23-Dec-2011 Frederic Weisbecker <fweisbec@gmail.com> cgroup: Remove task_lock() from cgroup_post_fork()

cgroup_post_fork() is protected between threadgroup_change_begin()
and threadgroup_change_end() against concurrent changes of the
child's css_set in cgroup_task_migrate(). Also the child can't
exit and call cgroup_exit() at this stage, this means it's css_set
can't be changed with init_css_set concurrently.

For these reasons, we don't need to hold task_lock() on the child
because it's css_set can only remain stable in this place.

Let's remove the lock there.

v2: Update comment to explain that we are safe against
cgroup_exit()

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
c6ca57500c23d57a4ccec9874b6a3c99c297e1b5 27-Dec-2011 Kirill A. Shutemov <kirill@shutemov.name> cgroup: add sparse annotation to cgroup_iter_start() and cgroup_iter_end()

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
1c6c3fad81787e8cb4c85ddfd573b0d8442fe630 27-Dec-2011 Kirill A. Shutemov <kirill@shutemov.name> cgroup: mark cgroup_rmdir_waitq and cgroup_attach_proc() as static

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
892a2b90ba15cb7dbee40979f23fdb492913abf8 22-Dec-2011 Mandeep Singh Baines <msb@chromium.org> cgroup: only need to check oldcgrp==newgrp once

In cgroup_attach_proc it is now sufficient to only check that
oldcgrp==newcgrp once. Now that we are using threadgroup_lock()
during the migrations, oldcgrp will not change.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
b07ef7741122a83575499c11417e514877941e76 22-Dec-2011 Mandeep Singh Baines <msb@chromium.org> cgroup: remove redundant get/put of task struct

threadgroup_lock() guarantees that the target threadgroup will
remain stable - no new task will be added, no new PF_EXITING
will be set and exec won't happen.

Changes in V2:
* https://lkml.org/lkml/2011/12/20/369 (Tejun Heo)
* Undo incorrect removal of get/put from attach_task_by_pid()
* Author
* Remove a comment which is made stale by this change

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
026085ef5ae07c3197f2baacc091ce067b86ed11 22-Dec-2011 Mandeep Singh Baines <msb@chromium.org> cgroup: remove redundant get/put of old css_set from migrate

We can now assume that the css_set reference held by the task
will not go away for an exiting task. PF_EXITING state can be
trusted throughout migration by checking it after locking
threadgroup.

Changes in V4:
* https://lkml.org/lkml/2011/12/20/368 (Tejun Heo)
* Fix typo in commit message
* Undid the rename of css_set_check_fetched
* https://lkml.org/lkml/2011/12/20/427 (Li Zefan)
* Fix comment in cgroup_task_migrate()
Changes in V3:
* https://lkml.org/lkml/2011/12/20/255 (Frederic Weisbecker)
* Fixed to put error in retval
Changes in V2:
* https://lkml.org/lkml/2011/12/19/289 (Tejun Heo)
* Updated commit message

-tj: removed stale patch description about dropped function rename.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
c84cdf75ccb2845f690579e838f13f7e744e3d23 21-Dec-2011 Frederic Weisbecker <fweisbec@gmail.com> cgroup: Remove unnecessary task_lock before fetching css_set on migration

When we fetch the css_set of the tasks on cgroup migration, we don't need
anymore to synchronize against cgroup_exit() that could swap the old one
with init_css_set. Now that we are using threadgroup_lock() during
the migrations, we don't need to worry about it anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
7e381b0eb1e1a9805c37335562e8dc02e7d7848c 21-Dec-2011 Frederic Weisbecker <fweisbec@gmail.com> cgroup: Drop task_lock(parent) on cgroup_fork()

We don't need to hold the parent task_lock() on the
parent in cgroup_fork() because we are already synchronized
against the two places that may change the parent css_set
concurrently:

- cgroup_exit(), but the parent obviously can't exit concurrently
- cgroup migration: we are synchronized against threadgroup_lock()

So we can safely remove the task_lock() there.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Containers <containers@lists.linux-foundation.org>
Cc: Cgroups <cgroups@vger.kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
29e21368b9baf9c4b25060d65062da2dda926c70 15-Dec-2011 Mandeep Singh Baines <msb@chromium.org> cgroups: remove redundant get/put of css_set from css_set_check_fetched()

We already have a reference to all elements in newcg_list.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: Paul Menage <paul@paulmenage.org>
e0197aae59e55c06db172bfbe1a1cdb8c0e1cab3 15-Dec-2011 Mandeep Singh Baines <msb@chromium.org> cgroups: fix a css_set not found bug in cgroup_attach_proc

There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch()
is not called for the PF_EXITING case, find_existing_css_set() will return
NULL inside cgroup_task_migrate() causing a BUG.

This bug is easy to reproduce. Create a zombie and echo its pid to
cgroup.procs.

$ cat zombie.c
\#include <unistd.h>

int main()
{
if (fork())
pause();
return 0;
}
$

We are hitting this bug pretty regularly on ChromeOS.

This bug is already fixed by Tejun Heo's cgroup patchset which is
targetted for the next merge window:

https://lkml.org/lkml/2011/11/1/356

I've create a smaller patch here which just fixes this bug so that a
fix can be merged into the current release and stable.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Downstream-Bug-Report: http://crosbug.com/23953
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: stable@kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Olof Johansson <olofj@chromium.org>
494c167cf76d02000adf740c215adc69a824ecc9 13-Dec-2011 Tejun Heo <tj@kernel.org> cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()

These three methods are no longer used. Kill them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
2f7ee5691eecb67c8108b92001a85563ea336ac5 13-Dec-2011 Tejun Heo <tj@kernel.org> cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()

Currently, there's no way to pass multiple tasks to cgroup_subsys
methods necessitating the need for separate per-process and per-task
methods. This patch introduces cgroup_taskset which can be used to
pass multiple tasks and their associated cgroups to cgroup_subsys
methods.

Three methods - can_attach(), cancel_attach() and attach() - are
converted to use cgroup_taskset. This unifies passed parameters so
that all methods have access to all information. Conversions in this
patchset are identical and don't introduce any behavior change.

-v2: documentation updated as per Paul Menage's suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
134d33737f9015761c3832f6b268fae6274aac7f 13-Dec-2011 Tejun Heo <tj@kernel.org> cgroup: improve old cgroup handling in cgroup_attach_proc()

cgroup_attach_proc() behaves differently from cgroup_attach_task() in
the following aspects.

* All hooks are invoked even if no task is actually being moved.

* ->can_attach_task() is called for all tasks in the group whether the
new cgrp is different from the current cgrp or not; however,
->attach_task() is skipped if new equals new. This makes the calls
asymmetric.

This patch improves old cgroup handling in cgroup_attach_proc() by
looking up the current cgroup at the head, recording it in the flex
array along with the task itself, and using it to remove the above two
differences. This will also ease further changes.

-v2: nr_todo renamed to nr_migrating_tasks as per Paul Menage's
suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
cd3d095275374220921fcf0d4e0c16584b26ddbc 13-Dec-2011 Tejun Heo <tj@kernel.org> cgroup: always lock threadgroup during migration

Update cgroup to take advantage of the fack that threadgroup_lock()
guarantees stable threadgroup.

* Lock threadgroup even if the target is a single task. This
guarantees that when the target tasks stay stable during migration
regardless of the target type.

* Remove PF_EXITING early exit optimization from attach_task_by_pid()
and check it in cgroup_task_migrate() instead. The optimization was
for rather cold path to begin with and PF_EXITING state can be
trusted throughout migration by checking it after locking
threadgroup.

* Don't add PF_EXITING tasks to target task array in
cgroup_attach_proc(). This ensures that task migration is performed
only for live tasks.

* Remove -ESRCH failure path from cgroup_task_migrate(). With the
above changes, it's guaranteed to be called only for live tasks.

After the changes, only live tasks are migrated and they're guaranteed
to stay alive until migration is complete. This removes problems
caused by exec and exit racing against cgroup migration including
symmetry among cgroup attach methods and different cgroup methods
racing each other.

v2: Oleg pointed out that one more PF_EXITING check can be removed
from cgroup_attach_proc(). Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
257058ae2b971646b96ab3a15605ac69186e562a 13-Dec-2011 Tejun Heo <tj@kernel.org> threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem

Make the following renames to prepare for extension of threadgroup
locking.

* s/signal->threadgroup_fork_lock/signal->group_rwsem/
* s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
* s/threadgroup_fork_read_unlock()/threadgroup_change_end()/
* s/threadgroup_fork_write_lock()/threadgroup_lock()/
* s/threadgroup_fork_write_unlock()/threadgroup_unlock()/

This patch doesn't cause any behavior change.

-v2: Rename threadgroup_change_done() to threadgroup_change_end() per
KAMEZAWA's suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
e25e2cbb4c6679bed5f52fb0f2cc381688297901 13-Dec-2011 Tejun Heo <tj@kernel.org> cgroup: add cgroup_root_mutex

cgroup wants to make threadgroup stable while modifying cgroup
hierarchies which will introduce locking dependency on
cred_guard_mutex from cgroup_mutex. This unfortunately completes
circular dependency.

A. cgroup_mutex -> cred_guard_mutex -> s_type->i_mutex_key -> namespace_sem
B. namespace_sem -> cgroup_mutex

B is from cgroup_show_options() and this patch breaks it by
introducing another mutex cgroup_root_mutex which nests inside
cgroup_mutex and protects cgroupfs_root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
c1e2ee2dc436574880758b3836fc96935b774c32 02-Nov-2011 Andrew Bresticker <abrestic@google.com> memcg: replace ss->id_lock with a rwlock

While back-porting Johannes Weiner's patch "mm: memcg-aware global
reclaim" for an internal effort, we noticed a significant performance
regression during page-reclaim heavy workloads due to high contention of
the ss->id_lock. This lock protects idr map, and serializes calls to
idr_get_next() in css_get_next() (which is used during the memcg hierarchy
walk).

Since idr_get_next() is just doing a look up, we need only serialize it
with respect to idr_remove()/idr_get_new(). By making the ss->id_lock a
rwlock, contention is greatly reduced and performance improves.

Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
each core (one file + container per core) in parallel on a NUMA machine.
Result is the time for the test to complete in 1 of the containers.
Both kernels included Johannes' memcg-aware global reclaim patches.

Before rwlock patch: 1710.778s
After rwlock patch: 152.227s

Signed-off-by: Andrew Bresticker <abrestic@google.com>
Cc: Paul Menage <menage@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
77ceab8ea590d7dc6c8f055ce43dfebd74428107 02-Nov-2011 Ben Blum <bblum@andrew.cmu.edu> cgroups: don't attach task to subsystem if migration failed

If a task has exited to the point it has called cgroup_exit() already,
then we can't migrate it to another cgroup anymore.

This can happen when we are attaching a task to a new cgroup between the
call to ->can_attach_task() on subsystems and the migration that is
eventually tried in cgroup_task_migrate().

In this case cgroup_task_migrate() returns -ESRCH and we don't want to
attach the task to the subsystems because the attachment to the new cgroup
itself failed.

Fix this by only calling ->attach_task() on the subsystems if the cgroup
migration succeeded.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
33ef6b6984403a688189317ef46bb3caab3b70e0 02-Nov-2011 Ben Blum <bblum@andrew.cmu.edu> cgroups: more safe tasklist locking in cgroup_attach_proc

Fix unstable tasklist locking in cgroup_attach_proc.

According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures
proper exclusion.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cdcc136ffd264849a943acb42c36ffe9b458f811 25-Jul-2009 Thomas Gleixner <tglx@linutronix.de> locking, sched, cgroups: Annotate release_list_lock as raw

The release_list_lock can be taken in atomic context and therefore
cannot be preempted on -rt - annotate it.

In mainline this change documents the low level nature of
the lock - otherwise there's no functional difference. Lockdep
and Sparse checking will work as usual.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
60063497a95e716c9a689af3be2687d261f115b4 27-Jul-2011 Arun Sharma <asharma@fb.com> atomic: use <linux/atomic.h>

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3bfa784a6539f91a27d7ffdd408efdb638e3bebd 19-Jun-2011 Al Viro <viro@zeniv.linux.org.uk> kill file_permission() completely

convert the last remaining caller to inode_permission()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d8bf4ca9ca9576548628344c9725edd3786e90b1 08-Jul-2011 Michal Hocko <mhocko@suse.cz> rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check

Since ca5ecddf (rcu: define __rcu address space modifier for sparse)
rcu_dereference_check use rcu_read_lock_held as a part of condition
automatically so callers do not have to do that as well.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2ce9738bac1b386f46e8478fd2c263460e7c2b09 02-Jun-2011 eparis@redhat <eparis@redhat> cgroupfs: use init_cred when populating new cgroupfs mount

We recently found that in some configurations SELinux was blocking the ability
for cgroupfs to be mounted. The reason for this is because cgroupfs creates
files and directories during the get_sb() call and also uses lookup_one_len()
during that same get_sb() call. This is a problem since the security
subsystem cannot initialize the superblock and the inodes in that filesystem
until after the get_sb() call returns. Thus we leave the inodes in
an unitialized state during get_sb(). For the vast majority of filesystems
this is not an issue, but since cgroupfs uses lookup_on_len() it does
search permission checks on the directories in the path it walks. Since the
inode security state is not set up SELinux does these checks as if the inodes
were 'unlabeled.'

Many 'normal' userspace process do not have permission to interact with
unlabeled inodes. The solution presented here is to do the permission checks
of path walk and inode creation as the kernel rather than as the task that
called mount. Since the kernel has permission to read/write/create
unlabeled inodes the get_sb() call will complete successfully and the SELinux
code will be able to initialize the superblock and those inodes created during
the get_sb() call.

This appears to be the same solution used by other filesystems such as devtmpfs
to solve the same issue and should thus have no negative impact on other LSMs
which currently work.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: James Morris <jmorris@namei.org>
a77aea92010acf54ad785047234418d5d68772e2 27-May-2011 Daniel Lezcano <daniel.lezcano@free.fr> cgroup: remove the ns_cgroup

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:

* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling a lot of
namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroup

The ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal. Since that
time we have heard from XXX users who were affected by this.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: Matt Helsley <matthltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d846687d7f84e45f23ecf3846dbb43312a1206dd 27-May-2011 Ben Blum <bblum@andrew.cmu.edu> cgroups: use flex_array in attach_proc

Convert cgroup_attach_proc to use flex_array.

The cgroup_attach_proc implementation requires a pre-allocated array to
store task pointers to atomically move a thread-group, but asking for a
monolithic array with kmalloc() may be unreliable for very large groups.
Using flex_array provides the same functionality with less risk of
failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
74a1166dfe1135dcc168d35fa5261aa7e087011b 27-May-2011 Ben Blum <bblum@andrew.cmu.edu> cgroups: make procs file writable

Make procs file writable to move all threads by tgid at once.

Add functionality that enables users to move all threads in a threadgroup
at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This
current implementation makes use of a per-threadgroup rwsem that's taken
for reading in the fork() path to prevent newly forking threads within the
threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f780bdb7c1c73009cb57adcf99ef50027d80bf3c 27-May-2011 Ben Blum <bblum@andrew.cmu.edu> cgroups: add per-thread subsystem callbacks

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
025cea99db3fb110ebc8ede5ff647833fab9574f 15-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com> cgroup,rcu: convert call_rcu(__free_css_id_cb) to kfree_rcu()

The rcu callback __free_css_id_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(__free_css_id_cb).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
f2da1c40dc003939f616f27a103b2592f1424b07 15-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com> cgroup,rcu: convert call_rcu(free_cgroup_rcu) to kfree_rcu()

The rcu callback free_cgroup_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_cgroup_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
30088ad815802f850f26114920ccf9effd4bc520 15-Mar-2011 Lai Jiangshan <laijs@cn.fujitsu.com> cgroup,rcu: convert call_rcu(free_css_set_rcu) to kfree_rcu()

The rcu callback free_css_set_rcu() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(free_css_set_rcu).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
25985edcedea6396277003854657b5f3cb31a628 31-Mar-2011 Lucas De Marchi <lucas.demarchi@profusion.mobi> Fix common misspellings

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
8d2587970b8bdf7c8d9208e3f4bb93182aef1a0f 23-Mar-2011 Phil Carmody <ext-phil.2.carmody@nokia.com> cgroups: if you list_empty() a head then don't list_del() it

list_del() leaves poison in the prev and next pointers. The next
list_empty() will compare those poisons, and say the list isn't empty.
Any list operations that assume the node is on a list because of such a
check will be fooled into dereferencing poison. One needs to INIT the
node after the del, and fortunately there's already a wrapper for that -
list_del_init().

Some of the dels are followed by deallocations, so can be ignored, and one
can be merged with an add to make a move. Apart from that, I erred on the
side of caution in making nodes list_empty()-queriable.

Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e5d1367f17ba6a6fed5fd8b74e4d5720923e0c25 14-Feb-2011 Stephane Eranian <eranian@google.com> perf: Add cgroup support

This kernel patch adds the ability to filter monitoring based on
container groups (cgroups). This is for use in per-cpu mode only.

The cgroup to monitor is passed as a file descriptor in the pid
argument to the syscall. The file descriptor must be opened to
the cgroup name in the cgroup filesystem. For instance, if the
cgroup name is foo and cgroupfs is mounted in /cgroup, then the
file descriptor is opened to /cgroup/foo. Cgroup mode is
activated by passing PERF_FLAG_PID_CGROUP in the flags argument
to the syscall.

For instance to measure in cgroup foo on CPU1 assuming
cgroupfs is mounted under /cgroup:

struct perf_event_attr attr;
int cgroup_fd, fd;

cgroup_fd = open("/cgroup/foo", O_RDONLY);
fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP);
close(cgroup_fd);

Signed-off-by: Stephane Eranian <eranian@google.com>
[ added perf_cgroup_{exit,attach} ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
d41d5a01631af821d3a3447e6613a316f5ee6c25 07-Feb-2011 Peter Zijlstra <a.p.zijlstra@chello.nl> cgroup: Fix cgroup_subsys::exit callback

Make the ::exit method act like ::attach, it is after all very nearly
the same thing.

The bug had no effect on correctness - fixing it is an optimization for
the scheduler. Also, later perf-cgroups patches rely on it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Menage <menage@google.com>
LKML-Reference: <1297160655.13327.92.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
c72a04e34735ec3f19f4788b7f95017310b5e1eb 14-Jan-2011 Al Viro <viro@ZenIV.linux.org.uk> cgroup_fs: fix cgroup use of simple_lookup()

cgroup can't use simple_lookup(), since that'd override its desired ->d_op.

Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3ec762ad8be364c2fadfe0d6b2cc6d4d3b5e1b54 14-Jan-2011 Li Zefan <lizf@cn.fujitsu.com> cgroups: Fix a lockdep warning at cgroup removal

Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry
lock, which caused a lockdep warning.

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
0df6a63f8735a7c8a877878bc215d4312e41ef81 21-Dec-2010 Al Viro <viro@zeniv.linux.org.uk> switch cgroup

switching it to s_d_op allows to kill the cgroup_lookup() kludge.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fb045adb99d9b7c562dc7fef834857f78249daa1 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache reduce branches in lookup path

Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.

Patched with:

git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
dc0474be3e27463d4d4a2793f82366eed906f223 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache rationalise dget variants

dget_locked was a shortcut to avoid the lazy lru manipulation when we already
held dcache_lock (lru manipulation was relatively cheap at that point).
However, how that the lru lock is an innermost one, we never hold it at any
caller, so the lock cost can now be avoided. We already have well working lazy
dcache LRU, so it should be fine to defer LRU manipulations to scan time.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
b5c84bf6f6fa3a7dfdcb556023a62953574b60ee 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache remove dcache_lock

dcache_lock no longer protects anything. remove it.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2fd6b7f50797f2e993eea59e0a0b8c6399c811dc 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache scale subdirs

Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).

Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
b7ab39f631f505edc2bbdb86620d5493f995c9da 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: dcache scale dentry refcount

Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
fe15ce446beb3a33583af81ffe6c9d01a75314ed 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> fs: change d_delete semantics

Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.

This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
5adcee1d8d32d7f305f6f5aaefdbf8f35adca177 07-Jan-2011 Nick Piggin <npiggin@kernel.dk> cgroup fs: avoid switching ->d_op on live dentry

Switching d_op on a live dentry is racy in general, so avoid it. In this case
it is a negative dentry, which is safer, but there are still concurrent ops
which may be called on d_op in that case (eg. d_revalidate). So in general
a filesystem may not do this. Fix cgroupfs so as not to do this.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
f7e835710ab5f6e43933c983f38f2d2e262b718c 26-Jul-2010 Al Viro <viro@zeniv.linux.org.uk> convert cgroup and cpuset

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
f4a2589feaef0a9b737a3e582b37ee96695bb25f 28-Oct-2010 Evgeny Kuznetsov <ext-eugeny.kuznetsov@nokia.com> cgroups: add check for strcpy destination string overflow

Function "strcpy" is used without check for maximum allowed source string
length and could cause destination string overflow. Check for string
length is added before using "strcpy". Function now is return error if
source string length is more than a maximum.

akpm: presently considered NotABug, but add the check for general
future-safeness and robustness.

Signed-off-by: Evgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
32a8cf235e2f192eb002755076994525cdbaa35a 28-Oct-2010 Daniel Lezcano <daniel.lezcano@free.fr> cgroup: make the mount options parsing more accurate

Current behavior:
=================

(1) When we mount a cgroup, we can specify the 'all' option which
means to enable all the cgroup subsystems. This is the default option
when no option is specified.

(2) If we want to mount a cgroup with a subset of the supported cgroup
subsystems, we have to specify a subsystems name list for the mount
option.

(3) If we specify another option like 'noprefix' or 'release_agent',
the actual code wants the 'all' or a subsystem name option specified
also. Not critical but a bit not friendly as we should assume (1) in
this case.

(4) Logically, the 'all' option is mutually exclusive with a subsystem
name, but this is not detected.

In other words:
succeed : mount -t cgroup -o all,freezer cgroup /cgroup
=> is it 'all' or 'freezer' ?
fails : mount -t cgroup -o noprefix cgroup /cgroup
=> succeed if we do '-o noprefix,all'

The following patches consolidate a bit the mount options check.

New behavior:
=============

(1) untouched
(2) untouched
(3) the 'all' option will be by default when specifying other than
a subsystem name option
(4) raises an error

In other words:
fails : mount -t cgroup -o all,freezer cgroup /cgroup
succeed : mount -t cgroup -o noprefix cgroup /cgroup

For the sake of lisibility, the if ... then ... else ... if ...
indentation when parsing the options has been changed to:
if ... then
...
continue
fi

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
97978e6d1f2da0073416870410459694fbdbfd9b 28-Oct-2010 Daniel Lezcano <daniel.lezcano@free.fr> cgroup: add clone_children control file

The ns_cgroup is a control group interacting with the namespaces. When a
new namespace is created, a corresponding cgroup is automatically created
too. The cgroup name is the pid of the process who did 'unshare' or the
child of 'clone'.

This cgroup is tied with the namespace because it prevents a process to
escape the control group and use the post_clone callback, so the child
cgroup inherits the values of the parent cgroup.

Unfortunately, the more we use this cgroup and the more we are facing
problems with it:

(1) when a process unshares, the cgroup name may conflict with a
previous cgroup with the same pid, so unshare or clone return -EEXIST

(2) the cgroup creation is out of control because there may have an
application creating several namespaces where the system will
automatically create several cgroups in his back and let them on the
cgroupfs (eg. a vrf based on the network namespace).

(3) the mix of (1) and (2) force an administrator to regularly check
and clean these cgroups.

This patchset removes the ns_cgroup by adding a new flag to the cgroup and
the cgroupfs mount option. It enables the copy of the parent cgroup when
a child cgroup is created. We can then safely remove the ns_cgroup as
this flag brings a compatibility. We have now to manually create and add
the task to a cgroup, which is consistent with the cgroup framework.

This patch:

Sent as an answer to a previous thread around the ns_cgroup.

https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html

It adds a control file 'clone_children' for a cgroup. This control file
is a boolean specifying if the child cgroup should be a clone of the
parent cgroup or not. The default value is 'false'.

This flag makes the child cgroup to call the post_clone callback of all
the subsystem, if it is available.

At present, the cpuset is the only one which had implemented the
post_clone callback.

The option can be set at mount time by specifying the 'clone_children'
mount option.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jamal Hadi Salim <hadi@cyberus.ca>
Cc: Matt Helsley <matthltc@us.ibm.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
85fe4025c616a7c0ed07bc2fc8c5371b07f3888c 23-Oct-2010 Christoph Hellwig <hch@lst.de> fs: do not assign default i_ino in new_inode

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves. For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
38d018dba3f725b969f196550d92a6ec1c092428 24-Feb-2010 Jan Blunck <jblunck@infradead.org> BKL: Remove BKL from cgroup

The BKL is only used in remount_fs and get_sb that are both protected by
the superblocks s_umount rw_semaphore. Therefore it is safe to remove the
BKL entirely.

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
db71922217a214e5c9268448e537b54fc1f301ea 15-Aug-2010 Jan Blunck <jblunck@infradead.org> BKL: Explicitly add BKL around get_sb/fill_super

This patch is a preparation necessary to remove the BKL from do_new_mount().
It explicitly adds calls to lock_kernel()/unlock_kernel() around
get_sb/fill_super operations for filesystems that still uses the BKL.

I've read through all the code formerly covered by the BKL inside
do_kern_mount() and have satisfied myself that it doesn't need the BKL
any more.

do_kern_mount() is already called without the BKL when mounting the rootfs
and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
afs_mntpt_follow_link(). Both later functions are actually the filesystems
follow_link inode operation. vfs_kern_mount() is calling the specified
get_sb function and lets the filesystem do its job by calling the given
fill_super function.

Therefore I think it is safe to push down the BKL from the VFS to the
low-level filesystems get_sb/fill_super operation.

[arnd: do not add the BKL to those file systems that already
don't use it elsewhere]

Signed-off-by: Jan Blunck <jblunck@infradead.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
31583bb0cf6cc40f2a468a4d2f3b9cbefd24f891 10-Sep-2010 Michael S. Tsirkin <mst@redhat.com> cgroups: fix API thinko

Add cgroup_attach_task_all()

The existing cgroup_attach_task_current_cg() API is called by a thread to
attach another thread to all of its cgroups; this is unsuitable for cases
where a privileged task wants to attach itself to the cgroups of a less
privileged one, since the call must be made from the context of the target
task.

This patch adds a more generic cgroup_attach_task_all() API that allows
both the source task and to-be-moved task to be specified.
cgroup_attach_task_current_cg() becomes a specialization of the more
generic new function.

[menage@google.com: rewrote changelog]
[akpm@linux-foundation.org: address reviewer comments]
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ben Blum <bblum@google.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
73457f0f836956747e0394320be2163c050e96ef 06-Aug-2010 Michael S. Tsirkin <mst@redhat.com> cgroups: fix API thinko

cgroup_attach_task_current_cg API that have upstream is backwards: we
really need an API to attach to the cgroups from another process A to
the current one.

In our case (vhost), a priveledged user wants to attach it's task to cgroups
from a less priveledged one, the API makes us run it in the other
task's context, and this fails.

So let's make the API generic and just pass in 'from' and 'to' tasks.
Add an inline wrapper for cgroup_attach_task_current_cg to avoid
breaking bisect.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
2c392b8c3450ceb69ba1b93cb0cddb3998fb8cdc 24-Feb-2010 Arnd Bergmann <arnd@arndb.de> cgroups: __rcu annotations

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
e400c28524af2d344b1663b27bf28984fa959a0e 11-Aug-2010 Dan Carpenter <error27@gmail.com> cgroups: save space for the terminator

The original code didn't leave enough space for a NULL terminator. These
strings are copied with strcpy() into fixed length buffers in
cgroup_root_from_opts().

Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ben Blum <bblum@andrew.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
676db4af043014e852f67ba0349dae0071bd11f3 05-Aug-2010 Greg KH <gregkh@suse.de> cgroupfs: create /sys/fs/cgroup to mount cgroupfs on

We really shouldn't be asking userspace to create new root filesystems.
So follow along with all of the other in-kernel filesystems, and provide
a mount point in sysfs.

For cgroupfs, this should be in /sys/fs/cgroup/ This change provides
that mount point when the cgroup filesystem is registered in the kernel.

Acked-by: Paul Menage <menage@google.com>
Acked-by: Dhaval Giani <dhaval.giani@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
d7926ee38f5c6e0bbebe712304f99a4c67e40f84 30-May-2010 Sridhar Samudrala <samudrala.sridhar@gmail.com> cgroups: Add an API to attach a task to current task's cgroup

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
94b3dd0f7bb393d93e84a173b1df9b8b64c83ac4 04-Jun-2010 Greg Thelen <gthelen@google.com> cgroups: alloc_css_id() increments hierarchy depth

Child groups should have a greater depth than their parents. Prior to
this change, the parent would incorrectly report zero memory usage for
child cgroups when use_hierarchy is enabled.

test script:
mount -t cgroup none /cgroups -o memory
cd /cgroups
mkdir cg1

echo 1 > cg1/memory.use_hierarchy
mkdir cg1/cg11

echo $$ > cg1/cg11/tasks
dd if=/dev/zero of=/tmp/foo bs=1M count=1

echo
echo CHILD
grep cache cg1/cg11/memory.stat

echo
echo PARENT
grep cache cg1/memory.stat

echo $$ > tasks
rmdir cg1/cg11 cg1
cd /
umount /cgroups

Using fae9c79, a recent patch that changed alloc_css_id() depth computation,
the parent incorrectly reports zero usage:
root@ubuntu:~# ./test
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/s

CHILD
cache 1048576
total_cache 1048576

PARENT
cache 0
total_cache 0

With this patch, the parent correctly includes child usage:
root@ubuntu:~# ./test
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/s

CHILD
cache 1052672
total_cache 1052672

PARENT
cache 0
total_cache 1052672

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: <stable@kernel.org> [2.6.34.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
907860ed381a31b0102f362df67c1c5cae6ef050 26-May-2010 Kirill A. Shutemov <kirill@shutemov.name> cgroups: make cftype.unregister_event() void-returning

Since we are unable to handle an error returned by
cftype.unregister_event() properly, let's make the callback
void-returning.

mem_cgroup_unregister_event() has been rewritten to be a "never fail"
function. On mem_cgroup_usage_register_event() we save old buffer for
thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
avoid allocation.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
747388d78a0ae768fd82b55c4ed38aa646a72364 11-May-2010 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> memcg: fix css_is_ancestor() RCU locking

Some callers (in memcontrol.c) calls css_is_ancestor() without
rcu_read_lock. Because css_is_ancestor() has to access RCU protected
data, it should be under rcu_read_lock().

This makes css_is_ancestor() itself does safe access to RCU protected
area. (At least, "root" can have refcnt==0 if it's not an ancestor of
"child". So, we need rcu_read_lock().)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7f0f15464185a92f9d8791ad231bcd7bf6df54e4 11-May-2010 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> memcg: fix css_id() RCU locking for real

Commit ad4ba375373937817404fd92239ef4cadbded23b ("memcg: css_id() must be
called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
message. But Andrew Morton pointed out that the fix doesn't seems sane
and it was just for hidining lockdep messages.

This is a patch for do proper things. Checking again, all places,
accessing without rcu_read_lock, that commit fixies was intentional....
all callers of css_id() has reference count on it. So, it's not necessary
to be under rcu_read_lock().

Considering again, we can use rcu_dereference_check for css_id(). We know
css->id is valid if css->refcnt > 0. (css->id never changes and freed
after css->refcnt going to be 0.)

This patch makes use of rcu_dereference_check() in css_id/depth and remove
unnecessary rcu-read-lock added by the commit.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a93d2f1744206827ccf416e2cdc5018aa503314e 07-May-2010 Changli Gao <xiaosuo@gmail.com> sched, wait: Use wrapper functions

epoll should not touch flags in wait_queue_t. This patch introduces a new
function __add_wait_queue_exclusive(), for the users, who use wait queue as a
LIFO queue.

__add_wait_queue_tail_exclusive() is introduced too instead of
add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
it is a duplicate of __remove_wait_queue(), disliked by users, and with less
users.

Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: <containers@lists.linux-foundation.org>
LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fae9c791703606636c1220e47f6690660042ce7f 22-Apr-2010 Li Zefan <lizf@cn.fujitsu.com> cgroup: Fix an RCU warning in alloc_css_id()

With CONFIG_PROVE_RCU=y, a warning can be triggered:

# mount -t cgroup -o memory xxx /mnt
# mkdir /mnt/0

...
kernel/cgroup.c:4442 invoked rcu_dereference_check() without protection!
...

This is a false-positive. It's safe to directly access parent_css->id.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
9a9686b634acc5cb6b7c601c171ae64af0318a24 22-Apr-2010 Li Zefan <lizf@cn.fujitsu.com> cgroup: Fix an RCU warning in cgroup_path()

with CONFIG_PROVE_RCU=y, a warning can be triggered:

# mount -t cgroup -o debug xxx /mnt
# cat /proc/$$/cgroup

...
kernel/cgroup.c:1649 invoked rcu_dereference_check() without protection!
...

This is a false-positive, because cgroup_path() can be called
with either rcu_read_lock() held or cgroup_mutex held.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
9d34706f42f9b8c15185423d9af98d37ba21d011 23-Mar-2010 Li Zefan <lizf@cn.fujitsu.com> cgroups: remove duplicate include

commit e6a1105b ("cgroups: subsystem module loading interface") and commit
c50cc752 ("sched, cgroups: Fix module export") result in duplicate
including of module.h

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
88393161210493e317ae391696ee8ef463cb3c23 16-Mar-2010 Thomas Weber <swirl@gmx.li> Fix typos in comments

[Ss]ytem => [Ss]ystem
udpate => update
paramters => parameters
orginal => original

Signed-off-by: Thomas Weber <swirl@gmx.li>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
a0a4db548edcce067c1201ef25cf2bc29f32dca4 11-Mar-2010 Kirill A. Shutemov <kirill@shutemov.name> cgroups: remove events before destroying subsystem state objects

Events should be removed after rmdir of cgroup directory, but before
destroying subsystem state objects. Let's take reference to cgroup
directory dentry to do that.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4ab78683c17d739c2a2077141dcf81a02b7fb57e 11-Mar-2010 Kirill A. Shutemov <kirill@shutemov.name> cgroups: fix race between userspace and kernelspace

Notify userspace about cgroup removing only after rmdir of cgroup
directory to avoid race between userspace and kernelspace.

eventfd are used to notify about two types of event:
- control file-specific, like crossing memory threshold;
- cgroup removing.

To understand what really happen, userspace can check if the cgroup still
exists. To avoid race beetween userspace and kernelspace we have to
notify userspace about cgroup removing only after rmdir of cgroup
directory.

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0dea116876eefc9c7ca9c5d74fe665481e499fa3 11-Mar-2010 Kirill A. Shutemov <kirill@shutemov.name> cgroup: implement eventfd-based generic API for notifications

This patchset introduces eventfd-based API for notifications in cgroups
and implements memory notifications on top of it.

It uses statistics in memory controler to track memory usage.

Output of time(1) on building kernel on tmpfs:

Root cgroup before changes:
make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total
Non-root cgroup before changes:
make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total
Root cgroup after changes (0 thresholds):
make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total
Non-root cgroup after changes (0 thresholds):
make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total
Root cgroup after changes (1 thresholds, never crossed):
make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total
Non-root cgroup after changes (1 thresholds, never crossed):
make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total

This patch:

Introduce the write-only file "cgroup.event_control" in every cgroup.

To register new notification handler you need:
- create an eventfd;
- open a control file to be monitored. Callbacks register_event() and
unregister_event() must be defined for the control file;
- write "<event_fd> <control_fd> <args>" to cgroup.event_control.
Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

If you need notification functionality for a control file you have to
implement callbacks register_event() and unregister_event() in the
struct cftype.

[kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b70cc5fdb445a6929a01e9c406593265b136c99d 11-Mar-2010 Li Zefan <lizf@cn.fujitsu.com> cgroups: clean up cgroup_pidlist_find() a bit

Don't call get_pid_ns() before we locate/alloc the ns.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
67523c48aa74d5637848edeccf285af1c60bf14a 11-Mar-2010 Ben Blum <bblum@andrew.cmu.edu> cgroups: blkio subsystem as module

Modify the Block I/O cgroup subsystem to be able to be built as a module.
As the CFQ disk scheduler optionally depends on blk-cgroup, config options
in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
enhanced to support the new module dependency.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cf5d5941fda647fe3d2f2d00cf9e0245236a5f08 11-Mar-2010 Ben Blum <bblum@andrew.cmu.edu> cgroups: subsystem module unloading

Provides support for unloading modular subsystems.

This patch adds a new function cgroup_unload_subsys which is to be used
for removing a loaded subsystem during module deletion. Reference
counting of the subsystems' modules is moved from once (at load time) to
once per attached hierarchy (in parse_cgroupfs_options and
rebind_subsystems) (i.e., 0 or 1).

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e6a1105ba08b265023dd71a4174fb4a29ebc7083 11-Mar-2010 Ben Blum <bblum@andrew.cmu.edu> cgroups: subsystem module loading interface

Add interface between cgroups subsystem management and module loading

This patch implements rudimentary module-loading support for cgroups -
namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
module initcall, and a struct module pointer in struct cgroup_subsys.

Several functions that might be wanted by modules have had EXPORT_SYMBOL
added to them, but it's unclear exactly which functions want it and which
won't.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aae8aab40367036931608fdaf9e2dc568b516f19 11-Mar-2010 Ben Blum <bblum@andrew.cmu.edu> cgroups: revamp subsys array

This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree. This is
mainly useful for classifiers and subsystems that hook into components
that are already modules. cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.

It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.

Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear. Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.

Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.

Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.

Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.

Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves. In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy). The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored. Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.

This patch:

Make subsys[] able to be dynamically populated to support modular
subsystems

This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.

Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d7b9fff711d5e8db8c844161c684017e556c38a0 11-Mar-2010 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> cgroup: introduce coalesce css_get() and css_put()

Current css_get() and css_put() increment/decrement css->refcnt one by
one.

This patch add a new function __css_get(), which takes "count" as a arg
and increment the css->refcnt by "count". And this patch also add a new
arg("count") to __css_put() and change the function to decrement the
css->refcnt by "count".

These coalesce version of __css_get()/__css_put() will be used to improve
performance of memcg's moving charge feature later, where instead of
calling css_get()/css_put() repeatedly, these new functions will be used.

No change is needed for current users of css_get()/css_put().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2468c7234b366eeb799ee0648cb58f9cba394a54 11-Mar-2010 Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> cgroup: introduce cancel_attach()

Add cancel_attach() operation to struct cgroup_subsys. cancel_attach()
can be used when can_attach() operation prepares something for the subsys,
but we should rollback what can_attach() operation has prepared if attach
task fails after we've succeeded in can_attach().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c50cc75271759373bd89a036eec4d4269b291616 25-Feb-2010 Ingo Molnar <mingo@elte.hu> sched, cgroups: Fix module export

I have exported it in d11c563 - but cgroups.c did not have module.h included ...

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
d11c563dd20ff35da5652c3e1c989d9e10e1d6d0 23-Feb-2010 Paul E. McKenney <paulmck@linux.vnet.ibm.com> sched: Use lockdep-based checking on rcu_dereference()

Update the rcu_dereference() usages to take advantage of the new
lockdep-based checking.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
[ -v2: fix allmodconfig missing symbol export build failure on x86 ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
4528fd0595847c2078b59f24800e751c2d6b7e41 02-Feb-2010 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix to return errno in a failure path

In cgroup_create(), if alloc_css_id() returns failure, the errno is not
propagated to userspace, so mkdir will fail silently.

To trigger this bug, we mount blkio (or memory subsystem), and create more
then 65534 cgroups. (The number of cgroups is limited to 65535 if a
subsystem has use_id == 1)

# mount -t cgroup -o blkio xxx /mnt
# for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done
# mkdir /mnt/65534
(should return ENOSPC)
#

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bd4f490a079730aadfaf9a728303ea0135c01945 08-Jan-2010 Dave Anderson <anderson@redhat.com> cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput()

The LTP cgroup test suite generates a "kernel BUG at kernel/cgroup.c:790!"
here in cgroup_diput():

/*
* if we're getting rid of the cgroup, refcount should ensure
* that there are no pidlists left.
*/
BUG_ON(!list_empty(&cgrp->pidlists));

The cgroup pidlist rework in 2.6.32 generates the BUG_ON, which is caused
when pidlist_array_load() calls cgroup_pidlist_find():

(1) if a matching cgroup_pidlist is found, it down_write's the mutex of the
pre-existing cgroup_pidlist, and increments its use_count.
(2) if no matching cgroup_pidlist is found, then a new one is allocated, it
down_write's its mutex, and the use_count is set to 0.
(3) the matching, or new, cgroup_pidlist gets returned back to pidlist_array_load(),
which increments its use_count -- regardless whether new or pre-existing --
and up_write's the mutex.

So if a matching list is ever encountered by cgroup_pidlist_find() during
the life of a cgroup directory, it results in an inflated use_count value,
preventing it from ever getting released by cgroup_release_pid_array().
Then if the directory is subsequently removed, cgroup_diput() hits the
BUG_ON() when it finds that the directory's cgroup is still populated with
a pidlist.

The patch simply removes the use_count increment when a matching pidlist
is found by cgroup_pidlist_find(), because it gets bumped by the calling
pidlist_array_load() function while still protected by the list's mutex.

Signed-off-by: Dave Anderson <anderson@redhat.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Paul Menage <menage@google.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
478988d3b28e98a31e0cfe24e011e28ba0f3cf0d 27-Oct-2009 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> cgroup: fix strstrip() misuse

cgroup_write_X64() and cgroup_write_string() ignore the return value of
strstrip(). it makes small inconsistent behavior.

example:
=========================
# cd /mnt/cgroup/hoge
# cat memory.swappiness
60
# echo "59 " > memory.swappiness
# cat memory.swappiness
59
# echo " 58" > memory.swappiness
bash: echo: write error: Invalid argument

This patch fixes it.

Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3dece8347df6a16239fab10dadb370854f1c969c 02-Oct-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> cgroup: catch bad css refcnt at css_put

__css_put() doesn't check a bug as refcnt goes to minus.
I think it should be caught. This patch adds a check for it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
828c09509b9695271bcbdc53e9fc9a6a737148d2 02-Oct-2009 Alexey Dobriyan <adobriyan@gmail.com> const: constify remaining file_operations

[akpm@linux-foundation.org: fix KVM]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
be367d09927023d081f9199665c8500f69f14d22 24-Sep-2009 Ben Blum <bblum@google.com> cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time

Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.

[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c378369d8b4fa516ff2b1e79c3eded4e0e955ebb 24-Sep-2009 Ben Blum <bblum@google.com> cgroups: change css_set freeing mechanism to be under RCU

Changes css_set freeing mechanism to be under RCU

This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d1d9fd3308fdef6b4bf564fa3d6cfe35b68b50bc 24-Sep-2009 Ben Blum <bblum@google.com> cgroups: use vmalloc for large cgroups pidlist allocations

Separates all pidlist allocation requests to a separate function that
judges based on the requested size whether or not the array needs to be
vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
72a8cb30d10d4041c455a7054607a7d519167c87 24-Sep-2009 Ben Blum <bblum@google.com> cgroups: ensure correct concurrent opening/reading of pidlists across pid namespaces

Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process
seeing results from the other's namespace. Rather than one pidlist for
each file in a cgroup, we now keep a list of pidlists keyed by namespace
and file type (tasks versus procs) in which entries are placed on demand.
Each pidlist has its own lock, and that the pidlists themselves are passed
around in the seq_file's private pointer means we don't have to touch the
cgroup or its master list except when creating and destroying entries.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
102a775e3647628727ae83a9a6abf0564c3ca7cb 24-Sep-2009 Ben Blum <bblum@google.com> cgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids

struct cgroup used to have a bunch of fields for keeping track of the
pidlist for the tasks file. Those are now separated into a new struct
cgroup_pidlist, of which two are had, one for procs and one for tasks.
The way the seq_file operations are set up is changed so that just the
pidlist struct gets passed around as the private data.

Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8f3ff20862cfcb85500a2bb55ee64622bd59fd0c 24-Sep-2009 Paul Menage <menage@google.com> cgroups: revert "cgroups: fix pid namespace bug"

The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.

The subsystem "attach" interface is modified to support attaching whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need to access the old cgroup of every thread being
moved. The attach interface may need to be revised if this becomes the
case.

Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
with the sighand's atomic count. This scheme should introduce no extra
overhead in the fork path when there's no contention.

The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through
the attach() calls, though no subsystem yet demonstrates need for an
extended critical section.

This patch:

Revert

commit 096b7fe012d66ed55e98bc8022405ede0cc80e96
Author: Li Zefan <lizf@cn.fujitsu.com>
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Jul 29 19:10:35 2009 -0700

cgroups: fix pid namespace bug

This is in preparation for some clashing cgroups changes that subsume the
original commit's functionaliy.

The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches. I played around with trying to reconcile Ben's patch
series with Li's patch, but concluded that it was simpler to just revert
Li's, given that Ben's patch series contained essentially the same fix.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2c6ab6d200827e1c41dc71fff3a2ac7473f51777 24-Sep-2009 Paul Menage <menage@google.com> cgroups: allow cgroup hierarchies to be created with no bound subsystems

This patch removes the restriction that a cgroup hierarchy must have at
least one bound subsystem. The mount option "none" is treated as an
explicit request for no bound subsystems.

A hierarchy with no subsystems can be useful for plain task tracking, and
is also a step towards the support for multiply-bindable subsystems.

As part of this change, the hierarchy id is no longer calculated from the
bitmask of subsystems in the hierarchy (since this is not guaranteed to be
unique) but is allocated via an ida. Reference counts on cgroups from
css_set objects are now taken explicitly one per hierarchy, rather than
one per subsystem.

Example usage:

mount -t cgroup -o none,name=foo cgroup /mnt/cgroup

Based on the "no-op"/"none" subsystem concept proposed by
kamezawa.hiroyu@jp.fujitsu.com

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7717f7ba92de485bce8293419a20ffef130f4286 24-Sep-2009 Paul Menage <menage@google.com> cgroups: add a back-pointer from struct cg_cgroup_link to struct cgroup

Currently the cgroups code makes the assumption that the subsystem
pointers in a struct css_set uniquely identify the hierarchy->cgroup
mappings associated with the css_set; and there's no way to directly
identify the associated set of cgroups other than by indirecting through
the appropriate subsystem state pointers.

This patch removes the need for that assumption by adding a back-pointer
from struct cg_cgroup_link object to its associated cgroup; this allows
the set of cgroups to be determined by traversing the cg_links list in
the struct css_set.

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fe6934354f8e287275500cd6ec73826d4d6ad457 24-Sep-2009 Paul Menage <menage@google.com> cgroups: move the cgroup debug subsys into cgroup.c to access internal state

While it's architecturally clean to have the cgroup debug subsystem be
completely independent of the cgroups framework, it limits its usefulness
for debugging the contents of internal data structures. Move the debug
subsystem code into the scope of all the cgroups data structures to make
more detailed debugging possible.

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c6d57f3312a6619d47c5557b5f6154a74d04ff80 24-Sep-2009 Paul Menage <menage@google.com> cgroups: support named cgroups hierarchies

To simplify referring to cgroup hierarchies in mount statements, and to
allow disambiguation in the presence of empty hierarchies and
multiply-bindable subsystems this patch adds support for naming a new
cgroup hierarchy via the "name=" mount option

A pre-existing hierarchy may be specified by either name or by subsystems;
a hierarchy's name cannot be changed by a remount operation.

Example usage:

# To create a hierarchy called "foo" containing the "cpu" subsystem
mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1

# To mount the "foo" hierarchy on a second location
mount -t cgroup -oname=foo cgroup /mnt/cgroup2

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
34f77a90f79fca31802c2e942bd73f7f557fe28c 24-Sep-2009 Xiaotian Feng <dfeng@redhat.com> cgroups: make unlock sequence in cgroup_get_sb consistent

Make the last unlock sequence consistent with previous unlock sequeue.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
88e9d34c727883d7d6f02cf1475b3ec98b8480c7 23-Sep-2009 James Morris <jmorris@namei.org> seq_file: constify seq_operations

Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.

This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.

Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6e1d5dcc2bbbe71dbf010c747e15739bef6b7218 22-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com> const: mark remaining inode_operations as const

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b87221de6a4934eda856475a0065688d12973a04 22-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com> const: mark remaining super_operations const

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d993831fa7ffeb89e994f046f93eeb09ec91df08 12-Jun-2009 Jens Axboe <jens.axboe@oracle.com> writeback: add name to backing_dev_info

This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
887032670d47366a8c8f25396ea7c14b7b2cc620 30-Jul-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> cgroup avoid permanent sleep at rmdir

After commit ec64f51545fffbc4cb968f0cea56341a4b07e85a ("cgroup: fix
frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg)
doesn't return -EBUSY by temporary ref counts. That commit expects all
refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can
wait permanently. This patch tries to fix that and change followings.

- set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
- clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
if there are sleeping ones, wakes them up.
- rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.

Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Paul Menage <menage@google.com>
Acked-by: Balbir Sigh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
096b7fe012d66ed55e98bc8022405ede0cc80e96 30-Jul-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix pid namespace bug

The bug was introduced by commit cc31edceee04a7b87f2be48f9489ebb72d264844
("cgroups: convert tasks file to use a seq_file with shared pid array").

We cache a pid array for all threads that are opening the same "tasks"
file, but the pids in the array are always from the namespace of the
last process that opened the file, so all other threads will read pids
from that namespace instead of their own namespaces.

To fix it, we maintain a list of pid arrays, which is keyed by pid_ns.
The list will be of length 1 at most time.

Reported-by: Paul Menage <menage@google.com>
Idea-by: Paul Menage <menage@google.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f9ab5b5b0f5be506640321d710b0acd3dca6154a 18-Jun-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: forbid noprefix if mounting more than just cpuset subsystem

The 'noprefix' option was introduced for backwards-compatibility of
cpuset, but actually it can be used when mounting other subsystems.

This results in possibility of name collision, and now the collision can
really happen, because we have 'stat' file in both memory and cpuacct
subsystem:

# mount -t cgroup -o noprefix,memory,cpuacct xxx /mnt

Cgroup will happily mount the 2 subsystems, but only 'stat' file of memory
subsys can be seen.

We don't want users to use nopreifx, and also want to avoid name
collision, so we change to allow noprefix only if mounting just the cpuset
subsystem.

[akpm@linux-foundation.org: fix shift for cpuset_subsys_id >= 32]
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
337eb00a2c3a421999c39c94ce7e33545ee8baa7 12-May-2009 Alessio Igor Bogani <abogani@texware.it> Push BKL down into ->remount_fs()

[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
6f5bbff9a1b7d6864a495763448a363bbfa96324 06-May-2009 Al Viro <viro@zeniv.linux.org.uk> Convert obvious places to deactivate_locked_super()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
0b7f569e45bb6be142d87017030669a6a7d327a1 03-Apr-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> memcg: fix OOM killer under memcg

This patch tries to fix OOM Killer problems caused by hierarchy.
Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
kill a task in memcg.

But, when hierarchy is used, it's broken and correct task cannot
be killed. For example, in following cgroup

/groupA/ hierarchy=1, limit=1G,
01 nolimit
02 nolimit
All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

This patch provides makes the bad process be selected from all tasks
under hierarchy. BTW, currently, oom_jiffies is updated against groupA
in above case. oom_jiffies of tree should be updated.

To see how oom_jiffies is used, please check mem_cgroup_oom_called()
callers.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: const fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0670e08bdfc67272f8c3087030417465629b8073 03-Apr-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: don't change release_agent when remount failed

Remount can fail in either case:
- wrong mount options is specified, or option 'noprefix' is changed.
- a to-be-added subsys is already mounted/active.

When using remount to change 'release_agent', for the above former failure
case, remount will return errno with release_agent unchanged, but for the
latter case, remount will return EBUSY with relase_agent changed, which is
unexpected I think:

# mount -t cgroup -o cpu xxx /cgrp1
# mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
# cat /cgrp2/release_agent
agent1
# mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
mount: /cgrp2 not mounted already, or bad option
# cat /cgrp2/release_agent
agent1 <-- ok
# mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
mount: /cgrp2 is busy
# cat /cgrp2/release_agent
agent2 <-- unexpected!

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
099fca3225b39f7a3ed853036038054172b55581 03-Apr-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: show correct file mode

We have some read-only files and write-only files, but currently they are
all set to 0644, which is counter-intuitive and cause trouble for some
cgroup tools like libcgroup.

This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
own files' file mode, and for the most cases cft->mode can be default to 0
and cgroup will figure out proper mode.

Acked-by: Paul Menage <menage@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
66bdc9cfc77ba89a9ee6c82d28375b646ab4bb1d 03-Apr-2009 Jesper Juhl <jj@chaosbits.net> kernel/cgroup.c: kfree(NULL) is legal

Reduces object file size a bit:

Before:
$ size kernel/cgroup.o
text data bss dec hex filename
21593 7804 4924 34321 8611 kernel/cgroup.o
After:
$ size kernel/cgroup.o
text data bss dec hex filename
21537 7744 4924 34205 859d kernel/cgroup.o

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ec64f51545fffbc4cb968f0cea56341a4b07e85a 03-Apr-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> cgroup: fix frequent -EBUSY at rmdir

In following situation, with memory subsystem,

/groupA use_hierarchy==1
/01 some tasks
/02 some tasks
/03 some tasks
/04 empty

When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
is triggered and the kernel walks tree under groupA. In this case,
rmdir /groupA/04 fails with -EBUSY frequently because of temporal
refcnt from the kernel.

In general. cgroup can be rmdir'd if there are no children groups and
no tasks. Frequent fails of rmdir() is not useful to users.
(And the reason for -EBUSY is unknown to users.....in most cases)

This patch tries to modify above behavior, by
- retries if css_refcnt is got by someone.
- add "return value" to pre_destroy() and allows subsystem to
say "we're really busy!"

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
38460b48d06440de46b34cb778bd6c4855030754 03-Apr-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> cgroup: CSS ID support

Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.

This patch attaches unique ID to each css and provides following.

- css_lookup(subsys, id)
returns pointer to struct cgroup_subysys_state of id.
- css_get_next(subsys, id, rootid, depth, foundid)
returns the next css under "root" by scanning

When cgroup_subsys->use_id is set, an id for css is maintained.

The cgroup framework only parepares
- css_id of root css for subsys
- id is automatically attached at creation of css.
- id is *not* freed automatically. Because the cgroup framework
don't know lifetime of cgroup_subsys_state.
free_css_id() function is provided. This must be called by subsys.

There are several reasons to develop this.
- Saving space .... For example, memcg's swap_cgroup is array of
pointers to cgroup. But it is not necessary to be very fast.
By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
reduce much amount of memory usage.

- Scanning without lock.
CSS_ID provides "scan id under this ROOT" function. By this, scanning
css under root can be written without locks.
ex)
do {
rcu_read_lock();
next = cgroup_get_next(subsys, id, root, &found);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
id = found + 1
} while(...)

Characteristics:
- Each css has unique ID under subsys.
- Lifetime of ID is controlled by subsys.
- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
- Allowed ID is 1-65535, ID 0 is UNUSED ID.

Design Choices:
- scan-by-ID v.s. scan-by-tree-walk.
As /proc's pid scan does, scan-by-ID is robust when scanning is done
by following kind of routine.
scan -> rest a while(release a lock) -> conitunue from interrupted
memcg's hierarchical reclaim does this.

- When subsys->use_id is set, # of css in the system is limited to
65535.

[bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
313e924c0852943e67335fad9d2608701f0dfe8e 03-Apr-2009 Grzegorz Nosek <root@localdomain.pl> cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups

The ns_proxy cgroup allows moving processes to child cgroups only one
level deep at a time. This commit relaxes this restriction and makes it
possible to attach tasks directly to grandchild cgroups, e.g.:

($pid is in the root cgroup)
echo $pid > /cgroup/CG1/CG2/tasks

Previously this operation would fail with -EPERM and would have to be
performed as two steps:
echo $pid > /cgroup/CG1/tasks
echo $pid > /cgroup/CG1/CG2/tasks

Also, the target cgroup no longer needs to be empty to move a task there.

Signed-off-by: Grzegorz Nosek <root@localdomain.pl>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a3ec947c85ec339884b30ef6a08133e9311fdae1 04-Mar-2009 Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> vfs: simple_set_mnt() should return void

simple_set_mnt() is defined as returning 'int' but always returns 0.
Callers assume simple_set_mnt() never fails and don't properly cleanup if
it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev()
should:

up_write(sb->s_unmount);
deactivate_super(sb);

if simple_set_mnt() fails.

Since simple_set_mnt() never fails, would be cleaner if it did not
return anything.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
3ba13d179e8c24c68eac32b93593a6b10fcd1572 20-Feb-2009 Al Viro <viro@zeniv.linux.org.uk> constify dentry_operations: rest

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
67e055d144c5b2acdc1c63811fde031263bf92c5 18-Feb-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix possible use after free

In cgroup_kill_sb(), root is freed before sb is detached from the list, so
another sget() may find this sb and call cgroup_test_super(), which will
access the root that has been freed.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cfebe563bd0a3ff97e1bc167123120d59c7a84db 11-Feb-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix lockdep subclasses overflow

I enabled all cgroup subsystems when compiling kernel, and then:
# mount -t cgroup -o net_cls xxx /mnt
# mkdir /mnt/0

This showed up immediately:
BUG: MAX_LOCKDEP_SUBCLASSES too low!
turning off the locking correctness validator.

It's caused by the cgroup hierarchy lock:
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (ss->root == root)
mutex_lock_nested(&ss->hierarchy_mutex, i);
}

Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but
MAX_LOCKDEP_SUBCLASSES is 8.

This patch uses different lockdep keys for different subsystems.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
839ec5452ebfd5905b9c69b20ceb640903a8ea1a 29-Jan-2009 Paul Menage <menage@google.com> cgroup: fix root_count when mount fails due to busy subsystem

root_count was being incremented in cgroup_get_sb() after all error
checking was complete, but decremented in cgroup_kill_sb(), which can be
called on a superblock that we gave up on due to an error. This patch
changes cgroup_kill_sb() to only decrement root_count if the root was
previously linked into the list of roots.

Signed-off-by: Paul Menage <menage@google.com>
Tested-by: Serge Hallyn <serue@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
804b3c28a4e4fa1c224571bf76edb534b9c4b1ed 29-Jan-2009 Paul Menage <menage@google.com> cgroups: add cpu_relax() calls in css_tryget() and cgroup_clear_css_refs()

css_tryget() and cgroup_clear_css_refs() contain polling loops; these
loops should have cpu_relax calls in them to reduce cross-cache traffic.

Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1404f06565ee89e0ce04d4a5859c00b0e3a0dc8d 29-Jan-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix lock inconsistency in cgroup_clone()

I fixed a bug in cgroup_clone() in Linus' tree in commit 7b574b7
("cgroups: fix a race between cgroup_clone and umount") without noticing
there was a cleanup patch in -mm tree that should be rebased (now commit
104cbd5, "cgroups: use task_lock() for access tsk->cgroups safe in
cgroup_clone()"), thus resulted in lock inconsistency.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
baef99a08a2e23d9386b47e53fa5f0d44fc98f66 29-Jan-2009 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> cgroups: use hierarchy mutex in creation failure path

Now, cgrp->sibling is handled under hierarchy mutex.
error route should do so, too.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e7c5ec9193d32b9559a3bb8893ceedbda85201ff 08-Jan-2009 Paul Menage <menage@google.com> cgroups: add css_tryget()

Add css_tryget(), that obtains a counted reference on a CSS. It is used
in situations where the caller has a "weak" reference to the CSS, i.e.
one that does not protect the cgroup from removal via a reference count,
but would instead be cleaned up by a destroy() callback.

css_tryget() will return true on success, or false if the cgroup is being
removed.

This is similar to Kamezawa Hiroyuki's patch from a week or two ago, but
with the difference that in the event of css_tryget() racing with a
cgroup_rmdir(), css_tryget() will only return false if the cgroup really
does get removed.

This implementation is done by biasing css->refcnt, so that a refcnt of 1
means "releasable" and 0 means "released or releasing". In the event of a
race, css_tryget() distinguishes between "released" and "releasing" by
checking for the CSS_REMOVED flag in css->flags.

Signed-off-by: Paul Menage <menage@google.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
999cd8a450f8f93701669a61cac4d3b19eca07e8 08-Jan-2009 Paul Menage <menage@google.com> cgroups: add a per-subsystem hierarchy_mutex

These patches introduce new locking/refcount support for cgroups to
reduce the need for subsystems to call cgroup_lock(). This will
ultimately allow the atomicity of cgroup_rmdir() (which was removed
recently) to be restored.

These three patches give:

1/3 - introduce a per-subsystem hierarchy_mutex which a subsystem can
use to prevent changes to its own cgroup tree

2/3 - use hierarchy_mutex in place of calling cgroup_lock() in the
memory controller

3/3 - introduce a css_tryget() function similar to the one recently
proposed by Kamezawa, but avoiding spurious refcount failures in
the event of a race between a css_tryget() and an unsuccessful
cgroup_rmdir()

Future patches will likely involve:

- using hierarchy mutex in place of cgroup_lock() in more subsystems
where appropriate

- restoring the atomicity of cgroup_rmdir() with respect to cgroup_create()

This patch:

Add a hierarchy_mutex to the cgroup_subsys object that protects changes to
the hierarchy observed by that subsystem. It is taken by the cgroup
subsystem (in addition to cgroup_mutex) for the following operations:

- linking a cgroup into that subsystem's cgroup tree
- unlinking a cgroup from that subsystem's cgroup tree
- moving the subsystem to/from a hierarchy (including across the
bind() callback)

Thus if the subsystem holds its own hierarchy_mutex, it can safely
traverse its own hierarchy.

Signed-off-by: Paul Menage <menage@google.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a47295e6bc42ad35f9c15ac66f598aa24debd4e2 08-Jan-2009 Paul Menage <menage@google.com> cgroups: make cgroup_path() RCU-safe

Fix races between /proc/sched_debug by freeing cgroup objects via an RCU
callback. Thus any cgroup reference obtained from an RCU-safe source will
remain valid during the RCU section. Since dentries are also RCU-safe,
this allows us to traverse up the tree safely.

Additionally, make cgroup_path() check for a NULL cgrp->dentry to avoid
trying to report a path for a partially-created cgroup.

[lizf@cn.fujitsu.com: call deactive_super() in cgroup_diput()]
Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e7b80bb695a5b64c92e314838e083b2f3bdf29b2 08-Jan-2009 Gowrishankar M <gowrishankar.m@in.ibm.com> cgroups: skip processes from other namespaces when listing a cgroup

Once tasks are populated from system namespace inside cgroup, container
replaces other namespace task with 0 while listing tasks, inside
container.

Though this is expected behaviour from container end, there is no use of
showing unwanted 0s.

In this patch, we check if a process is in same namespace before loading
into pid array.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Gowrishankar M <gowrishankar.m@in.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c12f65d4396e05c51ce3af7f159ead98574a587c 08-Jan-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: introduce link_css_set() to remove duplicate code

Add a common function link_css_set() to link a css_set to a cgroup.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
33a68ac1c1b695216e873ee12e819adbe73e4d9f 08-Jan-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: add inactive subsystems to rootnode.subsys_list

Though for an inactive hierarchy, we have subsys->root == &rootnode, but
rootnode's subsys_list is always empty.

This conflicts with the code in find_css_set():

for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
...
if (ss->root->subsys_list.next == &ss->sibling) {
...
}
}
if (list_empty(&rootnode.subsys_list)) {
...
}

The above code assumes rootnode.subsys_list links all inactive
hierarchies.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e5f6a8609bab0c2d7543ab1505105e011832afd7 08-Jan-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: make root_list contains active hierarchies only

Don't link rootnode to the root list, so root_list contains active
hierarchies only as the comment indicates. And rename for_each_root() to
for_each_active_root().

Also remove redundant check in cgroup_kill_sb().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7534432dcc3c654a8671b6b0cdffd1dbdbc73074 08-Jan-2009 Lai Jiangshan <laijs@cn.fujitsu.com> cgroups: remove rcu_read_lock() in cgroupstats_build()

cgroup_iter_* do not need rcu_read_lock().

In cgroup_enable_task_cg_lists(), do_each_thread() and while_each_thread()
are protected by RCU, it's OK, for write_lock(&css_set_lock) implies
rcu_read_lock() in non-RT kernel.

If we need explicit rcu_read_lock(), we should add rcu_read_lock() in
cgroup_enable_task_cg_lists(), not cgroup_iter_*.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
77efecd9e0526327548152df715ab8644ecb5ba0 08-Jan-2009 Lai Jiangshan <laijs@cn.fujitsu.com> cgroups: call find_css_set() safely in cgroup_attach_task()

In cgroup_attach_task(), tsk maybe exit when we call find_css_set(). and
find_css_set() will access to invalid css_set.

This patch increases the count before get_css_set(), and decreases it
after find_css_set().

NOTE:

css_set's refcount is also taskcount, after this patch applied, taskcount
may be off-by-one WHEN cgroup_lock() is not held. but I reviewed other
code which use taskcount, they are still correct. No regression found by
reviewing and simply testing.

So I do not use two counters in css_set. (one counter for taskcount, the
other for refcount. like struct mm_struct) If this fix cause regression,
we will use two counters in css_set.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
104cbd55377029e70fc2cee01089e84b9c36e5dc 08-Jan-2009 Lai Jiangshan <laijs@cn.fujitsu.com> cgroups: use task_lock() for access tsk->cgroups safe in cgroup_clone()

Use task_lock() protect tsk->cgroups and get_css_set(tsk->cgroups).

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b2aa30f7bb381e04c93eed106089ba55553955f1 08-Jan-2009 Lai Jiangshan <laijs@cn.fujitsu.com> cgroups: don't put struct cgroupfs_root protected by RCU

We don't access struct cgroupfs_root in fast path, so we should not put
struct cgroupfs_root protected by RCU

But the comment in struct cgroup_subsys.root confuse us.

struct cgroup_subsys.root is used in these places:

1 find_css_set(): if (ss->root->subsys_list.next == &ss->sibling)
2 rebind_subsystems(): if (ss->root != &rootnode)
rcu_assign_pointer(ss->root, root);
rcu_assign_pointer(subsys[i]->root, &rootnode);
3 cgroup_has_css_refs(): if (ss->root != cgrp->root)
4 cgroup_init_subsys(): ss->root = &rootnode;
5 proc_cgroupstats_show(): ss->name, ss->root->subsys_bits,
ss->root->number_of_cgroups, !ss->disabled);
6 cgroup_clone(): root = subsys->root;
if ((root != subsys->root) ||

All these place we have held cgroup_lock() or we don't dereference to
struct cgroupfs_root. It's means wo don't need RCU when use struct
cgroup_subsys.root, and we should not put struct cgroupfs_root protected
by RCU.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019f634ce5904c19eba4e86f51b1a119a53a9f1 08-Jan-2009 Lai Jiangshan <laijs@cn.fujitsu.com> cgroups: fix cgroup_iter_next() bug

We access res->cgroups without the task_lock(), so res->cgroups may be
changed. it's unreliable, and "if (l == &res->cgroups->tasks)" may be
false forever.

We don't need add any lock for fixing this bug. we just access to struct
css_set by struct cg_cgroup_link, not by struct task_struct.

Since we hold css_set_lock, struct cg_cgroup_link is reliable.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b12b533fa523e94e0cc9dc23274ae4f9439f1313 08-Jan-2009 Lai Jiangshan <laijs@cn.fujitsu.com> cgroups: add lock for child->cgroups in cgroup_post_fork()

When cgroup_post_fork() is called, child is seen by find_task_by_vpid(),
so child->cgroups maybe be changed, It'll incorrect.

child->cgroups<old>'s refcnt is decreased
child->cgroups<new>'s refcnt is increased
but child->cg_list is added to child->cgroups<old>'s list.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
75139b8274c3e30354daea623f14b43a482a0bb5 08-Jan-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: remove some redundant NULL checks

- In cgroup_clone(), if vfs_mkdir() returns successfully,
dentry->d_fsdata will be the pointer to the newly created
cgroup and won't be NULL.

- a cgroup file's dentry->d_fsdata won't be NULL, guaranteed
by cgroup_add_file().

- When walking through the subsystems of a cgroup_fs (using
for_each_subsys), cgrp->subsys[ss->subsys_id] won't be NULL,
guaranteed by cgroup_create().

(Also remove 2 unused variables in cgroup_rmdir().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e5991371ee0d1c0ce19e133c6f9075b49c5b4ae8 06-Jan-2009 Hugh Dickins <hugh@veritas.com> mm: remove cgroup_mm_owner_callbacks

cgroup_mm_owner_callbacks() was brought in to support the memrlimit
controller, but sneaked into mainline ahead of it. That controller has
now been shelved, and the mm_owner_changed() args were inadequate for it
anyway (they needed an mm pointer instead of a task pointer).

Remove the dead code, and restore mm_update_next_owner() locking to how it
was before: taking mmap_sem there does nothing for memcontrol.c, now the
only user of mm->owner.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
56ff5efad96182f4d3cb3dc6b07396762c658f16 09-Dec-2008 Al Viro <viro@zeniv.linux.org.uk> zero i_uid/i_gid on inode allocation

... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.

i_mode is not worth the effort; it has no common default value.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
7b574b7b0124ed344911f5d581e9bc2d83bbeb19 04-Jan-2009 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix a race between cgroup_clone and umount

The race is calling cgroup_clone() while umounting the ns cgroup subsys,
and thus cgroup_clone() might access invalid cgroup_fs, or kill_sb() is
called after cgroup_clone() created a new dir in it.

The BUG I triggered is BUG_ON(root->number_of_cgroups != 1);

------------[ cut here ]------------
kernel BUG at kernel/cgroup.c:1093!
invalid opcode: 0000 [#1] SMP
...
Process umount (pid: 5177, ti=e411e000 task=e40c4670 task.ti=e411e000)
...
Call Trace:
[<c0493df7>] ? deactivate_super+0x3f/0x51
[<c04a3600>] ? mntput_no_expire+0xb3/0xdd
[<c04a3ab2>] ? sys_umount+0x265/0x2ac
[<c04a3b06>] ? sys_oldumount+0xd/0xf
[<c0403911>] ? sysenter_do_call+0x12/0x31
...
EIP: [<c0456e76>] cgroup_kill_sb+0x23/0xe0 SS:ESP 0068:e411ef2c
---[ end trace c766c1be3bf944ac ]---

Cc: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
20ca9b3f4c6dfa0af8dd5b18a64df17eb994b54d 23-Dec-2008 Li Zefan <lizf@cn.fujitsu.com> cgroups: avoid accessing uninitialized data in failure path

If cgroup_get_rootdir() failed, free_cg_links() will be called in the
failure path, but tmp_cg_links hasn't been initialized at that time.

I introduced this bug in the 2.6.27 merge window.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e368d3a836797ddf193b1ec18c97407a791d2451 23-Dec-2008 Sharyathi Nagesh <sharyath@in.ibm.com> cgroups: suppress bogus warning messages

Remove spurious warning messages that are thrown onto the console during
cgroup operations.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Sharyathi Nagesh <sharyathi@in.ibm.com>
Acked-by: Serge E. Hallyn <serge@hallyn.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
307257cf475aac25db30b669987f13d90c934e3a 15-Dec-2008 Paul Menage <menage@google.com> cgroups: fix a race between rmdir and remount

When a cgroup is removed, it's unlinked from its parent's children list,
but not actually freed until the last dentry on it is released (at which
point cgrp->root->number_of_cgroups is decremented).

Currently rebind_subsystems checks for the top cgroup's child list being
empty in order to rebind subsystems into or out of a hierarchy - this can
result in the set of subsystems bound to a hierarchy being
removed-but-not-freed cgroup.

The simplest fix for this is to forbid remounts that change the set of
subsystems on a hierarchy that has removed-but-not-freed cgroups. This
bug can be reproduced via:

mkdir /mnt/cg
mount -t cgroup -o ns,freezer cgroup /mnt/cg
mkdir /mnt/cg/foo
sleep 1h < /mnt/cg/foo &
rmdir /mnt/cg/foo
mount -t cgroup -o remount,ns,devices,freezer cgroup /mnt/cg
kill $!

Though the above will cause oops in -mm only but not mainline, but the bug
can cause memory leak in mainline (and even oops)

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
33d283bef23132c48195eafc21449f8ba88fce6b 20-Nov-2008 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix a serious bug in cgroupstats

Try this, and you'll get oops immediately:
# cd Documentation/accounting/
# gcc -o getdelays getdelays.c
# mount -t cgroup -o debug xxx /mnt
# ./getdelays -C /mnt/tasks

Because a normal file's dentry->d_fsdata is a pointer to struct cftype,
not struct cgroup.

After the patch, it returns EINVAL if we try to get cgroupstats
from a normal file.

Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3fa59dfbc3b223f02c26593be69ce6fc9a940405 20-Nov-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> cgroup: fix potential deadlock in pre_destroy

As Balbir pointed out, memcg's pre_destroy handler has potential deadlock.

It has following lock sequence.

cgroup_mutex (cgroup_rmdir)
-> pre_destroy -> mem_cgroup_pre_destroy-> force_empty
-> cpu_hotplug.lock. (lru_add_drain_all->
schedule_work->
get_online_cpus)

But, cpuset has following.
cpu_hotplug.lock (call notifier)
-> cgroup_mutex. (within notifier)

Then, this lock sequence should be fixed.

Considering how pre_destroy works, it's not necessary to holding
cgroup_mutex() while calling it.

As a side effect, we don't have to wait at this mutex while memcg's
force_empty works.(it can be long when there are tons of pages.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
c69e8d9c01db2adc503464993c358901c9af9de4 14-Nov-2008 David Howells <dhowells@redhat.com> CRED: Use RCU to access another task's creds and to release a task's own creds

Use RCU to access another task's creds and to release a task's own creds.
This means that it will be possible for the credentials of a task to be
replaced without another task (a) requiring a full lock to read them, and (b)
seeing deallocated memory.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
b6dff3ec5e116e3af6f537d4caedcad6b9e5082a 14-Nov-2008 David Howells <dhowells@redhat.com> CRED: Separate task security context from task_struct

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
76aac0e9a17742e60d408be1a706e9aaad370891 14-Nov-2008 David Howells <dhowells@redhat.com> CRED: Wrap task credential accesses in the core kernel

Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.

Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-audit@redhat.com
Cc: containers@lists.linux-foundation.org
Cc: linux-mm@kvack.org
Signed-off-by: James Morris <jmorris@namei.org>
24eb089950ce44603b30a3145a2c8520e2b55bb1 06-Nov-2008 Li Zefan <lizf@cn.fujitsu.com> cgroups: fix invalid cgrp->dentry before cgroup has been completely removed

This fixes an oops when reading /proc/sched_debug.

A cgroup won't be removed completely until finishing cgroup_diput(), so we
shouldn't invalidate cgrp->dentry in cgroup_rmdir(). Otherwise, when a
group is being removed while cgroup_path() gets called, we may trigger
NULL dereference BUG.

The bug can be reproduced:

# cat test.sh
#!/bin/sh
mount -t cgroup -o cpu xxx /mnt
for (( ; ; ))
{
mkdir /mnt/sub
rmdir /mnt/sub
}
# ./test.sh &
# cat /proc/sched_debug

BUG: unable to handle kernel NULL pointer dereference at 00000038
IP: [<c045a47f>] cgroup_path+0x39/0x90
...
Call Trace:
[<c0420344>] ? print_cfs_rq+0x6e/0x75d
[<c0421160>] ? sched_debug_show+0x72d/0xc1e
...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org> [2.6.26.x, 2.6.27.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2077776641b6ffb0049f13018d2e162340ec51c7 21-Oct-2008 Stephen Rothwell <sfr@canb.auug.org.au> cgroup: remove unused variable

/scratch/sfr/next/kernel/cgroup.c: In function 'cgroup_tasks_start':
/scratch/sfr/next/kernel/cgroup.c:2107: warning: unused variable 'i'

Introduced in commit cc31edceee04a7b87f2be48f9489ebb72d264844 "cgroups:
convert tasks file to use a seq_file with shared pid array".

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cc31edceee04a7b87f2be48f9489ebb72d264844 19-Oct-2008 Paul Menage <menage@google.com> cgroups: convert tasks file to use a seq_file with shared pid array

Rather than pre-generating the entire text for the "tasks" file each
time the file is opened, we instead just generate/update the array of
process ids and use a seq_file to report these to userspace. All open
file handles on the same "tasks" file can share a pid array, which may
be updated any time that no thread is actively reading the array. By
sharing the array, the potential for userspace to DoS the system by
opening many handles on the same "tasks" file is removed.

[Based on a patch by Lai Jiangshan, extended to use seq_file]

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
146aa1bd0511f88ddb4e92fafa2b8aad4f2f65f3 19-Oct-2008 Lai Jiangshan <laijs@cn.fujitsu.com> cgroups: fix probable race with put_css_set[_taskexit] and find_css_set

put_css_set_taskexit may be called when find_css_set is called on other
cpu. And the race will occur:

put_css_set_taskexit side find_css_set side

|
atomic_dec_and_test(&kref->refcount) |
/* kref->refcount = 0 */ |
....................................................................
| read_lock(&css_set_lock)
| find_existing_css_set
| get_css_set
| read_unlock(&css_set_lock);
....................................................................
__release_css_set |
....................................................................
| /* use a released css_set */
|

[put_css_set is the same. But in the current code, all put_css_set are
put into cgroup mutex critical region as the same as find_css_set.]

[akpm@linux-foundation.org: repair comments]
[menage@google.com: eliminate race in css_set refcounting]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9363b9f23c9cc36cc8ef6c05fdf879ee4a96ae92 16-Oct-2008 Balbir Singh <balbir@linux.vnet.ibm.com> memrlimit: cgroup mm owner callback changes to add task info

This patch adds an additional field to the mm_owner callbacks. This field
is required to get to the mm that changed. Hold mmap_sem in write mode
before calling the mm_owner_changed callback

[hugh@veritas.com: fix mmap_sem deadlock]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
31a78f23bac0069004e69f98808b6988baccb6b6 29-Sep-2008 Balbir Singh <balbir@linux.vnet.ibm.com> mm owner: fix race between swapoff and exit

There's a race between mm->owner assignment and swapoff, more easily
seen when task slab poisoning is turned on. The condition occurs when
try_to_unuse() runs in parallel with an exiting task. A similar race
can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
or ptrace or page migration.

CPU0 CPU1
try_to_unuse
looks at mm = task0->mm
increments mm->mm_users
task 0 exits
mm->owner needs to be updated, but no
new owner is found (mm_users > 1, but
no other task has task->mm = task0->mm)
mm_update_next_owner() leaves
mmput(mm) decrements mm->mm_users
task0 freed
dereferencing mm->owner fails

The fix is to notify the subsystem via mm_owner_changed callback(),
if no new owner is found, by specifying the new task as NULL.

Jiri Slaby:
mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
must be set after that, so as not to pass NULL as old owner causing oops.

Daisuke Nishimura:
mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
and its callers need to take account of this situation to avoid oops.

Hugh Dickins:
Lockdep warning and hang below exec_mmap() when testing these patches.
exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
so exec_mmap() now needs to do the same. And with that repositioning,
there's now no point in mm_need_new_owner() allowing for NULL mm.

Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
55b6fd0162ace1e0f1b52c8c092565c115127ef6 30-Jul-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: uninline cgroup_has_css_refs()

It's not small enough, and has 2 call sites.

text data bss dec hex filename
12813 1676 4832 19321 4b79 cgroup.o.orig
12775 1676 4832 19283 4b53 cgroup.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
36553434f475a84b653e25e74490ee8df43b86d5 30-Jul-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: remove duplicate code in allocate_cg_link()

- just call free_cg_links() in allocate_cg_links()
- the list will get initialized in allocate_cg_links(), so don't init
it twice

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5a3eb9f6b7c598529f832b8baa6458ab1cbab2c6 30-Jul-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: fix possible memory leak

There's a leak if copy_from_user() returns failure.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3f8206d496e9e9495afb1d4e70d29712b4d403c9 26-Jul-2008 Al Viro <viro@zeniv.linux.org.uk> [PATCH] get rid of indirect users of namei.h

fs.h needs path.h, not namei.h; nfs_fs.h doesn't need it at all.
Several places in the tree needed direct include.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
96930a6365c99c160138a395566e360b27348b8f 26-Jul-2008 Adrian Bunk <bunk@kernel.org> make cgroup_seqfile_release() static

cgroup_seqfile_release() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e885dcde75685e09f23cffae1f6d5169c105b8a0 25-Jul-2008 Serge E. Hallyn <serue@us.ibm.com> cgroup_clone: use pid of newly created task for new cgroup

cgroup_clone creates a new cgroup with the pid of the task. This works
correctly for unshare, but for clone cgroup_clone is called from
copy_namespaces inside copy_process, which happens before the new pid is
created. As a result, the new cgroup was created with current's pid.
This patch:

1. Moves the call inside copy_process to after the new pid
is created
2. Passes the struct pid into ns_cgroup_clone (as it is not
yet attached to the task)
3. Passes a name from ns_cgroup_clone() into cgroup_clone()
so as to keep cgroup_clone() itself simpler
4. Uses pid_vnr() to get the process id value, so that the
pid used to name the new cgroup is always the pid as it
would be known to the task which did the cloning or
unsharing. I think that is the most intuitive thing to
do. This way, task t1 does clone(CLONE_NEWPID) to get
t2, which does clone(CLONE_NEWPID) to get t3, then the
cgroup for t3 will be named for the pid by which t2 knows
t3.

(Thanks to Dan Smith for finding the main bug)

Changelog:
June 11: Incorporate Paul Menage's feedback: don't pass
NULL to ns_cgroup_clone from unshare, and reduce
patch size by using 'nodename' in cgroup_clone.
June 10: Original version

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Serge Hallyn <serge@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Tested-by: Dan Smith <danms@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
af351026aafc8da16518a02b41c66d3e0c1cdef4 25-Jul-2008 Paul Menage <menage@google.com> cgroup files: turn attach_task_by_pid directly into a cgroup write handler

This patch changes attach_task_by_pid() to take a u64 rather than a
string; as a result it can be called directly as a control groups
write_u64 handler, and cgroup_common_file_write() can be removed.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6379c106152388f7ea45d6dda63edda0e9181fc8 25-Jul-2008 Paul Menage <menage@google.com> cgroup files: move notify_on_release file to separate write handler

This patch moves the write handler for the cgroups notify_on_release
file into a separate handler. This handler requires no cgroups locking
since it relies on atomic bitops for synchronization.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
84eea842886ac35020be6043e04748ed22014359 25-Jul-2008 Paul Menage <menage@google.com> cgroups: misc cleanups to write_string patchset

This patch contains cleanups suggested by reviewers for the recent
write_string() patchset:

- pair cgroup_lock_live_group() with cgroup_unlock() in cgroup.c for
clarity, rather than directly unlocking cgroup_mutex.

- make the return type of cgroup_lock_live_group() a bool

- use a #define'd constant for the local buffer size in read/write functions

Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e788e066c651b1bbf4a927dc95395c1aa13be436 25-Jul-2008 Paul Menage <menage@google.com> cgroup files: move the release_agent file to use typed handlers

Adds cgroup_release_agent_write() and cgroup_release_agent_show()
methods to handle writing/reading the path to a cgroup hierarchy's
release agent. As a result, cgroup_common_file_read() is now unnecessary.

As part of the change, a previously-tolerated race in
cgroup_release_agent() is avoided by copying the current
release_agent_path prior to calling call_usermode_helper().

Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
db3b14978abc02041046ed8353f0899cb58ffffc 25-Jul-2008 Paul Menage <menage@google.com> cgroup files: add write_string cgroup control file method

This patch adds a write_string() method for cgroups control files. The
semantics are that a buffer is copied from userspace to kernelspace
and the handler function invoked on that buffer. The buffer is
guaranteed to be nul-terminated, and no longer than max_write_len
(defaulting to 64 bytes if unspecified). Later patches will convert
existing raw file write handlers in control group subsystems to use
this method.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Balbir Singh <balbir@in.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8947f9d5b361ce927be6d5c11fed57905b7a4100 25-Jul-2008 Li Zefan <lizf@cn.fujitsu.com> cgroups: annotate two variables with __read_mostly

- need_forkexit_callback will be read only after system boot.
- use_task_css_set_links will be read only after it's set.

And these 2 variables are checked when a new process is forked.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
71cbb949d17d4d776abd547135feb7f3282405c8 25-Jul-2008 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> cgroup: list_for_each cleanup

--------------------------
while() {
list_entry();
...
}
--------------------------

is equivalent to following code.

--------------------------
list_for_each_entry(){
...
}
--------------------------

later can review easily more.

this patch is just clean up.
it doesn't have any behavor change.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7e9abd89cbdf9b73d327d8173343abce9022609b 25-Jul-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: use read lock to guard find_existing_css_set()

The function does not modify anything (except the temporary css template), so
it's sufficient to hold read lock.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5c02b575780d0d785815a1e7b79a98edddee895a 23-May-2008 Cedric Le Goater <clg@fr.ibm.com> cgroups: remove node_ prefix_from ns subsystem

This is a slight change in the namespace cgroup subsystem api.

The change is that previously when cgroup_clone() was called (currently
only from the unshare path in ns_proxy cgroup, you'd get a new group named
"node_$pid" whereas now you'll get a group named after just your pid.)

The only users who would notice it are those who are using the ns_proxy
cgroup subsystem to auto-create cgroups when namespaces are unshared -
something of an experimental feature, which I think really needs more
complete container/namespace support in order to be useful. I suspect the
only users are Cedric and Serge, or maybe a few others on
containers@lists.linux-foundation.org. And in fact it would only be
noticed by the users who make the assumption about how the name is
generated, rather than getting it from the /proc/<pid>/cgroups file for
the process in question.

Whether the change is actually needed or not I'm fairly agnostic on, but I
guess it is more elegant to just use the pid as the new group name rather
than adding a fairly arbitrary "node_" prefix on the front.

[menage@google.com: provided changelog]
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Paul Menage" <menage@google.com>
Cc: "Serge E. Hallyn" <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e4ad08fe64afca4ef79ecc4c624e6e871688da0d 30-Apr-2008 Miklos Szeredi <mszeredi@suse.cz> mm: bdi: add separate writeback accounting capability

Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().

Misc cleanups:

- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cf475ad28ac35cc9ba612d67158f29b73b38b05d 29-Apr-2008 Balbir Singh <balbir@linux.vnet.ibm.com> cgroups: add an owner to the mm_struct

Remove the mem_cgroup member from mm_struct and instead adds an owner.

This approach was suggested by Paul Menage. The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined. It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.

A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.

This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes. The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.

I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.

This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.

After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>,
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
29486df325e1fe6e1764afcb19e3370804c2b002 29-Apr-2008 Serge E. Hallyn <serue@us.ibm.com> cgroups: introduce cft->read_seq()

Introduce a read_seq() helper in cftype, which uses seq_file to print out
lists. Use it in the devices cgroup. Also split devices.allow into two
files, so now devices.deny and devices.allow are the ones to use to manipulate
the whitelist, while devices.list outputs the cgroup's current whitelist.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
28fd5dfc12bde391981dfdcf20755952b6e916af 29-Apr-2008 Li Zefan <lizf@cn.fujitsu.com> cgroups: remove the css_set linked-list

Now we can run through the hash table instead of running through the
linked-list.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e8d55fdeb882cfcb5e8db5a5ce16edfba78aafc5 29-Apr-2008 Li Zefan <lizf@cn.fujitsu.com> cgroups: simplify init_subsys()

We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so
we don't need to run through the css_set linked list. Neither do we need to
run through the task list, since no processes have been created yet.

Also referring to a comment in cgroup.h:

struct css_set
{
...
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
* during subsystem registration (at boot time).
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
472b1053f3c319cc60bfb2a0bb062fed77a93eb6 29-Apr-2008 Li Zefan <lizf@cn.fujitsu.com> cgroups: use a hash table for css_set finding

When we attach a process to a different cgroup, the css_set linked-list will
be run through to find a suitable existing css_set to use. This patch
implements a hash table for better performance.

The following benchmarks have been tested:

For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
task in each, and then move an additional task through each cgroup in
turn.

Here is a test result:

N Loop orig - Time(s) hash - Time(s)
----------------------------------------------
1 10000 1.201231728 1.196311177
5 2000 1.065743872 1.040566424
10 1000 0.991054735 0.986876440
50 200 0.976554203 0.969608733
100 100 0.998504680 0.969218270
500 20 1.157347764 0.962602963
1000 10 1.619521852 1.085140172

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d447ea2f30ec60370ddb99a668e5ac12995f043d 29-Apr-2008 Pavel Emelyanov <xemul@openvz.org> cgroups: add the trigger callback to struct cftype

Trigger callback can be used to receive a kick-up from the user space. The
string written is ignored.

The cftype->private is used for multiplexing events.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
46ae220bea40bd1cf4abec2d5cdfb4f9396c7115 29-Apr-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: switch to proc_create()

There is a race between create_proc_entry() and the assignment of file ops.
proc_create() is invented to fix it.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
06a119204d3e1e67d393e996ed987b0df7998381 29-Apr-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: annotate cgroup_init_subsys with __init

It is called by cgroup_init() and cgroup_init_early() only, which are
annotated with __init.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e73d2c61d1fcbd3621688ae457b49509c8d4c601 29-Apr-2008 Paul Menage <menage@google.com> CGroups _s64 files: add cgroups read_s64/write_s64 file methods

These patches add cgroups read_s64 and write_s64 control file methods (the
signed equivalent of read_u64/write_u64) and use them to implement the
cpu.rt_runtime_us control file in the CFS cgroup subsystem.

This patch:

These are the signed equivalents of the read_u64/write_u64 methods

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3116f0e3df0a67ad56f15dd4c5f6cefb04bb4a98 29-Apr-2008 Paul Menage <menage@google.com> CGroup API files: move "releasable" to cgroup_debug subsystem

The "releasable" control file provided by the cgroup framework exports the
state of a per-cgroup flag that's related to the notify-on-release feature.
This isn't really generally useful, unless you're trying to debug this
particular feature of cgroups.

This patch moves the "releasable" file to the cgroup_debug subsystem.

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9179656961adcea3c25403365597e486d851ac5e 29-Apr-2008 Paul Menage <menage@google.com> CGroup API files: add cgroup map data type

Adds a new type of supported control file representation, a map from strings
to u64 values.

Each map entry is printed as a line in a similar format to /proc/vmstat, i.e.
"$key $value\n"

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b7269dfc826fbf554c9e6a9eaa4e6ff95fa08656 29-Apr-2008 Paul Menage <menage@google.com> CGroup API files: strip all trailing whitespace in cgroup_write_u64

This removes the need for people to remember to pass the -n flag to echo when
writing values to cgroup control files.

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f4c753b7eacc277e506066abdda351cbc1cf8e6a 29-Apr-2008 Paul Menage <menage@google.com> CGroup API files: rename read/write_uint methods to read_write_u64

Several people have justifiably complained that the "_uint" suffix is
inappropriate for functions that handle u64 values, so this patch just renames
all these functions and their users to have the suffic _u64.

[peterz@infradead.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4fe91d518e4958af7edebbeb112a3272b2be232d 29-Apr-2008 Paul Jackson <pj@sgi.com> cgroup: fix sparse warning of shadow symbol in cgroup.c

Fix a code warning: symbol 'p' shadows an earlier one

This is a reincarnation of Harvey Harrison's patch:
cpuset: sparse warnings in cpuset.c

Independently, Cliff Wickman moved the affected code,
from kernel/cpuset.c to kernel/cgroup.c, in his patch:
cpusets: update_cpumask revision

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3df91fe30a1547af7e794c6e8cca76f4932c6ad7 29-Apr-2008 Adrian Bunk <bunk@kernel.org> make cgroup_enable_task_cg_lists() static

Make the needlessly global cgroup_enable_task_cg_lists() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
0e04388f0189fa1f6812a8e1cb6172136eada87e 17-Apr-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: fix a race condition in manipulating tsk->cg_list

When I ran a test program to fork mass processes and at the same time
'cat /cgroup/tasks', I got the following oops:

------------[ cut here ]------------
kernel BUG at lib/list_debug.c:72!
invalid opcode: 0000 [#1] SMP
Pid: 4178, comm: a.out Not tainted (2.6.25-rc9 #72)
...
Call Trace:
[<c044a5f9>] ? cgroup_exit+0x55/0x94
[<c0427acf>] ? do_exit+0x217/0x5ba
[<c0427ed7>] ? do_group_exit+0.65/0x7c
[<c0427efd>] ? sys_exit_group+0xf/0x11
[<c0404842>] ? syscall_call+0x7/0xb
[<c05e0000>] ? init_cyrix+0x2fa/0x479
...
EIP: [<c04df671>] list_del+0x35/0x53 SS:ESP 0068:ebc7df4
---[ end trace caffb7332252612b ]---
Fixing recursive fault but reboot is needed!

After digging into the code and debugging, I finlly found out a race
situation:

do_exit()
->cgroup_exit()
->if (!list_empty(&tsk->cg_list))
list_del(&tsk->cg_list);

cgroup_iter_start()
->cgroup_enable_task_cg_list()
->list_add(&tsk->cg_list, ..);

In this case the list won't be deleted though the process has exited.

We got two bug reports in the past, which seem to be the same bug as
this one:
http://lkml.org/lkml/2008/3/5/332
http://lkml.org/lkml/2007/10/17/224

Actually sometimes I got oops on list_del, sometimes oops on list_add.
And I can change my test program a bit to trigger other oops.

The patch has been tested both on x86_32 and x86_64.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b6c3006d204a0b86e1ebe02ca38f9f071a03c7ef 11-Apr-2008 Paul Menage <menage@google.com> cgroups: include hierarchy ids in /proc/<pid>/cgroup

Extend the /proc/<pid>/cgroup file to include the appropriate hierarchy ID on
each line.

Currently this ID isn't really needed since a hierarchy can be completely
identified by the set of subsystems bound to it, but this is likely to change
in the near future in order to support stateless subsystems and
merging/rebinding of subsystems. Getting this change into 2.6.25 reduces the
need for an API change later.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8bab8dded67d026c39367bbd5e27d2f6c556c38e 04-Apr-2008 Paul Menage <menage@google.com> cgroups: add cgroup support for enabling controllers at boot time

The effects of cgroup_disable=foo are:

- foo isn't auto-mounted if you mount all cgroups in a single hierarchy
- foo isn't visible as an individually mountable subsystem

As a result there will only ever be one call to foo->create(), at init time;
all processes will stay in this group, and the group will never be mounted on
a visible hierarchy. Any additional effects (e.g. not allocating metadata)
are up to the foo subsystem.

This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
but it could easily be extended to do so if any of the early_init systems
wanted it - I think it would just involve some nastier parameter processing
since it would occur before the command-line argument parser had been run.

Hugh said:

Ballpark figures, I'm trying to get this question out rather than
processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
to the affected paths, booting with cgroup_disable=memory cuts that back to
1% overhead (due to slightly bigger struct page).

I'm no expert on distros, they may have no interest whatever in
CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
without it, or apply the cgroup_disable=memory patches.

Unix bench's execl test result on x86_64 was

== just after boot without mounting any cgroup fs.==
mem_cgorup=off : Execl Throughput 43.0 3150.1 732.6
mem_cgroup=on : Execl Throughput 43.0 2932.6 682.0
==

[lizf@cn.fujitsu.com: fix boot option parsing]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9dce07f1a441b77a15631cf0ed0238e0baa7ed64 29-Mar-2008 Al Viro <viro@ftp.linux.org.uk> NULL noise: fs/*, mm/*, kernel/*

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b6abdb0e6ca5c2c0a7caa4131da2af0750927e72 04-Mar-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: fix default notify_on_release setting

The documentation says the default value of notify_on_release of a child
cgroup is inherited from its parent, which is reasonable, but the
implementation just sets the flag disabled.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bc231d2a048010d5e0b49ac7fddbfa822fc41109 24-Feb-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: remove dead code in cgroup_get_rootdir()

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
68db38f1537a44097e264f28bda751d6b919cd53 24-Feb-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: remove duplicate code in find_css_set()

The list head res->tasks gets initialized twice in find_css_set().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8d53d55d27754508e58e9ac18a4a445b110434bf 24-Feb-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: fix subsys bitops

Cgroup uses unsigned long for subsys bitops, not unsigned long long.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
f777073848ba3708d68d87e43f104f83316187d7 24-Feb-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: fix memory leak in cgroup_get_sb()

opts.release_agent is not kfree()ed in all necessary places.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a043e3b2c63445512c5592cbe3c8694f3c655e81 24-Feb-2008 Li Zefan <lizf@cn.fujitsu.com> cgroup: fix comments

fix:
- comments about need_forkexit_callback
- comments about release agent
- typo and comment style, etc.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
73507f335f406ff31ceb97b39fa76eaee00f4f26 07-Feb-2008 Pavel Emelyanov <xemul@openvz.org> Handle pid namespaces in cgroups code

There's one place that works with task pids - its the "tasks" file in cgroups.
The read/write handlers assume, that the pid values go to/come from the user
space and thus it is a virtual pid, i.e. the pid as it is seen from inside a
namespace.

Tune the code accordingly.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
956db3ca0606e78456786ef19fd4dc7a5151a6e1 07-Feb-2008 Cliff Wickman <cpw@sgi.com> hotplug cpu: move tasks in empty cpusets to parent

This patch corrects a situation that occurs when one disables all the cpus in
a cpuset.

Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
which is incorrect because it may then overlap its cpu-exclusive sibling.

Tasks of an empty cpuset should be moved to the cpuset which is the parent of
their current cpuset. Or if the parent cpuset has no cpus, to its parent,
etc.

And the empty cpuset should be released (if it is flagged notify_on_release).

Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
iterate through all tasks in the cpu-less cpuset. We are deliberately
avoiding a walk of the tasklist.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
31a7df01fd0cd786f60873a921aecafac148c290 07-Feb-2008 Cliff Wickman <cpw@sgi.com> cgroups: mechanism to process each task in a cgroup

Provide cgroup_scan_tasks(), which iterates through every task in a cgroup,
calling a test function and a process function for each. And call the process
function without holding the css_set_lock lock.

The idea is David Rientjes', predicting that such a function will make it much
easier in the future to extend things that require access to each task in a
cgroup without holding the lock,

[akpm@linux-foundation.org: cleanup]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4fca88c87b7969c698912e2de9b1b31088c777cb 07-Feb-2008 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> memory cgroup enhancements: add- pre_destroy() handler

Add a handler "pre_destroy" to cgroup_subsys. It is called before
cgroup_rmdir() checks all subsys's refcnt.

I think this is useful for subsys which have some extra refs even if there
are no tasks in cgroup. By adding pre_destroy(), the kernel keeps the rule
"destroy() against subsystem is called only when refcnt=0." and allows css
ref to be used by other objects than tasks.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e9685a03c8c3162cfa9ff02d254ea5c848f9facb 07-Feb-2008 Adrian Bunk <bunk@kernel.org> kernel/cgroup.c: make 2 functions static

cgroup_is_releasable() and notify_on_release() should be static,
not global inline.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8dc4f3e17dd5f7e59ce568155ccd8974af879315 07-Feb-2008 Paul Menage <menage@google.com> cgroups: move cgroups destroy() callbacks to cgroup_diput()

Move the calls to the cgroup subsystem destroy() methods from
cgroup_rmdir() to cgroup_diput(). This allows control file reads and
writes to access their subsystem state without having to be concerned with
locking against cgroup destruction - the control file dentry will keep the
cgroup and its subsystem state objects alive until the file is closed.

The documentation is updated to reflect the changed semantics of destroy();
additionally the locking comments for destroy() and some other methods were
clarified and decrustified.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
622d42cac9ed42098aa50c53994f625abfa3d473 07-Feb-2008 Paul Jackson <pj@sgi.com> cgroup simplify space stripping

Simplify the space stripping code in cgroup file write.

[akpm@linux-foundation.org: s/BUG_ON/BUILD_BUG_ON/]
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e18f6318e5dab189efd4cb0bbfcbd923cc373e3c 07-Feb-2008 Paul Jackson <pj@sgi.com> cgroup brace coding style fix

Coding style fix - one line conditionals don't get braces.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3cdeed2986b09fcc77b4812ca10dbc057e4e5f8c 07-Feb-2008 Adrian Bunk <bunk@kernel.org> kernel/cgroup.c: remove dead code

This patch removes dead code spotted by the Coverity checker
(look at the "(nbytes >= PATH_MAX)" check).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cfe36bde59bc1ae868e775ad82386c3acaabb738 15-Nov-2007 Diego Calleja <diegocg@gmail.com> Improve cgroup printks

When I boot with the 'quiet' parameter, I see on the screen:

[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 39.036026] Initializing cgroup subsys cpuacct
[ 39.036080] Initializing cgroup subsys debug
[ 39.036118] Initializing cgroup subsys ns

This patch lowers the priority of those messages, adds a "cgroup: " prefix
to another couple of printks and kills the useless reference to the source
file.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
3bdf590eac36ac5930deb9552febee3ff18cd2d1 24-Oct-2007 Jeff Garzik <jeff@garzik.org> cgroup: kill unused variable

Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
bd89aabc6761de1c35b154fe6f914a445d301510 19-Oct-2007 Paul Menage <menage@google.com> Control groups: Replace "cont" with "cgrp" and other misc renaming

Replace "cont" with "cgrp" and other misc renaming

This patch finishes some of the names that got missed in the great
"task containers" -> "control groups" rename. Primarily it renames
the local variable "cont" to "cgrp" in a number of places, and renames
the CONT_* enum members to CGRP_*.

This patch is not intended to have any effect on the generated code;
the output of "objdump -d kernel/cgroup.o" is unchanged.

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
69cccb887a35f3761b7ea62cdd2bee602868c735 19-Oct-2007 Pavel Emelyanov <xemul@openvz.org> Use task_pid_nr() instead of pid_nr(task_pid())

There are two places that do so - the cgroups subsystem and the autofs
code.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Ian Kent <raven@themaw.net>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
846c7bb055747989891f5cd2bb6e8d56243ba1e7 19-Oct-2007 Balbir Singh <balbir@linux.vnet.ibm.com> Add cgroupstats

This patch is inspired by the discussion at
http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics
as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The
patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward
ported)

This patch implements per cgroup statistics infrastructure and re-uses
code from the taskstats interface. A new set of cgroup operations are
registered with commands and attributes. It should be very easy to
*extend* per cgroup statistics, by adding members to the cgroupstats
structure.

The current model for cgroupstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add. Currently
user space requests for statistics by passing the cgroup file
descriptor. Statistics about the state of all the tasks in the cgroup
is returned to user space.

TODO's/NOTE:

This patch provides an infrastructure for implementing cgroup statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into cgroup statistics in the future.

Sample output

# ./cgroupstats -C /cgroup/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0

# ./cgroupstats -C /cgroup/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0

If the approach looks good, I'll enhance and post the user space utility for
the same

Feedback, comments, test results are always welcome!

[akpm@linux-foundation.org: build fix]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
81a6a5cdd2c5cd70874b88afe524ab09e9e869af 19-Oct-2007 Paul Menage <menage@google.com> Task Control Groups: automatic userspace notification of idle cgroups

Add the following files to the cgroup filesystem:

notify_on_release - configures/reports whether the cgroup subsystem should
attempt to run a release script when this cgroup becomes unused

release_agent - configures/reports the release agent to be used for this
hierarchy (top level in each hierarchy only)

releasable - reports whether this cgroup would have been auto-released if
notify_on_release was true and a release agent was configured (mainly useful
for debugging)

To avoid locking issues, invoking the userspace release agent is done via a
workqueue task; cgroups that need to have their release agents invoked by
the workqueue task are linked on to a list.

[pj@sgi.com: Need to include kmod.h]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
817929ec274bcfe771586d338bb31d1659615686 19-Oct-2007 Paul Menage <menage@google.com> Task Control Groups: shared cgroup subsystem group arrays

Replace the struct css_set embedded in task_struct with a pointer; all tasks
that have the same set of memberships across all hierarchies will share a
css_set object, and will be linked via their css_sets field to the "tasks"
list_head in the css_set.

Assuming that many tasks share the same cgroup assignments, this reduces
overall space usage and keeps the size of the task_struct down (three pointers
added to task_struct compared to a non-cgroups kernel, no matter how many
subsystems are registered).

[akpm@linux-foundation.org: fix a printk]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
a424316ca154317367c7ddf89997d1c80e4a8051 19-Oct-2007 Paul Menage <menage@google.com> Task Control Groups: add procfs interface

Add:

/proc/cgroups - general system info

/proc/*/cgroup - per-task cgroup membership info

[a.p.zijlstra@chello.nl: cgroups: bdi init hooks]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
697f41610863c9264a7ae26dac9a387c9dda8c84 19-Oct-2007 Paul Menage <menage@google.com> Task Control Groups: add cgroup_clone() interface

Add support for cgroup_clone(), a way to create new cgroups intended to
be used for systems such as namespace unsharing. A new subsystem callback,
post_clone(), is added to allow subsystems to automatically configure cloned
cgroups.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
b4f48b6363c81ca743ef46943ef23fd72e60f679 19-Oct-2007 Paul Menage <menage@google.com> Task Control Groups: add fork()/exit() hooks

This adds the necessary hooks to the fork() and exit() paths to ensure
that new children inherit their parent's cgroup assignments, and that
exiting processes release reference counts on their cgroups.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
355e0c48b757b7fcc79ccb98fda8105ed37a1598 19-Oct-2007 Paul Menage <menage@google.com> Add cgroup write_uint() helper method

Add write_uint() helper method for cgroup subsystems

This helper is analagous to the read_uint() helper method for
reporting u64 values to userspace. It's designed to reduce the amount
of boilerplate requierd for creating new cgroup subsystems.

Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bbcb81d09104f0d440974b994c1fc508ccbe9503 19-Oct-2007 Paul Menage <menage@google.com> Task Control Groups: add tasks file interface

Add the per-directory "tasks" file for cgroupfs mounts; this allows the
user to determine which tasks are members of a cgroup by reading a
cgroup's "tasks", and to move a task into a cgroup by writing its pid to
its "tasks".

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ddbcc7e8e50aefe467c01cac3dec71f118cd8ac2 19-Oct-2007 Paul Menage <menage@google.com> Task Control Groups: basic task cgroup framework

Generic Process Control Groups
--------------------------

There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.

This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.

The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:

- the userspace APIs are (somewhat) normalised

- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.

- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel

This patch:

Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>