MrPointy's journey designing, building and deploying private clouds

Using Linux CGroups with Docker containers

2014-07-17

Introduction to Control Groups

Starting at the source of all information - The Linux kernel documentation for cgroups cgroups.txt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Control Groups provide a mechanism for aggregating/partitioning sets of
tasks, and all their future children, into hierarchical groups with
specialized behaviour.
Definitions:
A *cgroup* associates a set of tasks with a set of parameters for one
or more subsystems.
A *subsystem* is a module that makes use of the task grouping
facilities provided by cgroups to treat groups of tasks in
particular ways. A subsystem is typically a "resource controller" that
schedules a resource or applies per-cgroup limits, but it may be
anything that wants to act on a group of processes, e.g. a
virtualization subsystem.
A *hierarchy* is a set of cgroups arranged in a tree, such that
every task in the system is in exactly one of the cgroups in the
hierarchy, and a set of subsystems; each subsystem has system-specific
state attached to each cgroup in the hierarchy. Each hierarchy has
an instance of the cgroup virtual filesystem associated with it.
At any one time there may be multiple active hierarchies of task
cgroups. Each hierarchy is a partition of all tasks in the system.
User-level code may create and destroy cgroups by name in an
instance of the cgroup virtual file system, specify and query to
which cgroup a task is assigned, and list the task PIDs assigned to
a cgroup. Those creations and assignments only affect the hierarchy
associated with that instance of the cgroup file system.
On their own, the only use for cgroups is for simple job
tracking. The intention is that other subsystems hook into the generic
cgroup support to provide new attributes for cgroups, such as
accounting/limiting the resources which processes in a cgroup can
access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allow
you to associate a set of CPUs and a set of memory nodes with the
tasks in each cgroup.

I’ve quoted that in its entirety because it’s important to keep the definitions in your mind as i’m going to refer to them several times during this post. Especially take note of the last sentence “…. cpusets allow you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.” CGroups need to be combined with cpusets and other subsystems (controllers) to restrict compute resources. Make sure you check out the other CGroup controller documentation at [2] .

Creating Control Groups

There are a couple of different ways to get cgroups working.
The first step is to ensure that you have the required software installed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@testvm ~]# yum install libcgroup
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirror.mel.bkb.net.au
* extras: mirror.mel.bkb.net.au
* updates: ftp.swin.edu.au
Setting up Install Process
Resolving Dependencies
--> Running transaction check
--> Package libcgroup.x86_64 0:0.40.rc1-5.el6_5.1 will be installed
--> Finished Dependency Resolution
[snip]
Running Transaction
Installing : libcgroup-0.40.rc1-5.el6_5.1.x86_64 1/1
Verifying : libcgroup-0.40.rc1-5.el6_5.1.x86_64 1/1
Installed:
libcgroup.x86_64 0:0.40.rc1-5.el6_5.1
Complete!

Now you should be able to determine what cgroup subsystems (controllers) are available using the lssubsys command

1
2
3
4
5
6
7
8
9
10
11
12
[root@testvm ~]# lssubsys -a
cpuset
ns
cpu
cpuacct
memory
devices
freezer
net_cls
blkio
perf_event
net_prio

You can now create the cgroup hierarchy with the required subsystems.

There are two ways to do this, either directly via the mount command or my preference, via the /etc/cgconfig.conf file.

The default cgconfig.conf looks like this :

1
2
3
4
5
6
7
8
9
10
mount {
cpuset = /cgroup/cpuset;
cpu = /cgroup/cpu;
cpuacct = /cgroup/cpuacct;
memory = /cgroup/memory;
devices = /cgroup/devices;
freezer = /cgroup/freezer;
net_cls = /cgroup/net_cls;
blkio = /cgroup/blkio;
}

Changes to the cgconfig.conf file are refreshed via :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[root@testvm ~]# service cgconfig restart
Stopping cgconfig service: [ OK ]
Starting cgconfig service: [ OK ]
[root@testvm ~]# ls -l /cgroup/
total 0
drwxr-xr-x 2 root root 0 Jul 16 09:05 blkio
drwxr-xr-x 2 root root 0 Jul 16 09:05 cpu
drwxr-xr-x 2 root root 0 Jul 16 09:05 cpuacct
drwxr-xr-x 2 root root 0 Jul 16 09:05 cpuset
drwxr-xr-x 2 root root 0 Jul 16 09:05 devices
drwxr-xr-x 2 root root 0 Jul 16 09:05 freezer
drwxr-xr-x 2 root root 0 Jul 16 09:05 memory
drwxr-xr-x 2 root root 0 Jul 16 09:05 net_cls
[root@testvm ~]# ls -l /cgroup/cpuset/
total 0
--w--w--w- 1 root root 0 Jul 16 09:05 cgroup.event_control
-rw-r--r-- 1 root root 0 Jul 16 09:05 cgroup.procs
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.cpu_exclusive
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.cpus
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.mem_exclusive
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.mem_hardwall
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.memory_migrate
-r--r--r-- 1 root root 0 Jul 16 09:05 cpuset.memory_pressure
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.memory_pressure_enabled
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.memory_spread_page
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.memory_spread_slab
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.mems
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.sched_load_balance
-rw-r--r-- 1 root root 0 Jul 16 09:05 cpuset.sched_relax_domain_level
-rw-r--r-- 1 root root 0 Jul 16 09:05 notify_on_release
-rw-r--r-- 1 root root 0 Jul 16 09:05 release_agent
-rw-r--r-- 1 root root 0 Jul 16 09:05 tasks

As you can see from the above, the control structures have been created and populated according to our definition.

For the specific details on the control structure check out the Resource Management Guide [3] , but for this post i’ll provide some examples.

Example 1 - Restrict a group of processes to the same CPU<

Start by creating a cgroup called mycpuburn

1
[root@testvm ~]# cgcreate -g cpuset:/group-mycpuburn

Please review the man page for cgcreate as there are a number of options that allow for non-root management of the cgroup, specifically the -a and -t options will be of interest.

Now we’re ready to do some interesting stuff.

For this scenario I have two processes that are cpu intensive but I only want to restrict them to a single CPU. The code for the ‘application’ is:

lang=C
1
2
3
4
5
6
7
8
9
10
#include <stdio.h>
int main(int argc, char **argv)
{
int i=0;
while (1)
{
i++;
if (i>1000) i=0;
}
}

that i’ve compiled to a utility called loop. The actual program is irrelevant, just that it consumes a lot of cpu.

On this system I have 2 cpus which are fully occupied when I run two copies of the application. eg.

1
2
3
4
5
6
7
8
9
10
11
[root@testvm ~]# top
top - 21:31:31 up 2:47, 2 users, load average: 0.51, 0.38, 0.19
Tasks: 87 total, 3 running, 84 sleeping, 0 stopped, 0 zombie
Cpu(s):100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061624k total, 505408k used, 7556216k free, 11520k buffers
Swap: 4194296k total, 0k used, 4194296k free, 373820k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1508 root 20 0 3920 364 288 R 99.9 0.0 0:17.33 loop
1509 root 20 0 3920 364 288 R 99.5 0.0 0:13.97 loop

This is not the outcome I want.

1
2
[root@testvm ~]# cgset -r cpuset.cpus='0' group-mycpuburn
[root@testvm ~]# cgset -r cpuset.mems='0' group-mycpuburn

The above commands state that the group-mycpuburn group can only use a single cpu and as this is not a NUMA system it can use the only memory node, 0.

Please be aware that cpuset.cpus and cpuset.mems are mandatory fields and you must define them or your cgroup will not work the way you expect.

Now we can place the pids into the group we created

1
[root@testvm ~]# cgclassify -g cpuset:group-mycpuburn 1508 1509

Reviewing top again shows that our processes are only using 1 cpu and they’re proportionally using 50% of that cpu each

1
2
1509 root 20 0 3920 364 288 R 49.9 0.0 21:09.05 loop
1508 root 20 0 3920 364 288 R 49.6 0.0 21:12.87 loop

Note: A child process inherits the parents cgroup so you don’t need to assign them manually.

What happens when you stop and start the parent process? You either need to manually re-assign it to the cgroup or automate the addition. It turns out the cgroups rules engine daemon (cgred) can assist in a limited way. For more information on this daemon please refer to the configuration file /etc/cgrules.conf and the man page.

CGroups and Docker - what happens when you start a docker container?

Firstly, let’s get ready to run a docker container. The docker-io package needs to be installed and then you can either create a base container or download one. I’m going to download one for simplicity here.

1
2
3
4
5
6
7
8
9
10
11
12
[root@testvm ~]# docker pull centos
Pulling repository centos
1a7dc42f78ba: Download complete
cd934e0010d5: Download complete
511136ea3c5a: Download complete
34e94e67e63a: Download complete
[root@testvm ~]# docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
centos centos7 1a7dc42f78ba 6 days ago 236.4 MB
centos latest 1a7dc42f78ba 6 days ago 236.4 MB
centos centos6 cd934e0010d5 7 days ago 206.9 MB

What does our cgroup setup look like before we start docker?

1
2
3
4
5
6
7
8
9
10
11
12
[root@testvm ~]# cgsnapshot
# Configuration file generated by cgsnapshot
mount {
cpuset = /cgroup/cpuset;
cpu = /cgroup/cpu;
cpuacct = /cgroup/cpuacct;
memory = /cgroup/memory;
devices = /cgroup/devices;
freezer = /cgroup/freezer;
net_cls = /cgroup/net_cls;
blkio = /cgroup/blkio;
}

Nothing special.

Example 2 - starting a docker container

Firstly I need my container to have my application in it. The following dockerfile gets that done for me.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[root@testvm ~]# ls burncentos6
Dockerfile loop
[root@testvm ~]# cat burncentos6/Dockerfile
FROM centos:centos6
MAINTAINER Geoff O'Callaghan
ADD loop /loop
ENTRYPOINT /loop
[root@testvm ~]# docker build burncentos6
Sending build context to Docker daemon 9.728 kB
Sending build context to Docker daemon
Step 0 : FROM centos:centos6
---> cd934e0010d5
Step 1 : MAINTAINER Geoff O'Callaghan
---> Using cache
---> 3ff5b17bf373
Step 2 : ADD loop /loop
---> ee36984270ee
Removing intermediate container 9dc4201fc10b
Step 3 : ENTRYPOINT /loop
---> Running in ad2c8e56e270
---> c62ddcab64e1
Removing intermediate container ad2c8e56e270
Successfully built c62ddcab64e1
[root@testvm ~]# docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
<none> <none> c62ddcab64e1 11 seconds ago 206.9 MB
centos centos7 1a7dc42f78ba 6 days ago 236.4 MB
centos latest 1a7dc42f78ba 6 days ago 236.4 MB
centos centos6 cd934e0010d5 7 days ago 206.9 MB
[root@testvm ~]# docker run -d c62ddcab64e1
0f26e2bc3ddab019b4d26aa1531e6bba2b4f35aa7a75805ec4bc99a40879487a
[root@testvm ~]# docker run -d c62ddcab64e1
7045f7a339908ec2ffbaa70eb7780f8911f0cd6f9c57ad143aad443f14e0ed81
1
2
3
4
5
[root@testvm ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7045f7a33990 c62ddcab64e1 /bin/sh -c /loop 26 minutes ago Up 26 minutes distracted_colden
0f26e2bc3dda c62ddcab64e1 /bin/sh -c /loop 27 minutes ago Up 27 minutes ecstatic_goldstine
1
2
3
4
5
6
7
8
9
10
[root@testvm ~]# top
top - 06:31:52 up 1:19, 3 users, load average: 1.03, 0.69, 0.51
Tasks: 106 total, 3 running, 103 sleeping, 0 stopped, 0 zombie
Cpu(s):100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061624k total, 987512k used, 7074112k free, 32796k buffers
Swap: 4194296k total, 0k used, 4194296k free, 769456k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2942 root 20 0 3920 364 288 R 99.8 0.0 1:17.41 loop
3027 root 20 0 3920 360 288 R 99.8 0.0 0:16.69 loop

So let’s look at the cgroups. Note this output is editted for clarity :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
[root@testvm ~]# cgsnapshot
# Configuration file generated by cgsnapshot
mount {
cpuset = /cgroup/cpuset;
cpu = /cgroup/cpu;
cpuacct = /cgroup/cpuacct;
memory = /cgroup/memory;
devices = /cgroup/devices;
freezer = /cgroup/freezer;
net_cls = /cgroup/net_cls;
blkio = /cgroup/blkio;
}
group docker {
cpu {
cpu.rt_period_us="1000000";
cpu.rt_runtime_us="0";
cpu.cfs_period_us="100000";
cpu.cfs_quota_us="-1";
cpu.shares="1024";
}
}
group docker/7045f7a339908ec2ffbaa70eb7780f8911f0cd6f9c57ad143aad443f14e0ed81 {
cpu {
cpu.rt_period_us="1000000";
cpu.rt_runtime_us="0";
cpu.cfs_period_us="100000";
cpu.cfs_quota_us="-1";
cpu.shares="1024";
}
}
group docker/0f26e2bc3ddab019b4d26aa1531e6bba2b4f35aa7a75805ec4bc99a40879487a {
cpu {
cpu.rt_period_us="1000000";
cpu.rt_runtime_us="0";
cpu.cfs_period_us="100000";
cpu.cfs_quota_us="-1";
cpu.shares="1024";
}
}
.......
group docker {
memory {
memory.memsw.failcnt="0";
memory.memsw.limit_in_bytes="9223372036854775807";
memory.memsw.max_usage_in_bytes="3538944";
memory.move_charge_at_immigrate="0";
memory.swappiness="60";
memory.use_hierarchy="0";
memory.failcnt="0";
memory.soft_limit_in_bytes="9223372036854775807";
memory.limit_in_bytes="9223372036854775807";
memory.max_usage_in_bytes="3538944";
}
}
group docker/7045f7a339908ec2ffbaa70eb7780f8911f0cd6f9c57ad143aad443f14e0ed81 {
memory {
memory.memsw.failcnt="0";
memory.memsw.limit_in_bytes="9223372036854775807";
memory.memsw.max_usage_in_bytes="5505024";
memory.move_charge_at_immigrate="0";
memory.swappiness="60";
memory.use_hierarchy="0";
memory.failcnt="0";
memory.soft_limit_in_bytes="9223372036854775807";
memory.limit_in_bytes="9223372036854775807";
memory.max_usage_in_bytes="5505024";
}
}
group docker/0f26e2bc3ddab019b4d26aa1531e6bba2b4f35aa7a75805ec4bc99a40879487a {
memory {
memory.memsw.failcnt="0";
memory.memsw.limit_in_bytes="9223372036854775807";
memory.memsw.max_usage_in_bytes="3653632";
memory.move_charge_at_immigrate="0";
memory.swappiness="60";
memory.use_hierarchy="0";
memory.failcnt="0";
memory.soft_limit_in_bytes="9223372036854775807";
memory.limit_in_bytes="9223372036854775807";
memory.max_usage_in_bytes="3653632";
}
}

….. and more.

As you can see, docker has created a docker group and sub-groups for each of the launched containers and allows you to set the performance and access restrictions on a per container basis - fantastic. But how do we use them?

Docker clearly understands and builds a cgroup structure, but the secret to manipulating them comes down to exploring the docker run command itself which you are encouraged to do.

docker run –cpuset=”can be used to create the cpuset as we did manually above.

docker run -c can be used to set a relative share weighting between containers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@testvm ~]# docker run -d --cpuset="0" c62ddcab64e1
84263a674eae9bf64f241c281abba21ac62afd005cf7a4f7364cb1886d49a1c4
[root@testvm ~]# docker run -d --cpuset="0" c62ddcab64e1
e72376e23ae7d0a1dc8fb886f56e8614e98aeb1c4bfbe5e505647970cfb0e950
top - 07:38:41 up 2:26, 3 users, load average: 1.64, 1.73, 1.80
Tasks: 105 total, 3 running, 102 sleeping, 0 stopped, 0 zombie
Cpu(s): 50.1%us, 0.0%sy, 0.0%ni, 49.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061624k total, 1026656k used, 7034968k free, 50524k buffers
Swap: 4194296k total, 0k used, 4194296k free, 787832k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9220 root 20 0 3920 360 288 R 49.9 0.0 0:28.05 loop
9306 root 20 0 3920 360 288 R 49.9 0.0 0:27.05 loop

As you can see the docker containers are now sharing the same cpu at 50% . This is visible in the cgsnapshot segment below where you can set the cpuset.cpus set appropriately.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
group docker/e72376e23ae7d0a1dc8fb886f56e8614e98aeb1c4bfbe5e505647970cfb0e950 {
cpuset {
cpuset.memory_spread_slab="0";
cpuset.memory_spread_page="0";
cpuset.memory_migrate="0";
cpuset.sched_relax_domain_level="-1";
cpuset.sched_load_balance="1";
cpuset.mem_hardwall="0";
cpuset.mem_exclusive="0";
cpuset.cpu_exclusive="0";
cpuset.mems="0";
cpuset.cpus="0";
}
}
group docker/84263a674eae9bf64f241c281abba21ac62afd005cf7a4f7364cb1886d49a1c4 {
cpuset {
cpuset.memory_spread_slab="0";
cpuset.memory_spread_page="0";
cpuset.memory_migrate="0";
cpuset.sched_relax_domain_level="-1";
cpuset.sched_load_balance="1";
cpuset.mem_hardwall="0";
cpuset.mem_exclusive="0";
cpuset.cpu_exclusive="0";
cpuset.mems="0";
cpuset.cpus="0";
}
}

Now what if we want to change relative weights of these docker containers. This is easiest shown by again restricting the cpuset available to a single cpu and addimg the additional shares option.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@testvm ~]# docker run -d --cpuset="0" -c 100 c62ddcab64e1
9f0fb473334a4c6ef418c9dda219ed3b93566a303340b5d3695531315d4dcc8f
[root@testvm ~]# docker run -d --cpuset="0" -c 900 c62ddcab64e1
70cfd8b7e660d0ff6c5a5d86c45d04723b0558aab0c40b1dfdeb235a9df8d4dc
top - 07:44:25 up 2:32, 3 users, load average: 1.47, 1.72, 1.78
Tasks: 105 total, 3 running, 102 sleeping, 0 stopped, 0 zombie
Cpu(s): 20.4%us, 0.0%sy, 0.0%ni, 79.4%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8061624k total, 1030236k used, 7031388k free, 52124k buffers
Swap: 4194296k total, 0k used, 4194296k free, 789752k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9521 root 20 0 3920 360 288 R 89.7 0.0 0:23.36 loop
9436 root 20 0 3920 360 288 R 10.0 0.0 0:06.29 loop

There you have it, the proportional weighting for the containers is as expected.
Obviously docker suffers from the same issues as normal processes when it comes to cgroups. If you customize the cgroup configuration manually and the process goes away and comes back then the manual configuration will be lost.
If you can utilize the docker supplied support for cgroups to meet your requirements then great. But what if you can’t ?

Bring on the docker management frameworks :-)

References

  1. cgroups.txt
  2. cgroup subsystems https://www.kernel.org/doc/Documentation/cgroups/00-INDEX
  3. Resource Management Guide https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html