This page looks best with JavaScript enabled

Containers: Deep dive

 ·  โ˜• 18 min read

Containers and compatibility

Before diving deeper into containers I want to highlight that containers DO NOT ensure compatibility on all host-guest possible combinations, that is one of the hugest misconceptions about containers. That doesn’t only apply to different CPU architectures or mixing windows with linux. Even with both host and guest being the same distr is impossible to guarantee compatibility between kernel versions because you donโ€™t know what stuff the user will try to execute. New syscalls can be added, or they can behave differently over time, so if your application is more advanced than just opening files, and shipping them across the network, itโ€™s more likely that you will run into this problem [1].

To illustrate the idea here is a simple PoC.

Create two VMs with Vagrant using generic/centos7 and generic/centos8 respectively.

Now create a container with quay.io/fatherlinux/fedora14 image and create a user inside the container.

Outcome using Centos7:

Incompatibility outcome using Centos7

Outcome using Centos8:
Incompatibility outcome using Centos8

Note that this is a incompatibility with SELinux so it works fine if we disable it [12]:

Disabling SELinux

What kernel features made them possible?

Container managers like docker (let’s ignore that is just a wrapper around containerd) and podman just create processes using some kernel features, without such features it wouldn’t be possible:

chroot

It was the first approach to isolate some resources. From chroot syscall man page [3]:

chroot() changes the root directory of the calling process to that specified in path. This directory will be used for pathnames beginning with /. The root directory is inherited by all children of the calling process.

Keep in mind that sometimes this feature is treated as a security measure but there under some conditions you can escape, for instance [4]:

  1. Root privileges
  2. Be able to run the following syscalls:
  • mkdir()
  • chroot()
  • chdir()

It is important to understand the differences between the syscall and the cli command. The syscall by itself doesn’t update the working directory, on the other side we can check the source code of the command [5]:

1
2
3
4
5
6
if (chroot (newroot) != 0)
die (EXIT_CANCELED, errno, _("cannot change root directory to %s"),
quoteaf (newroot));

if (! skip_chdir && chdir ("/"))
die (EXIT_CANCELED, errno, _("cannot chdir to root directory"));

If first calls chroot() and then changes the current directory with chdir("/").

There are other more complex ways of breaking out of the jail like creating your own /dev/hda or patching the kernel at runtime [6][7]

namespaces

Namespaces are a Linux kernel feature released in kernel version 2.6.24 in 2008. They provide processes with their own system view, thus isolating independent processes from each other. [8] [9] [10] [11]

Namespaces are created using clone syscall, from manpages:

These system calls create a new (“child”) process, in a manner similar to fork(2). By contrast with fork(2), these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors and the table of signal handlers. These system calls also allow the new child process to be placed in separate namespaces(7).

For testing purposes, you can also use unshare command that uses unshare syscall (who would have known) that allows to disassociate parts of the process execution context. Let’s see what each namespace is encharged of.

Process isolation (PID namespace)

PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID. PID namespaces allow containers to provide functionality such as suspending/resuming the set of processes in the container and migrating the container to a new host while the processes inside the container maintain the same PIDs [12],

There are two details that may be good to know:

  • If the “init” process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal. This also applies to containers, therefore, if the process with PID 1 inside a container ends, the container stops.
  • Only signals for which the “init” process has established a signal handler can be sent to the “init” process by other members of the PID namespace. This restriction applies even to privileged processes and prevents other members of the PID namespace from accidentally killing the “init” process.

From the explanation above you may have noticed that actually PID are not unique, instead are only unique for a given namespace, that can be seen in the kernel pid.h [13]:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
struct upid {
int nr;
struct pid_namespace *ns;
};

struct pid
{
refcount_t count;
unsigned int level;
spinlock_t lock;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
struct hlist_head inodes;
/* wait queue for pidfd notifications */
wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
};

You could try it out with the following command:

sudo unshare --pid --fork /bin/bash -c 'echo $$'

Note that ps command wouldn’t work because it uses /proc filesystem to list processes and the filesystem hasn’t been isolated.

Network interfaces (net namespace)

NET namespace allows to assign network interfaces to isolated processes. Even the loop-back interface will be unique for each namespace.

By default the namespace comes with no interface created so to get connectivity you would need to create them [14]. Here is how you create it with unshare:

sudo unshare --net /bin/bash

Unix Timesharing System (uts namespace)

UTS is a simple one. It enables a process to hold a different host and domain name. For instance run:

sudo unshare -u /bin/bash -c 'hostname foo && hostname && sleep 10'

Observe how your hostname doesn’t change in the main namespace.

User namespace

On Linux systems users may have or not privileges to access resources based on their effective user ID (UID). The user namespace is a kernel feature allowing per-process virtualization of this attribute. This means that you may seem to have root privileges inside an user namespace but unprivileged outside of it.

This mappings are stored on /proc/PID/uid_map and /proc/PID/gid_map. By default is empty and according to the doc ff a user ID has no mapping inside the namespace, then system calls that return user IDs return the value defined in the file /proc/sys/kernel/overflowuid, which on a standard system defaults to the value 65534.

User namespace

Let’s check it:

unshare -U id
uid=65534(nobody) gid=65534(nobody) group=65534(nogroup)

Mount (mnt namespace)

According to the documentation mount namespaces provide isolation of the list of mounts seen by the processes in each namespace instance. Thus, the processes in each of the mount namespace instances will see distinct single-directory hierarchies.

For example we could create a mount inside the namespace and that mount shouldn’t be accessible from outside:

Mount namespace

Interprocess Communication (IPC)

Linux supports a number of Inter-Process Communication (IPC) mechanisms such as signals, pipes or System V IPC (Message queues, semaphores, shared memory) [?]. All this mechanisms get isolated when using this specific namespace.

Here is a simple PoC creating a shared memory segment:

IPC namespace

Check windsock.io for an in-depth example 15

cgroups

Control groups (cgroups) are a Linux kernel mechanism for fine-grained control of resources. While there are currently two versions of cgroups, most distributions and mechanisms use version 1, as it has been in the kernel since 2.6.24. It brings four features that are highly related to each other [16]:

  • Resource limiting: Enable to configure programs running on the system within certain boundaries for CPU, RAM, device I/O and device groups.
  • Prioritization: Instead of limiting resources, it specifies which process will have more resources than other with less priority.
  • Accounting: Is turned off by default because of additional resource utilization. It allows to check what kind of resources are being used by each cgroups.
  • Process control: Also known as freezer it allows to take a ‘snapshot’ of a particular process and move it. This is often used on HPC clusters [17].

Let’s have a small example using cpushares. It basically specifies what portion of cpu usage each group may use, being default value 1024. Here is an example:

CPU share

Take a look at how the one with cpushare of 256 takes 256+1024/256 cpu usage on each core or what is the same, it takes one fifth of cpu usage. You could also check the cgroup configuration of the container under /sys/fs/cgroup/cpu/docker/cpu.shares

Let’s make another POC on how we can manage access to tty devices [18]:

cgroups TTY PoC

Current terminal does not close because the cgroup only applies when open get’s called and we are using an already opened descriptor.

Containers architecture (runtime, …)

From a user perspective it may look like docker or podman runs and handles all container related stuff. Nothing could be further from the truth, in case of docker there are more tools involved before even summoning a container. Here is a small diagram [19]:

Containers runtime

AS you can see in the image docker client only interacts with docker daemon API, then docker daemon communicate with containerd which finally would make calls to runc and spawn containers. Other container managers like podman communicate directly with runc.

Here is an example on how to create each of these processes by hand.

Interacting with docker daemon

You can manually interact with docker daemon API using curl. For instance let’s create a container:

Dockerd API

Check the docker engine API documentation [20] for better understanding.

Interacting with containerd

To start containers directly with containerd there is a CLI tool [21], here is how we would create a container with it [22]:

ctr POC

It can also be invoked through gRPC and protobuf (that’s actually what ctr does for us). Here is an example using grpcurl:

Containerd grpcurl

It is also interesting that namespaces are implemented in the runtime not in k8s

Interacting with runc

It is one of the low-level runtimes such as gVisor or Firecracker, which lack image integrity, bundle history log and some other features that high level runtimes like containerd support. That is the main reason why containerd is built on top of runc [23].

Creating a runc container involves a couple steps more [19], runc needs a config file that specifies all parameters on how it should run the container:

runc config

Additionally needs a root filesystem (rootfs directory by default)

Finally you can configure hooks to run before or after various lifecycle events [51] like prestart or poststop.

What are images in reality?

To understand what are images let’s see what happens when we pull one with BurpSuite (used nginx:latest for the example).

Before anything else docker first check that the registry is available:

Image pull check registry

Before next requests gets a token:

Image pull get token

Afterwards check that the specified tag exists for that image with a HEAD request:

Image pull check tag

Then according to OCI standard [24] it gets a list of available manifests [25]:

Image pull list available manifests

If docker finds a suitable one pulls the given manifest with the digest specified in the manifest list:

Image pull manifest

At this stage docker knows where to locate all he needs to download the image, starting with the image config [26]:

Image pull image config

Finally it downloads layer’s blob:

Image pulling all layers

Look how it doesn’t follow any specific order

You may also find images stored in tarballs [27] which is pretty similar to how we just downloaded the image from the registry but just stored in a single tar file. Note that docker save command creates a tarball following the old specficiation [28], to generate OCI tarballs you could use podman save --format oci-archie [29]

Also you should be aware that all commands in a Dockerfile generate a new layer, for instance let’s analyse the following Dockerfile:

1
2
3
4
5
6
FROM alpine

ENV foo=bar
ENV foo=bar2

RUN echo 'How many layers do I have?' > /poc.txt

Logically it may seem that only one of the ENV commands would create a layer. Nevertheless it creates two, here is the result of running podman history:

podman history

As you may have noticed they have 0 Bytes size, that just makes sense because they only contain config information:

podman inspect config

Advanced images

When creating images there are some features that you can use to get higher quality images.

Multi-Stage builds

One of the keys of creating high quality images is reducing the amount of layers and the size. To do so a good practice is to clean up any artifact you don’t need before moving to next layer [30]

To help solving that situation you can use FROM multiple times, which can refer to diferent base image each time. Each FROM statement begin a new stage in the image being built, you can copy any artifact from a previous stage using COPY:

1
COPY --from=0 /foo /bar

Stages can be named and be referenced by it:

# syntax=docker/dockerfile:1
FROM golang:1.16 AS builder
...
COPY --from=builder /foo /bar

You can even copy from an external image or use a previous stage as a new one.

Exec vs shell

There are two different ways of executing commands on docker images [31]:

  • RUN : shell form, the command is run in a shell, which by default is /bin/sh -c on Linux.
  • RUN ["executable", "param1", "param2"]: exec form which uses exec family of functions.

You can see how shell form can access environment variables:

RUN PoC 1

It doesn’t work using exec form:

RUN PoC 2

However you can emulate shell behaviour:

RUN PoC 3

Multiple services

A container’s main running process is specified with ENTRYPOINT and/or CMD. As I already explained you can only have one process, which matches the idea of having a single service per container.

Nevertheless you may want to have multiple services on a single container. You can accomplish that by using a process manager, allowing you to handle services without no need of having a full-fledged init process such as sysvinit, upstart or systemd. For instance:

1
2
3
4
5
6
7
8
# syntax=docker/dockerfile:1
FROM ubuntu:latest
RUN apt-get update && apt-get install -y supervisor
RUN mkdir -p /var/log/supervisor
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf
COPY my_first_process my_first_process
COPY my_second_process my_second_process
CMD ["/usr/bin/supervisord"]

Understanding different storage drivers

Storage drives specify how containers and images are stored, therefore having unique performance characteristics. Keep in mind that changing the default storage driver implies that you can only use containers and images generated with that specific driver [32].

AUFS

According to docker documentation:

aufs storage driver Was the preferred storage driver for Docker 18.06 and older, when running on Ubuntu 14.04 on kernel 3.13

Is just an UnionFS implementation, allowing to merge several directories and provide a single merged view of it [33] [34].

Here is an example:
AUFS

Have a look at how it handles files with the same absolute path and name, it merges branches (what you would call layers on container terminology) from right to left, so it keeps the latest version to be merged.

The diagram below shows a Docker container based on the ubuntu:latest image.

Docker aufs

You can install aufs with apt install -y aufs-tools.

BRTFS

Dockerโ€™s btrfs storage driver leverages many Btrfs features for image and container management. Among these features are block-level operations, thin provisioning, copy-on-write snapshots, and ease of administration. You can easily combine multiple physical block devices into a single Btrfs filesystem [35].

You can configure Docker to use btrfs as follows:

BRTFS

You can install aufs with apt install -y btrfs-progs.

Docker stores information about image layers and container layers in /var/lib/docker/btrfs/subvolumes/. This directory contains one directory per layer, with the unified filesystem built from a layer plus all its parent layers. Only the base layer of an image is stored as a true subvolume. All the other layers are stored as snapshots, which only contain the differences introduced in that layer. You can create snapshots of snapshots as shown in the diagram below.

Nested snapshots

On disk, snapshots look and feel just like subvolumes, but in reality they are much smaller and more space-efficient. Copy-on-write is used to maximize storage efficiency and minimize layer size, and writes in the containerโ€™s writable layer are managed at the block level. The following image shows a subvolume and its snapshot sharing data.

Snapshots CoW

I rather watch how think works than read documentation so created a 3 layers image, in each of the layers added a file. Then display where the file that shows 2 times but referes to the same file its located on the same offset of the disk [36]:

Btrfs CoW

Also may be interesting how easy is to increment the mount available storage without interrupting the services, here is a simple POC [?]:
Add volume

ZFS

ZFS is a next generation filesystem that supports many advanced storage technologies like volume management, snapshots, checksumming, compression and deduplication, replication and more. However is complex to manage so it is not recommended unless you are familiar with it.

You can configure Docker to use zfs as follows:
ZFS Setup

The base layer of an image is a ZFS filesystem. Each child layer is a ZFS clone based on a ZFS snapshot of the layer below it (similarly to btrfs snapshots) as shown in the diagram below:

ZFS Snapshot

Reading files behaves pretty much the same as BRTFS.

OVERLAY

Overlay driver is the default on docker and podman. It is another UnionFS, like AUFS that we already covered. OverlayFS needs at least three directories:

  • One or more low directories, files on this directory are merged into the mount destination. If there are multiple low directories with files having the same name it behaves the same way like AUFS. They are going to be called image layers.
  • Upper directory, all changes made to files inside the mounted device are stored in it. This is going to be the container layer on containers.
  • Workdir directory is required, and used to prepare files before they are switched to the overlay destination in an atomic action (the workdir needs to be on the same filesystem as the upperdir) [?].

Here is a small diagram to illustrate how it works:

Overlay diagram

Let’s get create a simple POC, first of all let’s setup the environment [37] [38] [39]:

Overlay Setup

Now check what happens when new files are created or we modify already existing ones:

Overlay PoC

Logging

Logging drivers capture output from container’s stdout/stderr. Taking this into consideration you should build your images so that logs are sent into stdout/stderr instead of files. For instance here is the source code of 40 where they redirect their logs:

1
2
3
4
5
6
7
8
9
sed -ri \
-e 's!^(\s*CustomLog)\s+\S+!\1 /proc/self/fd/1!g' \
-e 's!^(\s*ErrorLog)\s+\S+!\1 /proc/self/fd/2!g' \
-e 's!^(\s*TransferLog)\s+\S+!\1 /proc/self/fd/1!g' \
-e 's!^(\s*User)\s+daemon\s*$!\1 www-data!g' \
-e 's!^(\s*Group)\s+daemon\s*$!\1 www-data!g' \
"$HTTPD_PREFIX/conf/httpd.conf" \
"$HTTPD_PREFIX/conf/extra/httpd-ssl.conf" \
; \

There are multiple drivers depending on your needs:

  • Local
  • Logentries
  • JSON File
  • Graylog Extended Format
  • Syslog
  • Amazon CloudWatch logs
  • ETW logging
  • Fluentd
  • Google Cloud
  • Journald
  • Splunk

Security

Seccomp

Seccomp is a mechanism in the Linux kernel that allows a process to make a one-way transition to a restricted state where it can only perform a limited set of system calls. If a process attempts any other system call, it is killed via a SIGKILL signal [41]. To achieve so we can use the prctl syscall.

There are two modes available:

  • Strict: Only allows read, write, _exit and sigreturn syscalls to be executed.
  • Filter: When calling you specify what syscalls are allowed in BPF format. You can either block or allow by default, adding a custom list of syscalls with the given action to execute when they are called.

We can check that from prctl man pages[42]:
[?]

You may be wondering how to know which syscalls are being called by your application, this can be accomplish in several ways, I’ll showcase two of them:

  • strace: Trace system calls and signals
  • oci-seccomp-bpf-hook[43] [44]: Provides an OCI hook to generate seccomp profiles by tracing the syscalls made by the container. The generated profile would allow all the syscalls made and deny every other syscall.

I don’t recommend running oci-seccomp-bpf-hook to generate profiles on production environments since it requires CAP_SYS_ADMIN to run

AppArmor

Is a Linux kernel security module that supplements the standard Linux user and group based permissions to confine programs to a limited set of resources. [45]

By default all docker containers run with the same profile [46]. However creating custom profiles or using already existing ones [47] for services running in your containers is a good hardening method.

To run containers with a custom profile use --security-opt apparmor=your_profile.

On RedHat distros you can enable SELinux which is a similar approach than AppArmor [48].

Distroless

In order to make harder to get code execution inside your containers your services can run on ‘distroless’ images [49], which do not contain package managers, shells or any other programs you would expect to find in a standard Linux distribution. Keep in mind that this isn’t perfect, an attacker may be able to upload files (hence upload a shell interpreter) or even be vulnerable to other kind of attack vectors [50].

References

  1. The limits of compatibility and supportability with containers
  2. Red Hat Bugzilla - Bug 1096123
  3. chroot
  4. Breaking out of CHROOT Jailed Shell Environment
  5. coreutils/chroot.c
  6. Using Chroot Securely
  7. Breaking Out of and Securing chroot Jails
  8. Digging into linux namespaces - part 1
  9. Digging into linux namespaces - part 2
  10. Steve ovens
  11. Namespaces in operation, part 1: namespaces overview
  12. PID namespaces man page
  13. pid.h
  14. Connecting multiple namepsaces creating a LAN
  15. Windsock.io IPC namespace
  16. RedHat cgroups
  17. Cgroup freezer subsystem
  18. Linux insides - cgroups
  19. The tool that really runs your containers deep dive into runc and oci specifications
  20. Docker engine API documentation
  21. ctr doc
  22. Getting started with containerd
  23. Digging into runtimes - runc
  24. OCI - distribution spec
  25. Docker registry manifest specification
  26. OCI - image spec
  27. OCI - tarball spec
  28. moby - old tarball spec
  29. podman save
  30. multistage builds
  31. image building execution modes
  32. Docker storage drivers
  33. Linux AUFS
  34. AUFS man pages
  35. BTRFS docker driver
  36. BTRFS CoW
  37. programster - overlayfs
  38. Practical look into overlayfs
  39. Linux overlay filesystem docker
  40. Apache official docker image
  41. Sysdig - seccomp
  42. seccomp man pages
  43. OCI-seccomp-bpf
  44. Generate seccomp profiles
  45. Kubernetes apparmor tutorial
  46. Default docker AppArmor
  47. AppArmor profiles
  48. RedHat SELinux
  49. distroless containers
  50. RedHat: why distroless containers arent security solution you think they are
  51. OCI - runtime spec
Share on

ITasahobby
WRITTEN BY
ITasahobby
InTernet lover