Containers and compatibility
Before diving deeper into containers I want to highlight that containers DO NOT ensure compatibility on all host-guest possible combinations, that is one of the hugest misconceptions about containers. That doesn’t only apply to different CPU architectures or mixing windows with linux. Even with both host and guest being the same distr is impossible to guarantee compatibility between kernel versions because you don’t know what stuff the user will try to execute. New syscalls can be added, or they can behave differently over time, so if your application is more advanced than just opening files, and shipping them across the network, it’s more likely that you will run into this problem .
To illustrate the idea here is a simple PoC.
Create two VMs with Vagrant using
Now create a container with
quay.io/fatherlinux/fedora14 image and create a user inside the container.
Outcome using Centos7:
Outcome using Centos8:
Note that this is a incompatibility with SELinux so it works fine if we disable it :
What kernel features made them possible?
Container managers like
docker (let’s ignore that is just a wrapper around
podman just create processes using some kernel features, without such features it wouldn’t be possible:
It was the first approach to isolate some resources. From
chroot syscall man page :
chroot()changes the root directory of the calling process to that specified in path. This directory will be used for pathnames beginning with
/. The root directory is inherited by all children of the calling process.
Keep in mind that sometimes this feature is treated as a security measure but there under some conditions you can escape, for instance :
- Root privileges
- Be able to run the following syscalls:
It is important to understand the differences between the syscall and the cli command. The syscall by itself doesn’t update the working directory, on the other side we can check the source code of the command :
If first calls
chroot() and then changes the current directory with
Namespaces are a Linux kernel feature released in kernel version 2.6.24 in 2008. They provide processes with their own system view, thus isolating independent processes from each other.    
Namespaces are created using
clone syscall, from manpages:
These system calls create a new (“child”) process, in a manner similar to fork(2). By contrast with fork(2), these system calls provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using these system calls, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors and the table of signal handlers. These system calls also allow the new child process to be placed in separate namespaces(7).
For testing purposes, you can also use
unshare command that uses
unshare syscall (who would have known) that allows to disassociate parts of the process execution context. Let’s see what each namespace is encharged of.
Process isolation (PID namespace)
PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID. PID namespaces allow containers to provide functionality such as suspending/resuming the set of processes in the container and migrating the container to a new host while the processes inside the container maintain the same PIDs ,
There are two details that may be good to know:
- If the “init” process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal. This also applies to containers, therefore, if the process with PID 1 inside a container ends, the container stops.
- Only signals for which the “init” process has established a signal handler can be sent to the “init” process by other members of the PID namespace. This restriction applies even to privileged processes and prevents other members of the PID namespace from accidentally killing the “init” process.
From the explanation above you may have noticed that actually PID are not unique, instead are only unique for a given namespace, that can be seen in the kernel
You could try it out with the following command:
sudo unshare --pid --fork /bin/bash -c 'echo $$'
ps command wouldn’t work because it uses
/proc filesystem to list processes and the filesystem hasn’t been isolated.
Network interfaces (net namespace)
NET namespace allows to assign network interfaces to isolated processes. Even the loop-back interface will be unique for each namespace.
By default the namespace comes with no interface created so to get connectivity you would need to create them . Here is how you create it with
sudo unshare --net /bin/bash
Unix Timesharing System (uts namespace)
UTS is a simple one. It enables a process to hold a different host and domain name. For instance run:
sudo unshare -u /bin/bash -c 'hostname foo && hostname && sleep 10'
Observe how your hostname doesn’t change in the main namespace.
On Linux systems users may have or not privileges to access resources based on their effective user ID (UID). The user namespace is a kernel feature allowing per-process virtualization of this attribute. This means that you may seem to have root privileges inside an user namespace but unprivileged outside of it.
This mappings are stored on
/proc/PID/gid_map. By default is empty and according to the doc ff a user ID has no mapping inside the namespace, then system calls that return user IDs return the value defined in the file
/proc/sys/kernel/overflowuid, which on a standard system defaults to the value 65534.
Let’s check it:
unshare -U id uid=65534(nobody) gid=65534(nobody) group=65534(nogroup)
Mount (mnt namespace)
According to the documentation mount namespaces provide isolation of the list of mounts seen by the processes in each namespace instance. Thus, the processes in each of the mount namespace instances will see distinct single-directory hierarchies.
For example we could create a mount inside the namespace and that mount shouldn’t be accessible from outside:
Interprocess Communication (IPC)
Linux supports a number of Inter-Process Communication (IPC) mechanisms such as signals, pipes or System V IPC (Message queues, semaphores, shared memory) [?]. All this mechanisms get isolated when using this specific namespace.
Here is a simple PoC creating a shared memory segment:
Check windsock.io for an in-depth example 15
Control groups (cgroups) are a Linux kernel mechanism for fine-grained control of resources. While there are currently two versions of cgroups, most distributions and mechanisms use version 1, as it has been in the kernel since 2.6.24. It brings four features that are highly related to each other :
- Resource limiting: Enable to configure programs running on the system within certain boundaries for CPU, RAM, device I/O and device groups.
- Prioritization: Instead of limiting resources, it specifies which process will have more resources than other with less priority.
- Accounting: Is turned off by default because of additional resource utilization. It allows to check what kind of resources are being used by each cgroups.
- Process control: Also known as freezer it allows to take a ‘snapshot’ of a particular process and move it. This is often used on HPC clusters .
Let’s have a small example using cpushares. It basically specifies what portion of cpu usage each group may use, being default value 1024. Here is an example:
Take a look at how the one with cpushare of 256 takes
256+1024/256 cpu usage on each core or what is the same, it takes one fifth of cpu usage. You could also check the cgroup configuration of the container under
Let’s make another POC on how we can manage access to tty devices :
Current terminal does not close because the cgroup only applies when
openget’s called and we are using an already opened descriptor.
Containers architecture (runtime, …)
From a user perspective it may look like docker or podman runs and handles all container related stuff. Nothing could be further from the truth, in case of docker there are more tools involved before even summoning a container. Here is a small diagram :
AS you can see in the image docker client only interacts with docker daemon API, then docker daemon communicate with
containerd which finally would make calls to
runc and spawn containers. Other container managers like podman communicate directly with
Here is an example on how to create each of these processes by hand.
Interacting with docker daemon
You can manually interact with docker daemon API using
curl. For instance let’s create a container:
Check the docker engine API documentation  for better understanding.
Interacting with containerd
It can also be invoked through gRPC and protobuf (that’s actually what
ctr does for us). Here is an example using
It is also interesting that namespaces are implemented in the runtime not in k8s
Interacting with runc
It is one of the low-level runtimes such as gVisor or Firecracker, which lack image integrity, bundle history log and some other features that high level runtimes like
containerd support. That is the main reason why
containerd is built on top of
Creating a runc container involves a couple steps more , runc needs a config file that specifies all parameters on how it should run the container:
Additionally needs a root filesystem (
rootfs directory by default)
Finally you can configure hooks to run before or after various lifecycle events  like
What are images in reality?
To understand what are images let’s see what happens when we pull one with BurpSuite (used
nginx:latest for the example).
Before anything else docker first check that the registry is available:
Before next requests gets a token:
Afterwards check that the specified tag exists for that image with a
If docker finds a suitable one pulls the given manifest with the digest specified in the manifest list:
At this stage docker knows where to locate all he needs to download the image, starting with the image config :
Finally it downloads layer’s blob:
Look how it doesn’t follow any specific order
You may also find images stored in tarballs  which is pretty similar to how we just downloaded the image from the registry but just stored in a single tar file. Note that
docker save command creates a tarball following the old specficiation , to generate OCI tarballs you could use
podman save --format oci-archie 
Also you should be aware that all commands in a Dockerfile generate a new layer, for instance let’s analyse the following Dockerfile:
Logically it may seem that only one of the
ENV commands would create a layer. Nevertheless it creates two, here is the result of running
As you may have noticed they have 0 Bytes size, that just makes sense because they only contain config information:
When creating images there are some features that you can use to get higher quality images.
One of the keys of creating high quality images is reducing the amount of layers and the size. To do so a good practice is to clean up any artifact you don’t need before moving to next layer 
To help solving that situation you can use
FROM multiple times, which can refer to diferent base image each time. Each
FROM statement begin a new stage in the image being built, you can copy any artifact from a previous stage using
Stages can be named and be referenced by it:
# syntax=docker/dockerfile:1 FROM golang:1.16 AS builder ... COPY --from=builder /foo /bar
You can even copy from an external image or use a previous stage as a new one.
Exec vs shell
There are two different ways of executing commands on docker images :
RUN: shell form, the command is run in a shell, which by default is /bin/sh -c on Linux.
RUN ["executable", "param1", "param2"]: exec form which uses exec family of functions.
You can see how shell form can access environment variables:
It doesn’t work using exec form:
However you can emulate shell behaviour:
A container’s main running process is specified with
CMD. As I already explained you can only have one process, which matches the idea of having a single service per container.
Nevertheless you may want to have multiple services on a single container. You can accomplish that by using a process manager, allowing you to handle services without no need of having a full-fledged init process such as
systemd. For instance:
Understanding different storage drivers
Storage drives specify how containers and images are stored, therefore having unique performance characteristics. Keep in mind that changing the default storage driver implies that you can only use containers and images generated with that specific driver .
According to docker documentation:
aufs storage driver Was the preferred storage driver for Docker 18.06 and older, when running on Ubuntu 14.04 on kernel 3.13
Here is an example:
Have a look at how it handles files with the same absolute path and name, it merges branches (what you would call layers on container terminology) from right to left, so it keeps the latest version to be merged.
The diagram below shows a Docker container based on the
You can install aufs with
apt install -y aufs-tools.
Docker’s btrfs storage driver leverages many Btrfs features for image and container management. Among these features are block-level operations, thin provisioning, copy-on-write snapshots, and ease of administration. You can easily combine multiple physical block devices into a single Btrfs filesystem .
You can configure Docker to use btrfs as follows:
You can install aufs with
apt install -y btrfs-progs.
Docker stores information about image layers and container layers in
/var/lib/docker/btrfs/subvolumes/. This directory contains one directory per layer, with the unified filesystem built from a layer plus all its parent layers. Only the base layer of an image is stored as a true subvolume. All the other layers are stored as snapshots, which only contain the differences introduced in that layer. You can create snapshots of snapshots as shown in the diagram below.
On disk, snapshots look and feel just like subvolumes, but in reality they are much smaller and more space-efficient. Copy-on-write is used to maximize storage efficiency and minimize layer size, and writes in the container’s writable layer are managed at the block level. The following image shows a subvolume and its snapshot sharing data.
I rather watch how think works than read documentation so created a 3 layers image, in each of the layers added a file. Then display where the file that shows 2 times but referes to the same file its located on the same offset of the disk :
Also may be interesting how easy is to increment the mount available storage without interrupting the services, here is a simple POC [?]:
ZFS is a next generation filesystem that supports many advanced storage technologies like volume management, snapshots, checksumming, compression and deduplication, replication and more. However is complex to manage so it is not recommended unless you are familiar with it.
You can configure Docker to use zfs as follows:
The base layer of an image is a ZFS filesystem. Each child layer is a ZFS clone based on a ZFS snapshot of the layer below it (similarly to btrfs snapshots) as shown in the diagram below:
Reading files behaves pretty much the same as BRTFS.
Overlay driver is the default on docker and podman. It is another UnionFS, like AUFS that we already covered. OverlayFS needs at least three directories:
- One or more low directories, files on this directory are merged into the mount destination. If there are multiple low directories with files having the same name it behaves the same way like AUFS. They are going to be called image layers.
- Upper directory, all changes made to files inside the mounted device are stored in it. This is going to be the container layer on containers.
- Workdir directory is required, and used to prepare files before they are switched to the overlay destination in an atomic action (the workdir needs to be on the same filesystem as the upperdir) [?].
Here is a small diagram to illustrate how it works:
Now check what happens when new files are created or we modify already existing ones:
Logging drivers capture output from container’s stdout/stderr. Taking this into consideration you should build your images so that logs are sent into stdout/stderr instead of files. For instance here is the source code of 40 where they redirect their logs:
There are multiple drivers depending on your needs:
- JSON File
- Graylog Extended Format
- Amazon CloudWatch logs
- ETW logging
- Google Cloud
Seccomp is a mechanism in the Linux kernel that allows a process to make a one-way transition to a restricted state where it can only perform a limited set of system calls. If a process attempts any other system call, it is killed via a SIGKILL signal . To achieve so we can use the
There are two modes available:
- Strict: Only allows
sigreturnsyscalls to be executed.
- Filter: When calling you specify what syscalls are allowed in BPF format. You can either block or allow by default, adding a custom list of syscalls with the given action to execute when they are called.
We can check that from
prctl man pages:
You may be wondering how to know which syscalls are being called by your application, this can be accomplish in several ways, I’ll showcase two of them:
strace: Trace system calls and signals
oci-seccomp-bpf-hook : Provides an OCI hook to generate seccomp profiles by tracing the syscalls made by the container. The generated profile would allow all the syscalls made and deny every other syscall.
I don’t recommend running
oci-seccomp-bpf-hookto generate profiles on production environments since it requires
Is a Linux kernel security module that supplements the standard Linux user and group based permissions to confine programs to a limited set of resources. 
By default all docker containers run with the same profile . However creating custom profiles or using already existing ones  for services running in your containers is a good hardening method.
To run containers with a custom profile use
On RedHat distros you can enable SELinux which is a similar approach than AppArmor .
In order to make harder to get code execution inside your containers your services can run on ‘distroless’ images , which do not contain package managers, shells or any other programs you would expect to find in a standard Linux distribution. Keep in mind that this isn’t perfect, an attacker may be able to upload files (hence upload a shell interpreter) or even be vulnerable to other kind of attack vectors .
- The limits of compatibility and supportability with containers
- Red Hat Bugzilla - Bug 1096123
- Breaking out of CHROOT Jailed Shell Environment
- Using Chroot Securely
- Breaking Out of and Securing chroot Jails
- Digging into linux namespaces - part 1
- Digging into linux namespaces - part 2
- Steve ovens
- Namespaces in operation, part 1: namespaces overview
- PID namespaces man page
- Connecting multiple namepsaces creating a LAN
- Windsock.io IPC namespace
- RedHat cgroups
- Cgroup freezer subsystem
- Linux insides - cgroups
- The tool that really runs your containers deep dive into runc and oci specifications
- Docker engine API documentation
- ctr doc
- Getting started with containerd
- Digging into runtimes - runc
- OCI - distribution spec
- Docker registry manifest specification
- OCI - image spec
- OCI - tarball spec
- moby - old tarball spec
- podman save
- multistage builds
- image building execution modes
- Docker storage drivers
- Linux AUFS
- AUFS man pages
- BTRFS docker driver
- BTRFS CoW
- programster - overlayfs
- Practical look into overlayfs
- Linux overlay filesystem docker
- Apache official docker image
- Sysdig - seccomp
- seccomp man pages
- Generate seccomp profiles
- Kubernetes apparmor tutorial
- Default docker AppArmor
- AppArmor profiles
- RedHat SELinux
- distroless containers
- RedHat: why distroless containers arent security solution you think they are
- OCI - runtime spec