Introduction to Seccomp

lance

2024-04-02

转载防失效

Containers

Over the years, the way we build our applications has changed from a monolithic paradigm to a Microservices paradigm. Containerization gained strength, and with it the success of Docker and Kubernetes.

Briefly, a container is a set of processes. If you have multiple containers, you will have multiple sets of processes which are isolated from each other using namespaces and cgroups. Namespaces are responsible for isolating processes (which processes can my process see?) and cgroups are responsible for reserving resources for the processes (how much CPU can my process use?).

In the image above, we have an example of the isolation done by namespace. The ubuntu pod, pictured, is executing a sleep command. For the pod, sleep is the process with PID 1. The container doesn’t see any processes from the host, or from other containers, so it considers the sleep process as the first one. Otherwise, the host can see the processes inside all containers, and for him the PID of the sleep process is 7275.

Who is responsible for creating the namespace for each container? The answer is simple: the runtime container handles this task (Example: runc on docker). But if the process need access to hardware, how is this done? Here, we need to understand the concepts of user space and kernel space.

User space x Kernel Space

Linux divides the memory into user space and kernel space. All user processes live on the user space and the kernel processes on kernel space. When a user process needs to interact with hardware, it makes system calls, or just syscalls, to the kernel on the kernel space.

Although processes inside a container cannot see those on the host due to the isolation created by namespaces, there is no analogous solution for the kernel. That is, our containers share the same kernel space between themselves and the host. Thus, it would be possible for a container to erase file systems using syscall, or write on files that require privileges. This is far less secure than using virtual machines where each service has its own user and kernel space.

Seccomp

The good news is that it is possible to restrict the syscalls a process can do. Seccomp is a Linux kernel feature available since version 2.6.12, which limits the syscalls a process can do. The seccomp makes use of profiles which are json files that tell what is allowed and what is not allowed regarding system calls.
The runtimes already leverage seccomp profiles to limit some system calls by default. It’s possible to use its profiles on Kubernetes by enabling a feature gate on kubelet. However, it is also possible to add more security by creating customized and more restrictive profiles for a specific pod.

Seccomp on Kubernetes Pods

Using securityContext you can use seccomp profiles on your pods or containers. Here’s an example of a Pod that uses a seccomp profile (We’ll use this example later). The profile can be specified inside the Pod or the container, the difference is that when you do it inside the Pod all the containers inside it use the same profile. It is also important to set allowPrivilegeEscalation to false to prevent the container from trying to acquire more power than is allowed. Otherwise, the seccomp profile won’t be applied.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  name: ubuntu-deny
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: deny-list.json
  containers:
    - image: ubuntu
      name: ubuntu-deny
      command:
        - sleep
        - "90000"
      securityContext:
        allowPrivilegeEscalation: false

Seccomp is a powerful tool. But knowing which system call we should permit our process to do is a difficult task. A simple command like sleep can make a few dozen system calls as shown in the image below.

One way to get around this is using deny-type profiles. In a deny list you prohibit some calls and allow all others. This way you can ensure that your container won’t try to do stranger things with the host. And you will be able to answer faster to new threats. Here is an example of a deny list.

{
	"defaultAction": "SCMP_ACT_ALLOW",
	"architectures": [
		"SCMP_ARCH_X86_64",
		"SCMP_ARCH_X86",
		"SCMP_ARCH_X32"
	],
	"syscalls": [
		{
			"names": ["clock_nanosleep"],
			"action": "SCMP_ACT_ERRNO"
		}
	]
}

This deny list restricts the use of the clock_nanosleep system call. If you look at the previous example, this call is used by the sleep command. When this profile is applied to our pod, which executes the command sleep, the pod goes directly to Error state. That is, the process does not even run.

If you want to know more about seccomp and how we do at Cisco I recommend the article “Hardening Kubernetes Containers Security with Seccomp” here on the Techblog. In this article you will learn more about the best practices for creating and managing the seccomp profiles and how to use Cisco Secure Firewall Cloud Native to enhance seccomp.

Containers

User space x Kernel Space

Seccomp

Seccomp on Kubernetes Pods

References