Zero trust deployment with Kubernetes

cover

Using OpenSource software written by unkown people sometimes can be a little scary. Even more when I deploy them I a production environment in my company. On my case, I have created a brand new Kubernetes cluster to host some private services on my local network and I wanted to be sure that they don't do anything malicious on my network.

Isolate applications and services

First thing to do is to isolate each of the services inside a Kubernetes namespace. This is a core resource of Kubernetes that allow to isolate groups of resources within a single cluster. It then allow to have a more fine control on access, permissions, network on the resources.

A Namespace can be create pretty easily with the following command :

kubectl create namespace <insert-namespace-name-here>

Then I can list Namespaces with :

kubectl get namespaces

Result :

NAME              STATUS   AGE
default           Active   4h52m
kube-system       Active   4h52m
kube-public       Active   4h52m
kube-node-lease   Active   4h52m
tiwabbit-prod     Active   2s

Ignore Namespaces prefixed with kube- who are defined for the Kubernetes control plane

I can now create resources inside my Namespace. Let's start with a Secret for example :

kubectl \
  --namespace tiwabbit-prod \
  create secret generic \
  db-credentials \
  --from-literal=username=tiwabbit \
  --from-literal=password=mysecurepassword

If I list all Secrets in my cluster :

kubectl get secrets --all-namespaces

Result :

NAMESPACE       NAME                  TYPE                                  DATA   AGE
[...]
default         default-token-m59jl   kubernetes.io/service-account-token   3      4h58m
tiwabbit-prod   default-token-8zkr2   kubernetes.io/service-account-token   3      3m15s
tiwabbit-prod   db-credentials        Opaque                                2      49s

In theory, only Pods running inside my Namespace (tiwabbit-prod) can mount this secret and read it.

RBAC securization

By default all Pods without specific configuration use the default ServiceAccount of the Namespace they are running in. This last one dosn't have any right on the Kubernetes API witch is a great thing and should not be changed.

How ever in some scenarios my application may needs to call the Kubernetes API. For exemple if it need to create batch using a Kubernetes Job. Let's take that last exemple to create a ServiceAccount with those permission an assign it to my application Pod.

First I need to create a new ServiceAccount :

kubectl \
  --namespace tiwabbit-prod \
  create serviceaccount \
  my-application

Then I need a Role that implement the level of permission that my Pod need. here is the manifest :

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: my-application-batch-creator
  namespace: tiwabbit-prod
rules:
- apiGroups: []
  resources: [ pods, pods/status, pods/log ]
  verbs: [ get, list, watch ]

- apiGroups: [ batch ]
  resources: [ jobs ]
  verbs: [ create, get, list, watch, patch, update, delete ]

Finally, I need to assign the Role to my ServiceAccount using a RoleBinding with the following manifest definition :

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-application-permissions
  namespace: tiwabbit-prod
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: my-application-batch-creator
subjects:
- kind: ServiceAccount
  name: my-application
  namespace: tiwabbit-prod

Now let's create a Pod with the ServiceAccount and review the actions allowed by the role. Here is a Deployment manifest :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-application
  namespace: tiwabbit-prod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-application
  template:
    metadata:
      labels:
        app: my-application
    spec:
      serviceAccountName: my-application
      containers:
      - command:
        - sleep
        - 1d
        image: alpine:latest
        name: alpine

Geting inside the Pod and follow the Kubernetes documentation to install kubectl.

Let's check if I can list Pods by querying the Kubernetes API :

kubectl get pods

Result :

NAME                             READY   STATUS    RESTARTS   AGE
my-application-df5b5cb75-8twc5   1/1     Running   0          5m21s

Then I should try to create a batch execution with a Job :

kubectl create job my-application-batch --image alpine:latest -- echo "Hello World"

Listing all the Jobs in the Namespace :

kubectl get jobs

Result :

NAME                   COMPLETIONS   DURATION   AGE
my-application-batch   0/1           2s         2s

Listing the Pods in the Namespace :

kubectl get pods

Result :

NAME                             READY   STATUS    RESTARTS   AGE
my-application-df5b5cb75-8twc5   1/1     Running   0          11m
my-application-batch-f2pf6       1/1     Running   0          3s

My Job finished and I can delete it with :

kubectl delete job/my-application-batch

Conclusions

If the process in my Pod doesn't need to communicate with the Kubernetes API (for creating, querying, deleting ressources), use the default ServiceAccount witch give zero permission to my Pod. In other case, use a custom ServiceAccount for each of my apps and multiple Role and RoleBinding for each of my application use case.

Prevent usage of root user or privileged escalation

Kubernetes allow by default multiple security options that can be applied to pods and underlaying containers. Most of them can be configured with the SecurityContext block as follow :

apiVersion: v1
kind: Pod
metadata:
  name: my-secure-pod
spec:
  securityContext: {}
  containers:
  - name: my-secure-container
    image: nginx:latest
    securityContext: {}
  - name: my-second-secure-container
    image: busybox:latest
    command: ["sleep", "infinity"]
    securityContext: {}

Obviously such options can be added to a StatefulSet or Deployment template.

Running pod as non root

The easiest setup to secure the Pod is to overwrite the UID and GID of the user running the main process. This way, what ever the default user in the image, Kubernetes will bypass it. An UID and GID superior or equal to 1000 should be used.

Three parameters can be used :

runAsUser : Allow to change the UID of the default process
runAsGroup : Allow to change the GID of the default process
fsGroup : If specified, the user will also be in that group and all files and directories created will take that GID as owner.
This last parameter can increase the mount time of a given external volume because Kubernetes ensure that files are owned by the group defined by fsGroup. Resulting on an chmod -R of all the filesystem.
fsGroupChangePolicy can be used to change this behaviour with those values :
- OnRootMismatch : only check the root directory and change all filesystem if the group mismatch
- Always : it's in the option name

apiVersion: v1
kind: Pod
metadata:
  name: my-secure-pod
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 2000
    fsGroupChangePolicy: OnRootMismatch
  containers:
  - name: my-secure-container
    image: busybox:latest
    command: ["sleep", "infinity"]
    securityContext:
      runAsUser: 2000
      runAsGroup: 2000

Enforcing running as non root could be achieved with runAsNonRoot parameter :

apiVersion: v1
kind: Pod
metadata:
  name: my-secure-pod
spec:
  securityContext:
    runAsNonRoot: true
  containers:
  - name: my-secure-container
    image: busybox:latest
    command: ["sleep", "infinity"]
    securityContext: {}

Use readonly filesystem

Preventing all write/update/delete operation on the root filesystem can prevent malicious process (or breached application) to take advantage of the container environment and modify it. The readOnlyRootFilesystem option allow to achieve that :

apiVersion: v1
kind: Pod
metadata:
  name: my-secure-pod
spec:
  securityContext: {}
  containers:
  - name: my-secure-container
    image: busybox:latest
    command: ["sleep", "infinity"]
    securityContext:
      readOnlyRootFilesystem: true

If the application need to write temporary or application data during the runtime, I can still create an EmptyDir volume and mounting it inside the container :

apiVersion: v1
kind: Pod
metadata:
  name: my-secure-pod
spec:
  securityContext: {}
  containers:
  - name: my-secure-container
    image: busybox:latest
    command: ["sleep", "infinity"]
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
    - mountPath: /data
      name: my-data
  volumes:
  - name: my-data
    emptyDir: {}

Prevent priviled escalation

Linux kernel expose low level function for a process to change is current UID or GID. For example using [setuid][linux-setuid-wikipedia] or setgid privitive functions.

Using the option allowPrivilegeEscalation can prevent this to happens.

apiVersion: v1
kind: Pod
metadata:
  name: my-secure-pod
spec:
  securityContext: {}
  containers:
  - name: my-secure-container
    image: busybox:latest
    command: ["sleep", "infinity"]
    securityContext:
      allowPrivilegeEscalation: false

Dropping all capabilities

As for privileged escalation, Linux give Capabilities to process allowing them to make high privileged system calls. Like binding network ports under 1024.

We want to only give our process only the necessary capabilities for it to run. In the cas of the exemple we can have :

apiVersion: v1
kind: Pod
metadata:
  name: my-secure-pod
spec:
  securityContext: {}
  containers:
  - name: my-second-secure-container
    image: busybox:latest
    command: ["sleep", "infinity"]
    securityContext:
      capabilities:
        drop:
        - ALL
        add:
        - NET_BIND_SERVICE

Network communication hardening

In the world of pods, by default every pods in every namespace can talk to each other. If you have multiple application or even multiple clients on the same Kubernetes cluster, you should not allow them to communicate (at least not by default).

That's why, NetworkPolicies must be created with deny all by default. Then open point to point connections if services/clients needs to communicates with each others.

The default deny all config will looks like this :

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Then if you have a third party application that needs to access your application pods :

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: my-namespace
spec:
  podSelector:
    matchLabels:
      application: my-application
      component: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              project: my-friend-namespace
        - podSelector:
            matchLabels:
              application: his-application
      ports:
        - protocol: TCP
          port: 8080