fakecineaste : March 2018

Monday, March 26, 2018

git

Installing on Linux

If you want to install the basic Git tools on Linux via a binary installer, you can generally do so through the basic package-management tool that comes with your distribution. If you’re on Fedora for example (or any closely-related RPM-based distro such as RHEL or CentOS), you can use dnf:
$ sudo dnf install git-all
If you’re on a Debian-based distribution like Ubuntu, try apt-get:
$ sudo apt-get install git-all
https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

I’m often asked how to merge only specific commits from another branch into the current one. The reason you’d want to do this is to merge specific changes you need now, leaving other code changes you’re not interested in right now behind

62ecb3 is now applied to the master branch and commited (as a new commit) in master. cherry-pick behaves just like merge
https://www.devroom.io/2010/06/10/cherry-picking-specific-commits-from-another-branch/

Why would I use this?

That is a good question. Why would you cherry-pick instead of just merging one branch into the other? The short answer is that you cherry-pick when a full branch merge is not possible due to incompatible versions or simply because you do not want everything in the other branch
https://swsblog.stanford.edu/blog/cherry-picking-small-git-lesson

System V (Sys V), systemd, Upstart, runit

Red Hat is the inventor and primary booster of systemd, so the best distros for playing with it are Red Hat Enterprise Linux, RHEL clones like CentOS and Scientific Linux

Experienced RH users can still use service and chkconfig in RH 7, but it's long past time to dump them in favor of native systemd utilities. systemd has outpaced them, and service and chkconfig do not support native systemd services.
Our beloved /etc/inittab is no more.
Instead, we have a /etc/systemd/system/ directory chock-full of symlinks to files in /usr/lib/systemd/system/.
/usr/lib/systemd/system/ contains init scripts; to start a service at boot it must be linked to /etc/systemd/system/.
The systemctl command does this for you when you enable a new service

# systemctl enable clamd@scan.service

ClamAV offers both systemd and SysVInit init scripts
$ yum search clamav
clamav-server-sysvinit.noarch
clamav-server-systemd.noarch

https://lcom.static.linuxfound.org/images/stories/41373/Systemd-components.png

A brief overview and history of systemd — the Linux process manager

Understanding and Using Systemd

Full system roll-back and systemd in SUSE Linux Enterprise 12

RHEL 7 Boot Process

There are three possible states for a service: enabled or disabled, and static

Enabled means it has a symlink in a .wants directory
Disabled means it does not.
Static means the service is missing the [Install] section in its init script, so you cannot enable or disable it.
Static services are usually dependencies of other services, and are controlled automatically.

for anyone who does care about boot times you can run a command to see how long every program and service takes to start up

$ systemd-analyze blame

https://www.linux.com/learn/understanding-and-using-systemd

manage many aspects of your server, from services to mounted devices and system state

In systemd, a unit refers to any resource that the system knows how to operate on and manage.
These resources are defined using configuration files called unit files.

When a unit with this directive is enabled, a directory will be created within /etc/systemd/system named after the specified unit with .wants appended to the end.
For instance, if the current unit has WantedBy=multi-user.target, a directory called multi-user.target.wants will be created within /etc/systemd/system

https://www.digitalocean.com/community/tutorials/understanding-systemd-units-and-unit-files

Init process in Linux is the first process to start on boot, and it tracks all the system services and Daemons.

Various Linux Distros have created different implementations of the standard init systems. Currently, the most popular init system adopted by various Linux distros is systemd.
If your target system is not using the systemd init system, then you can use the Ansible service module
http://www.mydailytutorials.com/linux-services-using-ansible-systemd-and-service-modules/

SysVinit Vs Systemd Cheatsheet

systemd is a new init system
systemd is compatible with SysV and LSB init scripts.
systemd is the first process get started by kernel and holding PID 1.
systemctl is command line utility and primary tool to manage the systemd daemons/services such as (start, restart, stop, enable, disable, reload & status).
systemd uses .service files Instead of bash scripts (SysVinit uses).
systemd sorts all daemons into their own Linux cgroups and you can see the system hierarchy by exploring /cgroup/systemd file.

three init systems are widely used in Linux.

System V (Sys V): System V (Sys V) is one of the first and traditional init system for Unix like operating system.
Upstart: Upstart is an event-based replacement for the /sbin/init daemon.
systemd: Systemd is a new init system and system manager which was implemented/adapted into all the major Linux distributions over the traditional SysV init systems.

https://www.2daygeek.com/sysvinit-vs-systemd-cheatsheet-systemctl-command-usage/

Systemd (stylized as systemd) is a suite of software that provides fundamental building blocks for a Linux operating system.Among other features, it includes the systemd "System and Service Manager,"[5] an init system used to bootstrap the user space and to manage system processes after booting. It is a replacement for the UNIX System V and Berkeley Software Distribution (BSD) init systems and, as of 2015, the majority of Linux distributions have adopted systemd as their default init system

One of the main goals of the systemd project is unification of basic Linux configurations and service behaviors across all Linux distributions.
The name systemd adheres to the Unix convention of naming daemons by appending the letter d.
https://en.wikipedia.org/wiki/Systemd

In general, bootstrapping usually refers to a self-starting process that is supposed to proceed without external input. In computer technology the term (usually shortened to booting) usually refers to the process of loading the basic software into the memory of a computer after power-on or general reset, especially the operating system which will then take care of loading other software as needed.

systemd is a suite of basic building blocks for a Linux system. It provides a system and service manager that runs as PID 1 and starts the rest of the system. systemd provides aggressive parallelization capabilities, uses socket and D-Bus activation for starting services, offers on-demand starting of daemons, keeps track of processes using Linux control groups, maintains mount and automount points, and implements an elaborate transactional dependency-based service control logic. systemd supports SysV and LSB init scripts and works as a replacement for sysvinit. Other parts include a logging daemon, utilities to control basic system configuration like the hostname, date, locale, maintain a list of logged-in users and running containers and virtual machines, system accounts, runtime directories and settings, and daemons to manage simple network configuration, network time synchronization, log forwarding, and name resolution.

https://www.freedesktop.org/wiki/Software/systemd/

On systems based on SysVinit, init is the first process that is executed once the Linux kernel loads.

inittab is the startup configuration file for init located in /etc. It contains directions for init on what programs and scripts to run when entering a specific runlevel.
https://wiki.archlinux.org/index.php/SysVinit

Contra systemd, upstart

They depend on DBus, which means that if DBus is not working system initialization cannot begin.
If makers of systemd and upstart really wanted to contribute to F/OSS community, they could just improve sysvinit, init scripts and add additional software that can be used from shell scripts to achieve the same features they provide. Which is basically what openrc did.

Contra systemd
Not portable, by design; uses Linux-specific functionality not available on other platforms.
Violates traditional UNIX principles of simplicity and the separation of kernel/initsystem/userland duties such that components are interchangeable.
"Systemd ships a growing number of useful, unified command-line interfaces for system settings and control (timedatectl, bootctl, hostnamectl, loginctl, machinectl, kernel-install, localectl)."
"unified" is a bad replacement for complying to standards (such as POSIX) and conventions and power must go along with responsibility

Contra upstart
Can't set proper dependencies until everything has converted from SysV init files to Upstart jobs.

Contra OpenRC
This initsystem is the least familiar to most of us yet (only recently packaged, and still in Debian's NEW queue).

Contra sysvinit
init scripts are not really user edit-able configuration.
No real process supervision, e.g. daemons are not automatically restarted when they crash.
No integration with udev, services cannot be started when specific hardware is discovered.

https://wiki.debian.org/Debate/initsystem/sysvinit

Runit is a daemontools-inspired process supervision suite that also provides a program suitable for running as process 1. It can be used as alternative to sysvinit or systemd, either by itself or in conjunction with OpenRC. It can also be used as a helper for supervising OpenRC services.

https://wiki.gentoo.org/wiki/Runit

runit is a cross-platform Unix init scheme with service supervision, a replacement for sysvinit, and other init schemes. It runs on GNU/Linux, *BSD, MacOSX, Solaris, and can easily be adapted to other Unix operating systems.

http://smarden.org/runit/

Comparison of init systems

https://wiki.gentoo.org/wiki/Comparison_of_init_systems

systemd is a suite of basic building blocks for a Linux system. It provides a system and service manager that runs as PID 1 and starts the rest of the system. systemd provides aggressive parallelization capabilities, uses socket and D-Bus activation for starting services, offers on-demand starting of daemons, keeps track of processes using Linux control groups, maintains mount and automount points, and implements an elaborate transactional dependency-based service control logic

https://www.freedesktop.org/wiki/Software/systemd/

Systemd Container

The docker daemon was designed to provide a simple means of starting, stopping and managing containers. It was not originally designed to bring up an entire Linux system or manage services for such things as start-up order, dependency checking, and failed service recovery

https://molecule.readthedocs.io/en/latest/examples.html#id4

CHAPTER 3. USING SYSTEMD WITH CONTAINERS

This chapter describes how you can use the systemd initialization service to work with containers in two different ways:

Starting Containers with systemd: By setting up a systemd unit file on your host computer, you can have the host automatically start, stop, check the status, and otherwise manage a container as a systemd service.

Starting services within a container using systemd: Many Linux services (Web servers, file servers, database servers, and so on) are already packaged for Red Hat Enterprise Linux to run as systemd services. If you are using the latest RHEL container image, you can set the RHEL container image to start the systemd service, then automatically start selected services within the container when the container starts up.

3.2. Starting services within a container using systemd
Start the /sbin/init process (the systemd service) to run as PID 1 within the container.
Start all systemd services that are installed and enabled within the container, in order of dependencies.
Allow systemd to restart services or kill zombie processes for services started within the container.

The general steps for building a container that is ready to be used as a systemd services is:
Install the package containing the systemd-enabled service inside the container. This can include dozens of services that come with RHEL, such as Apache Web Server (httpd), FTP server (vsftpd), Proxy server (squid), and many others. For this example, we simply install an Apache (httpd) Web server.
Use the systemctl command to enable the service inside the container.
Add data for the service to use in the container (in this example, we add a Web server test page). For a real deployment, you would probably connect to outside storage.
Expose any ports needed to access the service.
Set /sbin/init as the default process to start when the container runs

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html/managing_containers/using_systemd_with_containers

Thursday, March 15, 2018

Docker Security

Docker is designed to have all containers share the same kernel and the host. This provides convenience but also amplifies the impact of any vulnerabilities present in the kernel. System administrators should ensure that the host environment is properly hardened and patched to reduce the likelihood of privilege escalation, arbitrary code execution, and denial of services.

Docker Security ● Docker uses several mechanisms for security:

○ Linux kernel namespaces

○ Linux Control Groups (cgroups)

○ The Docker daemon

○ Linux capabilities (libcap)

○ Linux security mechanisms like AppArmor or SELinux

provides an isolated view of the system where processes cannot see other processes in other containers ● Each container also gets its own network stack. ● A container doesn’t get privileged access to the sockets or interfaces of another container.

Docker Security ● Cgroups: kernel feature that limits and isolates the resource usage(CPU,memory,network) of a collection of processes.

Docker Security ● Linux Capabilities: divides the privileges of root into distinct units and smaller groups of privileges.

Images are extracted in a chrooted sub process, being the first-step in a wider effort toward privilege separation. ● From Docker 1.10, all images are stored and accessed by the cryptographic checksums of their contents, limiting the possibility of an attacker causing a collision with an existing image Docker Content Trust.

Docker Content Trust ● Protects against untrusted images ● Can enable signing checks on every managed host ● Signature verification transparent to users ● Guarantee integrity of your images when pulled ● Provides trust from publisher to consumer ● export DOCKER_CONTENT_TRUST=1 ● ~/.docker/trust/trusted-certificates/

Security Best Practices

14. DockerFile Security ●

● Do not write secrets(users and passwords).

● Remove unnecessary setuid, setgid permissions (Privilege escalation)

● Download packages securely using GPG and certificates

● Try to restrict an image or container to one service

Security best practices ● To disable setuid rights, add the following to the Dockerfile of your image

RUN find / -perm +6000 -type f -exec chmod a-s {} \;\||true

Security best practices

● Don’t run containers with --privileged flag

● The --privileged flag gives all capabilities to the container.

● docker run --privileged ...

● docker run --cap-drop=ALL --cap-add=CAP_NET_ADMIN

How do we add/remove capabilities?

● Use cap-add and cap-drop with docker run/create

● Drop all capabilities which are not required

● docker run --cap-drop ALL --cap-add $CAP

Security best practices capabilities

● Manual management within the container: docker run --cap-add ALL

● Restricted capabilities with root: docker run --cap-drop ALL --cap-add $CAP

● No capabilities: docker run --user

Security best practices ● Set a specific user. ● Don’t run your applications as root in containers.

https://docs.docker.com/engine/security/security/

It was initially developed by the US National Security Agency to protect computer systems from malicious intrusion and tampering

SELinux implements what's known as MAC (Mandatory Access Control). This is implemented on top of what's already present in every Linux distribution, the DAC (Discretionary Access Control).
To understand DAC, let's first consider how traditional Linux file security works.
In a traditional security model, we have three entities: User, Group, and Other (u,g,o) who can have a combination of Read, Write, and Execute (r,w,x) permissions on a file or directory. If a user jo creates a file in their home directory, that user will have read/write access to it, and so will the jo group. The "other" entity will possibly have no access to it

-rwxrw-r--. 1 jo jo 41 Aug 6 22:45 myscript.sh
Now jo can change this access. jo can grant (and restrict) access to this file to other users and groups or change the owner of the file. These actions can leave critical files exposed to accounts who don't need this access. jo can also restrict to be more secure, but that's discretionary: there's no way for the system administrator to enforce it for every single file in the system.

Consider another case: when a Linux process runs, it may run as the root user or another account with superuser privileges. That means if a black-hat hacker takes control of the application, they can use that application to get access to whatever resource the user account has access to.

With SELinux, you can define what a user or process can do. It confines every process to its own domain so the process can interact with only certain types of files and other processes from allowed domains.

At any one time, SELinux can be in any of three possible modes:

    Enforcing
    Permissive
    Disabled

In enforcing mode SELinux will enforce its policy on the Linux system and make sure any unauthorized access attempts by users and processes are denied

Permissive mode is like a semi-enabled state. SELinux doesn't apply its policy in permissive mode, so no access is denied. However any policy violation is still logged in the audit logs. It's a great way to test SELinux before enforcing it.

The disabled mode is self-explanatory – the system won't be running with enhanced security.

The default value is targeted. With a targeted policy, SELinux allows you to customize and fine tune access control permissions. The other possible value is "MLS" (multilevel security), an advanced mode of protection. Also with MLS, you need to install an additional package.

https://www.digitalocean.com/community/tutorials/an-introduction-to-selinux-on-centos-7-part-1-basic-concepts

You’ll need to run the program while AppArmor is watching it and walk through all its normal functions

Basically, you should use the program as it would be used in normal use: start the program, stop it, reload it, and use all its features
You should design a test plan that goes through the functions the program needs to perform
https://www.howtogeek.com/118328/how-to-create-apparmor-profiles-to-lock-down-programs-on-ubuntu/

The state of each profile can be switched between enforcing and complaining with calls to aa-enforce and aa-complain giving as parameter either the path of the executable or the path to the policy file

Additionaly a profile can be entirely disabled with aa-disable or put in audit mode (to log accepted system calls too) with aa-audit.

Even though creating an AppArmor profile is rather easy, most programs do not have one. This section will show you how to create a new profile from scratch just by using the target program and letting AppArmor monitor the system call it makes and the resources it accesses.

The most important programs that need to be confined are the network facing programs as those are the most likely targets of remote attackers.

https://debian-handbook.info/browse/stable/sect.apparmor.html

SELinux (Security Enhanced Linux) is a Mandatory Access Control system built on Linux's LSM (Linux Security Modules) interface.

In practice, the kernel queries SELinux before each system call to know whether the process is authorized to do the given operation.
https://debian-handbook.info/browse/stable/sect.selinux.html

AppArmor, like most other LSMs, supplements rather than replaces the default Discretionary Access Control (DAC). As such it's impossible to grant a process more privileges than it had in the first place.

Ubuntu, SUSE and a number of other distributions use it by default. RHEL (and its variants) use SELinux which requires good userspace integration to work properly. SELinux attaches labels to all files, processes and objects and is therefore very flexible. However configuring SELinux is considered to be very complicated and requires a supported filesystem. AppArmor on the other hand works using file paths and its configuration can be easily adapted.
Security policies completely define what system resources individual applications can access, and with what privileges.
Access is denied by default if no profile says otherwise
Every breach of policy triggers a message in the system log, and AppArmor can be configured to notify users with real-time violation warnings popping up on the desktop.
https://wiki.archlinux.org/index.php/AppArmor

Enforce – In the enforce mode, system begins enforcing the rules and report the violation attempts in syslog or auditd (only if auditd is installed) and operation will not be permitted.

Complain – In the complain mode, system doesn’t enforce any rules. It will only log the violation attempts.

The following are the different types of rules that are used in profiles.

    Path entries: This has information on which files the application is allowed to access.
    Capability entries: determines the privileges a confined process is allowed to use.
    Network entries: determines the connection-type. For example: tcp. For a packet-analyzer network can be raw or packet etc.
https://www.thegeekstuff.com/2014/03/apparmor-ubuntu

The docker-default profile is the default for running containers. It is moderately protective while providing wide application compatibility. The profile is generated from the following template.

https://docs.docker.com/engine/security/apparmor

only trusted users should be allowed to control your Docker daemon. This is a direct consequence of some powerful Docker features. Specifically, Docker allows you to share a directory between the Docker host and a guest container; and it allows you to do so without limiting the access rights of the container. This means that you can start a container where the /host directory is the / directory on your host; and the container can alter your host filesystem without any restriction. This is similar to how virtualization systems allow filesystem resource sharing. Nothing prevents you from sharing your root filesystem (or even your root block device) with a virtual machine.

for example, if you instrument Docker from a web server to provision containers through an API, you should be even more careful than usual with parameter checking, to make sure that a malicious user cannot pass crafted parameters causing Docker to create arbitrary containers.
For this reason, the REST API endpoint (used by the Docker CLI to communicate with the Docker daemon) changed in Docker 0.5.2, and now uses a UNIX socket instead of a TCP socket bound on 127.0.0.1 (the latter being prone to cross-site request forgery attacks if you happen to run Docker directly on your local machine, outside of a VM). You can then use traditional UNIX permission checks to limit access to the control socket

You can also expose the REST API over HTTP if you explicitly decide to do so. However, if you do that, be aware of the above mentioned security implications. Ensure that it is reachable only from a trusted network or VPN or protected with a mechanism such as stunnel and client SSL certificates. You can also secure API endpoints with HTTPS and certificates.

The daemon is also potentially vulnerable to other inputs, such as image loading from either disk with docker load, or from the network with docker pull. As of Docker 1.3.2, images are now extracted in a chrooted subprocess on Linux/Unix platforms, being the first-step in a wider effort toward privilege separation. As of Docker 1.10.0, all images are stored and accessed by the cryptographic checksums of their contents, limiting the possibility of an attacker causing a collision with an existing image.
https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface

Isolate containers with a user namespace

Linux namespaces provide isolation for running processes, limiting their access to system resources without the running process being aware of the limitations

The best way to prevent privilege-escalation attacks from within a container is to configure your container’s applications to run as unprivileged users
For containers whose processes must run as the root user within the container, you can re-map this user to a less-privileged user on the Docker host.
The mapped user is assigned a range of UIDs which function within the namespace as normal UIDs from 0 to 65536, but have no privileges on the host machine itself.

The remapping itself is handled by two files: /etc/subuid and /etc/subgid
It is very important that the ranges not overlap, so that a process cannot gain access in a different namespace
introduces some configuration complexity in situations where the container needs access to resources on the Docker host, such as bind mounts into areas of the filesystem that the system user cannot write to. From a security standpoint, it is best to avoid these situations.
https://docs.docker.com/engine/security/userns-remap

Secure computing mode (seccomp) is a Linux kernel feature. You can use it to restrict the actions available within the container. The seccomp() system call operates on the seccomp state of the calling process. You can use this feature to restrict your application’s access.

This feature is available only if Docker has been built with seccomp and the kernel is configured with CONFIG_SECCOMP enabled
The default seccomp profile provides a sane default for running containers with seccomp and disables around 44 system calls out of 300+. It is moderately protective while providing wide application compatibility
In effect, the profile is a whitelist which denies access to system calls by default, then whitelists specific system calls.
seccomp is instrumental for running Docker containers with least privilege. It is not recommended to change the default seccomp profile.

https://docs.docker.com/engine/security/seccomp

Kernel exploits

Unlike in a VM, the kernel is shared among all containers and
the   host,   magnifying   the   importance   of   any   vulnerabilities
present in the kernel.In VMs, the situation is much
better: an attacker would have to route an attack through both
the VM kernel and the hypervisor before being able to touch the
host kernel.

Denial-of-service attacks
All containers share kernel resources. If one container can
monopolize access to certain resources—including memory and
more esoteric resources such as user IDs (UIDs)—it can starve
out other containers on the host, resulting in a denial-of-service
(DoS), whereby legitimate users are unable to access part or all
of the system.

Container breakouts
An attacker who gains access to a container should not be able
to gain access to other containers or the host. By default, users
are not namespaced, so any process that breaks out of the con‐
tainer will have the same privileges on the host as it did in the
container; if you were root in the container, you will be root on the host. This also means that you need to worry about poten‐
tial privilege escalation   attacks—whereby a user gains elevated privileges such as those of the root
user, often through a bug in application code that needs to run with extra privileges.

Poisoned images
If an attacker can trick you into running his image, both the host and your data are at risk

Compromising secrets
When a container accesses a database or service, it will  likely require a secret, such as an API key or username and password.An attacker who can get access to this secret will also have
access to the service.This problem becomes more acute in a microservice architecture in which containers are constantly stopping and starting, as compared to an architecture with
small numbers of long-lived VMs.

do not consider containers to offer the same level of security guarantees as VMs.

The defenses for your system should also consist of multiple layers.For example, your containers will most likely run in VMs so that if a
container breakout occurs, another level of defense can prevent the attacker from getting to the host or other containers. Monitoring
systems should be in place to alert admins in the case of unusual behavior. Firewalls should restrict network access to containers, limiting the external attack surface

In regards to least privilege, you can take many steps to reduce the
capabilities of containers:

Ensure that processes in containers do not run as root, so that exploiting a vulnerability present in a process does not give the
attacker root access.
Run filesystems as read-only so that attackers cannot overwrite data or save malicious scripts to file
Cut down on the kernel calls that a container can make to reduce the potential attack surface

Limit the resources that a container can use to avoid DoS attacks whereby a compromised container or application consumes enough resources (such as memory or CPU) to bring the
host to a halt.

Similarly, if you have containers that process or store sensitive information, keep them on a host separate from containers handling
less-sensitive information and, in particular, away from containers running applications directly exposed to end users. For example,
containers   processing credit-card details should be kept separate from containers running the Node.js frontend.

the vast majority of container deployments will involve VMs. Although this isn’t an ideal situation, it does mean you can combine the efficiency of containers with the
security of VMs.

Applying Updates

Identify images that require updating. This includes both base images and any dependent images.
Get or create an updated version of each base image. Push this version to your registry or download site.
For each dependent image, run docker build with the --no-cache argument. Again, push these images.
On each Docker host, run docker pull to ensure that it has up-to-date images
Once you’ve ascertained that everything is functioning correctly, remove the old images from the hosts. If you can, also
remove them from your registry.

If you use Docker Hub to build your images, note  that  you can setup repository links, which will kick off a build of your image when
any linked image changes. By setting a link to the base image, your image will automatically get rebuilt if the base image changes.

In the past, providers have been criticized for being slow to respond. In such a situation, you can either wait or prepare your own image. Assuming  that  you have
access to the Dockerfile and source for the image, rolling your image may be a simple and effective temporary solution.

To safely use images, you need to have guarantees about their provenance: where they came from and who created them.
The primary tool in establishing the provenance of software or data is the securehash.
What’s to stop an attacker from modifying both the data and the hash? The best answer to this is cryptographic signing and public/private key pairs.
Secure hashes are known as digests in Docker parlance. A digest is a SHA256 hash of a filesystem layer or manifest, where a manifest is
metadata file describing the constituent parts of a Docker image.

This is Docker’s mechanism for allowing publishers to sign their content, completing the trusted distribution mechanism.
When a user pulls an image from a repository, she receives a certificate that includes the publisher’s public
key, allowing her to verify that the image came from the publisher.
When content trust is enabled, the Docker engine will only operate on images that have been signed and will refuse to run any images
whose signatures or digests do not match.

By running both a registry and a
Notary server, organizations can provide trusted images to users.

https://theswissbay.ch/pdf/_to_sort/O%27Reilly/docker-security.pdf

The Update Framework (TUF) helps developers maintain the security of a software update system, even against attackers that compromise the repository or signing keys. TUF provides a flexible framework and specification that developers can adopt into any software update system.

https://github.com/theupdateframework/tuf

Notary aims to make the internet more secure by making it easy for people to publish and verify content. We often rely on TLS to secure our communications with a web server, which is inherently flawed, as any compromise of the server enables malicious content to be substituted for the legitimate content.

https://github.com/theupdateframework/notary

This document describes basic use of the Notary CLI as a tool supporting Docker Content Trust. For more advanced use cases, you must run your own Notary service. Read the use the Notary client for advanced users documentation.

What is Notary
Notary is a tool for publishing and managing trusted collections of content. Publishers can digitally sign collections and consumers can verify integrity and origin of content. This ability is built on a straightforward key management and signing interface to create signed collections and configure trusted publishers.

With Notary anyone can provide trust over arbitrary collections of data. Using The Update Framework (TUF) as the underlying security framework, Notary takes care of the operations necessary to create, manage, and distribute the metadata necessary to ensure the integrity and freshness of your content
https://docs.docker.com/notary/getting_started/

Use Case Examples of Notary:

Docker uses Notary to implement Docker Content Trust and all of the docker trust subcommands.
Quay is using Notary as a library, wrapping it and extending it to suit their needs. For Quay, Notary is flexible rather than single-purpose.
CloudFlare’s PAL tool uses Notary for container identity, allowing one to associate metadata such as secrets to running containers in a verifiable manner.
LinuxKit is using Notary to distribute its kernels and system packages.
https://www.cncf.io/announcement/2017/10/24/cncf-host-two-security-projects-notary-tuf-specification/

The Docker Bench for Security is a script that checks for dozens of common best-practices around deploying Docker containers in production. The tests are all automated, and are inspired by the CIS Docker Community Edition Benchmark v1.1.0. We are releasing this as a follow-up to our Understanding Docker Security and Best Practices blog post.

https://github.com/docker/docker-bench-security

Clair is an open source project for the static analysis of vulnerabilities in appc and docker containers.

https://coreos.com/clair/docs/latest/

A service that analyzes docker images and applies user-defined acceptance policies to allow automated container image validation and certification. The Anchore Engine is an open source project that provides a centralized service for inspection, analysis, and certification of container images. The Anchore engine is provided as a Docker container image that can be run standalone (a docker-compose file is provided), or on an orchestration platform such as Kubernetes, Docker Swarm, Rancher or Amazon ECS

https://github.com/anchore/anchore-engine

a tool to perform static analysis of known vulnerabilities, trojans, viruses, malware & other malicious threats in docker images/containers and to monitor the docker daemon and running docker containers for detecting anomalous activities.

In order to fulfill its mission, first the known vulnerabilities as CVEs (Common Vulnerabilities and Exposures), BIDs (Bugtraq IDs), RHSAs (Red Hat Security Advisories) and RHBAs (Red Hat Bug Advisories), and the known exploits from Offensive Security database are imported into a MongoDB to facilitate the search of these vulnerabilities and exploits when your analysis are in progress.
https://github.com/eliasgranderubio/dagda

An open artifact metadata API to audit and govern your software supply chain

Grafaes can greatly help you create your own container security scanning projects
Developers can use use Grafaes, described as a "component metadata API," to define metadata for virtual machines and containers. IBM’s Vulnerability Advisor is also integrated into the project.
https://grafeas.io/

Falco is a behavioral activity monitor designed to detect anomalous activity in your applications. Powered by sysdig’s system call capture infrastructure, Falco lets you continuously monitor and detect container, application, host, and network activity

Falco is hosted by the Cloud Native Computing Foundation (CNCF) as a sandbox level project.
https://github.com/falcosecurity/falco

Build rules specific to your Kubernetes clusters to enforce policy across all your containers & microservices.

Runtime Security built for containers.

https://sysdig.com/opensource/falco

Banyan Collector: A framework to peek inside containers
A framework for Static Analysis of Docker container images
Have you wondered what your container images really contain? If they have the very packages that are susceptible to all kinds of attacks? Or, if they have the configuration you expect when they are run? Banyan Collector provides a powerful, extensible framework to answer all these questions and more.
https://github.com/banyanops/collector

Scan Docker installations for security issues and vulnerabilities.

plugin based system for discovery, audit and reporting
able to scan local and remote docker installations
plugins are easy to write
https://github.com/kost/dockscan

hub-detect-ws is container-based Web Service for scanning (via the file signature-based iScan) and inspecting (via the Linux package manager-based image inspector) Docker images.

https://github.com/blackducksoftware/hub-detect-ws

batten is an auditing framework that contains some tools to help audit and harden your Docker deployments.

Identify potential security issues, and harden your existing Docker containers using a configurable policy.
https://github.com/dockersecuritytools/batten

InSpec: Auditing and Testing Framework

InSpec is an open-source testing framework for infrastructure with a human- and machine-readable language for specifying compliance, security and policy requirements.
https://github.com/inspec/inspec

In a containerized deployment, how do you safely pass secrets—passwords, certificates, etc.—between containers in a cluster without compromising their safety? HashiCorp Vault to securely manage the secrets used by containers.

https://www.hashicorp.com/resources/securing-container-secrets-vault

HashiCorp Vault and Terraform on Google Cloud — Security Best Practices

Vault is not always an ideal solution for secrets management. If only static secrets are needed in certain contexts, you should consider Cloud KMS to encrypt secrets and store them in source code or GCS buckets. It’s perfectly fine to store secrets in source code if they are encrypted. Vault is an ideal solution for disparate teams storing secrets at a large scale or when you need some of Vault’s dynamic secret generation capability.

Vault Security Best Practices
Isolate the installation with single tenancy. You can host Vault on Google Kubernetes Engine (GKE) or Compute Engine, but it should be completely isolated in its own project, cluster, and/or private VPC/subnet. If running on Compute Engine, Vault should be the only main process running on a machine. This reduces the risk that another process running on the same machine could be compromised and interact with Vault

Use a bastion host for Admin Access. Vault must be accessible by humans via API, but never by SSH or RDP.Once you situate Vault on a private subnet, you can access it using a Bastion host. This allows you to disable SSH access to the Vault nodes altogether.

Restrict storage access. Vault encrypts all data at rest, regardless of which storage backend you use.To avoid unauthorized access or operations, restrict access to the storage backend to only Vault.Run in high availability mode. If using GKE, you should expose the Vault nodes via an internal load balancer. With Compute Engine however, you can use server-side redirection and avoid a load balancer altogether. In either case, Vault should be situated on a private subnet behind a Bastion host.

https://medium.com/@jryancanty/hashicorp-vault-and-terraform-on-google-cloud-security-best-practices-3d94de86a3e9

Security Model

The overall goal of Vault's security model is to provide confidentiality, integrity, availability, accountability, authentication.
This means that data at rest and in transit must be secure from eavesdropping or tampering.

Threat Model

Eavesdropping
Client communication with Vault should be secure from eavesdropping as well as communication from Vault to its storage backend.

Tampering with data at rest or in transit.
Any tampering should be detectable

Access to data or controls without authentication or authorization
All requests must be proceeded by the applicable security policies.

Access to data or controls without accountability.
If audit logging is enabled, requests and responses must be logged before the client receives any secret material

Confidentiality of stored secrets.
In practice, this means all data at rest must be encrypted.

Availability of secret material in the face of failure.
running in a highly available configuration to avoid loss of availability.

The following are not parts of the Vault threat model

Protecting against arbitrary control of the storage backend.
As an example, an attacker could delete or corrupt all the contents of the storage backend causing total data loss for Vault. The ability to control reads would allow an attacker to snapshot in a well-known state and rollback state changes if that would be beneficial to them.

Protecting against the leakage of the existence of secret material. An attacker that can read from the storage backend may observe that secret material exists and is stored, even if it is kept confidential.

Protecting against memory analysis of a running Vault. If an attacker is able to inspect the memory state of a running Vault instance then the confidentiality of data may be compromised.

External Threat Overview
there are 3 distinct systems
the client, which is speaking to Vault over an API
Vault or the server more accurately, which is providing an API and serving requests.
the storage backend, which the server is utilizing to read and write data.

There is no mutual trust between the Vault client and server.
Clients use TLS to verify the identity of the server and to establish a secure communication channel.
Servers require that a client provides a client token for every request which is used to identify the client.

The storage backends used by Vault are also untrusted by design
Vault uses a security barrier for all requests made to the backend.
The security barrier automatically encrypts all data leaving Vault using a 256-bit Advanced Encryption Standard (AES) cipher in the Galois Counter Mode (GCM) with 96-bit nonces.
The nonce is randomly generated for every encrypted object.
When data is read from the security barrier the GCM authentication tag is verified during the decryption process to detect any tampering.

Internal Threat Overview
Within the Vault system, a critical security concern is an attacker attempting to gain access to secret material they are not authorized to.
For example, GitHub users in the "engineering" team may be mapped to the "engineering" and "ops" Vault policies.
Vault then generates a client token which is a randomly generated, serialized value and maps it to the policy list.
This client token is then returned to the client.
On each request a client provides this token.
Vault then uses it to check that the token is valid and has not been revoked or expired, and generates an ACL based on the associated policies.
Vault uses a strict default deny or whitelist enforcement.
Each policy specifies a level of access granted to a path in Vault.
Although clients could be provided with root tokens or associated with the root policy, instead Vault supports the notion of "sudo" privilege.
As part of a policy, users may be granted "sudo" privileges to certain paths, so that they can still perform security sensitive operations without being granted global root access to Vault.
Lastly, Vault supports using a Two-man rule for unsealing using Shamir's Secret Sharing technique.
When Vault is started, it starts in a sealed state.
This means that the encryption key needed to read and write from the storage backend is not yet known.
The process of unsealing requires providing the master key so that the encryption key can be retrieved
The risk of distributing the master key is that a single malicious actor with access to it can decrypt the entire Vault.
Instead, Shamir's technique allows us to split the master key into multiple shares or parts.
The number of shares and the threshold needed is configurable, but by default Vault generates 5 shares, any 3 of which must be provided to reconstruct the master key
By using a secret sharing technique, we avoid the need to place absolute trust in the holder of the master key, and avoid storing the master key at all.
The master key is only retrievable by reconstructing the shares.
The shares are not useful for making any requests to Vault, and can only be used for unsealing.
Once unsealed the standard ACL mechanisms are used for all requests.
Opening the bank vault requires two-factors: the key and the combination.
Similarly, Vault requires multiple shares be provided to reconstruct the master key.
Once unsealed, each security deposit boxes still requires the owner provide a key, and similarly the Vault ACL system protects all the secrets stored.
https://www.vaultproject.io/docs/internals/security.html

Two-man rule

The two-man rule is a control mechanism designed to achieve a high level of security for especially critical material or operations. Under this rule all access and actions require the presence of two authorized people at all times.
No-lone zone
A no-lone zone is an area that must be staffed by two or more qualified or cleared individuals. Each individual must be within visual contact of each other and in visual contact with the critical component that requires a no-lone-zone area designation. A no-lone zone may contain a cryptographic component, weapon system hardware under test, a nuclear weapon or active nuclear weapon controls.
https://en.wikipedia.org/wiki/Two-man_rule

Shamir's Secret Sharing

Shamir's Secret Sharing is an algorithm in cryptography created by Adi Shamir. It is a form of secret sharing, where a secret is divided into parts, giving each participant its own unique part.
To reconstruct the original secret, a minimum number of parts is required. In the threshold scheme this number is less than the total number of parts. Otherwise all participants are needed to reconstruct the original secret.
https://en.wikipedia.org/wiki/Shamir's_Secret_Sharing

How our security team handle secrets

We were distributing our secrets using Kubernetes and Hashicorp Vault
Kubernetes stores the data in plaintext in etcd, a database where it stores all configuration data.
It gives the secrets to applications running on our servers as a file.
At Monzo, only a handful of engineers have access to Vault, and they tightly control who else can use it.

These were some of the issues:
1. Having two mechanisms created confusion, but neither one was perfect
2. The way we were using Kubernetes secrets didn't work well
The secrets were readable to anyone with access to our etcd nodes. Getting into these nodes is really hard, so we didn't consider this a big problem in the past
Creating secrets and adding them to your service requires messing around with Kubernetes configuration files. Unlike deployments, we don't have any nice tooling for this.
Reading them into your service is a little fiddly, and we were lacking libraries to standardise it.
3. The way we were using Vault also had issues
It was impossible to configure or even inspect without a key ceremony (where a few engineers who have access to Vault get together and configure it). A key ceremony takes about three hours of work.
We didn't have any ways to configure Vault using code, so were using ad-hoc scripts, which takes time and is prone to errors
To let a service read from Vault, you'd have to generate a secret key for that service, which we'd then manually store as Kubernetes secret. Every time you create a Vault secret for a new service, you also have to create a Kubernetes secret

To address these issues
1. Secrets are encrypted wherever they're stored
2. It's impossible to read all the secrets at once, but easy to let services read specific secrets
3. It's easy to add brand new secrets, and hard to overwrite existing ones.
5. It's auditable.
6. It's inspectable.

Authenticating to Vault
To make sure we can access secret information securely, but without having to create new secrets that add unnecessary complexity, we decided to use its Kubernetes auth mechanism.
This lets Kubernetes applications present their service account token (like a password, attached by default to all applications in Kubernetes) to Vault. Vault will then issue its own token with permissions that are configured for that application.
In our case, we chose to let all applications get a token with a default Vault policy defining what actions are allowed.
We called this policy kubernetes-reader

Service account tokens are Kubernetes secrets.
This means they're still not as secure as we'd like, because they're stored unencrypted in etcd.
But, given that the Kubernetes secret is just a token which allows access to Vault, this reduces how dangerous it is to access Kubernetes secrets; you'd now need to talk to Vault as well, which means that you'd need an active presence inside our network.

We also decided that we could improve the security of Kubernetes secrets by encrypting them at rest using an experimental Kubernetes feature. We generated an AES key (a very strong key for encrypting data fast), stored inside Vault, which we use to encrypt and decrypt Kubernetes secrets on the fly; the encrypted version is stored in etcd, and the unencrypted version only ever exists in the service that uses it
We wanted to avoid the issue of having to create a second Kubernetes secret for every new Vault secret, reducing complexity for our engineers.

Human interactions with Vault
We have an internal authentication mechanism which allow staff members to prove their identity, and perform various actions, like replacing a card for a customer.
We soon intend to do this via a custom Vault auth plugin, which is a little safer and easier to work with, but to save time during this project, we instead used a web service in front of Vault which intercepts its requests.

Computer interactions with Vault
We designed the default kubernetes-reader policy to allow applications to do a variety of safe actions. As manually configuring Vault requires a key ceremony, we wanted to cover as many use cases as possible for a service by default
Essentially, Vault allows you to interpolate the details of the authenticated user (eg their service account name) into the policy automatically, including into the allowed paths. This means that we can give every service a special path in Vault that they can read by default. In combination with the fact that all engineers can now create new secrets in Vault, we have already solved the problem of allowing engineers to create secrets for their services.

Libraries
We wrote some new library code to make it super easy for engineers to read secrets into their services. Our tooling encourages engineers to only have one key in their Vault secrets (which are actually each a key-value store), named data.

We'd like to get Vault to issue short lived credentials for our services which further restrict their access to Cassandra, our database, so we have more guarantees that an attacker couldn't write to the database.
It would also be great to issue AWS credentials out of Vault, and keep them updated. Right now we store these as static secrets in Vault, so they never change. In security we like to rotate things so that if they are ever stolen, they aren't useful for long.

https://monzo.com/blog/2019/10/11/how-our-security-team-handle-secrets?source=hashibits&utm_source=hc-newsletter&utm_medium=email&utm_campaign=november2019newsletter&utm_section=community&mkt_tok=eyJpIjoiTURCbVpqRTJOVEUzWVRabSIsInQiOiJSK1E3T0JcL2RzSUZ0R296alBXMjIzdHVUa0lnb1pTMXB1a1RNelVPME41a1wvKzVlbU16cUcyQWNIeWV5bWtcLzRcL0FQMDFcLzhuQWFCM005WmxcL2pFZFFzWkNiR0RpTkRZOVA1dDFmR2N6QUd1aGE1U0VHdTdMckd5OVBFQVZTbkdwNCJ9

use HashiCorp Consul for service discovery, service registration, and service mesh for Kubernetes and non-Kubernetes environments.

At KubeCon Barcelona, we announced our participation in Microsoft’s new Service Mesh Interface (SMI). Our integration enables Kubernetes users to manage HashiCorp Consul service mesh intentions directly through tools like kubectl, helm, or Terraform.
Consul supports both standalone Kubernetes and managed Kubernetes (Microsoft AKS, AWS EKS, Google Cloud GKE, and RedHat Openshift) in a number of different ways.
This makes it easier for organizations to start incorporating Kubernetes into their existing infrastrastructure and manage all their services with a single control plane.
Upon installation, Consul can be used as a central service registry, help secure and facilitate service-to-service communication, and enhance observability over network and service health. Tools like Helm or Shipyard can help you get started with Consul on Kubernetes.
Through integrations with popular network middleware solutions like F5 BIG-IP and HAProxy, service discovery and health checks can be done across multiple environments.
The way Consul interacts with network middleware is based on the idea of managing east-west communication (internal service to service) versus north-south (external users to internal services).
the need for static IPs may slow down service deployments and doesn’t scale in a way we might need to fully migrate to microservices. To alleviate this challenge, Consul can dynamically provide service data and health status back to a load balancer, making it visible to the operators that are managing the network health.
Envoy Proxy and Ambassador API Gateway
Proxies are very beneficial to managing microservices at scale by reducing latency in service calls, enabling greater security using mTLS, and helping monitor system resources for the services they are attached to. Consul offers a built-in L4 communication proxy for testing and development, but for proxying L4 and L7 communication in production, Consul service mesh leverages Envoy.
https://www.hashicorp.com/blog/exploring-the-hashicorp-consul-ecosystem-for-kubernetes/?source=hashibits&utm_source=hc-newsletter&utm_medium=email&utm_campaign=november2019newsletter&utm_section=hashicorp&mkt_tok=eyJpIjoiTURCbVpqRTJOVEUzWVRabSIsInQiOiJSK1E3T0JcL2RzSUZ0R296alBXMjIzdHVUa0lnb1pTMXB1a1RNelVPME41a1wvKzVlbU16cUcyQWNIeWV5bWtcLzRcL0FQMDFcLzhuQWFCM005WmxcL2pFZFFzWkNiR0RpTkRZOVA1dDFmR2N6QUd1aGE1U0VHdTdMckd5OVBFQVZTbkdwNCJ9

Production Hardening

best practices for a production hardened deployment of Vault.
The recommendations are based on the security model and focus on defense in depth.

Recommendations

End-to-End TLS.
If intermediate load balancers or reverse proxies are used to front Vault, they should not terminate TLS. This way traffic is always encrypted in transit to Vault and minimizes risks introduced by intermediate layers.

Single Tenancy.
Vault should be the only main process running on a machine.
This reduces the risk that another process running on the same machine is compromised and can interact with Vault.
Similarly, running on bare metal should be preferred to a VM, and a VM preferred to a container.
This reduces the surface area introduced by additional layers of abstraction and other tenants of the hardware.

Firewall traffic.
Vault listens on well known ports, use a local firewall to restrict all incoming and outgoing traffic to Vault and essential system services like NTP.

Disable SSH / Remote Desktop.
When running a Vault as a single tenant application, users should never access the machine directly. Instead, they should access Vault through its API over the network.

Disable Swap.
Vault encrypts data in transit and at rest, however it must still have sensitive data in memory to function.
Risk of exposure should be minimized by disabling swap to prevent the operating system from paging sensitive data to disk
Vault attempts to "memory lock" to physical memory automatically, but disabling swap adds another layer of defense.

Don't Run as Root.
Vault is designed to run as an unprivileged user, and there is no reason to run Vault with root or Administrator privileges, which can expose the Vault process memory and allow access to Vault encryption keys

Turn Off Core Dumps.
A user or administrator that can force a core dump and has access to the resulting file can potentially access Vault encryption keys. Preventing core dumps is a platform-specific process

Immutable Upgrades.

Avoid Root Tokens.
Vault provides a root token when it is first initialized.
We recommend treating Vault configuration as code, and using version control to manage policies.
Once setup, the root token should be revoked to eliminate the risk of exposure.
Root tokens can be generated when needed, and should be revoked as soon as possible

Enable Auditing.
Enabling auditing provides a history of all operations performed by Vault and provides a forensics trail in the case of misuse or compromise.
Audit logs securely hash any sensitive data, but access should still be restricted to prevent any unintended disclosures.

Upgrade Frequently.

Configure SELinux / AppArmor
Using additional mechanisms like SELinux and AppArmor can help provide additional layers of security when using Vault.

Restrict Storage Access
Vault encrypts all data at rest, regardless of which storage backend is used.
Although the data is encrypted, an attacker with arbitrary control can cause data corruption or loss by modifying or deleting keys.
Access to the storage backend should be restricted to only Vault to avoid unauthorized access or operations

Disable Shell Command History.
vault command itself to not appear in history at all.

Tweak ulimits.
Consider to review ulimits for maximum amount of open files, connections, etc. before going into production

Docker Containers.
To leverage the "memory lock" feature inside the Vault container you will likely need to use the overlayfs2 or another supporting driver.

https://www.vaultproject.io/guides/operations/production

HashiCorp Vault secures, stores, and tightly controls access to tokens, passwords, certificates, API keys, and other secrets in modern computing. Vault handles leasing, key revocation, key rolling, and auditing. Through a unified API, users can access an encrypted Key/Value store and network encryption-as-a-service, or generate AWS IAM/STS credentials, SQL/NoSQL databases, X.509 certificates, SSH credentials, and more.

https://www.vaultproject.io/

Security Best Practices for Building Docker Images

Installation of external software components
Copying files
When adding files into the image, the COPY statement is preferred.
To ensure proper usage of the cache, separate COPY statements (opposed to package installation activities). This helps in performance, by invalidating just some parts of the cache.

Downloading files
Use cURL/wget instead of ADD
To limit the size of an image, the usage of cURL or wget is preferred. By using ADD files will be extracted into the image, increasing size.
Additionally, the command can be appended directly with an integrity check, which is not possible when using ADD

Disk, Directories and Mounts
The Docker build file allows defining storage areas for your application with the help of the VOLUME statement

Working directory
Instead of using the combination of “cd /data && ./runscript.sh”, the WORKDIR statement changes the current work directory.

Running Processes
Processes can be started with the CWD statement. For example starting Lynis
CMD [“lynis”, “-c”, “-Q”]

Environment settings
ENV PATH /usr/local/yourpackage/bin:$PATH

Active User
With the USER statement, the permissions can be dropped from root to a non-privileged user.

Auditing tool for Docker
lynis audit dockerfile <file>

https://linux-audit.com/security-best-practices-for-building-docker-images/

Open-AudIT is an application to tell you exactly what is on your network, how it is configured and when it changes. Open-AudIT will run on Windows and Linux systems. Essentially, Open-AudIT is a database of information, that can be queried via a web interface. Data about the network is inserted via a Bash Script (Linux) or VBScript (Windows). The entire application is written in php, bash and vbscript.

https://www.open-audit.org/

purely in-kernel solutions for some packet-processing intensive workloads still lag behind the bypass solution, namely Data Plane Development Kit (DPDK), by almost an order of magnitude.

the kernel community never sleeps (almost literally) and the holy grail of kernel-based networking performance has been found under the name of XDP: the eXpress Data Path.
This technology allows you to implement new networking features (and/or re-implement existing ones) via custom extended BPF (eBPF) programs attached to the kernel packet processing deep down the stack, killing the overhead of socket buffers (SKBs) management, reducing the per-packet memory management overhead, and allowing more-effective bulking.
XDP eBPF programs have access to helpers for packet manipulation and packet forwarding, offering almost unlimited opportunity to change and extend the kernel behavior without the need to add new in-kernel code—and, while at it, reaching a higher possible processing speed.
XDP allows you to attach an eBPF program to a lower-level hook inside the kernel
Such a hook is implemented by the network device driver, inside the ingress traffic processing function (usually, the NAPI poll() method), before an SKB is allocated for the current packet.
XDP is not a programming language, butit it uses a programming language (eBPF), and every programming-related tutorial has to start with a “Hello world” program
https://developers.redhat.com/blog/2018/12/06/achieving-high-performance-low-latency-networking-with-xdp-part-1/

Docker Trusted Registry

You install it behind your firewall so that you can securely store and manage the Docker images you use in your applications.

Image management

DTR can be installed on-premises, or on a virtual private cloud. And with it, you can store your Docker images securely, behind your firewall

You can use DTR as part of your continuous integration, and continuous delivery processes to build, ship, and run your applications.

It even allows you to see what dockerfile lines were used to produce the image and, if security scanning is enabled, to see a list of all of the software installed in your images.

Security scanning

DTR has a built in security scanner that can be used to discover what versions of software are used in your images. It scans each layer and aggregates the results to give you a complete picture of what you are shipping as a part of your stack. Most importantly, it co-relates this information with a vulnerability database that is kept up to date through periodic updates. This gives you unprecedented insight into your exposure to known security threats.

Image signing

DTR ships with Notary built in so that you can use Docker Content Trust to sign and verify images.

https://docs.docker.com/v17.12/datacenter/dtr/2.4/guides/

10 Docker Image Security Best Practices

1. Prefer minimal base images

By preferring minimal images that bundle only the necessary system tools and libraries required to run your project, you are also minimizing the attack surface for attackers and ensuring that you ship a secure OS.

2. Least privileged user

When a Dockerfile doesn’t specify a USER, it defaults to executing the container using the root user.Docker defaults to running containers using the root user. When that namespace is then mapped to the root user in the running container, it means that the container potentially has root access on the Docker host.

To minimize exposure, opt-in to create a dedicated user and a dedicated group in the Docker image for the application; use the USER directive in the Dockerfile to ensure the container runs the application with the least privileged access possible

A specific user might not exist in the image; create that user using the instructions in the Dockerfile

3. Sign and verify images to mitigate MITM attacks

Verify docker images

Make it a best practice that you always verify images before pulling them in, regardless of policy

To experiment with verification, temporarily enable Docker Content Trust with the following command

export DOCKER_CONTENT_TRUST=1

Now attempt to pull an image that you know is not signed—the request is denied and the image is not pulled

Sign docker images

To sign images, use Docker Notary. Notary verifies the image signature for you, and blocks you from running an image if the signature of the image is invalid.

When Docker Content Trust is enabled a Docker image build signs the image.

When the image is signed for the first time, Docker generates and saves a private key in ~/docker/trust for your user. This private key is then used to sign any additional images as they are built

4. Find, fix and monitor for open source vulnerabilities

5. Don’t leak sensitive information to Docker images

when building an application inside a Docker image, you need secrets such as an SSH private key to pull code from a private repository, or you need tokens to install private packages. If you copy them into the Docker intermediate container they are cached on the layer to which they were added, even if you delete them later on. These tokens and keys must be kept outside of the Dockerfile

Using multi-stage builds

By leveraging Docker support for multi-stage builds, fetch and manage secrets in an intermediate image layer that is later disposed of so that no sensitive data reaches the image build.

Use code to add secrets to said intermediate layer

Using Docker secret commands

Use an alpha feature in Docker for managing secrets to mount sensitive files without caching them

Beware of recursive copy

the following command copies the entire build context folder, recursively, to the Docker image, which could end up copying sensitive files as wel

If you have sensitive files in your folder, either remove them or use .dockerignore to ignore them

6. Use fixed tags for immutability

The most common tag is latest, which represents the latest version of the image. Image tags are not immutable, and the author of the images can publish the same tag multiple times

Instead of pulling a tag, pull an image using the specific SHA256 reference of the Docker image, which guarantees you get the same image for every pull. However notice that using a SHA256 reference can be risky, if the image changes that hash might not exist anymore

7. Use COPY instead of ADD

Space and image layer considerations

using COPY allows separating the addition of an archive from remote locations and unpacking it as different layers, which optimizes the image cache.

If remote files are needed, combining all of them into one RUN command that downloads, extracts, and cleans-up afterwards optimizes a single layer operation over several layers that would be required if ADD were used.

zip bombs and Zip Slip vulnerabilities

When local archives are used, ADD automatically extracts them to the destination directory. While this may be acceptable, it adds the risk of zip bombs and Zip Slip vulnerabilities that could then be triggered automatically.

When using COPY the source for the files to be downloaded from remote URLs should be declared over a secure TLS connection and their origins need to be validated as well.

8. Use metadata labels

The most common label is “maintainer”, which specifies the email address and the name of the person maintaining this image.

This metadata could contain: a commit hash, a link to the relevant build, quality status (did all tests pass?), source code, a reference to your SECURITY.TXT file location

It is good practice to adopt a SECURITY.TXT (RFC5785) file that points to your responsible disclosure policy for your Docker label schema when adding labels

9. Use multi-stage build for small and secure images

This is a good reason why Docker has the multi-stage build capability. This feature allows you to use multiple temporary images in the build process, keeping only the latest image along with the information you copied into it.

First image—a very big image size, bundled with many dependencies that are used in order to build your app and run tests.

Second image—a very thin image in terms of size and number of libraries, with only a copy of the artifacts required to run the app in production.

10. Use a linter

Adopt the use of a linter to avoid common mistakes and establish best practice guidelines that engineers can follow in an automated way.

https://snyk.io/blog/10-docker-image-security-best-practices/

Docker Bench for Security

The Docker Bench for Security is a script that checks for dozens of common best-practices around deploying Docker containers in production. The tests are all automated, and are inspired by the CIS Docker Community Edition Benchmark v1.1.0.

https://github.com/docker/docker-bench-security

Docker Security Best Practices

1. Image Authenticity

Use Private or Trusted Repositories

Docker Cloud and Docker Hub can scan images in private repositories to verify that they are free from known security vulnerabilities or exposures, and report the results of the scan for each image tag.

Use Docker Content Trust

This is a new feature introduced in Docker Engine 1.8, which allows you to verify Docker image publishers.

the service protects against image forgery, replay attacks, and key compromises.

2. Excess Privileges

Drop Unnecessary Privileges and Capabilities

The best practice for users would be to remove all capabilities except those explicitly required for their processes.

3. System Security

make use of other Linux security options, such as AppArmor, SELinux, grsecurity and Seccomp.

AppArmor

a Linux kernel security module that allows the system administrator to restrict programs’ capabilities with per-program profiles.

Profiles can allow capabilities like network access, raw socket access, and the permission to read, write, or execute files on matching paths.

AppArmor supplements the traditional Unix discretionary access control (DAC) model by providing mandatory access control (MAC).

SELinux

Security-Enhanced Linux (SELinux) is a Linux kernel security module that provides a mechanism for supporting access control security policies, including…(MAC)

grsecurity

A set of patches for the Linux kernel which emphasize security enhancements.

Grsecurity provides a collection of security features to the Linux kernel, including address space protection, enhanced auditing and process control

seccomp( secure computing mode)

a computer security facility in the Linux kernel.

seccomp allows a process to make a one-way transition into a “secure” state where it cannot make any system calls except exit(), sigreturn(), read() and write() to already-open file descriptors.

Should it attempt any other system calls, the kernel will terminate the process with SIGKILL.

4. Limit Available Resource Consumption

-m / --memory: Set a memory limit

--memory-reservation: Set a soft memory limit

--kernel-memory: Set a kernel memory limit

--cpus: Limit the number of CPUs

--device-read-bps: Limit the read rate from a device

https://blog.sqreen.com/docker-security/

Docker containers are a wrapper around Linux control groups (cgroups) and namespaces.

Cgroups are used in the Linux kernel for monitoring and restricting resources among a group of processes

Namespaces determine what a process can see.

For example, the PID namespace restricts which processes can be seen within a container.

Securing the host OS

One can also develop custom security modules using Linux Security Modules (LSMs).

SELinux

SELinux is a type of Mandatory Access Control (MAC) security module based on type enforcement.

Type enforcement revolves around defining a type and assigning privileges to those types

AppArmor

AppArmor is another MAC solution

It is based on file system paths rather than defining types.

Users can specify a file path to a binary and the permissions they have

Seccomp

Seccomp (short for ‘Secure Computing’) is another security module included in many Linux distributions that allows users to restrict system calls.

Seccomp can be used to sandbox applications that handle untrusted user inputs to a subset of system calls.

The first step in using seccomp is to determine all the system calls an application makes when it runs. This can be a difficult and error-prone exercise that should be conducted when the application is written.

Users can use tools like audit to profile all the system calls that it makes by exercising it in different ways.

The drawback with seccomp is that the profile has to be applied during the launch of the application. The granularity of restricting system calls is too narrow and requires extensive working knowledge of Linux to come up with good profiles.

Capabilities

Linux capabilities are groups of permissions that can be given to child processes.

Container Runtime Security Practices

Unix socket (/var/run/docker.sock)

By default, the Docker client communicates with the Docker daemon using the unix socket.

This socket can also be mounted by any other container unless proper permissions are in place.

Once mounted, it is very easy to spin up any container, create new images, or shut down existing containers.

Solution:

Set up appropriate SELinux/AppArmor profiles to limit containers mounting this socket

Volume mounts

Docker allows mounting to sensitive host directories. Also, the contents of the host file system can be changed directly from the container. For application containers with direct Internet exposure, it is important to be extra careful when mounting sensitive host directories (/etc/, /usr/)

Solution:

Mount host-sensitive directories as read-only.

Privileged containers

Privileged containers can do almost anything a host can do

Solution:

Use capabilities to grant fine-grained privileges instead

SSH within container

Running ssh service within containers makes managing ssh keys/ access policies difficult

Solution:

Do not run ssh services inside a container.

Instead, run ssh on the host and use `docker exec` or `docker attach` to interact with the container.

Binding privileged ports

By default, Docker allows binding privileged ports (<1024) to a container.

Solution:

List all containers and their port mappings using the code below to ensure that the container's ports are not mapped to host ports below port 1024.

docker ps --quiet | xargs docker inspect --format '{{ .Id }}: Ports={{ .NetworkSettings.Ports }}'`

Exposing ports

Ensure that there are no unnecessary ports exposed

Running without default AppArmor/ SELinux or seccomp

Docker runs containers with default AppArmor/SELinux and seccomp profiles. They can be disabled with the --unconfined option

Solution:

Do not disable the default profiles that Docker supplies.

Sharing host namespaces

Containers can be started with -pid to connect with the host PID namespace or --net to share its network namespace

These allow containers to see and kill PIDs running on the host or even connect to privileged ports

Solution:

Avoid sharing host namespaces with containers

Enabling TLS

If the Docker daemon is running on a TCP endpoint, it is advised to run with TLS enabled

Solution:Docker offers a [helpful guide on enabling TLS with Docker

Do not set mount propagation mode to shared

Mount propagation mode allows mounting volumes in shared, slave or private mode on a container

Mounting a volume in shared mode does not restrict any other container to mount and make changes to that volume.

Solution:

Run the following command to list the propagation mode for mounted volumes:

`docker ps --quiet --all | xargs docker inspect --format '{{ .Id }}: Propagation={{range $mnt := .Mounts}} {{json $mnt.Propagation}} {{end}}'`

Restrict a container from acquiring new privileges

A process can set the no_new_priv bit in the kernel

The no_new_priv bit ensures that the process or its children processes do not gain any additional privileges via setuid or sgid bits.

Solution:

List the security options for all the containers using the following command:

`docker ps --quiet --all | xargs docker inspect --format '{{ .Id }}: SecurityOpt={{.HostConfig.SecurityOpt }}'`

One can start a container with `no_new_privileges` as below:

`docker run--security-opt=no-new-privileges`

https://www.stackrox.com/post/2017/08/hardening-docker-containers-and-hosts-against-vulnerabilities-a-security-toolkit/

Linux Security Modules (LSM) is a framework that allows the Linux kernel to support a variety of computer security models while avoiding favoritism toward any single security implementation.

since Linux 2.6. AppArmor, SELinux, Smack, and TOMOYO Linux are the currently accepted modules in the official kernel.

TOMOYO Linux is a Mandatory Access Control (MAC) implementation for Linux that can be used to increase the security of a system, while also being useful purely as a system analysis tool.

TOMOYO Linux allows each process to declare behaviours and resources needed to achieve their purpose. When protection is enabled, TOMOYO Linux acts like an operation watchdog, restricting each process to only the behaviours and resources allowed by the administrator.

The main features of TOMOYO Linux include:

System analysis

Increased security through Mandatory Access Control

Tools to aid in policy generation

Simple syntax

Easy to use

Very few dependencies

Requires no modification of existing binaries

http://tomoyo.osdn.jp/

The most complete implementation of a Smack based system is Tizen.

The Automotive Grade Linux project uses Smack. This uses the Yocto Project build tools.

Smack (full name: Simplified Mandatory Access Control Kernel) is a Linux kernel security module that protects data and process interaction from malicious manipulation using a set of custom mandatory access control (MAC) rules, with simplicity as its main design goal

http://schaufler-ca.com/

Protect the Docker daemon socket

By default, Docker runs via a non-networked Unix socket.
It can also optionally communicate using an HTTP socket.
If you need Docker to be reachable via the network in a safe manner, you can enable TLS by specifying the tlsverify flag and pointing Docker’s tlscacert flag to a trusted CA certificate.
In the daemon mode, it only allows connections from clients authenticated by a certificate signed by that CA. In the client mode, it only connects to servers with a certificate signed by that CA
https://docs.docker.com/v17.12/engine/security/https/

Cgroups and Namespaces

The first of these features is called cgroups. This feature was developed as a way to group processes and provide more control over the resources that were available to the group, such as CPU, memory, and I/O.
This feature also allows for better accounting of the usage, such as when teams need to report the usage for billing purposes
The cgroups feature of containers allows them to scale in a controllable way and have a predictable capacity.
It is also a good security feature because processes running in containers can not easily consume all the resources on a system - for example, a denial of service attack by starving other processes of required resources

The other feature is called namespaces which essentially allows a process to have its own dedicated set of resources, such as files, users, process ids, and hostnames
Namespaces limit what the process running inside a container can see and do
Container processes can only see processes running in the same namespace
In a lot of ways, this can make a container process seem like it is a virtual machine, but the process still executes system calls on the main kernel.
These processes see a filesystem which is a small subset of the real filesystem.
The user ids inside the container can be mapped from different ids outside the container
(you could make the user root have user id 0 inside the container but actually has user id 1099 outside the container - thus appearing to give administrative control when not actually doing so)

Priviledged Containers
A good example is running a web server that needs to listen on a privileged port, such as 80.
Ports under 1024 are privileged and usually assigned to more sensitive network processes such as mail, secure shell access, HTTP, and network time synchronization
If you wanted to run an Apache server (which is often used as a secure entry point to an application) in a container and listen on port 80, you would need to give that container privileged access.
Your intent was to give the process the ability to open a privileged port, but now the process has the ability to do other things that require privileged access.
The limitations imposed by the cgroups controller have been lifted, and the process can do almost anything that is possible to do running outside the container.
To avoid this issue, it is possible to map a non-privileged port outside the container to a privileged port inside the container. For example, you map port 8080 on the host to port 80 inside the container. This will allow you to run processes that normally require privileged ports without actually giving them privileged access.

Seccomp Profiles
The default seccomp profile for Docker disables around 40 system calls to provide a baseline level of security.
The issue with these seccomp profiles is that they must be specified at the start of the container and are difficult to manage.

Capabilities
Capabilities are another way of specifying privileges that need to be available to a process running in a container. The advantage of capabilities is that groups of permissions are bundled together into meaningful groups which makes it easier to collect the privileges required for doing common tasks.
Using capabilities is much more secure than simply running a container as privileged, and a lot easier to manage than using seccomp profiles

SELinux and AppArmor
A lot of the security concerns for processes running in containers apply to processes on a host in general

SELinux is a Linux kernel security module that provides a mandatory access control (MAC) mechanism for providing stricter security enforcement. SELinux defines a set of users, roles, and domains that can be mapped to the actual system users and groups.

AppArmor is a similar MAC mechanism that aims to confine programs to a limited set of resources. AppArmor is more focused on binding access controls to programs rather than users
It also combines capabilities and defining access to resources by path.

https://rancher.com/blog/2018/2018-09-12-introduction-to-container-security-1/

Making sense Of Linux namespaces
They are a combination of Linux kernel technologies. Specifically, cgroups, namespaces and LSMs.
To get a better understanding of how namespaces work I sat down last weekend and read a bunch of code and dug into some containers running on one of my Kubernetes workers

seven types of namespaces in the Linux kernel:

Cgroups control the cgroup information a process can see.
IPC controls the System V IPC and POSIX message queues a process can see.
Network controls the network resources a process can see.
Mount controls the set of file systems and mounts a process can see.
PID controls the process IDs a process can see.
User controls the user information (e.g., uids, gids, etc.) a process can see.
UTS controls the hostname and domain information a process can see.

Processes can be assigned to a namespace in two ways.
The first is via one or more CLONE_NEWXXXX flags passed to the clone system call.Clone allows new newspaces to be created
The second way is through the setns(2) system call.setns adds a process to an existing namespace

Each processs on a system has their process id registered in /proc. Inside that pid directory is a ns directory which lists all of the namespaces (and their ids) associated with the process.
Let’s take a look at the ns entries for a docker shim process running on a Kubernetes worker:

When the kubelet is asked to create a new container it will actually spin up two. One container will be used to execute your application and the second container will be used to run a pause container. If you want to learn all about the differences between the two types of containers you can check out the amazing pod and pause container posts from Ian Lewis.

Now what makes this interesting is the way the namespaces are shared between containers in a pod. To better understand the relationships between containers in a namespace I fired up a pod with kubectl:
This helped me visualize the namespace associations and walk away with the following notes:
Each container has its own mnt, uts and pid namespace.
All of the processes on the system share the cgroup namespace (need to dig more into this).
The pause container and application container share the net and ipc namespace.
If you are using docker and your container needs access to a namespace outside of the default namespaces it was assigned (net & ipc) you can use the following options to add it to additional namespaces:

https://prefetch.net/blog/2018/02/22/making-sense-of-linux-namespaces/

To understand the current state of Kernel Audit, it’s important to understand its relationship with namespaces, which are kernel-enforced user space views. Currently, there are seven namespaces: peer namespaces which include the Mount namespace; the UTS namespace (hostname and domain information); the IPC namespace (no one really knows what this one does); NET namespaces which are peer systems, peer namespaces so the system comes up in the initial NET namespace and all of the physical devices appear in that namespace; and, three hierarchy namespaces – PID, User, and Cgroups – so that the permissions are inherited from one to another.

The User namespaces are the most contentious, because there are a number of traps inherent to their use making securing them vexing. “As a result, there are number of distributions that have not enabled user namespaces by default yet, because there's still some work to be done to iron out where these are going,” Briggs said.

There Can Be Only One
As to containers, there is no hard definition as to what they are and certainly the kernel has no concept of containers. As far as the kernel is concerned, it's a concept that's completely in user space at the moment.
https://www.linux.com/news/understanding-and-securing-linux-namespaces

The Curious Case of Pid Namespaces And How Containers Can Share Them

Namespaces provide isolation of shared resources
they give each application its own unique view of the system
Because of namespaces, each docker container to appears to have its own filesystem and network

Pids in Linux
Processes in linux live in a tree-like structure
The pid is also passed to the parent when the a process is created via the fork syscall.
The kernel generates a new pid for the child and returns the identifier to the calling process, but it is up to the parent to keep track of this pid manually.

The first process started by the kernel has pid 1
This process is referred to as the init process, or simply ‘init’.
The parent pid of init is pid 0, signifying that its parent is the kernel.
Pid 1 is the root of the user-space process tree
It is possible to reach pid 1 on a linux system from any process by recursively following each process’ parent.
If pid 1 dies, the kernel will panic and you have to reboot the machine.

A Quick Overview of Namespaces
Linux namespaces are created using the unshare syscall, passing a set of flags representing which namespaces to create
In most cases, unshare pops you right into the new namespace.
For example, as soon as a process creates a network namespace it immediately sees an empty view of the network with no devices.

when you unshare the pid namespace, the process doesn’t immediately enter the new namespace.
Instead, it is required to fork. The child process enters the pid namespace and becomes pid 1
a pid namespace creates a separate view of the process hierarchy.
In other words the forked process will actually have two pids: it has pid 1 inside the namespace, and has a different pid when viewed from outside the namespace.

Pid 1 in a Namespace
Inside a namespace, init (pid 1) has three unique features when compared to other processes:

1) It does not automatically get default signal handers, so a signal sent to it is ignored unless it registers a signal hander for that signal. (This is why many dockerized processes fail to respond to ctrl-c and you are forced to kill them with something like `docker kill`).
2) If another process in the namespace dies before its children, its children will be reparented to pid 1. This allows init to collect the exit status from the process so that the kernel can remove it from the process table.
3) If it dies, every other process in the pid namespace will be forcibly terminated and the namespace will be cleaned up

The Docker “Mistake”
Docker (and runc) run the process specified as the containers entrypoint (or cmd) as pid 1 in a new pid namespace.
This can lead to some unexpected behavior for an application processes because it usually isn’t designed to run as pid 1. If it doesn’t set up its own signal handlers, signaling the process will not work. If it forks a child that dies before any grandchildren exit, zombie processes can accumulate in the container, potentially filling up the process table.

It is possible to run a special init process in your container and have it fork-exec into the application process, and many containers do this to avoid these problems. One unfortunate side affect of this decision is that the container gains more complexity. Once the container has a real init system, people are apt to embed multiple processes which sacrifices some of the benefit of dependency isolation

The Rkt “Solution”
It assumes that the process you are starting is not an init process, so it creates an init process for you (systemd) and then has systemd create a the filesystem namespace for the container process and start it
Systemd becomes pid 1 in the namespace and the container process runs as pid 2.

A Simpler Alternative
It is possible to pass the--init flag when starting your container, which will cause docker to start a simple init process for you.

Multiple Containers in a Pod
it is preferable to bundle these processes separately so that their dependencies can be isolated.
To achieve this, rkt and kubernetes introduced the idea of pods.
A pod is a set of related containers that share some namespaces. In the rkt implementation, every namespace but the filesystem namespace is shared.
Due to some of the aforementioned issues with pid namespaces, kubernetes doesn’t yet share pid namespaces between containers in the same pod.
that means processes in the same pod cannot signal each other
In addition, each container in the pod has the aforementioned init problem: every container process will run as pid 1

The rkt approach is superior for pods. You are not required to run an init process inside your containers, but it is easy to create multiple processes that can communicate and even signal each other

Adding a Containers to a Pod
With the container runtime interface, kubernetes has introduced the concept of a pod sandbox. This allows the container runtime to allocate resources in advance of starting the containers. While especially useful for networking, the concept also enables adding containers to existing pods. If you are creating the pod sandbox first and then starting the containers one by one, why not allow a for an additional container to be added later? This would be especially useful for periodic tasks like database backups or log collection

Rkt has introduced experimental support for this very feature, allowing for the creation of a pod independently of any containers. Containers (or “apps” in the rkt terminology) can be added or removed from the pod at a later time.
Rkt accomplishes this by starting systemd with no running units.
It then communicates with the pod’s systemd to start new apps on demand
in this model the init process has additional privileges and introduces a new attack vector.

The systemd process in rkts sandbox model:
has access to the host’s filesystem namespace so it can create the filesystem namespace for the containers it starts.
has to maintain full privileges because it doesn’t know in advance the set of privileges each new app will need.
is visible to all of the other processes running in the container.

Sandboxes and Pid Namespaces

1.Pid namespaces are not created along with the sandbox. Instead, each container gets its own pid namespace. This fits with how kubernetes works today, and can benefit from the simpler init strategy for handling a single process mentioned above. The main drawback of this approach is that the processes in the pod cannot signal each other.

3.Pid namespaces are created along with the sandbox. The sandbox includes a smart init process which can be used to start other processes. This how the rkt app sandbox works.the drawback is that the init process is too privileged, introducing new security attack vectors.

2.Pid namespaces are not created along with the sandbox. Instead, each container gets its own pid namespace. This fits with how kubernetes works today, and can benefit from the simpler init strategy for handling a single process mentioned above. The main drawback of this approach is that the processes in the pod cannot signal each other.

4.Pid namespaces are created along with the sandbox. The sandbox includes a simple init process which only handles signals and reaps zombies. Each other process enters the pid namespace but still has an active parent outside of the namespace. Because the init process is not starting new containers, it does not need to maintain privileges or have access to the host filesystem. The drawback of this approach is that inside the namespace each process appears to have a parent of pid 0, so the normal tree-like structure of the process tree is broken

5.The pid namespace and init are exactly the same as option four. Each other process enters the pid namespace and daemonizes (the parent exits). The kernel will reparent the process to init, which fixes option four’s broken process tree. The drawback in this case is monitoring the new process from the outside after daemonization becomes difficult. The containerization system is forced to track the process via pid instead of simply waiting on the process.

https://hackernoon.com/the-curious-case-of-pid-namespaces-1ce86b6bc900
But consider what happens if a process terminates. Let's say that the bash (PID 5) process terminates. It turns into a so-called "defunct process", also known as a "zombie process".
It's because Unix is designed in such a way that parent processes must explicitly "wait" for child process termination, in order to collect its exit status
The action of calling waitpid() on a child process in order to eliminate its zombie, is called "reaping". Many applications reap their child processes correctly. In the above example with sshd, if bash terminates then the operating system will send a SIGCHLD signal to sshd to wake it up. Sshd notices this and reaps the child process

Suppose the parent process terminates, either intentionally (because the program logic has determined that it should exit), or caused by a user action (e.g. the user killed the process). What happens then to its children? They no longer have a parent process, so they become "orphaned" (this is the actual technical term).
And this is where the init process kicks in. The init process -- PID 1 -- has a special task. Its task is to "adopt" orphaned child processes (again, this is the actual technical term). This means that the init process becomes the parent of such processes, even though those processes were never created directly by the init process.

Why zombie processes are harmful
Why are zombie processes a bad thing, even though they're terminated processes? Surely the original application memory has already been freed, right? Is it anything more than just an entry that you see in ps?
You're right, the original application memory has been freed. But the fact that you still see it in ps means that it's still taking up some kernel resources

But doesn't running a full init system make the container heavyweight and like a VM?
You may be thinking about Upstart, Systemd, SysV init etc with all the implications that come with them.
A "full init system" as we may call it, is neither necessary nor desirable.
The init system that I'm talking about is a small, simple program whose only responsibility is to spawn your application, and to reap adopted child processes.
https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

Containers as everyone knows them are normal processes that are executed using two features of the Linux kernel called namespaces and cgroups.

Namespaces allow you to provide a "view" to the process that hides everything outside of those namespaces, thus giving it its own environment to run in. This makes it so processes can't see or interfere with other processes:
Namespaces include:
Hostname
Process IDs
File System
Network interfaces
Inter-Process Communication (IPC)

While I said above that processes running in a namespace can't interfere with other processes that's not really true.
Linux has a feature called cgroups. Processes can be run in a cgroup much like a namespace but the cgroup limits the resources that the process can use

The thing I wanted to point out here was that cgroups and each namespace type are separate features. Some subset of the namespaces listed above could be used or not used at all. You could use only cgroups or some other combination of the two (well, you're still using namespaces and cgroups just the root ones but whatever). Namespaces and cgroups also work on groups of processes. You have have multiple processes running with a single namespace. That way they can see and interact with each other. Or you could run them in a single cgroup. That way the processes together will be limited to a specific amount of CPU and RAM.

You can also think of other types of abstractions you could use. Service meshes like istio can be bolted on as sidecars and provide service routing, telemetry, policy enforcement without ever having to change the main application.

You can also use multiple sidecars. There's nothing stopping you from using both the confd sidecar and istio sidecar at the same time. Applications can be combined in this way to build much more complex and reliable systems while keeping each individual app relatively simple.

https://www.ianlewis.org/en/what-are-kubernetes-pods-anyway

Kubernetes provides a clean abstraction called pods for just this use case. It hides the complexity of Docker flags and the need to babysit the containers, shared volumes, and the like. It also hides differences between container runtimes. For example, rkt supports pods natively so there is less work for Kubernetes to do but you as the user of Kubernetes don't have to worry about it.

In Kubernetes, the pause container serves as the "parent container" for all of the containers in your pod.
The pause container has two core responsibilities. First, it serves as the basis of Linux namespace sharing in the pod. And second, with PID (process ID) namespace sharing enabled, it serves as PID 1 for each pod and reaps zombie processes.

Sharing namespaces
In Linux, when you run a new process, the process inherits its namespaces from the parent process. The way that you run a process in a new namespace is by "unsharing" the namespace with the parent process thus creating a new namespace
Once the process is running, you can add other processes to the process' namespace to form a pod.New processes can be added to an existing namespace using the setns system call.
Containers in a pod share namespaces among them.

In both cases we specify the pause container as the container whose namespaces we want to join. This will effectively create our pod. If you access http://localhost:8080/ you should be able to see ghost running through an nginx proxy because the network namespace is shared among the pause, nginx, and ghost containers.

Reaping Zombies
In Linux, processes in a PID namespace form a tree with each process having a parent process. Only one process at the root of the tree doesn't really have a parent. This is the "init" process, which has PID 1.
Processes can start other processes using the fork and exec syscalls

Processes can start other processes using the fork and exec syscalls. When they do this, the new process' parent is the process that called the fork syscall. fork is used to start another copy of the running process and exec is used to replace the current process with a new one, keeping the same PID (in order to run a totally separate application you need to run the fork and exec syscalls. A process creates a new copy of itself as a child process with a new PID using fork and then when the child process runs it checks if it's the child process and runs exec to replace itself with the one you actually want to run. Most languages provide a way to do this via a single function). Each process has an entry in the OS process table. This records info about the process' state and exit code. When a child process has finished running, its process table entry remains until the parent process has retrieved its exit code using the wait syscall. This is called "reaping" zombie processes.

Longer lived zombie processes occur when parent processes don't call the wait syscall after the child process has finished. One situation where this occurs is when the parent process is poorly written and simply omits the wait call or when the parent process dies before the child and the new parent process does not call wait on it. When a process' parent dies before the child, the OS assigns the child process to the "init" process or PID 1. i.e. The init process "adopts" the child process and becomes its parent. This means that now when the child process exits the new parent (init) must call waitto get its exit code or its process table entry remains forever and it becomes a zombie

In containers, one process must be the init process for each PID namespace. With Docker, each container usually has its own PID namespace and the ENTRYPOINT process is the init process
on Kubernetes pods, a container can be made to run in another container's namespace. In this case, one container must assume the role of the init process, while others are added to the namespace as children of the init process.

It's useful to note that there has been a lot of back-and-forth on PID namespace sharing. Reaping zombies is only done by the pause container if you have PID namespace sharing enabled, and currently it is only available in Kubernetes 1.7+. It is enabled by default if running Kubernetes 1.7 with Docker 1.13.1+ unless disabled with a kubelet flag (--docker-disable-shared-pid=true). This was reverted in Kubernetes 1.8 and now it's disabled by default unless enabled by a kubelet flag (--docker-disable-shared-pid=false)

https://www.ianlewis.org/en/almighty-pause-container

init container - kubernetes

A PodThe smallest and simplest Kubernetes object. A Pod represents a set of running containers on your cluster. can have multiple containers running apps within it, but it can also have one or more init containers, which are run before the app containers are started.

Init containers are exactly like regular containers, except:
Init containers always run to completion.
Each init container must complete successfully before the next one starts.
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

Introduction to Container Security

Cgroups and Namespaces

We will begin by talking about two of the Linux kernel features that helped make containers as we know them today possible

Cgroups

This feature was developed as a way to group processes and provide more control over the resources that were available to the group, such as CPU, memory, and I/O.

This feature also allows for better accounting of the usage, such as when teams need to report the usage for billing purposes

The cgroups feature of containers allows them to scale in a controllable way and have a predictable capacity.

It is also a good security feature because processes running in containers can not easily consume all the resources on a system - for example, a denial of service attack by starving other processes of required resources.

https://rancher.com/blog/2018/2018-09-12-introduction-to-container-security-1/

The underlying technology

Namespaces

Docker is written in Go and takes advantage of several features of the Linux kernel to deliver its functionality.

Docker uses a technology called namespaces to provide the isolated workspace called the container. When you run a container, Docker creates a set of namespaces for that container.

These namespaces provide a layer of isolation

Each aspect of a container runs in a separate namespace and its access is limited to that namespace.

Docker Engine uses namespaces such as the following on Linux:

The pid namespace: Process isolation (PID: Process ID).

The net namespace: Managing network interfaces (NET: Networking).

The ipc namespace: Managing access to IPC resources (IPC: InterProcess Communication).

The mnt namespace: Managing filesystem mount points (MNT: Mount).

The uts namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).

Control groups

Docker Engine on Linux also relies on another technology called control groups (cgroups). A cgroup limits an application to a specific set of resources. Control groups allow Docker Engine to share available hardware resources to containers and optionally enforce limits and constraints. For example, you can limit the memory available to a specific container.

Union file systems

Union file systems, or UnionFS, are file systems that operate by creating layers, making them very lightweight and fast. Docker Engine uses UnionFS to provide the building blocks for containers. Docker Engine can use multiple UnionFS variants, including AUFS, btrfs, vfs, and DeviceMapper.

Container format

Docker Engine combines the namespaces, control groups, and UnionFS into a wrapper called a container format. The default container format is libcontainer. In the future, Docker may support other container formats by integrating with technologies such as BSD Jails or Solaris Zones.

https://docs.docker.com/engine/docker-overview/#the-underlying-technology

Unionfs: A Stackable Unification File System

This project builds a stackable unification file system, which can appear to merge the contents of several directories (branches), while keeping their physical content separate.

Unionfs is useful for unified source tree management, merged contents of split CD-ROM, merged separate software package directories, data grids, and more.

http://unionfs.filesystems.org/

Kernel Korner - Unionfs: Bringing Filesystems Together

In this situation, unioning allows administrators to keep such files separate physically, but to merge them logically into a single view. A collection of merged directories is called a union, and each physical directory is called a branch

https://www.linuxjournal.com/article/7714

Harbor

Harbor is an open source cloud native registry that stores, signs, and scans container images for vulnerabilities.
It fills a gap for organizations and applications that cannot use a public or cloud-based registry, or want a consistent experience across clouds.
https://goharbor.io/

fakecineaste

Monday, March 26, 2018

git

System V (Sys V), systemd, Upstart, runit

Thursday, March 15, 2018

Docker Security

Labels

Blog Archive

Followers

Search This Blog