Add Enhancement for node file integrity monitoring by mrogers950 · Pull Request #79 · openshift/enhancements

mrogers950 · 2019-10-21T21:26:24Z

This proposal is for the file-integrity-operator.
@JAORMX @jhrozek

cgwalters · 2019-10-21T23:43:51Z

enhancements/security/file-integrity-operator.md

+
+## Summary
+
+This enhancement describes a new security feature for OpenShift. Many security-conscious customers want to be informed when files on a host's filesystem are modified in a way that is unexpected, as this may indicate an attack or compromise.  It proposes a "file-integrity-operator" that provides file integrity monitoring of select files on the host filesystems of the cluster nodes. It periodically runs a verification check on the watched files and provides logs of any changes.


Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.

Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.

Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.

In this case, the back door is basically killing or compromising the AIDE daemonset.

This is going to be a very minor speedup to any attacker that has studied the system beforehand.

Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised /usr/bin/bash" and "OSTree update changed /usr/bin/bash". Similarly for files in /etc that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.

Really a lot of this boils down to:

providing a log of files that have been modified since the initial run of the DaemonSet pods.

Is just not even close to sufficient.

Finally, another thing is that "periodically scan the whole filesystem" is a known way to cause performance hits. It means that files that were unused are suddenly brought into the page cache, potentially evicting hot files. It causes I/O contention.

I understand we're trying to meet a compliance standard, and we also can't let the perfect be the enemy of the good. We have to start somewhere, and I (and our customers) certainly appreciate the efforts here.

But my bottom line is that this implementation overall will cause more problems (in false positives and also periodic perf hits) than it will solve.

To give a little bit more context on where this initiative started.... This is ont the ultimate file integrity silver bullet that will solve all of our issues. This initiative came from the need to comply with federal regulations (through the FedRAMP program, moderate baseline to be precise), which requires such a system to ensure file integrity to be in place. So, from a compliance point of view, either you're compliant and you can sell to folks that require this sort of thing (US public sector, finance, healthcare), or you're not and you can't sell. AIDE was chosen as a first approach since this is how we currently make RHEL be compliant. Given that this is an operator, we can further iterate by creating another provider that would be called by the operator, that would give more security assurances. But to begin with, lets just enable customers to be able to comply with regulations, and lets enable them to at least be able to use OpenShift.

On the other hand, this is not being recommended as a default, it would be something you enable through OperatorHub.

Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.

Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.

Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.

In this case, the back door is basically killing or compromising the AIDE daemonset.

This is going to be a very minor speedup to any attacker that has studied the system beforehand.

While this might be true, the same could be said about basically any OS-level security component that is not backed by a HSM or similar hardware root of trust, and even then that is just moving the goal post - you have to then trust that your hardware vendor is not selling its keys to a nation state. Given a sophisticated enough attacker it all falls apart, so you can't really include this worst case scenario when developing a threat model.

Like Juan mentioned there is the possibility of extending this to different file integrity providers but we want to tackle the FedRAMP "moderate" baseline initially (we also would not want to make the HSM/TPM a barrier for entry to be compliant if the spec does not call for it). Even if at a practical level the AIDE solution only provides the ability to better post-mortem a compromise because you have a log of the files that changed, that is still valuable to the organization.

Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised /usr/bin/bash" and "OSTree update changed /usr/bin/bash". Similarly for files in /etc that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.

We will need to come up with a good strategy for false positives. In the case of an OSTree update, this sounds like something the file integrity operator might be able to handle if it can detect that there was a cluster update and update the checksum database.

cgwalters · 2019-10-21T23:51:43Z

Trying to provide some positive direction: I think we could probably meet some of these compliance standards by consulting the source of truth for each file. For RHCOS for example, OSTree already keeps a SHA256 checksum of each object underneath /usr - for traditional RPM systems there's obviously rpm -V too.

For files in /etc - we can consult the MachineConfig state.

For things in /var - well, hard. There's container images there too of course which are obviously somewhat important. There isn't any current standard for per-file verification of container images; I think the containers team at one point made an effort to add BSD mtree files, but I don't think that ended up in the containers/image stack.

Nothing in this discussion so far attempts to defend against compromising the "source of truth", from the ostree checksum to the RPM DB to any mtree files written by the container images stack.

Longer term, I think we should move towards fs-verity for a lot of things.

JAORMX · 2019-10-22T10:07:24Z

Trying to provide some positive direction: I think we could probably meet some of these compliance standards by consulting the source of truth for each file. For RHCOS for example, OSTree already keeps a SHA256 checksum of each object underneath /usr - for traditional RPM systems there's obviously rpm -V too.

This might be viable for files under /usr ; The main thing to take into account is how to do that integrity check, and is the checksumming FIPS compliant. We would also need to start coding the capability of providing reports from these checks... So... Even if it's viable, it's way more work.

For files in /etc - we can consult the MachineConfig state.

etcd which stores the MachineConfig content would need to be integrity-checked constantly, and again, the algorithms FIPS compliant...

It's not only about checking integrity, but having the system meet certain requirements. This is why AIDE was proposed in the first place, to not reinvent the wheel and have to do this duplicate work. It's not ideal, but it does the job and it allows customers to use OpenShift in the first place.

For things in /var - well, hard. There's container images there too of course which are obviously somewhat important. There isn't any current standard for per-file verification of container images; I think the containers team at one point made an effort to add BSD mtree files, but I don't think that ended up in the containers/image stack.

Nothing in this discussion so far attempts to defend against compromising the "source of truth", from the ostree checksum to the RPM DB to any mtree files written by the container images stack.

Longer term, I think we should move towards fs-verity for a lot of things.

With further iterations we can start moving the underlying functionality of the operator to use different, more efficient and secure means of checking integrity. In the meantime, why not iterate and have something that customers can already use?

jhrozek · 2019-10-22T10:39:56Z

On 10/22/19 1:47 AM, Colin Walters wrote: In this case, the back door is basically killing or compromising the AIDE daemonset. This is going to be a very minor speedup to any attacker that has studied the system beforehand.

I would suspect this would ring a loud bell to anyone who's listening to audit logs from the cluster and then the cluster administrator would take action, even as drastic as decomission the cluster.

Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised |/usr/bin/bash|" and "OSTree update changed |/usr/bin/bash|". Similarly for files in |/etc| that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.

Isn't this similar to what would happen on a single node when the administrator runs dnf update? Anyway, would the administrator be able to correlate the ostree update with the hashes changing?

cgwalters · 2019-10-22T11:28:49Z

Anyway, would the administrator be able
to correlate the ostree update with the hashes changing?

Yes - ostree is a content-addressed object store; a lot like git except with SHA256 and support for uid/gid/xattrs and empty directories. The oscontainer image is addressed by sha256, and it contains an ostree repo with an ostree commit, which in turn covers the subtrees and finally the files.

ashcrow

In general this sounds good to me. I'd prefer to have the design copied in to the enhancement here so it's clear what folks are approving and discussing. The use of AIDE makes sense. I like @cgwalters's ideas of also being able to use ostree features to help with security over time.

How much does reporting matter? In other words, do people already have reporting tools that specifically AIDE's output format will be required?

mrogers950 · 2019-10-22T19:08:52Z

In general this sounds good to me. I'd prefer to have the design copied in to the enhancement here so it's clear what folks are approving and discussing. The use of AIDE makes sense. I like @cgwalters's ideas of also being able to use ostree features to help with security over time.

I'll add more of the design details.

How much does reporting matter? In other words, do people already have reporting tools that specifically AIDE's output format will be required?

It's possible that they do, considering that AIDE is something that OpenSCAP remediation deploys for RHEL and these same customers could already have infra in place to consume the AIDE logs. So I don't think we need to do much with the log format.

cgwalters · 2019-10-25T22:10:17Z

Yes, we all agree about the short term. I didn't see any response to my longer-term proposals, specifically around e.g. fs-verity. I started a WIP to enable using it in Fedora CoreOS (with OSTree) to start: coreos/coreos-assembler#876 and ostreedev/ostree#1959

lucab · 2019-10-28T13:48:52Z

enhancements/security/file-integrity-operator.md

+
+## Motivation
+
+In addition to the reasons stated in the Summary section, as part of the FedRAMP gap assessment of OpenShift/CoreOS, it has been identified that to fulfill several NIST SP800-53 security controls we need to constantly do integrity checks on configuration files (CM-3 & CM-6), as well as critical system paths and binaries (boot configuration, drivers, firmware, libraries) (SI-7). Besides verifying the files, we need to be able to report which files changed and in what manner, in order for the organization to better determine if the change has been authorized or not. In order to fulfull the controls the file integrity checks need to be done using a state-of-the-practice integrity checking mechanism (e.g., parity checks, cyclical redundancy checks, cryptographic hashes). If using cryptographic hashes for integrity checks, such algorithms need to be FIPS-approved.


s/CoreOS/RHCOS/

But it is unclear whether:

this applies to RHCOS only, because on other OSes you expect some other solutions to cover the same usecase (which? how?)

this applies to RHCOS only, because on other OSes we don't want cluster-orchestrated file integrity monitoring

this is not specific to RHCOS, but applies to any OS where the OpenShift cluster is running

Thanks, this is not specific to RHCOS but for any type of node.

Ack. Then I think you can simply drop RHCOS at all from here, as the gap you are trying to address is not specific to it.

mrogers950 · 2019-10-28T22:10:56Z

Yes, we all agree about the short term. I didn't see any response to my longer-term proposals, specifically around e.g. fs-verity. I started a WIP to enable using it in Fedora CoreOS (with OSTree) to start: coreos/coreos-assembler#876 and ostreedev/ostree#1959

Cool, fs-verity/IMA seem like a natural fit for extending the operator in the future but I don't have any comments on them specifically. I've included in the design a field to specify a "provider" type to make room for us to eventually add more file integrity checking types.

cgwalters · 2019-10-29T15:11:14Z

On a related but different tangent from this:

In OpenShift 4 we view the host as just part of a cluster. And some parts of the cluster (SDN, MCO, etcd) are fully privileged pods. Compomising those pods isn't different from compromising the host in any useful way.

(Actually of course, compromising the etcd database is itself a huge "persistence vector", but let's leave that gigantic gaping hole aside for now)

Does this proposal exclude /var from AIDE? All of the container images are stored there, which includes privileged code.

ashcrow · 2019-10-29T15:51:31Z

enhancements/security/file-integrity-operator.md

+
+**Note:** *Section not required until targeted at a release.*
+
+TBD


I think adding a OpenShift CI test that:

Installs the operator

Ensures the operator rolls out

Modifies a file on the host

Verifies AIDE caught the change logged it

would make sense here

ashcrow · 2019-10-29T16:06:14Z

enhancements/security/file-integrity-operator.md

+
+## Drawbacks
+
+TBD


Ideas:

AIDE runs periodically which means items are caught at intervals

False positives may be reported when making updates with the MCO/MCD

mrogers950 · 2019-10-30T16:03:04Z

Does this proposal exclude /var from AIDE? All of the container images are stored there, which includes privileged code.

I don't see a way for AIDE to handle /var properly, since the stuff under /var is dynamically created based on pod UIDs and such and AIDE is only suited for paths known ahead of time. So I think we need to exclude /var.

ashcrow · 2019-10-31T16:00:19Z

There are a few missing sections still. One is only required when targeted (so it's fine). Is there a plan to have the other sections updated before final review?

mrogers950 · 2019-11-04T21:05:25Z

@ashcrow I've filled in the missing sections, let me know if that is sufficient. Thanks!

ashcrow

Looks good. Will leave open for a few days to give others a chance to chime in if needed.

ashcrow · 2019-11-07T17:06:35Z

/lgtm

openshift-ci-robot · 2019-11-07T17:06:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, JAORMX, mrogers950

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ashcrow]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2020-01-20T19:00:50Z

I share Colin's concerns. Seeing this afterwards, almost every element of this sounds like "something the MCD and the OS should already just own". Having to create this operator is a short-term workaround for a gap in our product. I expected to see a section in here on Roadmap that takes Colin's feedback into account and is basically "remove the need for this operator".

Can you open a follow up extending this proposal and gather more details about exactly why this is a special operator on top?

openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 21, 2019

openshift-ci-robot requested review from imcleod and stevekuznetsov October 21, 2019 21:26

cgwalters reviewed Oct 21, 2019

View reviewed changes

ashcrow reviewed Oct 22, 2019

View reviewed changes

cgwalters mentioned this pull request Oct 25, 2019

Add support for 'rootfs: verity' coreos/coreos-assembler#876

Merged

lucab reviewed Oct 28, 2019

View reviewed changes

mrogers950 force-pushed the file-integrity branch from 0e45307 to 3fd778e Compare October 28, 2019 21:51

ashcrow reviewed Oct 29, 2019

View reviewed changes

mrogers950 force-pushed the file-integrity branch from 3fd778e to 13dd2b7 Compare October 30, 2019 15:27

JAORMX approved these changes Oct 31, 2019

View reviewed changes

Add Enhancement for node file integrity monitoring

06e5b08

mrogers950 force-pushed the file-integrity branch from 13dd2b7 to 06e5b08 Compare November 4, 2019 21:04

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 4, 2019

ashcrow approved these changes Nov 4, 2019

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2019

openshift-ci-robot assigned ashcrow Nov 7, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 7, 2019

openshift-merge-robot merged commit d24de5c into openshift:master Nov 7, 2019

mrogers950 mentioned this pull request Feb 7, 2020

Update file-integrity-operator enhancement #204

Merged

cgwalters mentioned this pull request Apr 1, 2020

ocp4: Remove fapolicyd checks from RHCOS moderate profile ComplianceAsCode/content#5351

Merged


		## Summary

		This enhancement describes a new security feature for OpenShift. Many security-conscious customers want to be informed when files on a host's filesystem are modified in a way that is unexpected, as this may indicate an attack or compromise. It proposes a "file-integrity-operator" that provides file integrity monitoring of select files on the host filesystems of the cluster nodes. It periodically runs a verification check on the watched files and provides logs of any changes.


		## Motivation

		In addition to the reasons stated in the Summary section, as part of the FedRAMP gap assessment of OpenShift/CoreOS, it has been identified that to fulfill several NIST SP800-53 security controls we need to constantly do integrity checks on configuration files (CM-3 & CM-6), as well as critical system paths and binaries (boot configuration, drivers, firmware, libraries) (SI-7). Besides verifying the files, we need to be able to report which files changed and in what manner, in order for the organization to better determine if the change has been authorized or not. In order to fulfull the controls the file integrity checks need to be done using a state-of-the-practice integrity checking mechanism (e.g., parity checks, cyclical redundancy checks, cryptographic hashes). If using cryptographic hashes for integrity checks, such algorithms need to be FIPS-approved.


		Note: Section not required until targeted at a release.

		TBD

Conversation

mrogers950 commented Oct 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JAORMX commented Oct 22, 2019

Uh oh!

jhrozek commented Oct 22, 2019 via email

Uh oh!

cgwalters commented Oct 22, 2019

Uh oh!

ashcrow left a comment

Choose a reason for hiding this comment

Uh oh!

mrogers950 commented Oct 22, 2019

Uh oh!

cgwalters commented Oct 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrogers950 commented Oct 28, 2019

Uh oh!

cgwalters commented Oct 29, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrogers950 commented Oct 30, 2019

Uh oh!

ashcrow commented Oct 31, 2019

Uh oh!

mrogers950 commented Nov 4, 2019

Uh oh!

ashcrow left a comment

Choose a reason for hiding this comment

Uh oh!

ashcrow commented Nov 7, 2019

Uh oh!

openshift-ci-robot commented Nov 7, 2019

Uh oh!

smarterclayton commented Jan 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

cgwalters commented Oct 21, 2019 •

edited

Loading