Skip to content

Add Enhancement for node file integrity monitoring#79

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
mrogers950:file-integrity
Nov 7, 2019
Merged

Add Enhancement for node file integrity monitoring#79
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
mrogers950:file-integrity

Conversation

@mrogers950
Copy link
Contributor

This proposal is for the file-integrity-operator.
@JAORMX @jhrozek

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 21, 2019

## Summary

This enhancement describes a new security feature for OpenShift. Many security-conscious customers want to be informed when files on a host's filesystem are modified in a way that is unexpected, as this may indicate an attack or compromise. It proposes a "file-integrity-operator" that provides file integrity monitoring of select files on the host filesystems of the cluster nodes. It periodically runs a verification check on the watched files and provides logs of any changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.

Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.

Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.

In this case, the back door is basically killing or compromising the AIDE daemonset.

This is going to be a very minor speedup to any attacker that has studied the system beforehand.

Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised /usr/bin/bash" and "OSTree update changed /usr/bin/bash". Similarly for files in /etc that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.

Really a lot of this boils down to:

providing a log of files that have been modified since the initial run of the DaemonSet pods.

Is just not even close to sufficient.

Finally, another thing is that "periodically scan the whole filesystem" is a known way to cause performance hits. It means that files that were unused are suddenly brought into the page cache, potentially evicting hot files. It causes I/O contention.

I understand we're trying to meet a compliance standard, and we also can't let the perfect be the enemy of the good. We have to start somewhere, and I (and our customers) certainly appreciate the efforts here.

But my bottom line is that this implementation overall will cause more problems (in false positives and also periodic perf hits) than it will solve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give a little bit more context on where this initiative started.... This is ont the ultimate file integrity silver bullet that will solve all of our issues. This initiative came from the need to comply with federal regulations (through the FedRAMP program, moderate baseline to be precise), which requires such a system to ensure file integrity to be in place. So, from a compliance point of view, either you're compliant and you can sell to folks that require this sort of thing (US public sector, finance, healthcare), or you're not and you can't sell. AIDE was chosen as a first approach since this is how we currently make RHEL be compliant. Given that this is an operator, we can further iterate by creating another provider that would be called by the operator, that would give more security assurances. But to begin with, lets just enable customers to be able to comply with regulations, and lets enable them to at least be able to use OpenShift.

On the other hand, this is not being recommended as a default, it would be something you enable through OperatorHub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.

Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.

Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.

In this case, the back door is basically killing or compromising the AIDE daemonset.

This is going to be a very minor speedup to any attacker that has studied the system beforehand.

While this might be true, the same could be said about basically any OS-level security component that is not backed by a HSM or similar hardware root of trust, and even then that is just moving the goal post - you have to then trust that your hardware vendor is not selling its keys to a nation state. Given a sophisticated enough attacker it all falls apart, so you can't really include this worst case scenario when developing a threat model.

Like Juan mentioned there is the possibility of extending this to different file integrity providers but we want to tackle the FedRAMP "moderate" baseline initially (we also would not want to make the HSM/TPM a barrier for entry to be compliant if the spec does not call for it). Even if at a practical level the AIDE solution only provides the ability to better post-mortem a compromise because you have a log of the files that changed, that is still valuable to the organization.

Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised /usr/bin/bash" and "OSTree update changed /usr/bin/bash". Similarly for files in /etc that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.

We will need to come up with a good strategy for false positives. In the case of an OSTree update, this sounds like something the file integrity operator might be able to handle if it can detect that there was a cluster update and update the checksum database.

@cgwalters
Copy link
Member

cgwalters commented Oct 21, 2019

Trying to provide some positive direction: I think we could probably meet some of these compliance standards by consulting the source of truth for each file. For RHCOS for example, OSTree already keeps a SHA256 checksum of each object underneath /usr - for traditional RPM systems there's obviously rpm -V too.

For files in /etc - we can consult the MachineConfig state.

For things in /var - well, hard. There's container images there too of course which are obviously somewhat important. There isn't any current standard for per-file verification of container images; I think the containers team at one point made an effort to add BSD mtree files, but I don't think that ended up in the containers/image stack.

Nothing in this discussion so far attempts to defend against compromising the "source of truth", from the ostree checksum to the RPM DB to any mtree files written by the container images stack.

Longer term, I think we should move towards fs-verity for a lot of things.

@JAORMX
Copy link
Contributor

JAORMX commented Oct 22, 2019

Trying to provide some positive direction: I think we could probably meet some of these compliance standards by consulting the source of truth for each file. For RHCOS for example, OSTree already keeps a SHA256 checksum of each object underneath /usr - for traditional RPM systems there's obviously rpm -V too.

This might be viable for files under /usr ; The main thing to take into account is how to do that integrity check, and is the checksumming FIPS compliant. We would also need to start coding the capability of providing reports from these checks... So... Even if it's viable, it's way more work.

For files in /etc - we can consult the MachineConfig state.

etcd which stores the MachineConfig content would need to be integrity-checked constantly, and again, the algorithms FIPS compliant...

It's not only about checking integrity, but having the system meet certain requirements. This is why AIDE was proposed in the first place, to not reinvent the wheel and have to do this duplicate work. It's not ideal, but it does the job and it allows customers to use OpenShift in the first place.

For things in /var - well, hard. There's container images there too of course which are obviously somewhat important. There isn't any current standard for per-file verification of container images; I think the containers team at one point made an effort to add BSD mtree files, but I don't think that ended up in the containers/image stack.

Nothing in this discussion so far attempts to defend against compromising the "source of truth", from the ostree checksum to the RPM DB to any mtree files written by the container images stack.

Longer term, I think we should move towards fs-verity for a lot of things.

With further iterations we can start moving the underlying functionality of the operator to use different, more efficient and secure means of checking integrity. In the meantime, why not iterate and have something that customers can already use?

@jhrozek
Copy link
Contributor

jhrozek commented Oct 22, 2019 via email

@cgwalters
Copy link
Member

Anyway, would the administrator be able
to correlate the ostree update with the hashes changing?

Yes - ostree is a content-addressed object store; a lot like git except with SHA256 and support for uid/gid/xattrs and empty directories. The oscontainer image is addressed by sha256, and it contains an ostree repo with an ostree commit, which in turn covers the subtrees and finally the files.

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this sounds good to me. I'd prefer to have the design copied in to the enhancement here so it's clear what folks are approving and discussing. The use of AIDE makes sense. I like @cgwalters's ideas of also being able to use ostree features to help with security over time.

How much does reporting matter? In other words, do people already have reporting tools that specifically AIDE's output format will be required?

@mrogers950
Copy link
Contributor Author

In general this sounds good to me. I'd prefer to have the design copied in to the enhancement here so it's clear what folks are approving and discussing. The use of AIDE makes sense. I like @cgwalters's ideas of also being able to use ostree features to help with security over time.

I'll add more of the design details.

How much does reporting matter? In other words, do people already have reporting tools that specifically AIDE's output format will be required?

It's possible that they do, considering that AIDE is something that OpenSCAP remediation deploys for RHEL and these same customers could already have infra in place to consume the AIDE logs. So I don't think we need to do much with the log format.

@cgwalters
Copy link
Member

Yes, we all agree about the short term. I didn't see any response to my longer-term proposals, specifically around e.g. fs-verity. I started a WIP to enable using it in Fedora CoreOS (with OSTree) to start: coreos/coreos-assembler#876 and ostreedev/ostree#1959


## Motivation

In addition to the reasons stated in the Summary section, as part of the FedRAMP gap assessment of OpenShift/CoreOS, it has been identified that to fulfill several NIST SP800-53 security controls we need to constantly do integrity checks on configuration files (CM-3 & CM-6), as well as critical system paths and binaries (boot configuration, drivers, firmware, libraries) (SI-7). Besides verifying the files, we need to be able to report which files changed and in what manner, in order for the organization to better determine if the change has been authorized or not. In order to fulfull the controls the file integrity checks need to be done using a state-of-the-practice integrity checking mechanism (e.g., parity checks, cyclical redundancy checks, cryptographic hashes). If using cryptographic hashes for integrity checks, such algorithms need to be FIPS-approved.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/CoreOS/RHCOS/

But it is unclear whether:

  • this applies to RHCOS only, because on other OSes you expect some other solutions to cover the same usecase (which? how?)
  • this applies to RHCOS only, because on other OSes we don't want cluster-orchestrated file integrity monitoring
  • this is not specific to RHCOS, but applies to any OS where the OpenShift cluster is running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is not specific to RHCOS but for any type of node.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Then I think you can simply drop RHCOS at all from here, as the gap you are trying to address is not specific to it.

@mrogers950
Copy link
Contributor Author

Yes, we all agree about the short term. I didn't see any response to my longer-term proposals, specifically around e.g. fs-verity. I started a WIP to enable using it in Fedora CoreOS (with OSTree) to start: coreos/coreos-assembler#876 and ostreedev/ostree#1959

Cool, fs-verity/IMA seem like a natural fit for extending the operator in the future but I don't have any comments on them specifically. I've included in the design a field to specify a "provider" type to make room for us to eventually add more file integrity checking types.

@cgwalters
Copy link
Member

On a related but different tangent from this:

In OpenShift 4 we view the host as just part of a cluster. And some parts of the cluster (SDN, MCO, etcd) are fully privileged pods. Compomising those pods isn't different from compromising the host in any useful way.

(Actually of course, compromising the etcd database is itself a huge "persistence vector", but let's leave that gigantic gaping hole aside for now)

Does this proposal exclude /var from AIDE? All of the container images are stored there, which includes privileged code.


**Note:** *Section not required until targeted at a release.*

TBD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a OpenShift CI test that:

  • Installs the operator
  • Ensures the operator rolls out
  • Modifies a file on the host
  • Verifies AIDE caught the change logged it

would make sense here


## Drawbacks

TBD No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideas:

  • AIDE runs periodically which means items are caught at intervals
  • False positives may be reported when making updates with the MCO/MCD

@mrogers950
Copy link
Contributor Author

Does this proposal exclude /var from AIDE? All of the container images are stored there, which includes privileged code.

I don't see a way for AIDE to handle /var properly, since the stuff under /var is dynamically created based on pod UIDs and such and AIDE is only suited for paths known ahead of time. So I think we need to exclude /var.

@ashcrow
Copy link
Member

ashcrow commented Oct 31, 2019

There are a few missing sections still. One is only required when targeted (so it's fine). Is there a plan to have the other sections updated before final review?

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 4, 2019
@mrogers950
Copy link
Contributor Author

@ashcrow I've filled in the missing sections, let me know if that is sufficient. Thanks!

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will leave open for a few days to give others a chance to chime in if needed.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 4, 2019
@ashcrow
Copy link
Member

ashcrow commented Nov 7, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 7, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, JAORMX, mrogers950

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit d24de5c into openshift:master Nov 7, 2019
@smarterclayton
Copy link
Contributor

I share Colin's concerns. Seeing this afterwards, almost every element of this sounds like "something the MCD and the OS should already just own". Having to create this operator is a short-term workaround for a gap in our product. I expected to see a section in here on Roadmap that takes Colin's feedback into account and is basically "remove the need for this operator".

Can you open a follow up extending this proposal and gather more details about exactly why this is a special operator on top?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants