Add Enhancement for node file integrity monitoring#79
Add Enhancement for node file integrity monitoring#79openshift-merge-robot merged 1 commit intoopenshift:masterfrom
Conversation
|
|
||
| ## Summary | ||
|
|
||
| This enhancement describes a new security feature for OpenShift. Many security-conscious customers want to be informed when files on a host's filesystem are modified in a way that is unexpected, as this may indicate an attack or compromise. It proposes a "file-integrity-operator" that provides file integrity monitoring of select files on the host filesystems of the cluster nodes. It periodically runs a verification check on the watched files and provides logs of any changes. |
There was a problem hiding this comment.
Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.
Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.
Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.
In this case, the back door is basically killing or compromising the AIDE daemonset.
This is going to be a very minor speedup to any attacker that has studied the system beforehand.
Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised /usr/bin/bash" and "OSTree update changed /usr/bin/bash". Similarly for files in /etc that may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.
Really a lot of this boils down to:
providing a log of files that have been modified since the initial run of the DaemonSet pods.
Is just not even close to sufficient.
Finally, another thing is that "periodically scan the whole filesystem" is a known way to cause performance hits. It means that files that were unused are suddenly brought into the page cache, potentially evicting hot files. It causes I/O contention.
I understand we're trying to meet a compliance standard, and we also can't let the perfect be the enemy of the good. We have to start somewhere, and I (and our customers) certainly appreciate the efforts here.
But my bottom line is that this implementation overall will cause more problems (in false positives and also periodic perf hits) than it will solve.
There was a problem hiding this comment.
To give a little bit more context on where this initiative started.... This is ont the ultimate file integrity silver bullet that will solve all of our issues. This initiative came from the need to comply with federal regulations (through the FedRAMP program, moderate baseline to be precise), which requires such a system to ensure file integrity to be in place. So, from a compliance point of view, either you're compliant and you can sell to folks that require this sort of thing (US public sector, finance, healthcare), or you're not and you can't sell. AIDE was chosen as a first approach since this is how we currently make RHEL be compliant. Given that this is an operator, we can further iterate by creating another provider that would be called by the operator, that would give more security assurances. But to begin with, lets just enable customers to be able to comply with regulations, and lets enable them to at least be able to use OpenShift.
On the other hand, this is not being recommended as a default, it would be something you enable through OperatorHub.
There was a problem hiding this comment.
Summarizing some points I raised in internal discussion: I just don't see this implementation as worth it.
Security is full of tradeoffs - I can add three locks to my front door for example. Is it worth it? Probably not...it'd be a daily usability hit for a very marginal benefit.
Humans (including humans attempting to compromise computer systems) will follow the path of least resistance. If they see 3 locks on my front door, they'll check the back door, where I didn't put 3 locks.
In this case, the back door is basically killing or compromising the AIDE daemonset.
This is going to be a very minor speedup to any attacker that has studied the system beforehand.
While this might be true, the same could be said about basically any OS-level security component that is not backed by a HSM or similar hardware root of trust, and even then that is just moving the goal post - you have to then trust that your hardware vendor is not selling its keys to a nation state. Given a sophisticated enough attacker it all falls apart, so you can't really include this worst case scenario when developing a threat model.
Like Juan mentioned there is the possibility of extending this to different file integrity providers but we want to tackle the FedRAMP "moderate" baseline initially (we also would not want to make the HSM/TPM a barrier for entry to be compliant if the spec does not call for it). Even if at a practical level the AIDE solution only provides the ability to better post-mortem a compromise because you have a log of the files that changed, that is still valuable to the organization.
Further, it raises the risk of a lot of false positives. How for example would an AIDE system distinguish between "attacker compromised
/usr/bin/bash" and "OSTree update changed/usr/bin/bash". Similarly for files in/etcthat may be changed via MachineConfig, or certificates on the host that end up being rotated, etc.
We will need to come up with a good strategy for false positives. In the case of an OSTree update, this sounds like something the file integrity operator might be able to handle if it can detect that there was a cluster update and update the checksum database.
|
Trying to provide some positive direction: I think we could probably meet some of these compliance standards by consulting the source of truth for each file. For RHCOS for example, OSTree already keeps a SHA256 checksum of each object underneath For files in For things in Nothing in this discussion so far attempts to defend against compromising the "source of truth", from the ostree checksum to the RPM DB to any mtree files written by the container images stack. Longer term, I think we should move towards fs-verity for a lot of things. |
This might be viable for files under /usr ; The main thing to take into account is how to do that integrity check, and is the checksumming FIPS compliant. We would also need to start coding the capability of providing reports from these checks... So... Even if it's viable, it's way more work.
etcd which stores the MachineConfig content would need to be integrity-checked constantly, and again, the algorithms FIPS compliant... It's not only about checking integrity, but having the system meet certain requirements. This is why AIDE was proposed in the first place, to not reinvent the wheel and have to do this duplicate work. It's not ideal, but it does the job and it allows customers to use OpenShift in the first place.
With further iterations we can start moving the underlying functionality of the operator to use different, more efficient and secure means of checking integrity. In the meantime, why not iterate and have something that customers can already use? |
|
On 10/22/19 1:47 AM, Colin Walters wrote:
In this case, the back door is basically killing or compromising the
AIDE daemonset.
This is going to be a very minor speedup to any attacker that has
studied the system beforehand.
I would suspect this would ring a loud bell to anyone who's listening to
audit logs from the cluster and then the cluster administrator would
take action, even as drastic as decomission the cluster.
Further, it raises the risk of a lot of false positives. How for
example would an AIDE system distinguish between "attacker compromised
|/usr/bin/bash|" and "OSTree update changed |/usr/bin/bash|".
Similarly for files in |/etc| that may be changed via MachineConfig,
or certificates on the host that end up being rotated, etc.
Isn't this similar to what would happen on a single node when the
administrator runs dnf update? Anyway, would the administrator be able
to correlate the ostree update with the hashes changing?
|
Yes - ostree is a content-addressed object store; a lot like git except with SHA256 and support for uid/gid/xattrs and empty directories. The oscontainer image is addressed by sha256, and it contains an ostree repo with an ostree commit, which in turn covers the subtrees and finally the files. |
ashcrow
left a comment
There was a problem hiding this comment.
In general this sounds good to me. I'd prefer to have the design copied in to the enhancement here so it's clear what folks are approving and discussing. The use of AIDE makes sense. I like @cgwalters's ideas of also being able to use ostree features to help with security over time.
How much does reporting matter? In other words, do people already have reporting tools that specifically AIDE's output format will be required?
I'll add more of the design details.
It's possible that they do, considering that AIDE is something that OpenSCAP remediation deploys for RHEL and these same customers could already have infra in place to consume the AIDE logs. So I don't think we need to do much with the log format. |
|
Yes, we all agree about the short term. I didn't see any response to my longer-term proposals, specifically around e.g. fs-verity. I started a WIP to enable using it in Fedora CoreOS (with OSTree) to start: coreos/coreos-assembler#876 and ostreedev/ostree#1959 |
|
|
||
| ## Motivation | ||
|
|
||
| In addition to the reasons stated in the Summary section, as part of the FedRAMP gap assessment of OpenShift/CoreOS, it has been identified that to fulfill several NIST SP800-53 security controls we need to constantly do integrity checks on configuration files (CM-3 & CM-6), as well as critical system paths and binaries (boot configuration, drivers, firmware, libraries) (SI-7). Besides verifying the files, we need to be able to report which files changed and in what manner, in order for the organization to better determine if the change has been authorized or not. In order to fulfull the controls the file integrity checks need to be done using a state-of-the-practice integrity checking mechanism (e.g., parity checks, cyclical redundancy checks, cryptographic hashes). If using cryptographic hashes for integrity checks, such algorithms need to be FIPS-approved. |
There was a problem hiding this comment.
s/CoreOS/RHCOS/
But it is unclear whether:
- this applies to RHCOS only, because on other OSes you expect some other solutions to cover the same usecase (which? how?)
- this applies to RHCOS only, because on other OSes we don't want cluster-orchestrated file integrity monitoring
- this is not specific to RHCOS, but applies to any OS where the OpenShift cluster is running
There was a problem hiding this comment.
Thanks, this is not specific to RHCOS but for any type of node.
There was a problem hiding this comment.
Ack. Then I think you can simply drop RHCOS at all from here, as the gap you are trying to address is not specific to it.
0e45307 to
3fd778e
Compare
Cool, fs-verity/IMA seem like a natural fit for extending the operator in the future but I don't have any comments on them specifically. I've included in the design a field to specify a "provider" type to make room for us to eventually add more file integrity checking types. |
|
On a related but different tangent from this: In OpenShift 4 we view the host as just part of a cluster. And some parts of the cluster (SDN, MCO, etcd) are fully privileged pods. Compomising those pods isn't different from compromising the host in any useful way. (Actually of course, compromising the etcd database is itself a huge "persistence vector", but let's leave that gigantic gaping hole aside for now) Does this proposal exclude |
|
|
||
| **Note:** *Section not required until targeted at a release.* | ||
|
|
||
| TBD |
There was a problem hiding this comment.
I think adding a OpenShift CI test that:
- Installs the operator
- Ensures the operator rolls out
- Modifies a file on the host
- Verifies AIDE caught the change logged it
would make sense here
|
|
||
| ## Drawbacks | ||
|
|
||
| TBD No newline at end of file |
There was a problem hiding this comment.
Ideas:
- AIDE runs periodically which means items are caught at intervals
- False positives may be reported when making updates with the MCO/MCD
3fd778e to
13dd2b7
Compare
I don't see a way for AIDE to handle /var properly, since the stuff under /var is dynamically created based on pod UIDs and such and AIDE is only suited for paths known ahead of time. So I think we need to exclude /var. |
|
There are a few missing sections still. One is only required when targeted (so it's fine). Is there a plan to have the other sections updated before final review? |
13dd2b7 to
06e5b08
Compare
|
@ashcrow I've filled in the missing sections, let me know if that is sufficient. Thanks! |
ashcrow
left a comment
There was a problem hiding this comment.
Looks good. Will leave open for a few days to give others a chance to chime in if needed.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ashcrow, JAORMX, mrogers950 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I share Colin's concerns. Seeing this afterwards, almost every element of this sounds like "something the MCD and the OS should already just own". Having to create this operator is a short-term workaround for a gap in our product. I expected to see a section in here on Roadmap that takes Colin's feedback into account and is basically "remove the need for this operator". Can you open a follow up extending this proposal and gather more details about exactly why this is a special operator on top? |
This proposal is for the file-integrity-operator.
@JAORMX @jhrozek