Skip to content

Conversation

@benkilimnik
Copy link
Member

@benkilimnik benkilimnik commented Aug 28, 2023

Summary: Adds kernel version to HostInfo during PEM registration. This is used to fix #1582 by scoping tracepoint deployments to nodes running certain versions of the kernel.

Related issues: #1582 #1659

Type of change: /kind feature

Test Plan: Tested all existing targets and added tests to src/vizier/services/agent/shared/manager/registration_test.cc and src/vizier/services/agent/pem/pem_manager_test.cc

@benkilimnik benkilimnik changed the title [Kernel scoping 2/5] Add kernel version to HostInfo (#1659, #1582) [Kernel scoping 2/5] Add kernel version to HostInfo Aug 28, 2023
@benkilimnik benkilimnik requested review from a team August 28, 2023 19:02
@benkilimnik benkilimnik force-pushed the register-kernel-version branch from 43012f4 to 1adc3aa Compare September 5, 2023 17:43
Comment on lines 48 to 49
// TODO(@benkilimnik): Kernel version may not be needed in kelvin, only in PEMManager for
// script selection based on HostInfo. Could use dummy value or refactor Manager class.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that Kelvin may not need kernel version, at least for the purposes we have right now. Is it possible to remove from Kelvin (or is there some common base class where we add kernel version as an arg. making it required as an arg. here)?

Copy link
Member Author

@benkilimnik benkilimnik Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a common base class Manager in vizier/services/agent/shared/manager/manager.{h,cc} to which I've added the kernel version.

I think we could handle this in a number of ways:

  1. Set a dummy value for the kernel version in KelvinManager.

  2. Set a flag in Manager

protected:
    Manager(..., bool use_kernel_version = true);
Manager::Manager(..., bool use_kernel_version) {
    if (use_kernel_version) {
        // Find and set the kernel version
    }
    // other initialization
}
KelvinManager::KelvinManager(...) 
    : Manager(..., false) { // Pass false to indicate that we don't need to go find the kernel version
    // KelvinManager specific initialization
}
  1. Refactor Manager such that it does not require the kernel version in its constructor, but provides a separate method to set it. Derived classes like PEMManager set the kernel version after construction if needed.

  2. Make the kernel version optional in the base Manager class e.g. std::optional<px::system::KernelVersion>

  3. Keep kernel version in Kelvin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another proposal for addressing this. What if we mutated the info_ member's kernel_version field within the PEMManager's constructor body? This would allow us to remove any references of kernel version from the Manager base class and the KelvinManager.

This commit (1795988) shows an example of what I mean. If we like that direction, I think we should also add tests to ensure that a KelvinManager and PEMManager have an info_ member in the correct state after construction: former having kernel_version uninitialized and the latter having it set to a kernel version.

I see this as an improvement to option 3 because it ensures that a newly minted PEMManager is properly configured and reduces risk of using the API incorrectly (forgetting to call the kernel version setter).

Copy link
Member Author

@benkilimnik benkilimnik Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a more robust solution.

Regarding testing, are you imagining something like this, where we check that the kernel version is initialized after construction? (I think the member function info() is protected, so we'd need a getter or friend test function).

class PEMManagerTest : public ::testing::Test {
    protected:
        PEMManagerTest() {
            agent_info_ = agent::Info{};
            agent_info_.agent_id = sole::uuid4();
            agent_info_.hostname = "hostname";
            agent_info_.address = "address";
            agent_info_.pod_name = "pod_name";
            agent_info_.host_ip = "host_ip";
            agent_info_.kernel_version =
                system::ParseKernelVersionString("5.15.0-106-generic").ValueOrDie();
        }

    agent::Info agent_info_;
};

TEST_F(PEMManagerTest, ConstructorTest) {
    auto manager =
      PEMManager::Create(agent_info_.agent_id, agent_info_.pod_name, agent_info_.host_ip,
                         "nats_url", agent_info_.kernel_version)
          .ConsumeValueOrDie();  // Remove the semicolon before .ConsumeValueOrDie()
    EXPECT_EQ(manager->info()->agent_id, agent_info_.agent_id);
    EXPECT_EQ(manager->info()->pod_name, agent_info_.pod_name);
    EXPECT_EQ(manager->info()->host_ip, agent_info_.host_ip);
    EXPECT_EQ(manager->info()->kernel_version, agent_info_.kernel_version);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benkilimnik that's what I was thinking and I think making it a friend test function sounds good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the way PEMManager inherits from Manager I have actually found it rather difficult to write a test case for this. When I instantiate the PEMManager I get this error

F20230908 07:05:26.144451    12 statusor.h:148] Check failed: _s.ok() Bad Status: Internal : Failed to read file ../../services/certs/ca.crt (No such file or directory)

which I believe comes from SSL::DefaultGRPCClientCreds() in manager.cc:103

Manager::Manager(sole::uuid agent_id, std::string_view pod_name, std::string_view host_ip,
                 int grpc_server_port, services::shared::agent::AgentCapabilities capabilities,
                 services::shared::agent::AgentParameters parameters, std::string_view nats_url,
                 std::string_view mds_url)
    : grpc_channel_creds_(SSL::DefaultGRPCClientCreds()),
...

I think a test that mocks out parts of the Manager constructor will require some design changes. As I see it, we could:

  1. Move initialization logic out of the Manager constructor, so that we can avoid calling it in tests and only instantiate the object
  2. Instead of directly creating instances / reading files inside the constructor, we could pass them as parameters, so that we can pass mock versions in tests.
  3. Instead of inheriting from Manager, we could make PEMManager and Manager implement the same interface. Then, PEMManager can have a Manager member that it delegates to for the shared functionality. This way, we could mock out the entire Manager when testing PEMManager.

@ddelnano @etep eager to hear your thoughts on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion with @ddelnano, I ended up adding a private constructor to PEMManager that contains a flag to skip the Init() of the base Manager. This flag is set to false in the public constructor and true in friend tests which call it.
The cert error was fixed by setting FLAGS_disable_SSL = true;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with option (5) keep the version in Kelvin if it nets out to "simpler." It appears we already have a "way" to avoid going here, but is the pain worth it? I am ok with either route btw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite advocating for the current direction, I do think that having the kernel version for both is likely simpler.

Sorry for causing additional change @benkilimnik, but after reconsidering (5) is probably best. If the Manager class was an area where future test coverage would helpful that might tip the scale, but as we've seen it's very difficult to exercise its code.

Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
@benkilimnik benkilimnik force-pushed the register-kernel-version branch from 3acf04b to 385025e Compare September 7, 2023 06:28
Copy link
Member

@ddelnano ddelnano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments in addition to the existing discussion about how to handle the kernel version for KelvinManagers.

Comment on lines 48 to 49
// TODO(@benkilimnik): Kernel version may not be needed in kelvin, only in PEMManager for
// script selection based on HostInfo. Could use dummy value or refactor Manager class.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have another proposal for addressing this. What if we mutated the info_ member's kernel_version field within the PEMManager's constructor body? This would allow us to remove any references of kernel version from the Manager base class and the KelvinManager.

This commit (1795988) shows an example of what I mean. If we like that direction, I think we should also add tests to ensure that a KelvinManager and PEMManager have an info_ member in the correct state after construction: former having kernel_version uninitialized and the latter having it set to a kernel version.

I see this as an improvement to option 3 because it ensures that a newly minted PEMManager is properly configured and reduces risk of using the API incorrectly (forgetting to call the kernel version setter).

…in PEMManager constructor body

Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
… correctly

Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
Copy link
Member

@ddelnano ddelnano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more minor comments. Assuming @etep doesn't have any issues with the direction of how we tested this, this lgtm.

Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
Copy link
Member

@ddelnano ddelnano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more minor change, but approving since we have consensus on the direction.

Signed-off-by: Benjamin Kilimnik <bkilimnik@pixielabs.ai>
@benkilimnik benkilimnik marked this pull request as ready for review September 11, 2023 18:28
@benkilimnik benkilimnik requested a review from a team as a code owner September 11, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

px/tcp_drops pxl script broken due to missing tcp_drop kprobe

4 participants