Skip to content

cephadm: Get rid of injected_argv#37495

Closed
jmolmo wants to merge 2 commits intoceph:masterfrom
jmolmo:rm_injected_argv
Closed

cephadm: Get rid of injected_argv#37495
jmolmo wants to merge 2 commits intoceph:masterfrom
jmolmo:rm_injected_argv

Conversation

@jmolmo
Copy link
Member

@jmolmo jmolmo commented Sep 30, 2020

Removed the injected_argv parameter and the injection of code in the cephadm
script we send to hosts.
Now the script is copied and after that we execute the cephadm command.
I would like to copy it only one time (when adding new hosts) but this will be
part of a future PR, together with other prs to:

  • Introduce cephadm version
  • Get rid of packaged/root mode
  • Use pex or eggs

Signed-off-by: Juan Miguel Olmo Martínez jolmomar@redhat.com

Note:
To test this is a bit tricky .. because the cephadm binary is by default obtained from "/usr/sbin" folder in the mgr container.
The new cephadm binary included in this pr is the one to test, so in order to do it properly:

1. Use the new script to bootstarp the cluster

2. Copy the new script to a location accessible from manager container:
for example:
/usr/share/ceph/mgr/cephadm/cephadm

3. Open a ceph shell:
 Change "cephadm_path": point to the new "cephadm tool" script
# ceph config get mgr cephadm_path
/usr/sbin/cephadm

# ceph config set mgr cephadm_path /usr/share/ceph/mgr/cephadm/cephadm

# Force read the new "cephadm" script from the orchestrator module
ceph mgr module disable cephadm
ceph mgr module enable cephadm

4. Add new hosts and OSDs..
```.
 

@jmolmo jmolmo added the cephadm label Sep 30, 2020
@jmolmo jmolmo requested a review from a team September 30, 2020 13:48
@jschmid1
Copy link
Contributor

jschmid1 commented Oct 1, 2020

Looks like there's a test failure:

  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/orchestrator/_interface.py", line 295, in _finalize
    next_result = self._on_complete(self._value)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 110, in <lambda>
    return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs))
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1365, in add_host
    return self._add_host(spec)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1351, in _add_host
    error_ok=True, no_fsid=True)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1284, in _run_cephadm
    self._copy_cephadm(conn, cephadm_dest)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1227, in _copy_cephadm
    stdin=self._cephadm.encode('utf-8'))
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/tests/test_cephadm.py", line 802, in _check
    raise Exception("boom: connection is dead")
Exception: boom: connection is dead

@jschmid1
Copy link
Contributor

jschmid1 commented Oct 1, 2020

For my own clarity. This PR does the following:

Every time we execute a command from mgr/cephadm, we no longer execute it by directly feeding it to python's stdin but rather save it to /var/lib/ceph/fsid/cephadm/$mgr_name and execute this script with the provided args instead.

We also mount /var/lib/ceph/fsid/ it to any type->mgr container.

Do we do anything with the mounted cephadm_path inside the container?

Is that correct @jmolmo ?

@jmolmo
Copy link
Member Author

jmolmo commented Oct 1, 2020

Looks like there's a test failure:

  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/orchestrator/_interface.py", line 295, in _finalize
    next_result = self._on_complete(self._value)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 110, in <lambda>
    return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs))
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1365, in add_host
    return self._add_host(spec)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1351, in _add_host
    error_ok=True, no_fsid=True)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1284, in _run_cephadm
    self._copy_cephadm(conn, cephadm_dest)
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/module.py", line 1227, in _copy_cephadm
    stdin=self._cephadm.encode('utf-8'))
  File "/home/jenkins-build/build/workspace/ceph-pull-requests/src/pybind/mgr/cephadm/tests/test_cephadm.py", line 802, in _check
    raise Exception("boom: connection is dead")
Exception: boom: connection is dead

By the context it seems that something happened to the destination host when trying to copy cephadm to it .... By the error message we can think that maybe it has been a network error....

Can you pass the link to the log file in order to see if i can get more information?

@jmolmo
Copy link
Member Author

jmolmo commented Oct 1, 2020

For my own clarity. This PR does the following:

Every time we execute a command from mgr/cephadm, we no longer execute it by directly feeding it to python's stdin but rather save it to /var/lib/ceph/fsid/cephadm/$mgr_name and execute this script with the provided args instead.

yes. You're right. The goal here is to avoid "inject" code in execution time. We are copying the script and after that execute the commands.
I do not like to copy the script each time we need to call cephadm tool, but I hope to solve that soon in a new PR ( It seems team do not like to have lot of things together in the same PR)

We also mount /var/lib/ceph/fsid/ it to any type->mgr container.

Do we do anything with the mounted cephadm_path inside the container?

No!. Good catch!! (this was part of my "source" PR where i had more things ... ) Removed!

Is that correct @jmolmo ?

Everything correct @jschmid1

@jschmid1
Copy link
Contributor

jschmid1 commented Oct 1, 2020

Can you pass the link to the log file in order to see if i can get more information?

Click on make check at the bottom of the pullrequest.

@jschmid1
Copy link
Contributor

jschmid1 commented Oct 1, 2020

For my own clarity. This PR does the following:
Every time we execute a command from mgr/cephadm, we no longer execute it by directly feeding it to python's stdin but rather save it to /var/lib/ceph/fsid/cephadm/$mgr_name and execute this script with the provided args instead.

yes. You're right. The goal here is to avoid "inject" code in execution time. We are copying the script and after that execute the commands.
I do not like to copy the script each time we need to call cephadm tool, but I hope to solve that soon in a new PR ( It seems team do not like to have lot of things together in the same PR)

Indeed it's easier to review and judge smaller portions of code.

Looking forward to future PRs

cephadm_dest = '/var/lib/ceph/%s/cephadm/%s' % (self._cluster_fsid, self.get_mgr_id())
self._copy_cephadm(conn, cephadm_dest)

self.log.info("Tying to execute : %s" % (['sudo', 'python', cephadm_dest] + final_args))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python, not 'python'

@jmolmo
Copy link
Member Author

jmolmo commented Oct 1, 2020

It seems there is a test failing because I reuse the remoto connection object ... as we can read in the test:

"A mocked connection class that only allows the use of the connection once. If you attempt to use it again via a _check, it'll explode (go boom!)."

Somebody knows why to reuse a remoto connection object is not a good idea?

@jmolmo
Copy link
Member Author

jmolmo commented Oct 2, 2020

It seems there is a test failing because I reuse the remoto connection object ... as we can read in the test:

"A mocked connection class that only allows the use of the connection once. If you attempt to use it again via a _check, it'll explode (go boom!)."

Somebody knows why to reuse a remoto connection object is not a good idea?

Yes. Nothing like read the code ... We are reusing connections all the time ... not needed to do it explicitly :-)

@jmolmo
Copy link
Member Author

jmolmo commented Oct 6, 2020

I have modified the unit test because the previous one tries to test the "refresh" of the connection indirectly using higher level calls (check_host).
This approach has several problems:

  • You need to mock more things than the strictly needed.
  • A change in the implementation in other functions can affect negatively the test, although what is tested (deal with stale connections) is not affected by the modifications ( this is what has happened with this pr).

The modification in unit test, now test directly only the function where "reuse connection or reconnect" is implemented.

@jmolmo jmolmo requested review from jschmid1 and toabctl October 6, 2020 11:09
@jmolmo jmolmo force-pushed the rm_injected_argv branch 6 times, most recently from c651b7e to 1fbb837 Compare October 7, 2020 15:30
@jmolmo
Copy link
Member Author

jmolmo commented Oct 8, 2020

jenkins test make check

Removed the injected_argv parameter and the injection of code in the cephadm
script we send to hosts.
Now the script is copied and after that we execute the cephadm command.
I would like to copy it only one time (when adding new hosts) but this will be
 part of a future PR, together with other prs to:
 - Introduce cephadm version
 - Get rid of packaged/root mode
 - Use pex or eggs

Signed-off-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com>
Added container image hash to cephadm binary location.

Signed-off-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com>
Copy link
Contributor

@sebastian-philipp sebastian-philipp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit and then I think we're good to go

Comment on lines +1136 to +1138
mgrd = self.cache.get_daemon("mgr.%s" % self.get_mgr_id())
cephadm_dest = '/var/lib/ceph/%s/cephadm/%s/%s/cephadm' % (
self._cluster_fsid, self.get_mgr_id(), mgrd.container_image_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks already good. There is just one problem: in a vstart cluster, there is no
containerized MGR. Therefore I think we can simplify this to:

Suggested change
mgrd = self.cache.get_daemon("mgr.%s" % self.get_mgr_id())
cephadm_dest = '/var/lib/ceph/%s/cephadm/%s/%s/cephadm' % (
self._cluster_fsid, self.get_mgr_id(), mgrd.container_image_id)
cephadm_hash = sha256(self._read_cephadm_script())
cephadm_dest = f'/var/lib/ceph/{self._cluster_fsid}/cephadm-{cephadm_hash}/cephadm'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping. Needs rebase: now you also need to fix

def _deploy_cephadm_binary(self, host: str) -> bool:
# Use tee (from coreutils) to create a copy of cephadm on the target machine
self.log.info(f"Deploying cephadm binary to {host}")
with self._remote_connection(host) as tpl:
conn, _connr = tpl
_out, _err, code = remoto.process.check(
conn,
['tee', '-', '/var/lib/ceph/{}/cephadm'.format(self._cluster_fsid)],
stdin=self._cephadm.encode('utf-8'))
return code == 0

to contain the hash.

Comment on lines +1136 to +1138
mgrd = self.cache.get_daemon("mgr.%s" % self.get_mgr_id())
cephadm_dest = '/var/lib/ceph/%s/cephadm/%s/%s/cephadm' % (
self._cluster_fsid, self.get_mgr_id(), mgrd.container_image_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping. Needs rebase: now you also need to fix

def _deploy_cephadm_binary(self, host: str) -> bool:
# Use tee (from coreutils) to create a copy of cephadm on the target machine
self.log.info(f"Deploying cephadm binary to {host}")
with self._remote_connection(host) as tpl:
conn, _connr = tpl
_out, _err, code = remoto.process.check(
conn,
['tee', '-', '/var/lib/ceph/{}/cephadm'.format(self._cluster_fsid)],
stdin=self._cephadm.encode('utf-8'))
return code == 0

to contain the hash.

@liewegas
Copy link
Member

Is the goal to eliminate injected_args, or to avoid the overhead of transferring the script each time? If it is the former, we just replace that with an environment variable. e.g.,

                    out, err, code = remoto.process.check(
                        conn,
                        [python, '-u'],
                        stdin=script.encode('utf-8'),
                        extend_env={'CEPHADM_ARGS': json.dumps(args)})

@sebastian-philipp
Copy link
Contributor

Is the goal to eliminate injected_args, or to avoid the overhead of transferring the script each time?

Both: injected_args prevents us from splitting up bin/cephadm into multiple source code files. And in addition, we already have cephadm deployed on the hosts:

def _deploy_cephadm_binary(self, host: str) -> bool:
# Use tee (from coreutils) to create a copy of cephadm on the target machine
self.log.info(f"Deploying cephadm binary to {host}")
with self._remote_connection(host) as tpl:
conn, _connr = tpl
_out, _err, code = remoto.process.check(
conn,
['tee', '-', '/var/lib/ceph/{}/cephadm'.format(self.mgr._cluster_fsid)],
stdin=self.mgr._cephadm.encode('utf-8'))
return code == 0

why not use it, if it is there already.

@sebastian-philipp
Copy link
Contributor

just created #39141 to fix the unsafe cephadm daemon path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants