Skip to content

rgw: qpid-proton amqp1.0 bucket notification#1

Closed
wangxuw wants to merge 214 commits intomasterfrom
gsoc-amqp1
Closed

rgw: qpid-proton amqp1.0 bucket notification#1
wangxuw wants to merge 214 commits intomasterfrom
gsoc-amqp1

Conversation

@wangxuw
Copy link
Owner

@wangxuw wangxuw commented Jun 10, 2021

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

sebastian-philipp and others added 30 commits May 11, 2021 13:08
Fixes: https://tracker.ceph.com/issues/50691
Signed-off-by: Sebastian Wagner <sewagner@redhat.com>
crash on multipart upload to bucket with policy
Fixes: https://tracker.ceph.com/issues/50556

Signed-off-by: Or Friedmann <ofriedma@redhat.com>
extend the common logic used by the deploy, ceph-volume, and shell
commands for validating the `--config` arg during bootstrap

Signed-off-by: Michael Fritch <mfritch@suse.com>
use the standard error message from FileNotFound:
```
cephadm bootstrap --mon-ip 192.168.1.1 --config ~/foobar
ERROR: [Errno 2] No such file or directory: '/root/foobar'
```

Signed-off-by: Michael Fritch <mfritch@suse.com>
Signed-off-by: Sage Weil <sage@newdream.net>
hash(str) is non-deterministic, probably because it is using the internal
object ID or something and not the string content?

In any case, explicitly hash the string content and use that instead.

Also, sort the input pre-shuffle to ensure that variations in the original
host list ordering don't screw with the result.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Look in dict, not encoded JSON string

Signed-off-by: Sage Weil <sage@newdream.net>
('orch ps')

Signed-off-by: Sage Weil <sage@newdream.net>
This makes 'orch ls' match up daemosn to services (and probably cleans up
other bits and pieces) when the old daemon id -> service name calc code
can't do its thing.

Signed-off-by: Sage Weil <sage@newdream.net>
The rank_map is a bit of state to keep track of which ranks are
occupied by which generation and daemon_id.

Signed-off-by: Sage Weil <sage@newdream.net>
DaemonDescription
CephadmDaemonDeploySpec
DaemonPlacement
unit.meta
get_unique_name() (we include it in the daemon_id)

Signed-off-by: Sage Weil <sage@newdream.net>
If we are passed a rank_map, use it maintain one daemon per rank, where
the ranks are consecutive non-negative integers starting from 0.

A bit of refactoring in place() so that we only do the rank allocations
on slots we are going to use (no more than count).

Signed-off-by: Sage Weil <sage@newdream.net>
This is more informative than just the hostnames.

Signed-off-by: Sage Weil <sage@newdream.net>
- we need to assign all names and update the rank_map before we start
creating daemons.
- if we are using ranks, we should delete old daemons first, and
fence them from the cluster (where possible).

Signed-off-by: Sage Weil <sage@newdream.net>
Use ranked daemons for NFS. Ganesha does not like it if multiple
instances start up with the same rank, but we need stable ranks so that
a rank can "fail over" to a new instance of a new daemon on another host
(with the same rank) for NFS client reclaim to work.

Specify a nodeid of '{service_name}.{rank}' for ganesha.

Include a unique id in the daemon_id just because this avoids some issues
with the create/destroy ordering, and because the daemon_id doesn't matter
much anymore since we are using a stable rank.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Do the grace file manipulation from the mgr module.  For add, this isn't
especially important, but for remove it is very important.  Clean out
old ranks from the grace table before we record that the rank has been
purged from the rank_map.

Signed-off-by: Sage Weil <sage@newdream.net>
This avoids any hangs due to rados.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
- use consistent hashing
- statically map across ranks
- disable backend checks so that clients don't move

Signed-off-by: Sage Weil <sage@newdream.net>
Remove the grace object if we purge the service.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
Better to raise an error; eth0 will never be correct.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
tchaikov and others added 26 commits May 27, 2021 23:07
less error-prone, and it's simpler to manage the resource using RAII

Signed-off-by: Kefu Chai <kchai@redhat.com>
before this change, cot never destructs the created ObjectStore
instances.

after this change, they are destructed upon returning from main().

Signed-off-by: Kefu Chai <kchai@redhat.com>
just for the sake of correctness, as they don't need a full-blown
std::string, what they need is but a string like object. and they always
create a std::string instance as a member variable if they want to have
a copy of it.

Signed-off-by: Kefu Chai <kchai@redhat.com>
Following crash occured at Sepia [1]:

```
INFO  2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] ProtocolV2::start_accept(): targ
et_addr=172.21.15.119:55220/0
DEBUG 2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] TRIGGER ACCEPTING, was NONE
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] SEND(26) banner: len_payload=16,
 supported=1, required=0, banner="ceph v2
"
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(10) banner: "ceph v2
"
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT banner: payload_len=16
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(16) banner features: supported=1 required=0
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] WRITE HelloFrame: my_type=osd, peer_addr=172.21.15.119:55220/0
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT HelloFrame: my_type=client peer_addr=v2:172.21.15.119:6803/31733
INFO  2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] UPDATE: peer_type=client, policy(lossy=true server=true standby=false resetcheck=false)
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] GOT AuthRequestFrame: method=2, preferred_modes={1, 2}, payload_len=174
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-4622-gaa1dc559/rpm/el8/BUILD/ceph-17.0.0-4622-gaa1dc559/src/crimson/mon/MonClient.cc:399:10: runtime error: member access within null pointer of type 'struct Connection'
Segmentation fault on shard 0.
Backtrace:
 0# 0x000055E84CF44C1F in ceph-osd
 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd
 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
 3# 0x00007F2BC88C0B20 in /lib64/libpthread.so.0
 4# crimson::mon::Connection::get_conn() in ceph-osd
 5# crimson::mon::Client::handle_auth_request(seastar::shared_ptr<crimson::net::Connection>, seastar::lw_shared_ptr<AuthConnectionMeta>, bool, unsigned int, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*) in ceph-osd
 6# crimson::net::ProtocolV2::_handle_auth_request(ceph::buffer::v15_2_0::list&, bool) in ceph-osd
 7# 0x000055E84DF67669 in ceph-osd
 8# 0x000055E84DF68775 in ceph-osd
 9# 0x000055E846F47F60 in ceph-osd
10# 0x000055E85296770F in ceph-osd
11# 0x000055E85296CC50 in ceph-osd
12# 0x000055E852B1ECBB in ceph-osd
13# 0x000055E85267C73A in ceph-osd
14# main in ceph-osd
15# __libc_start_main in /lib64/libc.so.6
16# _start in ceph-osd
Fault at location: 0x98
```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136907

When the `handle_auth_request()` happens, there is no guarantee
`active_con` is being available. This is reflected in the classical
implementation:

```cpp
int MonClient::handle_auth_request(
  Connection *con,
  // ...
  ceph::buffer::list *reply)
{
  // ...
  bool isvalid = ah->verify_authorizer(
    cct,
    *rotating_secrets,
    payload,
    auth_meta->get_connection_secret_length(),
    reply,
    &con->peer_name,
    &con->peer_global_id,
    &con->peer_caps_info,
    &auth_meta->session_key,
    &auth_meta->connection_secret,
    ac);
```

The patch transplate the same logic to crimson.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
If the host IP/addr is known, use that.  The addr might even be a FQDN
instead of an IP address, in which case we want to look that up instead
of the bare hostname.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
- Use a centralized method get_mgr_ip()
- Look up the hostname via DNS.  This is a bit more reliable than
getfqdn() since it will work even when podman adds the container
name to /etc/hosts.

Signed-off-by: Sage Weil <sage@newdream.net>
Previously we allowed the host.addr to be a DNS name (short or fqdn).
This is problematic because of the inconsistent way that docker and podman
handle /etc/hosts, and undesirable because relying on external DNS is
an external source of failure for the cluster without any benefit in
return (simply updating DNS is not sufficient to make ceph behave).

So: update any non-IP to an IP as soon as we start up (presumably on
upgrade).  If we get a loopback address (127.0.0.1 or 127.0.1.1), then
wait and hope that the next instance of the manager has better luck.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
This reverts cfc1f91, which is no longer
neceesary because (1) we don't use socket.getfqdn(), and (2) we generally
do not rely on DNS or /etc/hosts at all anymore (with the exception of
the upgrade transition).

Signed-off-by: Sage Weil <sage@newdream.net>
…rvice-status-improvement-2021-05-26

doc/cephadm: enrich "service status"

Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
* refs/pull/41483/head:
	cephadm: stop passing --no-hosts to podman
	mgr/nfs: use host.addr for backend IP where possible
	mgr/cephadm: convert host addr if non-IP to IP
	mgr/dashboard,prometheus: new method of getting mgr IP
	doc/cephadm: remove any reference to the use of DNS or /etc/hosts
	mgr/cephadm: use known host addr
	mgr/cephadm: resolve IP at 'orch host add' time

Reviewed-by: Sebastian Wagner <swagner@suse.com>
doc: 15.2.13 Release Notes

Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Reviewed-by: Casey Bodley <cbodley@redhat.com>
Reviewed-by: Sebastian Wagner <sebastian.wagner@suse.com>
Reviewed-by: Ramana Raja <rraja@redhat.com>
Reviewed-by: Neha Ojha <nojha@redhat.com>
less repeating this way

Signed-off-by: Kefu Chai <kchai@redhat.com>
doc/mgr: use confval directive to define options

Reviewed-by: Neha Ojha <nojha@redhat.com>
crimson/monc: handle_auth_request() doesn't depend on active_con.

Reviewed-by: Kefu Chai <kchai@redhat.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
Signed-off-by: Yingxin Cheng <yingxin.cheng@intel.com>
…xtent

crimson/seastore: introduce and adopt LBAManager::get_mapping(t, offset)

Reviewed-by: Kefu Chai <kchai@redhat.com>
os/bluestore: pass string_view to ctor of Allocator

Reviewed-by: Igor Fedotov <ifedotov@suse.com>
os: let ObjectStore::create() return unique_ptr<>

Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…user-no-hosts

mgr/cephadm: Don't call _check_host without hosts

Reviewed-by: Juan Miguel Olmo <jolmomar@redhat.com>
Reviewed-by: Adam King <adking@redhat.com>
…blocking-io-during-index-resharding

rgw: add the description of blocking io during index resharding

Reviewed-by: Matt Benjamin mbenjamin@redhat.com
Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>
wangxuw pushed a commit that referenced this pull request Jun 15, 2021
`OpSequencer` assumes that ID of a previous client request
is always lower than ID of current one. This is reflected
by the assertion in `OpSequencer::start_op()`. It triggered
the following failure [1] in Teuthology:

```
DEBUG 2021-05-07 08:01:41,227 [shard 0] osd - client_request(id=1, detail=osd_op(client.4171.0:1 2.2 2.7c339972 (undecoded) ondisk+retry+read+known_if_redirected e29) v8) same_interval_since: 31
ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3910-g1b18e076/rpm/el8/BUILD/ceph-
17.0.0-3910-g1b18e076/src/crimson/osd/osd_operation_sequencer.h:38: seastar::futurize_t<Result> crimson::osd::OpSequencer::start_op(HandleT&, uint64_t, uint64_t, FuncT&&) [with HandleT = crimson::PipelineHa
ndle; FuncT = crimson::interruptible::interruptor<InterruptCond>::wrap_function(Func&&) [with Func = crimson::osd::ClientRequest::start()::<lambda()> mutable::<lambda(Ref<crimson::osd::PG>)> mutable::<lambd
a()> mutable::<lambda()>; InterruptCond = crimson::osd::IOInterruptCondition]::<lambda()>; Result = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<>
>; seastar::futurize_t<Result> = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, seastar::future<> >; uint64_t = long unsigned int]: Assertion `prev_op < this_op' fai
led.
Aborting on shard 0.
Backtrace:
Segmentation fault.
Backtrace:
 0# 0x00005592B028932F in ceph-osd
 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd
 2# FatalSignal::install_oneshot_signal_handler<6>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
 3# 0x00007F57B72E7B20 in /lib64/libpthread.so.0
 4# gsignal in /lib64/libc.so.6
 5# abort in /lib64/libc.so.6
 6# 0x00007F57B58E2B09 in /lib64/libc.so.6
 7# 0x00007F57B58F0DE6 in /lib64/libc.so.6
 8# 0x00005592ABB8484D in ceph-osd
 9# 0x00005592ABB8ACB3 in ceph-osd
10# seastar::continuation<seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<boost::intrusive_ptr<crimson::osd::PG> >::then_impl_nrvo<seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>, seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > >(seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<seastar::bool_class<seastar::stop_iteration_tag> >&&, seastar::noncopyable_function<seastar::future<seastar::bool_class<seastar::stop_iteration_tag> > (boost::intrusive_ptr<crimson::osd::PG>&&)>&, seastar::future_state<boost::intrusive_ptr<crimson::osd::PG> >&&)#1}, boost::intrusive_ptr<crimson::osd::PG> >::run_and_dispose() in ceph-osd
11# 0x00005592B357F88F in ceph-osd
12# 0x00005592B3584DD0 in ceph-osd
```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-07_07:41:02-rados-master-distro-basic-smithi/6104530

Crash analysis resulted in two observations:
1. during the request execution the acting set got
   changed, the request was interrupted and a try
   to re-execute it emerged;
2. the interrupted request was the very first client
   request the OSD has ever seen.

Code analysis showed a problem in how `ClientRequest`
establishes `prev_op_id`: although supposed to be performed
only once for a request, it can get executed twice but only
for the very first request `OpSequencer` saw.

```cpp
void ClientRequest::may_set_prev_op()
{
  // set prev_op_id if it's not set yet
  if (__builtin_expect(prev_op_id == 0, true)) {
    prev_op_id = sequencer.get_last_issued();
  }
}
```

Unfortunately, `0` isn't a distincted value that cannot
be returned by `get_last_issued()`:

```cpp
class OpSequencer {
  // ...

  uint64_t get_last_issued() const {
    return last_issued;
  }

  // ...

  // the id of last op which is issued
  uint64_t last_issued = 0;
```

As a result, `OpSequencer` returned on the second call
a new value (actually `this_op`) violating the assertion.
The commit fixes the problem by switching from a designated
value to `std::optional`.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
wangxuw pushed a commit that referenced this pull request Jun 15, 2021
f7181ab has optimized the client
parallelism. To achieve that `PG::do_osd_ops()` were converted to
return basically future pair of futures. Unfortunately, the life-
time management of `OpsExecuter` was kept intact. In the result,
the object was valid only till fullfying the outer future while,
due to the `rollbacker` instances, it should be available till
`all_completed` becomes available.

This issue can explain the following problem has been observed
in a Teuthology job [1].

```
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - do_op_call: method returned ret=-17, outdata.length()=0 while num_read=1, num_write=0
DEBUG 2021-05-20 08:03:22,617 [shard 0] osd - rollback_obc_if_modified: object 19:e17d4708:test-rados-api-smithi095-38404-2::foo:head got erro
r generic:17, need_rollback=false
=================================================================
==33626==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0000b9320 at pc 0x560f486b8222 bp 0x7fffc467a1e0 sp 0x7fffc467a1d0
READ of size 4 at 0x60d0000b9320 thread T0
    #0 0x560f486b8221  (/usr/bin/ceph-osd+0x2c610221)
    #1 0x560f4880c6b1 in seastar::continuation<seastar::internal::promise_base_with_type<boost::intrusive_ptr<MOSDOpReply> >, seastar::noncopy
able_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_
wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > >
 > ()>, seastar::future<void>::then_impl_nrvo<seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::
IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson:
:errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>, crimson::interruptible::interruptible_future_detail<crimson::osd::IOInte
rruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::error
ated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > >(seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail
<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_
future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&&)::{lambda(seastar::internal::promise_base_with_type<boos
t::intrusive_ptr<MOSDOpReply> >&&, seastar::noncopyable_function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<(std::errc)11> > >::_future<crimson::errorated_future_marker<boost::intrusive_ptr<MOSDOpReply> > > > ()>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() (/usr/bin/ceph-osd+0x2c7646b1)
    #2 0x560f5352c3ae  (/usr/bin/ceph-osd+0x374843ae)
    ceph#3 0x560f535318ef  (/usr/bin/ceph-osd+0x374898ef)
    ceph#4 0x560f536e395a  (/usr/bin/ceph-osd+0x3763b95a)
    ceph#5 0x560f532413d9  (/usr/bin/ceph-osd+0x371993d9)
    ceph#6 0x560f476af95a in main (/usr/bin/ceph-osd+0x2b60795a)
    ceph#7 0x7f7aa0af97b2 in __libc_start_main (/lib64/libc.so.6+0x237b2)
    ceph#8 0x560f477d2e8d in _start (/usr/bin/ceph-osd+0x2b72ae8d)

```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-20_07:28:16-rados-master-distro-basic-smithi/6124735/

The commit deals with the problem by repacking the outer future.
An alternative could be in switching from `std::unique_ptr` to
`seastar::shared_ptr` for managing `OpsExecuter`.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
wangxuw pushed a commit that referenced this pull request Jun 15, 2021
Following crash occured at Sepia [1]:

```
INFO  2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] ProtocolV2::start_accept(): targ
et_addr=172.21.15.119:55220/0
DEBUG 2021-05-26 20:16:32,872 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] TRIGGER ACCEPTING, was NONE
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] SEND(26) banner: len_payload=16,
 supported=1, required=0, banner="ceph v2
"
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(10) banner: "ceph v2
"
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT banner: payload_len=16
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] RECV(16) banner features: supported=1 required=0
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] WRITE HelloFrame: my_type=osd, peer_addr=172.21.15.119:55220/0
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> unknown.? -@55220] GOT HelloFrame: my_type=client peer_addr=v2:172.21.15.119:6803/31733
INFO  2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] UPDATE: peer_type=client, policy(lossy=true server=true standby=false resetcheck=false)
DEBUG 2021-05-26 20:16:32,873 [shard 0] ms - [osd.0(client) v2:172.21.15.119:6803/31733 >> client.? -@55220] GOT AuthRequestFrame: method=2, preferred_modes={1, 2}, payload_len=174
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-4622-gaa1dc559/rpm/el8/BUILD/ceph-17.0.0-4622-gaa1dc559/src/crimson/mon/MonClient.cc:399:10: runtime error: member access within null pointer of type 'struct Connection'
Segmentation fault on shard 0.
Backtrace:
 0# 0x000055E84CF44C1F in ceph-osd
 1# FatalSignal::signaled(int, siginfo_t const*) in ceph-osd
 2# FatalSignal::install_oneshot_signal_handler<11>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) in ceph-osd
 3# 0x00007F2BC88C0B20 in /lib64/libpthread.so.0
 4# crimson::mon::Connection::get_conn() in ceph-osd
 5# crimson::mon::Client::handle_auth_request(seastar::shared_ptr<crimson::net::Connection>, seastar::lw_shared_ptr<AuthConnectionMeta>, bool, unsigned int, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*) in ceph-osd
 6# crimson::net::ProtocolV2::_handle_auth_request(ceph::buffer::v15_2_0::list&, bool) in ceph-osd
 7# 0x000055E84DF67669 in ceph-osd
 8# 0x000055E84DF68775 in ceph-osd
 9# 0x000055E846F47F60 in ceph-osd
10# 0x000055E85296770F in ceph-osd
11# 0x000055E85296CC50 in ceph-osd
12# 0x000055E852B1ECBB in ceph-osd
13# 0x000055E85267C73A in ceph-osd
14# main in ceph-osd
15# __libc_start_main in /lib64/libc.so.6
16# _start in ceph-osd
Fault at location: 0x98
```

[1]: http://pulpito.front.sepia.ceph.com/rzarzynski-2021-05-26_12:20:26-rados-master-distro-basic-smithi/6136907

When the `handle_auth_request()` happens, there is no guarantee
`active_con` is being available. This is reflected in the classical
implementation:

```cpp
int MonClient::handle_auth_request(
  Connection *con,
  // ...
  ceph::buffer::list *reply)
{
  // ...
  bool isvalid = ah->verify_authorizer(
    cct,
    *rotating_secrets,
    payload,
    auth_meta->get_connection_secret_length(),
    reply,
    &con->peer_name,
    &con->peer_global_id,
    &con->peer_caps_info,
    &auth_meta->session_key,
    &auth_meta->connection_secret,
    ac);
```

The patch transplate the same logic to crimson.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@wangxuw wangxuw closed this Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.