-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Description
This issue was first reported in the SONiC community under the following link:
SONiC Issue #21931
Issue Description
The issue occurs with FRR 10.0.1 and involves the loss of the set src routemap for zebra during boot-up. Notably, this problem is observed only on specific hardware platforms and cannot be reproduced on KVM.
By default, mgmtd loads the initial configuration for zebra. Additionally, the bgp configuration is applied via the bgpcfgd helper function, which executes:
vtysh -f /tmp/tmpfile
Observations
Through debugging, we identified that the issue arises when multiple sessions are managed by mgmtd concurrently.
From the log output below, we see that session 18 was initiated at 04:23:26.249031 and destroyed at 04:24:13.050618. Meanwhile, sessions 19 and 20 were also created and destroyed during this period. However, during the cleanup process, the dnode of the candidate configuration was erroneously removed, even though it did not belong to sessions 19 or 20, leading to the loss of the set src routemap.
Excerpt from the log:
2025 Mar 29 04:23:25.745244 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 17 with session-id: 17
2025 Mar 29 04:23:25.767625 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 17
2025 Mar 29 04:23:26.249031 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 18 with session-id: 18
2025 Mar 29 04:23:43.927126 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 19 with session-id: 19
2025 Mar 29 04:23:44.117662 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 19
2025 Mar 29 04:24:01.425006 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 20 with session-id: 20
2025 Mar 29 04:24:01.758119 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 20
2025 Mar 29 04:24:13.050618 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 18
2025 Mar 29 04:24:16.990561 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 21 with session-id: 21
2025 Mar 29 04:24:17.020653 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 21
2025 Mar 29 04:24:32.146610 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 22 with session-id: 22
2025 Mar 29 04:24:32.154911 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 22
2025 Mar 29 04:24:47.251026 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 23 with session-id: 23
2025 Mar 29 04:24:47.259175 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 23
2025 Mar 29 04:25:02.360972 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [NBFTJ-ZQJDX] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (create) for client-id 24 with session-id: 24
2025 Mar 29 04:25:02.369800 strtk5-msn2700-02 DEBUG bgp#mgmtd[33]: [R58SJ-Q04FG] FE-CLIENT: mgmt_fe_client_handle_msg: Got SESSION_REPLY (destroy) for session-id 24
Debugging with GDB
Using GDB, we monitored the dnode associated with the lost set src routemap. The call stack confirms that the routemap dnode was freed during the destruction of another session:
Breakpoint 1.1, mgmt_txn_process_set_cfg (thread=) at ../mgmtd/mgmt_txn.c:564
564 ../mgmtd/mgmt_txn.c: No such file or directory.
(gdb) print rm_s
rm_set_src_node rm_src_node
(gdb) print rm_src_node
$1 = (struct lyd_node *) 0x6080003181a0
(gdb) print rm_set_src_node
$2 = (struct lyd_node *) 0x6080003181a0
(gdb) watch rm_set_src_node->next
Hardware watchpoint 2: rm_set_src_node->next
(gdb) watch rm_set_src_node->prev
Hardware watchpoint 3: rm_set_src_node->prev
(gdb) watch rm_set_src_node->hash
Hardware watchpoint 4: rm_set_src_node->hash
(gdb) disable breakpoints 1
(gdb) c
Continuing.
Hardware watchpoint 4: rm_set_src_node->hash
Old value = 458022941
New value = 0
__asan::Allocator::Deallocate (alloc_type=__asan::FROM_MALLOC, stack=0x7ffdd9dc0e40, delete_alignment=0, delete_size=0, ptr=0x6080003181a0, this=0x7ff7ede80dc0 <__asan::instance>) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:698
698 ../../../../src/libsanitizer/asan/asan_allocator.cpp: No such file or directory.
(gdb) bt
#0 __asan::Allocator::Deallocate (alloc_type=__asan::FROM_MALLOC, stack=0x7ffdd9dc0e40, delete_alignment=0, delete_size=0, ptr=0x6080003181a0, this=0x7ff7ede80dc0 <__asan::instance>) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:698
#1 __asan::asan_free (ptr=ptr@entry=0x6080003181a0, stack=stack@entry=0x7ffdd9dc0e40, alloc_type=alloc_type@entry=__asan::FROM_MALLOC) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:955
#2 0x00007ff7eddfc67f in _interceptor_free (ptr=0x6080003181a0) at ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:53
#3 0x00007ff7ed8205b9 in lyd_free_subtree (node=0x6070002aa900, top=) at ./src/tree_data_free.c:174
#4 0x00007ff7ed8205b9 in lyd_free_subtree (node=0x6070002aa7b0, top=) at ./src/tree_data_free.c:174
#5 0x00007ff7ed8205b9 in lyd_free_subtree (node=0x6070002a8e50, top=) at ./src/tree_data_free.c:174
#6 0x00007ff7ed8205b9 in lyd_free_subtree (node=0x6070002a88a0, top=) at ./src/tree_data_free.c:174
#7 0x00007ff7ed8205b9 in lyd_free_subtree (node=0x6070002a27f0, top=) at ./src/tree_data_free.c:174
#8 0x00007ff7ed82073e in lyd_free (node=, top=top@entry=1 '\001') at ./src/tree_data_free.c:237
#9 0x00007ff7ed8209ea in lyd_free_all (node=) at ./src/tree_data_free.c:250
#10 0x00007ff7edbe91b9 in yang_dnode_free (dnode=) at ../lib/yang.c:643
#11 0x00007ff7edb88dd0 in nb_config_replace (config_dst=0x6020000afa70, config_src=0x6020000a5070, preserve_source=) at ../lib/northbound.c:381
#12 0x000055fe26caa823 in mgmt_ds_replace_dst_with_src_ds (src=0x55fe26d22ea0 , dst=dst@entry=0x55fe26d22e80 ) at ../mgmtd/mgmt_ds.c:87
#13 0x000055fe26caae36 in mgmt_ds_copy_dss (src_ds_ctx=, dst_ds_ctx=0x55fe26d22e80 , updt_cmt_rec=updt_cmt_rec@entry=false) at ../mgmtd/mgmt_ds.c:256
#14 0x000055fe26cb1eb2 in mgmt_fe_session_cfg_txn_cleanup (session=0x6070002ac8f0) at ../mgmtd/mgmt_fe_adapter.c:121
#15 mgmt_fe_cleanup_session (sessionp=sessionp@entry=0x7ffdd9dc1808) at ../mgmtd/mgmt_fe_adapter.c:180
#16 0x000055fe26cb2a6b in mgmt_fe_adapter_handle_msg (fe_msg=0x604000197f90, adapter=0x61300004e8c0) at ../mgmtd/mgmt_fe_adapter.c:964
#17 mgmt_fe_adapter_process_msg (version=, data=, len=6, conn=) at ../mgmtd/mgmt_fe_adapter.c:1282
#18 0x00007ff7edb7be4a in msg_conn_send_msg (conn=conn@entry=0x615000067a00, version=version@entry=0 '\000', msg=msg@entry=0x7ffdd9dc1900, mlen=, packf=, short_circuit_ok=short_circuit_ok@entry=true) at ../lib/mgmt_msg.c:608
#19 0x00007ff7edb796c0 in mgmt_fe_client_send_msg (short_circuit_ok=true, fe_msg=0x7ffdd9dc1900, client=0x615000067a00) at ../lib/mgmt_fe_client.c:115
#20 mgmt_fe_send_session_req (client=client@entry=0x615000067a00, session=session@entry=0x60400017a450, create=create@entry=false) at ../lib/mgmt_fe_client.c:163
#21 0x00007ff7edb7a375 in mgmt_fe_destroy_client_session (client=0x615000067a00, client_id=) at ../lib/mgmt_fe_client.c:835
#22 0x00007ff7edbd926d in vty_close (vty=vty@entry=0x62b000093200) at ../lib/vty.c:2496
#23 0x00007ff7edbe059e in vtysh_read (thread=) at ../lib/vty.c:2337
#24 0x00007ff7edbcd4b1 in event_call (thread=thread@entry=0x7ffdd9dc1db0) at ../lib/event.c:2011
#25 0x00007ff7edb693e0 in frr_run (master=0x613000000040) at ../lib/libfrr.c:1212
#26 0x000055fe26ca9fcf in main (argc=3, argv=0x7ffdd9dc2018) at ../mgmtd/mgmt_main.c:279
Root Cause
The issue is caused by a race condition in mgmtd, where multiple active sessions interfere with each other's cleanup process. Specifically:
When a session is destroyed, it mistakenly cleans up dnode structures that belong to other active sessions.
This results in the deletion of critical routemap data, leading to the observed issue.
Attempts & Request for Assistance
We have explored several workarounds, but none have been successful. Could you suggest possible fixes or alternative debugging approaches to mitigate this issue?
Version
strtk5-msn2700-02# show ver
FRRouting 10.0.1 (strtk5-msn2700-02) on Linux(6.1.0-22-2-amd64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
'--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--sbindir=/usr/lib/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--disable-rpki' '--disable-scripting' '--enable-pim6d' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-multipath=514' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'
How to reproduce
In SONiC system, sudo config reload -y. This issue only repro in one specified platform.
Expected behavior
the set src routemap can be added into the zebra.
Actual behavior
the set src routemap is missing from the zebra.
Additional context
No response
Checklist
- I have searched the open issues for this bug.
- I have not included sensitive information in this report.