Skip to content

When using SNMPv3, some devices may show polling issues #5797

@bernisys

Description

@bernisys

Describe the bug
Among the device types we are monitoring, there are a few which have major problems during polling cycles. So far mainly Juniper and Cisco ASA devices are not polled properly. The devices were all adapted by the corresponding change-device.php cli script, but i doubt that the script has done anything funny. In fact, the credentials seem to be working fine, because when we go to the "device edit" page in the Cacti GUI, we see the SNMP information coming just fine from all devices. So i assume there is something within spine that causes this issue, which is reacting differently than the PHP based parts in the GUI.

I suspect that there is something odd happening on SNMP side, which is causing spine to drop the information, or maybe a packet is not coming back properly. But on the capture i made i can see packets being sent and arriving properly (not 100% sure because i can't decode the content due to the encryption),. What's significant is that the packets come always in pairs with identical data portion encapsulated. Is that normal behavior?

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...' ebay and buy an old Cisco ASA ...
    Ok, just kidding, not sure if this is really reproducible. The interesting part is that outside of spine everything looks fine.

Server (please complete the following information):

  • OS: RHEL
  • Version 7.9

Compiling (please complete the following information):

  • compiler: gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
  • autoconf: autoconf (GNU Autoconf) 2.69
  • glibc: 2.17-326.el7_9
  • source: release 1.2.17 / 23 / 25 (tested with all 3 versions)

Additional context
Did some spine runs in debug mode and what i saw is that they all have rather large delays between the "Updating Full System Information Table" and "RECACHE: Processing 3 items in the auto reindex cache" part - the time span of the gaps depends on the timeout value (tested with 1500ms and 2500ms). Here's a shortened version, maybe you already get some ideas and if you guide me a bit, i could add some more debugging code in specific places, recompile and gather some details.

      ... skipping the init sequence, which looks good
1719990805.378535 Total[0.1096] Device[303] Checking for System Information Update
1719990805.378570 Total[0.1096] Device[303] Updating Full System Information Table
      (15s spent here with no output - timeout is 2500ms, means 6x timeout reached somewhere?)
1719990820.842316 Total[0.1172] DEBUG: Device[303] HT[1] RECACHE: Processing 3 items in the auto reindex cache for 'xx.xx.xx.xx'
1719990820.842428 Total[0.1181] Device[303] HT[1] DQ[1] Legacy Uptime Result: 906303816, Is Numeric: 1
1719990820.842466 Total[0.1188] Device[303] HT[1] DQ[1] Extended Uptime Result: 20023448200, Is Numeric: 1
1719990820.842572 Total[0.1188] Device[303] HT[1] DQ[32] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 20023438800 < output: 20023448200)
1719990820.842608 Total[0.1199] Device[303] HT[1] NOTE: There are '31' Polling Items for this Device
1719990820.842645 Total[0.1199] The executable is '/usr/bin/perl' in 'perl /opt/SP/cacti/cacti/scripts/ping.pl 'xxx.xxx.xxx.xxx''
1719990820.842748 Total[0.1202] DEBUG: Device[303] DEBUG: The NIFTY POPEN returned the following File Descriptor 46
1719990820.842784 Total[0.1342] Device[303] HT[1] DS[4517] TT[14.26] SCRIPT: perl /opt/SP/cacti/cacti/scripts/ping.pl 'xxx.xxx.xxx.xxx', output: 1.87
1719990820.842819 Total[0.1342] Device[303] INFO: SNMP Device 'xxx.xxx.xxx.xxx:161' has a timeout of 2500000 (2500), with 3 retries
1719990820.842854 Total[0.5794] WARNING: Device[303] polling sleeping while waiting for 1 Threads to End
1719990820.842890 Total[1.0795] WARNING: Device[303] polling sleeping while waiting for 1 Threads to End
      ... lots of those thread-waits
1719990820.843559 Total[10.1447] Device[303] HT[1] DS[14613] WARNING: SNMP timeout detected [2500 ms], ignoring host 'xxx.xxx.xxx.xxx'
1719990820.843595 Total[10.1447] Device[303] HT[1] DS[14613] TT[10010.47] SNMP: v3: xxx.xxx.xxx.xxx, dsname: asa_connections, oid: 1.3.6.1.4.1.9.9.147.1.2.2.2.1.5.40.6, value: U
1719990820.843631 Total[10.5820] WARNING: Device[303] polling sleeping while waiting for 1 Threads to End
      ... lots of those thread-waits again, then another delay of 5s
**1719990820**.843959 Total[15.0834] WARNING: Device[303] polling sleeping while waiting for 1 Threads to End
**1719990825**.845514 Total[15.5835] WARNING: Device[303] polling sleeping while waiting for 1 Threads to End
      ... more thread-waits
1719990825.846107 Total[20.0848] WARNING: Device[303] polling sleeping while waiting for 1 Threads to End
1719990825.846145 Total[20.1546] Device[303] HT[1] DS[14614] WARNING: SNMP timeout detected [2500 ms], ignoring host 'xxx.xxx.xxx.xxx'
1719990825.846178 Total[20.1546] Device[303] HT[1] DS[14614] TT[20020.33] SNMP: v3: xxx.xxx.xxx.xxx, dsname: asa_active_sessions, oid: 1.3.6.1.4.1.9.9.392.1.3.1.0, value: U
1719990825.846212 Total[20.1548] DEBUG: Setting up writes to remote database
1719990825.846246 Total[20.1549] Device[303] HT[1] Total Time: 20 Seconds
1719990825.846280 Total[20.1555] DEBUG: Freeing Local Pool ID 0
1719990825.846315 Total[20.1555] DEBUG: Freeing Remote Pool ID 0
1719990825.846351 Total[20.1555] DEBUG: Device[303] HT[1] DEBUG: HOST COMPLETE: About to Exit Device Polling Thread Function
1719990825.846388 Total[20.5850] The final count of Threads is 0
1719990825.846423 Total[20.5850] INFO: Thread complete for Device[303] and 0 to 0 sources
1719990825.846459 Total[20.5850] DEBUG: Closing Local Connection Pool ID 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugUndesired behaviourconfirmedBug is confirm by dev teamresolvedA fixed issue

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions