Errors in sending notifications to configured platforms.
Details: Check alertmanager logs for 'Error on notify' or 'Notify for alerts failed' messages
Required privileges: cvp
Certificate upload fails due to stale entries in the database.
Bug ID: 491453 (public link)
Introduced in: 2020.1.0
Links: https://www.arista.com/en/support/software-bug-portal/bugdetail?bug_id=491453
Internal Links: https://sites.google.com/arista.com/cvp-tac/cvp-certs/cvp-certs-gtk
Details: Reads the /eventSubscriber/ids path in the 'cvp' dataset.
Required privileges: cvp
Removes all the ids under /eventSubscriber/ids path.
Required privileges: cvp
Certain scenarios can lead to ZtpMode being set to "true" for provisioned devices at various paths in the NetDb
Bug ID: 528983 (public link)
Introduced in: 2019.1.0
Fixed in: 2021.1.0
Links: https://www.arista.com/en/support/software-bug-portal/bugdetail?bug_id=528983
Internal Links: https://sites.google.com/arista.com/cvp-tac/troubleshooting/troubleshooting-onboarding-issues
Bug ID: 603699 (public link)
Introduced in: 2019.1.0
Fixed in: 2021.2.1
Links: https://www.arista.com/en/support/software-bug-portal/bugdetail?bug_id=603699
Internal Links: https://sites.google.com/arista.com/cvp-tac/troubleshooting/troubleshooting-onboarding-issues
Details: Reads the /provisioning/device/ids and /ztpService/status/device/ids paths in the 'cvp' dataset to find devices with ZtpMode set to true where the ParentContainerKey value is not equal to 'undefined_container'
Required privileges: cvp
Rewrites key values for affected devices in scanned paths to set ZtpMode to false
Required privileges: cvp
Invalid CVP backend certificates
Internal Links: https://sites.google.com/arista.com/cvp-tac/uncommon-issues/components-cannot-come-up-due-to-expired-aeris-certificate-bug591049
Details: Checks if backend certificates have expired or will expire within the next 30 days, don't have start dates in the future, have a valid cert chain and the certificate file in the filesystem is the same as the cert loaded into kubernetes.
If checking logs we look for pods in the crashloopbackoff state and compare it to a list of components that are known to fail if certificates have expired: aaa, aeris-ccapi, audit, ccapi, cloudmanager, enroll, image, inventory, snapshot, ztp. If the crashed components match this list, then we indicate that with an error message.
However if CVP services haven't been restarted pods might still be running. In this case we check for services throwing out Context Deadline Exceeded messages and compare them to the list mentioned previously. If we have a match we indicate that with a warning message.
Required privileges: cvp
Renew backend and CA certificates.
- Stop CVP
- Start aeris
- Remove the backend certificates
- Reset and initialize the CA
- Initialize Aeris
- Restart Aeris
- Start all remaining CVP services
Required privileges: cvp
Wrong permissions or ownership on CVP certificate files.
Details: Checks if backend certificates are owned by the cvp user and if their permissions match the expected ones.
Required privileges: cvp
Correct files ownership and permissions.
- Change the certificate files ownership to the
cvpuser and group. - Set the expected permissions on the files.
Required privileges: root
Clickhouse fails to start and clover cannot initialize schema due to readonly tables.
Details: Check clickhouse logs for 'Table is in readonly mode' messages
Required privileges: cvp
Detach and re-attach the affected table.
Required privileges: cvp
Reset clickhouse. This will wipe telemetry data.
- stop cvp
- remove org data from clickhouse
- start zookeeper
- remove clickhouse path from zookeeper
- start cvp
Required privileges: cvp
JVM configuration exposes this CVP cluster to CVE-2021-44228
Introduced in: 2019.1.0
Fixed in: 2021.2.2
Links: https://www.arista.com/en/support/advisories-notices/security-advisories/13425-security-advisory-0070, https://nvd.nist.gov/vuln/detail/CVE-2021-44228, https://logging.apache.org/log4j/2.x/security.html
Details: Checks the CVP version and based on that, examines either /cvpi/conf/templates/elasticsearch.jvm.options (2019-2020.2.4) or /cvpi/elasticsearch/conf/jvm.options (2020.3.0+) to determine if log4j2.formatMsgNoLookups=true is set
Required privileges: cvp
Writes log4j2.formatMsgNoLookups=true key to JVM options file and rebuilds affected component deployments in the cluster to mitigate CVE
Required privileges: cvp
Configured authentication servers could not be reached.
Details: Checks for Server unreachable messages in aaa logs.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Deadline Exceeded messages in services.
Details: Checks for Context Deadline Exceeded messages in services.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Acknowledging events may not work
Bug ID: 639278 (public link)
Internal Links: https://bug/639278
Details: Checks the turbine-version-events-active.log for not found interactions
Required privileges: cvp
Restart the turbine-version-events-active component.
Required privileges: cvp
Different key file contents
Details: Checks and compares the checksum of key file contents. Key files are: /etc/cvpi/env, /etc/cvpi/cvpi.key, /cvpi/tls/certs/aerisadmin.crt, /cvpi/tls/certs/ca.crt, /cvpi/tls/certs/saml.crt
- Store file checksum from all nodes
- Compare file checksums
Required privileges: cvp
Images are configured in CVP but aren't being loaded by the image service.
Internal Links: https://sites.google.com/arista.com/cvp-tac/architecture-and-components/location-of-swi-and-swix-images, https://mail.google.com/chat/u/0/#chat/space/AAAAy7qQUss/ZPNnx_y65FQ
Details: Checks if required images weren't loaded.
- Check if there are images that need to be added.
- Check if the image service wasn't able to find the images.
- Check if the image service wasn't able to add those missing images
Required privileges: cvp
No patch is available. The TAC engineer handling the issue should the provided internal links for instructions on fixing it.
/data/apprpms directory does not exist.
Bug ID: 634395 (public link)
Introduced in: 2021.2.0
Details: Check if /data/apprpms exist.
Required privileges: cvp
Create /data/apprpms.
Required privileges: cvp
Enabled CVP components that are not running.
Details: Checks if enabled components are not running.
At the moment this is only supported when running in live mode
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Loads CVPI resources.
Details: Loads CVPI resources and saves them to self['status']['extra']. This is intended to be extended by other bugchecks that need to check those values.
Required privileges: cvp
Loads CVPI components statuses.
Details: Loads CVPI components statuses and saves them to self['status']['extra']. This is intended to be extended by other bugchecks that need to check those values.
Required privileges: cvp
Docker containers failing to start with cgroup memory allocation error.
Bug ID: 550147 (public link)
Fixed in: 2021.3.0
Links: https://www.arista.com/en/support/software-bug-portal/bugdetail?bug_id=550147, docker/for-linux#841
Internal Links: https://sites.google.com/arista.com/cvp-tac/uncommon-issues/pods-fail-to-start-due-to-cgroup-memory-allocation-error
Details: Checks for related error messages and missing settings.
- Checks if
cgroup.memory=nokmemis present on theGRUB_CMDLINE_LINUXparameter in/etc/default/grub - Checks if
cgroup.memory=nokmemis present on/proc/cmdline - Checks if there are
cgroup.*cannot allocate memorymessages on kubernetes pods
Required privileges: cvp
Apply kernel settings on grub's configuration
- Back up the current
/etc/default/grubfile - Add
cgroup.memory=nokmemif not present - Regenerate grub configuration
Required privileges: root
Out of memory errors in elasticsearch due to insufficient heap space.
Internal Links: https://sites.google.com/arista.com/cvp-tac/howto/how-to-elasticsearch
Details: Checks if there are OutOfMemoryError: Java heap space in elasticsearch logs.
Required privileges: cvp
Increase Elasticsearch's memory limits.
- Stop elasticsearch
- Increase elasticsearch memory limits
- Start CVP
Required privileges: cvp
Corrupted procedures in HBase WAL files.
Internal Links: https://sites.google.com/arista.com/cvp-tac/troubleshooting/troubleshooting-hbase
Details: Checks if there are corrupted procedures on hbase logs.
Required privileges: cvp
Fix database inconsistencies using hbck.
- Stop all CVP components except for hadoop
- Move current WAL files to a backup location
- Start HBase master and regionserver
- Run hbck
- Rotate Hbase logs
- Start CVP
Required privileges: cvp
Offline hbase regions
Internal Links: https://sites.google.com/arista.com/cvp-tac/troubleshooting/troubleshooting-hbase
Details: Scan log files looking for offline regions.
- Determine current hbase master log file
- Look for lines containing the string 'Master startup cannot progress, in holding-pattern until region onlined.'
- Extract and store the region name from matching lines.
Required privileges: cvp
Assign offline regions
- Assign open offline regions
- Restart regionserver if regions are not open
Required privileges: root
HBase operations in STUCK state.
Internal Links: https://sites.google.com/arista.com/cvp-tac/troubleshooting/troubleshooting-hbase
Details: Checks if there are stuck operations on hbase logs.
Required privileges: cvp
HBase regions not deployed on any region server.
Internal Links: https://sites.google.com/arista.com/cvp-tac/uncommon-issues/region-not-deployed-on-any-region-server, https://sites.google.com/arista.com/cvp-tac/troubleshooting/troubleshooting-hbase
Details: Checks if there are unassigned regions on hbase.
Required privileges: cvp
Assigns unassigned regions using hbase shell.
Required privileges: cvp
Kubernetes pods in CrashLoopBackOff state.
Details: Checks if there are pods in CrashLoopBackOff state in kubernetes.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Kubernetes pods in Failed state.
Details: Checks if there are pods in Failed state in kubernetes.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Kubernetes pods in Pending state.
Details: Checks if there are pods in Pending state in kubernetes.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Missing CVP required kubernetes secrets.
Details: Checks if required secrets (currently only ambassador-tls-origin) are present on kubernetes. This is only supported when running on live mode
Required privileges: cvp
Recreate ambassador certificate and secret.
- Reset ambassador
- Init ambassador
- Start all CVP components
Required privileges: cvp
High Kafka lag on the postDB topic.
Details: Checks if the lag in Kafka's postDB topic is above a threshold (500).
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Low ammount of RAM available.
Details: Checks if available RAM is below a threshold (2.5gb)
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Disk space is lower than threshold.
Details: Checks if disk space usage is above a threshold (70%)
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Disk throughput below threshold.
Details: Checks if the disk bandwidth is below a threshold (50mb/s).
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Load average is higher than the number of CPUs in the node
Details: Checks if load average is high by reading /proc/loadavg (live) or cvpi_commands/top (logs).
It reads all 3 load average measurements (1, 5 and 15 minutes) and takes the highest in consideration, so a warning may still be displayed if there was a recent peak but things are back to normal at the moment the check is done.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Server time is not synchronized.
Details: Checks the NTP status on the cvpi resources output.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Services killed due to system running out of memory.
Details: Checks if processes have been OOM killed on journalctl and kubelet_journalctl files.
Required privileges: cvp
No patch is available. This is an informational message and further debugging will be needed by the TAC team.
Invalid characters in usernames.
Bug ID: 662346 (public link)
Introduced in: 2020.3.0
Internal Links: https://sites.google.com/arista.com/cvp-tac/troubleshooting/troubleshooting-fastupgrade-failures/upgrading-to-2021-3
Details: Look for user validation errors due to special characters in the user upgrade logs
- Read user-upgrade.log'
- Look for lines containing "Error in validating user: Allowed special characters in username"
- Extract the username from lines
Required privileges: cvp
Remove usernames with invalid characters from the aeris database.
- Stop aeris
- Start aeris
- Remove the username path contents
- Remove username from the user list
- Start CVP
Required privileges: cvp