Skip to content

Bug Report: DiscoverInstance (instance is nil) logs in VTOrc #13112

@GuptaManan100

Description

@GuptaManan100

Overview of the Issue

It has been noticed that VTOrc sometimes has spurious logs like - DiscoverInstance(10.10.10.10:3307) instance is nil in 0.002s (Backend: 0.002s, Instance: 0.000s), error=tablet alias is nil.

I have looked at the code and I know how this is happening. Let's say initially you have a vttablet with hostname h1, port p1, and alias a1. Then, in the VTOrc backend, you would have 1 row in vitess_tablet for this tablet having all the three values h1, p1 and a1 and you would have a record in database_instance for this tablet with the values h1, p1 in it.

Now, let's say that this tablet gets evicted by Kubernetes and it restarts on a different machine. The tablet's alias remains the same, but the host and port would change, let's say to h2 and p2.

When VTOrc tries to refresh the information from the topo-server it would see this new record for the vttablet and try to insert a row into vitess_tablet with the values h2, p2 and a1. Since there is a uniqueness constraint on alias we end up replacing the row and the first row is automatically removed. We also load the MySQL information for this tablet and populate the data in database_instance with the values h2, p2. We don't store the alias in this table, so no uniqueness constraint fails and we have both the rows in the table now!

Now, we run the check to see what all tablets we need to forget about. This check runs by looking at the tablet aliases only and since the tablet alias for the given tablet didn't change, we conclude we have nothing to forget about.

Overall, this sequence of steps leads to a row in the database_instance table that should have actually been removed and is in the table without having a corresponding row in vitess_tablet. ReadOutdatedInstanceKeys picks up on this record and tries to refresh its information, but this errors out with DiscoverInstance(10.10.10.10:3307) instance is nil in 0.002s (Backend: 0.002s, Instance: 0.000s), error=tablet alias is nil

Reproduction Steps

Described in the description.

Binary Version

main

Operating System and Environment details

all

Log Fragments

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions