Not sure if this is a kafka issue or a librdkafka issue. We were doing some tests with our 3 node cluster. We brought down one of the nodes and continued to operate. (librdkafka gave the callback about the transport failure to the one node as expected, but continued on). We found that when we brought back the 3rd node online, things continued to operate successfully while it was syncing back up/catching up. As soon as the 3rd node was fully online, we got an error in several of our librdkafka clients that the topic wasn't found. Some of the clients did not get this error.
Can you think of a reason why this might happen? I can ignore the error 'X' number of times/retry but wanted to make sure whether this was expected or considered a bug somewhere.