Skip to content

Removed sock fd assert on IpSocket#3973

Merged
LeStarch merged 1 commit intonasa:develfrom
jsilveira1409:bug/ip-socket
Aug 6, 2025
Merged

Removed sock fd assert on IpSocket#3973
LeStarch merged 1 commit intonasa:develfrom
jsilveira1409:bug/ip-socket

Conversation

@jsilveira1409
Copy link
Contributor

Related Issue(s)
#3888

|Has Unit Tests (y/n)| |
|Documentation Included (y/n)| |


Change Description

Removed unnecessary assert for the socketDescription.fd when sending data out.

Rationale

When the FSW streams out data to a e.g. TCP client, and the client disconnects abruptly, the assert get's triggered, crashing the FSW instance, instead of letting it fail to send in the sendProtocol() function further below in the IpSocket::send() function.

Testing/Review Recommendations

Future Work

@LeStarch
Copy link
Collaborator

LeStarch commented Aug 4, 2025

I am not sure this is the fix we want. The intent of these assertions was to ensure that the file descriptor is not an invalid file descriptor in both send and recv.

Look at the receive code:

this->m_lock.lock();
SocketDescriptor descriptor = this->m_descriptor;
this->m_lock.unlock();
if (descriptor.fd == -1) {
return SOCK_DISCONNECTED;
}
status = this->getSocketHandler().recv(descriptor, data, size);
if (status == SOCK_DISCONNECTED) {
this->close();
}

Here, we clearly check to avoid a case where the assert would trip unnecessarily.

The send code is a bit different.

status = this->reopen();
// if reopen wasn't successful, pass the that up to the caller
if (status != SOCK_SUCCESS) {
return status;
}
// Refresh local copy after reopen
this->m_lock.lock();
descriptor = this->m_descriptor;
this->m_lock.unlock();

Here it attempts to reopen a failed descriptor, checks status, and then persists another copy. This leads to a race condition because something else (e.g. recv) could close the descriptor between a successful open, and when the new copy is persisted.

This code should instead:

  1. Call reopen
  2. Safely persist a local copy of the descriptor.
  3. Check for and return bad statuses
  4. Check for fd == -1 and return DISCONNECTED

@vincewoo does this make sense?

@LeStarch
Copy link
Collaborator

LeStarch commented Aug 4, 2025

We should also put back in the assert in the recv function.

@LeStarch
Copy link
Collaborator

LeStarch commented Aug 5, 2025

@jsilveira1409 do you want to attempt the above change, or shall I?

@jsilveira1409
Copy link
Contributor Author

@LeStarch hey thanks for the feedback! I'll give it a shot in the coming days and propose a solution! Thanks

@LeStarch
Copy link
Collaborator

LeStarch commented Aug 6, 2025

It turns out this is a whole lot more complicated that first assumed. The crash comes from the fact that in the TcpServer re-connections are potentially a long process (wait listening for some unknown client).

We need this fix for the release today, so I am going to go ahead and make it as it affects multiple files.

For the time being, I will merge this PR as it will partially address the problem, and will give you credit.

@LeStarch LeStarch merged commit ab16b2a into nasa:devel Aug 6, 2025
49 checks passed
@LeStarch
Copy link
Collaborator

LeStarch commented Aug 6, 2025

We looked at this yet deeper, and decided this was the correct fix for now. A better fix would be to push re-connection requests onto their own thread, and get rid of this weird inter-threading logic.

However, for now this code results in an error when threads are reconnecting and haven't completed. This is the stop-gap solution.

@jsilveira1409
Copy link
Contributor Author

I haven't had time to look more into it yet but will study it deeper when I do! Thanks for the merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants