Skip to content

Add platform agnostic socket error macros and error details helper#11565

Merged
mattklein123 merged 27 commits intoenvoyproxy:masterfrom
greenhouse-org:io-error-macros
Jun 28, 2020
Merged

Add platform agnostic socket error macros and error details helper#11565
mattklein123 merged 27 commits intoenvoyproxy:masterfrom
greenhouse-org:io-error-macros

Conversation

@sunjayBhatia
Copy link
Copy Markdown
Member

@sunjayBhatia sunjayBhatia commented Jun 11, 2020

Commit Message:
Add platform agnostic socket error macros and error details helper

  • WSAE* error codes are a disjoint set from POSIX error codes (which are
    still defined on Windows for non winsock2 errors) so we define macros
    that map to these error codes on Windows and the usual POSIX error
    codes on other platforms
  • do not use strerror to get error details on Windows, instead use
    FormatMessage and add utility function to replace strerror usage
  • replace usual error code macros with new SOCKET_ERROR_ macros in
    applicable locations where they come from socket operations
  • any differences in error code returned by socket operations continue
    to be handled with an #ifdef to make the differences between platforms
    clear (e.g. when SOCKET_ERROR_AGAIN is returned by Windows but
    SOCKET_ERROR_IN_PROGRESS is returned by other platforms)
  • Fix small UBSAN failure in integration test library

Additional Description:
Risk Level: Low (should be a 1-1 replacement for Linux etc.)
Testing: Adds a unit test for Windows and modifies some to use the new macros etc.
Docs Changes: N/A
Release Notes: N/A
Fixes #11298

- WSAE* error codes are a disjoint set from POSIX error codes (which are
  still defined on Windows for non winsock2 errors) so we define macros
  that map to these error codes on Windows and the usual POSIX error
  codes on other platforms
- do not use strerror to get error details on Windows, instead use
  FormatMessage and add utility function to replace sterror usage
- replace usual error code macros with new SOCKET_ERROR_ macros in
  applicable locations where they come from socket operations
- any differences in error code returned by socket operations continue
  to be handled with an #ifdef to make the differences between platforms
  clear (e.g. when SOCKET_ERROR_AGAIN is returned by Windows but
  SOCKET_ERROR_IN_PROGRESS is returned by other platforms)

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/.

🐱

Caused by: #11565 was opened by sunjayBhatia.

see: more, trace.

@sunjayBhatia
Copy link
Copy Markdown
Member Author

We experimented with and considered other options to address the general issue described below:

Problems:

  • there are specific winsock2 error codes, that do not correspond to the regular POSIX errno values
    • common case: EAGAIN would typically signal a program to repeat an operation, etc. but with winsock2 it's equivalent to WSAEWOULDBLOCK
  • EAGAIN/WSAEWOULDBLOCK happen very frequently so frequent allocations of error objects is not ideal
  • we need to be able to report the exact error code in logs (be sure not to replace a specific winsock2 error code with a translation of any kind)
  • we need to be able to report the exact error description string in logs (strerror may not cut it on Windows)

Options:

  • just add #ifdef case statements for WSA error codes when we translate to Envoy IoErrorCode
    • can use a macro in the error class to correctly generate the correct case statement
    • getIoSocketEagainInstance should also create an error with WSAEWOULDBLOCK instead of EAGAIN
    • easy, but messier? mixes up Windows and other platform specifics together in the IoSocketError class
    • ignores tests directly accessing OsSysCallsImpl, those tests and others who generate error objects in mocks etc. would still need to be updated with ifdefs
    • in places where we look at the errno coming from a sys call, use new macros for comparing constants (a la SOCKET_VALID() macro or SOCKET_ERROR macro)
  • #define new macros for the errors we care about, use those exclusively (the option we chose)
    • POSIX: #define ERROR_AGAIN EAGAIN
    • WINDOWS: #define ERROR_AGAIN WSAEWOULDBLOCK
    • should get rid of many #ifdefs required but could be brittle if developers are frequently changing code and forget to use the defined error codes
    • still may be some platform specifics where we need to check additional error codes on a platform (e.g. the error from a syscall is slightly different and doesn't map to the same ERROR_* for all platforms), however calling these differences out may be good as it ensures developers can directly see it when making changes
  • translate WSA errors to POSIX errno equivalents in OsSysCallsImpl results
    • translation table from WSA to posix errors
    • always pass around the POSIX error code
    • This will likely become confusing for users who see a POSIX error code or error string in logs and don't understand the mapping
    • Optimizations:
      • co-mingle required NT and WSA errors by expected frequency
      • static map of WSA error to POSIX error for constant time access in os sys calls
  • change IoError interface completely to no longer use getErrorCode() or use IoErrorCode enum to interrogate error specifics
    • instead interrogate error object with 'isError()' methods, e.g. isErrorAgain() for EAGAIN/WSAEWOULDBLOCK
    • can have a different implementation per platform as needed
    • can cleanly have multiple errors that resolve to the same "is error this or that" logic, such as EAGAIN/EWOULDBLOCK/EINPROGRESS (from the man 7 socket page: "-1 is returned with errno set to EAGAIN or EWOULDBLOCK, or EINPROGRESS")
    • requires a bigger refactor of the IoError interface
    • does not cover cases where we are calling syscalls directly and looking at errnos, this only helps the code that uses IoErrors

cc @wrowe @nigriMSFT @davinci26

} else {
ASSERT(result.rc_ == -1);
if (result.errno_ == EINPROGRESS) {
#ifdef WIN32
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of an error code that is different on Windows vs POSIX, personally I think it may be valuable to keep the #ifdef to ensure developers understand the differences between connect() on different platforms, but we could potentially abstract this error check in a different way, open to feedback about this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach as long as we don't have too many instances of the same syscall

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Copy link
Copy Markdown
Member

@davinci26 davinci26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@htuch htuch removed the api label Jun 12, 2020
@davinci26
Copy link
Copy Markdown
Member

Should we also start removing the fail on windows tag with this change to have a sense of where we are at with getting all the tests running?

@wrowe
Copy link
Copy Markdown
Contributor

wrowe commented Jun 13, 2020

When you resolve a test failure, remove the corresponding fails_on_windows tag.

You can simply change your bazel test command to ignore that flag to review newly passing tests, or only build tests with that flag present to discover newly passing tests.

In any case removing that flag without fixing the bug will break CI, so please don't do that.

@ggreenway
Copy link
Copy Markdown
Member

@junr03 Can you take a pass on this?

wrowe and others added 2 commits June 16, 2020 14:15
- pre-allocating error message buffer we may fail to fetch error details
for errors with long descriptions
- revert IoFileError.getErrorDetails() to use strerror, we will take
another PR to refactor Filesystem Impl on Windows to use Win32 APIs not
POSIX

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
sunjayBhatia and others added 2 commits June 16, 2020 17:12
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
Copy link
Copy Markdown
Member

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments


const std::string errorDetails(int error_code) {
#ifndef WIN32
return strerror(error_code);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document why strerror cannot be used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment describing below

* @param int error code
* @return const std::string error detail description
*/
const std::string errorDetails(int error_code);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we somehow enforce that new code does not use strerror? clang-tidy rule?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super experienced with clang-tidy and couldn't find a check that could do this, do you know of one? We could add to the format checks potentially instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're thinking of doing a format check instead, since we also have a couple places where we intentionally left strerror b/c there is another refactor required before we get rid of it (the filesystem impl on Windows uses the POSIX subsystem which is problematic for error handling purely Windows errors), we can use clang-format off/on to guard around those blocks that need it still

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#11655 for the filesystem impl issue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

adding checks for strerror in clang format

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new format check, please take a look!

os_syscalls.recv(socket.ioHandle().fd(), buf_, Config::MAX_INSPECT_SIZE, MSG_PEEK);
ENVOY_LOG(trace, "http inspector: recv: {}", result.rc_);
if (result.rc_ == -1 && result.errno_ == EAGAIN) {
if (SOCKET_FAILURE(result.rc_) && result.errno_ == SOCKET_ERROR_AGAIN) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, do we have any more places with the old-style comparison that should be converted to SOCKET_FAILURE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few more and converted them, it was tough to think of a format or other automation to detect it as it is dependent on if the return code comes from a socket function, lmk if you think there is a good way to auto detect this

#define ENVOY_SHUT_RDWR SD_BOTH

// Mapping winsock2 errors to common error names
#define SOCKET_ERROR_AGAIN WSAEWOULDBLOCK
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should document this decision a bit more inline, i.e., expand on the thinking about defining envoy-neutral socket errors and maping them to platform specific definitions?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more content here, lmk if this is sufficient

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, this is great!

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
@wrowe
Copy link
Copy Markdown
Contributor

wrowe commented Jun 22, 2020

/retest envoy-presubmit

for DNS failure

@repokitteh-read-only
Copy link
Copy Markdown

🐴 hold your horses - no failures detected, yet.

🐱

Caused by: a #11565 (comment) was created by @wrowe.

see: more, trace.

Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
@sunjayBhatia
Copy link
Copy Markdown
Member Author

@junr03 should we similarly ignore the kqueue code in clang-tidy (like is done for Windows)? That seems to be the last error remaining in the existing checks

cc @junr03 @ggreenway if you have thoughts on this

No longer the last clang-tidy error (we will fix those others in a commit soon), but still relevant

wrowe and others added 4 commits June 23, 2020 18:16
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>

Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
@sunjayBhatia
Copy link
Copy Markdown
Member Author

sunjayBhatia commented Jun 25, 2020

not quite sure yet how to fix this ubsan error, though it appears to be unrelated to the changes in this PR

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior test/integration/integration.cc:470:45 in 

It appears ubsan is still evaluating this and checking the result of test_server_->counter(success) for nullptr even though the first part of the or already does so:

test_server_->counter(success)->value() < concurrency_) &&

The ubsan error in this failure could be fixed by grabbing the counter once instead of each time it is needed, would appreciate your thoughts @ggreenway @junr03 if we should fix this here or open another issue, should be fixed in the most recent commit

see https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/43031/logs/104 for a raw log that hopefully does not go away with future pushes (AZP build: https://dev.azure.com/cncf/envoy/_build/results?buildId=43031&view=logs&jobId=930b2ee4-0ea7-5b74-a645-6bc26962bc06)

wrowe and others added 5 commits June 25, 2020 12:33
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Co-authored-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: William A Rowe Jr <wrowe@vmware.com>
Co-authored-by: William A Rowe Jr <wrowe@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
Signed-off-by: Sunjay Bhatia <sunjayb@vmware.com>
@davinci26
Copy link
Copy Markdown
Member

@junr03 @ggreenway can we get this PR merged with master 🚀

This PR unblock various tests that are failing on windows and I have been working on top of it for some time now and it would be awesome if we have it in master.

Copy link
Copy Markdown
Member

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. @ggreenway do you mind giving a final look?

__pragma(pack(pop))

using ssize_t = ptrdiff_t;
typedef ptrdiff_t ssize_t;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe we bias to using using.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are aware of this and wrote the original solve. We've run across an edge case where using was insufficient, and these are global C declarations of posix to be emulated correctly on Windows (We did joke about making it a #define, obviously not seriously:)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we changed those to typedef instead of using to appease clang-tidy in this commit: 6ab3de8, if there is a clang-tidy trick we don't know of (besides just turning tidy off for that section) we could, change to, that would be great to know

#define ENVOY_SHUT_RDWR SD_BOTH

// Mapping winsock2 errors to common error names
#define SOCKET_ERROR_AGAIN WSAEWOULDBLOCK
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, this is great!

@ggreenway
Copy link
Copy Markdown
Member

lgtm. @ggreenway do you mind giving a final look?

I'm a bit backed up with some end-of-quarter stuff. I'm going to unassign myself. @envoyproxy/senior-maintainers PTAL.

@ggreenway ggreenway removed their assignment Jun 26, 2020
@mattklein123 mattklein123 self-assigned this Jun 27, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@mattklein123 mattklein123 merged commit 9e3dbff into envoyproxy:master Jun 28, 2020
@wrowe wrowe deleted the io-error-macros branch June 29, 2020 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WIP: Resolve WSAE* errors on Windows into expected Posix error states

8 participants