Skip to content

Fix telemetry rpc getting stuck if all nodes have bandwidth set to 0#3643

Merged
dsiganos merged 1 commit intonanocurrency:developfrom
pwojcikdev:fix-telemetry-rare-case
Feb 2, 2022
Merged

Fix telemetry rpc getting stuck if all nodes have bandwidth set to 0#3643
dsiganos merged 1 commit intonanocurrency:developfrom
pwojcikdev:fix-telemetry-rare-case

Conversation

@pwojcikdev
Copy link
Copy Markdown
Contributor

@pwojcikdev pwojcikdev commented Jan 3, 2022

I encountered this bug when I was doing some testing on private network, where all nodes have bandwidth set to 0. In that case the 'bandwidths' set is empty and std::next() has undefined behavior (gets stuck in infinite loop).

@thsfs
Copy link
Copy Markdown
Contributor

thsfs commented Jan 3, 2022

what are the steps to reproduce this issue?

@thsfs thsfs requested review from clemahieu and thsfs January 3, 2022 20:42
@pwojcikdev
Copy link
Copy Markdown
Contributor Author

  1. Create network of at least 10 nodes with config option 'bandwidth_limit = 0'
  2. Try to call 'telemetry' rpc action
  3. RPC never completes, IO thread stuck in infinite loop

@zhyatt zhyatt added this to the V24.0 milestone Jan 3, 2022
@zhyatt zhyatt added the bug label Jan 3, 2022
@thsfs thsfs self-assigned this Jan 3, 2022
@dsiganos
Copy link
Copy Markdown
Contributor

dsiganos commented Jan 4, 2022

I think this bug is caused by this if statement, which removes the zero bandwidth values.

if (telemetry_data.bandwidth_cap != 0)
{
	bandwidths.insert (telemetry_data.bandwidth_cap);
}

If all bandwidth_cap are set 0 then counts inside strip_outliers_and_sum when calculating bandwidth_sum will be empty.

Therefore, this should be reproducible by a unit test with 10 nodes having bandwidth set to 0.

@fikumikudev are you willing to write such a unit test?

@dsiganos
Copy link
Copy Markdown
Contributor

dsiganos commented Jan 4, 2022

Also, I do not think this bug is fully fixed by the proposed change. The function strip_outliers_and_sum expects to receive a list of at least 2 elements to operate correctly and this check only checks for empty list.

@pwojcikdev
Copy link
Copy Markdown
Contributor Author

pwojcikdev commented Jan 4, 2022

Yes, it looks like my fix is not complete. std::next will move the iterator past the end of a container, there are no checks there. I also think that this bug could potentially affect mainnet even, if someone set up a lot of nodes with bandwidth_limit = 0, so that num_either_side_to_remove is greater than the number of nodes with bandwidth limit set to non zero value, then any node that handles telemetry rpc would have all its io threads stuck, effectively taking down that node.

@pwojcikdev
Copy link
Copy Markdown
Contributor Author

There are two ways to fix this I can think of, first is to rewrite strip_outliers_and_sum to properly respect container bounds, the second is to not skip zero bandwidth values, but instead replace them with something predefined, like 1Gbps maybe?

@dsiganos
Copy link
Copy Markdown
Contributor

dsiganos commented Jan 6, 2022

I think improving strip_outliers_and_sum to handle any number of elements is the right way to do it.

@thsfs thsfs assigned dsiganos and unassigned thsfs Jan 19, 2022
@pwojcikdev pwojcikdev force-pushed the fix-telemetry-rare-case branch from bf67968 to af495ea Compare January 24, 2022 22:29
@pwojcikdev pwojcikdev force-pushed the fix-telemetry-rare-case branch from af495ea to 1aee398 Compare January 24, 2022 22:54
@pwojcikdev
Copy link
Copy Markdown
Contributor Author

@dsiganos I fixed the strip_outliers_and_sum logic and added a unit test for this bug case. Please check if it looks good to you.

@dsiganos
Copy link
Copy Markdown
Contributor

dsiganos commented Jan 24, 2022

Yeap, I'll look at it now. I pushed a PR to your election scheduler RPC PR branch.
pwojcikdev#1

thank you for your contributions, they are great!

@dsiganos dsiganos merged commit 4d07daf into nanocurrency:develop Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants