[xcvrd] Enhance xcvrd to handle new system level event/error#39
[xcvrd] Enhance xcvrd to handle new system level event/error#39qiluo-msft merged 14 commits intosonic-net:masterfrom keboliu:retry-on-fail
Conversation
sonic-xcvrd/scripts/xcvrd
Outdated
| # Receive event on single SFP | ||
| for key, value in port_dict.iteritems(): | ||
| if key == EVENT_ON_ALL_SFP: | ||
| logger.log_error("Receive event with unexpected content, key: {}, value: {}".format(key, value)) |
There was a problem hiding this comment.
Receive [](start = 42, length = 7)
Past tense is better. As original 'Got'. Many time below. #Closed
sonic-xcvrd/scripts/xcvrd
Outdated
| for key, value in port_dict.iteritems(): | ||
| if key == EVENT_ON_ALL_SFP: | ||
| if value == SYSTEM_NOT_READY: | ||
| # system not ready, wait and retry |
There was a problem hiding this comment.
[](start = 32, length = 1)
wrong indentation. #Closed
sonic-xcvrd/scripts/xcvrd
Outdated
| if value == SYSTEM_NOT_READY: | ||
| # system not ready, wait and retry | ||
| if self._retry_for_system_ready(): | ||
| continue |
There was a problem hiding this comment.
[](start = 33, length = 3)
inconsistent indentation #Closed
sonic-xcvrd/scripts/xcvrd
Outdated
| while retry < RETRY_TIMES_FOR_SYSTEM_READY: | ||
| #time.sleep(RETRY_PERIOD_FOR_SYSTEM_READY_SECS) | ||
| time_start = time.time() | ||
| rc, event_dict = platform_sfputil.get_transceiver_change_event(RETRY_PERIOD_FOR_SYSTEM_READY_MSECS) |
There was a problem hiding this comment.
platform_sfputil.get_transceiver_change_event [](start = 29, length = 45)
You can call platform_sfputil.get_transceiver_change_event function only once, and implement the logic as a normal state machine.
I don't mean your implementation is wrong, state machine may be much more readable and future proof. #Closed
There was a problem hiding this comment.
Hi Qi,
I admit that rewriting it in a state machine approach can make the code a little shorter by eliminating some duplicate code.
But I'm afraid it can also introduce negative effects: We have the logic to control retry timespan and times and introduce some local variables to control that. Rewriting it as a state machine will combine the retry control logic into the main procedure of xcvrd, which make the main function complicated. #Closed
There was a problem hiding this comment.
The benefit for state machine pattern is not shorter code, but a correct design and implementation.
For the design part, you need to consider any sequence of change events returned by vendor plugin code. For example, SYSTEM_NOT_READY followed by an inserted event. You need to draw the state machine on paper first to consider all the events at each state. You need to document the state machine diagram first to consider all the events at each state.
For the implementation part, you could design you state transition code in standalone function, and achieve 'control retry timespan and times' or 'local variables'.
In reply to: 310075723 [](ancestors = 310075723)
There was a problem hiding this comment.
Hi Qi,
I understand the advantage of state machine approach. I just feel that the logic of xcvrd daemon is relatively simple and it seems that using state machine is less necessary.
IMO, state machine is adopted in event handling systems which have a lot of state and event, like network protocols. In those systems using state machine makes the flow clear and relatively easy to understand and also benefits the formal verification of that kind of system.
But for systems with less states and events the state machine approach doesn't have overwhelming advantage, since the current logic is straightforward. I have considered the state machine approach and found that only 2 states and 4 events. In this sense it's less necessary to go in this way.
meanwhile, I've created a PR in which I've updated the handling logic in a state-machine-like way, I think it is better than the current implementation. but it's not a strictly-speaking state machine. the pr is sonic-platform-daemons PR#5. You can check it.
qiluo-msft
left a comment
There was a problem hiding this comment.
As comments
Also blocked on sonic-net/sonic-platform-common#50
qiluo-msft
left a comment
There was a problem hiding this comment.
Reviewed again. Pending on #39 (comment)
…e state to "SYSTEM_READY"
Retry on fail state machine approach
code updated with state machine approach
events definition
State transmit:
State event next state |
sonic-xcvrd/scripts/xcvrd
Outdated
| status, port_dict = platform_sfputil.get_transceiver_change_event(timeout) | ||
| logger.log_debug("Got event {} {} in state {}".format(status, port_dict, state)) | ||
| event = self._mapping_event_from_change_event(status, port_dict) | ||
| for key, value in port_dict.iteritems(): |
There was a problem hiding this comment.
for key, value in port_dict.iteritems(): [](start = 12, length = 40)
Why you put the for-loop outside the event/state branches? Can you move it inside a smaller branch? #Closed
There was a problem hiding this comment.
Yes. You are correct.
Only for NORMAL_EVENT is this for loop required.
The "break" statements which originally break the inner "for" loop are also removed.
sonic-xcvrd/scripts/xcvrd
Outdated
| SYSTEM_FAIL = 'system_fail' | ||
| NORMAL_EVENT = 'normal' | ||
| # states definition | ||
| STATE_INIT = 'INIT STATE' |
There was a problem hiding this comment.
'INIT STATE' [](start = 13, length = 12)
You can use integer as value, don't rely on python to optimize the string comparison. #Closed
There was a problem hiding this comment.
I think to use string is better since it makes logging information more human-readable.
Even though a dictionary can be adopted to map from integer value to string, the case that the state is not in the key set of the dictionary has to be handled. That's too complicated.
BTW, I'm not sure whether to use string address to compare the string is part of python's language specification or just optimization that depends on python interpreter's implementation?
There was a problem hiding this comment.
states have been updated to integer value.
The events, like SYSTEM_FAIL, are already used by the plugins. In this sence, we'd better not to change them. It seems that to use string has already been a common practice in plugins, like the normal plugin/out events which are represented by '1', '0' respectively.
sonic-xcvrd/scripts/xcvrd
Outdated
| # If get_transceiver_change_event() return error, will clean up the DB and then exit | ||
| # TODO: next step need to define more error types to handle accordingly. | ||
| logger.log_error("Failed to get transceiver change event. Exiting...") | ||
| next_state = None |
There was a problem hiding this comment.
next_state [](start = 12, length = 10)
You may initialize it with state, next_state will never be None. #Closed
sonic-xcvrd/scripts/xcvrd
Outdated
|
|
||
| if unexpected_state: | ||
| next_state = STATE_EXIT | ||
| logger.log_error("Facing unexpected state {}, exiting...".format(state)) |
There was a problem hiding this comment.
You can remove unexpected_state. This is redundant logic. #Closed
There was a problem hiding this comment.
This is only for log.
By setting unexpected_state to true we can know why the state machine reach STATE_EXIT.
There was a problem hiding this comment.
This is only for log.
By setting unexpected_state to true we can know why the state machine reach STATE_EXIT.
I decide to remove this. It is very unlikely to enter an unexpected state since all states are handled for each event.
sonic-xcvrd/scripts/xcvrd
Outdated
| else: | ||
| unexpected_state = True | ||
| next_state = STATE_EXIT | ||
| break |
There was a problem hiding this comment.
repeated code, you can extract function. #Closed
There was a problem hiding this comment.
now, only next_state is set when facing unexpected state. in this sense, no need to extract function.
| os.kill(os.getppid(), signal.SIGTERM) | ||
| break | ||
| elif next_state == STATE_NORMAL: | ||
| timeout = 0 |
There was a problem hiding this comment.
You can move this part into event handling branches. #Closed
There was a problem hiding this comment.
It's better to set it here since it can ensure that wherever the state reaches STATE_NORMAL the timeout can be properly set.
qiluo-msft
left a comment
There was a problem hiding this comment.
Please check my new comment, and also please resolve merge conflict for this PR.
sonic-xcvrd/scripts/xcvrd
Outdated
| logger.log_error("Failed to get transceiver change event. Exiting...") | ||
| logger.log_warning("Got unknown event {} on state {}.".format(event, state)) | ||
|
|
||
| if not next_state == state: |
There was a problem hiding this comment.
if not next_state == state: [](start = 12, length = 27)
For readability, use if next_state != state
[sonic-platform-common] [sonic_sfp] Interpret sff 'int' element =0 as valid value (sonic-net/sonic-platform-common#51) add more error code to get_transceiver_change_event ((sonic-net/sonic-platform-common#50) [sonic_platform_base] support new-platform-api-based daemons ((sonic-net/sonic-platform-common#48) sync change to sonic_platform_base/sonic_sfp and create symbol link ((sonic-net/sonic-platform-common#49) Add parser support for Tx_RxLos,TxFault, PowerControl, ResetStatus in sff8436.py ((sonic-net/sonic-platform-common#45) readd type_abbrv_name in sonic_sfp/sff8436.py ((sonic-net/sonic-platform-common#44) [psu_base] get_status_led() returns current state of the status LED ((sonic-net/sonic-platform-common#39) Fix abbrv name for OSFP ((sonic-net/sonic-platform-common#36) [sff8436] support "Control Bytes" and "Options" ((sonic-net/sonic-platform-common#38) sonic_sfp: avoid possible key error in get_physical_to_logical() ((sonic-net/sonic-platform-common#37) [sonic-platform-daemons] [xcvrd] Enhance xcvrd to handle new system level event/error (sonic-net/sonic-platform-daemons#39) [xcvrd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#38) [psud] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#37) [syseepromd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#36) Add missing import statemet (sonic-net/sonic-platform-daemons#32) sonic_xcvrd: Support for DOM Threshold values for EEPROM dump (sonic-net/sonic-platform-daemons#29)
[xcvrd] backport PR(#39) "Enhance xcvrd to handle new system level event/error" to 201811
…ic-net#3333) [sonic-platform-common] [sonic_sfp] Interpret sff 'int' element =0 as valid value (sonic-net/sonic-platform-common#51) add more error code to get_transceiver_change_event ((sonic-net/sonic-platform-common#50) [sonic_platform_base] support new-platform-api-based daemons ((sonic-net/sonic-platform-common#48) sync change to sonic_platform_base/sonic_sfp and create symbol link ((sonic-net/sonic-platform-common#49) Add parser support for Tx_RxLos,TxFault, PowerControl, ResetStatus in sff8436.py ((sonic-net/sonic-platform-common#45) readd type_abbrv_name in sonic_sfp/sff8436.py ((sonic-net/sonic-platform-common#44) [psu_base] get_status_led() returns current state of the status LED ((sonic-net/sonic-platform-common#39) Fix abbrv name for OSFP ((sonic-net/sonic-platform-common#36) [sff8436] support "Control Bytes" and "Options" ((sonic-net/sonic-platform-common#38) sonic_sfp: avoid possible key error in get_physical_to_logical() ((sonic-net/sonic-platform-common#37) [sonic-platform-daemons] [xcvrd] Enhance xcvrd to handle new system level event/error (sonic-net/sonic-platform-daemons#39) [xcvrd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#38) [psud] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#37) [syseepromd] Support both new platform API and old platform plugins (sonic-net/sonic-platform-daemons#36) Add missing import statemet (sonic-net/sonic-platform-daemons#32) sonic_xcvrd: Support for DOM Threshold values for EEPROM dump (sonic-net/sonic-platform-daemons#29)
This PR has a dependency on PRs:
sonic-net/sonic-platform-common#50
sonic-net/sonic-buildimage#3261
In the above PRs introduced the new system-level event, this PR to enhance xcvrd to be more robust by handling the more new system-level event:
wait for a certain period on system_not_ready event to overcome race condition when xcvrd started but system not ready.
start to routing work on system_become_ready event, this event indicates that the system is ready for SFP change event report.
exit on system_fail event