Warm reboot: restore the database docker with content saved#2216
Warm reboot: restore the database docker with content saved#2216lguohan merged 8 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Co-Authored-By: qiluo-msft <qiluo-msft@users.noreply.github.com>
| if [[ "$REBOOT_TYPE" == "warm" && -d /host/warmboot ]]; then | ||
| WARM_DIR=/host/warmboot | ||
| function redisLoadAndDelete() | ||
| { |
There was a problem hiding this comment.
This function needs to also take database ID as a parameter #Resolved
| function redisLoadAndDelete() | ||
| { | ||
| FILENAME="$1" | ||
| test -e $FILENAME && redis-load -s /var/run/redis/redis.sock -e EMPTY $FILENAME && rm $FILENAME |
There was a problem hiding this comment.
A few issues from test:
- rm always fail in this function. you need to issue "sudo rm" to get it to work.
- "-s /var/run/redis/redis.sock" cause import to fail always. Removing this option works better.
- import fails randomly. I am stilling looking for a way to make it working reliably. This service is crucial that it has to be reliable.
- I think you shouldn't use '&&' notation. We want to remove these files regardless import succeeded or not. right? I don't think we should retry warm-boot if any failure was encountered. #Resolved
There was a problem hiding this comment.
Just a thought:
Maybe we should catch these db restore failures and in case of failure, clear the database and continue with a regular boot up? #Resolved
There was a problem hiding this comment.
- rm fixed
- redis-load fixed. if any more failure case, let me know
- I cannot agree to make it retry blindly. I make it exit immediately and we should fix if there is error in normal case.
In reply to: 229778573 [](ancestors = 229778573)
There was a problem hiding this comment.
I make it exit immediately and we should fix if there is error in normal case.
In reply to: 229780996 [](ancestors = 229780996)
There was a problem hiding this comment.
My concern is that if we fail database service in product, the device will be in failed state but ASIC is still forwarding. I am not sure if this is better than coming up with cold start and suffer a short IO disruption?
Co-Authored-By: qiluo-msft <qiluo-msft@users.noreply.github.com>
Co-Authored-By: qiluo-msft <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
84b2815 to
f6c7a64
Compare
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
|
@qiluo-msft , can you provide description for you commit? #Resolved |
| echo $1 | python -c "import sys, json, os; mnts = [x for x in json.load(sys.stdin)[0]['Mounts'] if x['Destination'] == '/usr/share/sonic/hwsku']; print '' if len(mnts) == 0 else os.path.basename(mnts[0]['Source'])" 2>/dev/null | ||
| } | ||
|
|
||
| function getRebootType() |
|
|
||
| function postStartAction() | ||
| { | ||
| REBOOT_TYPE=`getRebootType` |
| $SUDO rm $FILENAME || exit 12 | ||
| } | ||
| # Load applDB from /host/warm-reboot/appl_db.json | ||
| redisLoadAndDelete $WARM_DIR/appl_db.json |
There was a problem hiding this comment.
where is the DB argument? #Resolved
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
| # Load stateDB from /host/warm-reboot/state_db.json | ||
| redisLoadAndDelete 6 $WARM_DIR/state_db.json | ||
| # Load asicDB from /host/warm-reboot/asic_db.json | ||
| redisLoadAndDelete 1 $WARM_DIR/asic_db.json |
There was a problem hiding this comment.
Another thing came to my mind: I think we should test all file existence before proceeding with restoration. If any file is missing, there is something wrong. We should restore all or nothing. Do you agree?
There was a problem hiding this comment.
Current implementation treat this case as a service start failure. Later we can refine the case with robust recovery.
This submodule update brings in the following changes: ``` 50d5be2 Make changes to support compiling on Bullseye with GCC 10 (sonic-net#2216) 0870cf5 [mirrororch]: Implement HW resources availability validation for SPAN/ERSPAN (sonic-net#2187) f4ec565 [vlanmgrd] fix use-after-free memory issue (sonic-net#2211) c2de7fc [QosOrch] The notifications cannot be drained in QosOrch in case the first one needs to retry (sonic-net#2206) 5575935 [neighsyncd] increase neighsyncd timeout (sonic-net#2209) 0f06910 [PBH] Implement Edit Flows (sonic-net#2169) 6241bbf Remove redundant and problematic code to skip "pool" field in buffer profile handling (sonic-net#2197) a55343c [azp]: Set diff coverage threshhold to 80% (sonic-net#2188) 390cae1 [portsorch]: Prevent LAG member configuration when port has active ACL binding (sonic-net#2165) c1d47e6 [VNET]Fixing nexthop group delete during route change (sonic-net#2198) 8941cc0 [BFD]Registering BFD state change callback during session creation (sonic-net#2202) 680c539 [vxlan] Remove tunnel map objects on VNET tunnel removal (sonic-net#2150) 20dde0c Fix for handling broadcom DNX ASIC to have ipv4 and ipv6 ACL rules in separate tables. (sonic-net#2178) 5b7c949 [FdbOrch] SAI_FDB_EVENT_MOVE generates update with empty update.entry.port_name (sonic-net#2200) 7350d49 [Vxlanmgr] vnet netdev cleanup during config reload fix (sonic-net#2191) 2bef62b Validate LAG has members before mirror session create (sonic-net#2130) 1e4d4ce [VS test] Increase VS test time, skip dpb flaky test (sonic-net#2195) 6eda965 [vstest]Migrating vs tests from using click commands to direct DB access (sonic-net#2179) ``` Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
50d5be2 (HEAD, origin/master, origin/HEAD) Make changes to support compiling on Bullseye with GCC 10 (sonic-net#2216) 0870cf5 [mirrororch]: Implement HW resources availability validation for SPAN/ERSPAN (sonic-net#2187) f4ec565 [vlanmgrd] fix use-after-free memory issue (sonic-net#2211) c2de7fc [QosOrch] The notifications cannot be drained in QosOrch in case the first one needs to retry (sonic-net#2206) 5575935 [neighsyncd] increase neighsyncd timeout (sonic-net#2209) 0f06910 (master) [PBH] Implement Edit Flows (sonic-net#2169) 6241bbf Remove redundant and problematic code to skip "pool" field in buffer profile handling (sonic-net#2197) a55343c [azp]: Set diff coverage threshhold to 80% (sonic-net#2188) 390cae1 [portsorch]: Prevent LAG member configuration when port has active ACL binding (sonic-net#2165) c1d47e6 [VNET]Fixing nexthop group delete during route change (sonic-net#2198) 8941cc0 [BFD]Registering BFD state change callback during session creation (sonic-net#2202) 680c539 [vxlan] Remove tunnel map objects on VNET tunnel removal (sonic-net#2150) 20dde0c Fix for handling broadcom DNX ASIC to have ipv4 and ipv6 ACL rules in separate tables. (sonic-net#2178) 5b7c949 [FdbOrch] SAI_FDB_EVENT_MOVE generates update with empty update.entry.port_name (sonic-net#2200) 7350d49 [Vxlanmgr] vnet netdev cleanup during config reload fix (sonic-net#2191) 2bef62b Validate LAG has members before mirror session create (sonic-net#2130) 1e4d4ce [VS test] Increase VS test time, skip dpb flaky test (sonic-net#2195) 6eda965 [vstest]Migrating vs tests from using click commands to direct DB access (sonic-net#2179) Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
In order to include the following commit: 0f06910 [PBH] Implement Edit Flows (sonic-net/sonic-swss#2169) sonic-swss 50d5be2 Make changes to support compiling on Bullseye with GCC 10 (#2216) 0870cf5 [mirrororch]: Implement HW resources availability validation for SPAN/ERSPAN (#2187) f4ec565 [vlanmgrd] fix use-after-free memory issue (#2211) c2de7fc [QosOrch] The notifications cannot be drained in QosOrch in case the first one needs to retry (#2206) 5575935 [neighsyncd] increase neighsyncd timeout (#2209) 0f06910 [PBH] Implement Edit Flows (#2169) 6241bbf Remove redundant and problematic code to skip "pool" field in buffer profile handling (#2197) a55343c [azp]: Set diff coverage threshhold to 80% (#2188) 390cae1 [portsorch]: Prevent LAG member configuration when port has active ACL binding (#2165) c1d47e6 [VNET]Fixing nexthop group delete during route change (#2198) 8941cc0 [BFD]Registering BFD state change callback during session creation (#2202) 680c539 [vxlan] Remove tunnel map objects on VNET tunnel removal (#2150) 20dde0c Fix for handling broadcom DNX ASIC to have ipv4 and ipv6 ACL rules in separate tables. (#2178) 5b7c949 [FdbOrch] SAI_FDB_EVENT_MOVE generates update with empty update.entry.port_name (#2200) 7350d49 [Vxlanmgr] vnet netdev cleanup during config reload fix (#2191) 2bef62b Validate LAG has members before mirror session create (#2130) 1e4d4ce [VS test] Increase VS test time, skip dpb flaky test (#2195) 6eda965 [vstest]Migrating vs tests from using click commands to direct DB access (#2179) Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
…2216) Types of changes done: * Add missing includes in header files and .cpp files * Don't use parentheses when doing list initialization in constructors * Make sure variables are initialized before first use Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Related work items: #49, #58, #107, sonic-net#247, sonic-net#249, sonic-net#277, sonic-net#593, sonic-net#597, sonic-net#1035, sonic-net#2130, sonic-net#2150, sonic-net#2165, sonic-net#2169, sonic-net#2178, sonic-net#2179, sonic-net#2187, sonic-net#2188, sonic-net#2191, sonic-net#2195, sonic-net#2197, sonic-net#2198, sonic-net#2200, sonic-net#2202, sonic-net#2206, sonic-net#2209, sonic-net#2211, sonic-net#2216, sonic-net#7909, sonic-net#8927, sonic-net#9681, sonic-net#9733, sonic-net#9746, sonic-net#9850, sonic-net#9967, sonic-net#10104, sonic-net#10152, sonic-net#10168, sonic-net#10228, sonic-net#10266, sonic-net#10288, sonic-net#10294, sonic-net#10313, sonic-net#10394, sonic-net#10403, sonic-net#10404, sonic-net#10421, sonic-net#10431, sonic-net#10437, sonic-net#10445, sonic-net#10457, sonic-net#10458, sonic-net#10465, sonic-net#10467, sonic-net#10469, sonic-net#10470, sonic-net#10474, sonic-net#10477, sonic-net#10478, sonic-net#10482, sonic-net#10485, sonic-net#10488, sonic-net#10489, sonic-net#10492, sonic-net#10494, sonic-net#10498, sonic-net#10501, sonic-net#10509, sonic-net#10512, sonic-net#10514, sonic-net#10516, sonic-net#10517, sonic-net#10523, sonic-net#10525, sonic-net#10531, sonic-net#10532, sonic-net#10538, sonic-net#10555, sonic-net#10557, sonic-net#10559, sonic-net#10561, sonic-net#10565, sonic-net#10572, sonic-net#10574, sonic-net#10576, sonic-net#10578, sonic-net#10581, sonic-net#10585, sonic-net#10587, sonic-net#10599, sonic-net#10607, sonic-net#10611, sonic-net#10616, sonic-net#10618, sonic-net#10619, sonic-net#10623, sonic-net#10624, sonic-net#10633, sonic-net#10646, sonic-net#10655, sonic-net#10660, sonic-net#10664, sonic-net#10680, sonic-net#10683
Restore the database docker with content saved during the command 'warm-reboot'. If anything failed, the database service failed immediately.