-
Notifications
You must be signed in to change notification settings - Fork 368
Description
Symptom
We have a monitor for cf push; that means we push a simple application once every 5 minutes. This monitor detected a "downtime" while we were updating our largest internal foundation from CAPI 1.124.0 to 1.126.0.
Failure
The errors lasted for about 1.5 hours. When looking into the logs of the monitor, it could be seen that cf push did not fail, but did not succeed in time (it has a timeout of less than 5 minutes).
The "downtime" correlated with the update of api and cc-worker vms. When searching for logs produced for a cf push that timed out, a deserialization error could be found:
Mar 24, 2022 @ 18:29:29.407 (0.000888s) (lib/delayed_job_plugins/deserialization_retry.rb:12:in `block (2 levels) in <class:DeserializationRetry>') (query_length=21114) UPDATE "delayed_jobs" SET "guid" = '920c6540-0f2d-4f46-91dc-2769fefec587', "created_at" = '2022-03-24 18:29:29.250646+0000', "updated_at" = CURRENT_TIMESTAMP, "priority" = 0, "attempts" = 1, "handler" = '--- !ruby/object:VCAP::CloudController::Jobs::LoggingContextJob
handler: !ruby/object:VCAP::CloudController::Jobs::TimeoutJob
handler: !ruby/object:VCAP::CloudController::Jobs::PollableJobWrapper
existing_guid:
handler: !ruby/object:VCAP::CloudController::Jobs::SpaceApplyManifestActionJob
[...]
', "last_error" = 'Job failed to load: undefined method `default_proc='' for nil:NilClass.'
Root Cause
Further analyses showed that the root cause for this error is the Ruby on Rails update from version 5.x to 6.x that included a change in the error model (rails/rails/pull/32313). The serialization of an empty ActiveModel::Errors object changed from:
--- !ruby/object:ActiveModel::Errors
messages: {}
details: {}
to:
--- !ruby/object:ActiveModel::Errors
errors: []
The nil:NilClass error is raised when a cc-worker running Rails 5 tries to deserialize a job created by an api node already updated to Rails 6 as the payload does not contain messages anymore.
sequenceDiagram
participant API 1.124.0
participant WORKER 1.124.0
participant CCDB
participant API 1.126.0
participant WORKER 1.126.0
rect rgba(0, 255, 0, .4)
API 1.124.0->>CCDB: serialize job
CCDB->>WORKER 1.124.0: deserialize job
note over WORKER 1.124.0: success
end
note over API 1.124.0,WORKER 1.126.0: update started
rect rgba(255, 0, 0, .4)
API 1.126.0->>CCDB: serialize job
CCDB->>WORKER 1.124.0: deserialize job
note over WORKER 1.124.0: failure
end
note over API 1.124.0,WORKER 1.126.0: update finished
rect rgba(0, 255, 0, .4)
API 1.126.0->>CCDB: serialize job
CCDB->>WORKER 1.126.0: deserialize job
note over WORKER 1.126.0: success
end
Mitigation
There already is a mitigation in place, i.e. the DeserializationRetry plugin that ensures that all jobs failing with Job failed to load are rescheduled.
Although this retry mechanism works as intended, it could not prevent the significant service degradation we experienced. Due to the fact that we have many api nodes (far more than 100), it simply takes some time until all of them are updated. And only after all api nodes have been updated, the cc-worker vms are in line.
So when the job was finally processed by an updated cc-worker, it failed as the app was already deleted. This is due to the fact that the cf push monitor deletes the app in the end and also cleans up failed apps from previous runs:
# Mar 24, 2022 @ 19:59:33.598 (0.001951s) (51086a20d7be919cc370b15e4225a7ea526e809f/gem_home/ruby/2.7.0/gems/talentbox-delayed_job_sequel-4.3.0/lib/delayed/backend/sequel.rb:111:in `save!') (query_length=19799) UPDATE "delayed_jobs" SET "guid" = '920c6540-0f2d-4f46-91dc-2769fefec587', "created_at" = '2022-03-24 18:29:29.250646+0000', "updated_at" = CURRENT_TIMESTAMP, "priority" = 1, "attempts" = 19, "handler" = '--- !ruby/object:VCAP::CloudController::Jobs::LoggingContextJob
handler: !ruby/object:VCAP::CloudController::Jobs::TimeoutJob
handler: !ruby/object:VCAP::CloudController::Jobs::PollableJobWrapper
existing_guid:
handler: !ruby/object:VCAP::CloudController::Jobs::SpaceApplyManifestActionJob
[...]
', "last_error" = 'App with guid ''279c616d-2e8c-4cfa-9eca-bda9ec464740'' not found'
Fix
Before continuing to rollout the update to customer-facing foundations, we decided to patch the deserialization to be compatible between Rails versions 5 and 6.
Therefore we cherry-picked this commit onto CAPI 1.124.0 and 1.126.0.
Depending on the Rails version (Rails::VERSION::MAJOR) it monkey patches ActiveModel::Errors.init_with either with a version that can also deserialize errors serialized with Rails 6 or those serialized with Rails 5.
The error outlined above could be prevented by solely making the old CAPI release capable of deserializing Rails 6 errors; but as there is also the possibility that jobs created by an api node running Rails 5 are still in the database after the update (e.g. due to retries of failed jobs), we also patched the new CAPI release to be able to deserialize Rails 5 errors.
