Skip to content

[Post-Mortem] Deserialization errors caused by Rails update #2748

@philippthun

Description

@philippthun

Symptom

We have a monitor for cf push; that means we push a simple application once every 5 minutes. This monitor detected a "downtime" while we were updating our largest internal foundation from CAPI 1.124.0 to 1.126.0.

cf-push-monitor

Failure

The errors lasted for about 1.5 hours. When looking into the logs of the monitor, it could be seen that cf push did not fail, but did not succeed in time (it has a timeout of less than 5 minutes).

The "downtime" correlated with the update of api and cc-worker vms. When searching for logs produced for a cf push that timed out, a deserialization error could be found:

Mar 24, 2022 @ 18:29:29.407	(0.000888s) (lib/delayed_job_plugins/deserialization_retry.rb:12:in `block (2 levels) in <class:DeserializationRetry>') (query_length=21114) UPDATE "delayed_jobs" SET "guid" = '920c6540-0f2d-4f46-91dc-2769fefec587', "created_at" = '2022-03-24 18:29:29.250646+0000', "updated_at" = CURRENT_TIMESTAMP, "priority" = 0, "attempts" = 1, "handler" = '--- !ruby/object:VCAP::CloudController::Jobs::LoggingContextJob
handler: !ruby/object:VCAP::CloudController::Jobs::TimeoutJob
  handler: !ruby/object:VCAP::CloudController::Jobs::PollableJobWrapper
    existing_guid:
    handler: !ruby/object:VCAP::CloudController::Jobs::SpaceApplyManifestActionJob
[...]
', "last_error" = 'Job failed to load: undefined method `default_proc='' for nil:NilClass.'

Root Cause

Further analyses showed that the root cause for this error is the Ruby on Rails update from version 5.x to 6.x that included a change in the error model (rails/rails/pull/32313). The serialization of an empty ActiveModel::Errors object changed from:

--- !ruby/object:ActiveModel::Errors
  messages: {}
  details: {}

to:

--- !ruby/object:ActiveModel::Errors
  errors: []

The nil:NilClass error is raised when a cc-worker running Rails 5 tries to deserialize a job created by an api node already updated to Rails 6 as the payload does not contain messages anymore.

sequenceDiagram
  participant API 1.124.0
  participant WORKER 1.124.0
  participant CCDB
  participant API 1.126.0
  participant WORKER 1.126.0

  rect rgba(0, 255, 0, .4)
    API 1.124.0->>CCDB: serialize job
    CCDB->>WORKER 1.124.0: deserialize job
    note over WORKER 1.124.0: success
  end

  note over API 1.124.0,WORKER 1.126.0: update started

  rect rgba(255, 0, 0, .4)
    API 1.126.0->>CCDB: serialize job
    CCDB->>WORKER 1.124.0: deserialize job
    note over WORKER 1.124.0: failure
  end

  note over API 1.124.0,WORKER 1.126.0: update finished

  rect rgba(0, 255, 0, .4)
    API 1.126.0->>CCDB: serialize job
    CCDB->>WORKER 1.126.0: deserialize job
    note over WORKER 1.126.0: success
  end
Loading

Mitigation

There already is a mitigation in place, i.e. the DeserializationRetry plugin that ensures that all jobs failing with Job failed to load are rescheduled.

Although this retry mechanism works as intended, it could not prevent the significant service degradation we experienced. Due to the fact that we have many api nodes (far more than 100), it simply takes some time until all of them are updated. And only after all api nodes have been updated, the cc-worker vms are in line.

So when the job was finally processed by an updated cc-worker, it failed as the app was already deleted. This is due to the fact that the cf push monitor deletes the app in the end and also cleans up failed apps from previous runs:

# Mar 24, 2022 @ 19:59:33.598	(0.001951s) (51086a20d7be919cc370b15e4225a7ea526e809f/gem_home/ruby/2.7.0/gems/talentbox-delayed_job_sequel-4.3.0/lib/delayed/backend/sequel.rb:111:in `save!') (query_length=19799) UPDATE "delayed_jobs" SET "guid" = '920c6540-0f2d-4f46-91dc-2769fefec587', "created_at" = '2022-03-24 18:29:29.250646+0000', "updated_at" = CURRENT_TIMESTAMP, "priority" = 1, "attempts" = 19, "handler" = '--- !ruby/object:VCAP::CloudController::Jobs::LoggingContextJob
handler: !ruby/object:VCAP::CloudController::Jobs::TimeoutJob
  handler: !ruby/object:VCAP::CloudController::Jobs::PollableJobWrapper
    existing_guid:
    handler: !ruby/object:VCAP::CloudController::Jobs::SpaceApplyManifestActionJob
[...]
', "last_error" = 'App with guid ''279c616d-2e8c-4cfa-9eca-bda9ec464740'' not found'

Fix

Before continuing to rollout the update to customer-facing foundations, we decided to patch the deserialization to be compatible between Rails versions 5 and 6.

Therefore we cherry-picked this commit onto CAPI 1.124.0 and 1.126.0.

Depending on the Rails version (Rails::VERSION::MAJOR) it monkey patches ActiveModel::Errors.init_with either with a version that can also deserialize errors serialized with Rails 6 or those serialized with Rails 5.

The error outlined above could be prevented by solely making the old CAPI release capable of deserializing Rails 6 errors; but as there is also the possibility that jobs created by an api node running Rails 5 are still in the database after the update (e.g. due to retries of failed jobs), we also patched the new CAPI release to be able to deserialize Rails 5 errors.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions