[Post-Mortem] Deserialization errors caused by Rails update

## Symptom

We have a monitor for `cf push`; that means we push a simple application once every 5 minutes. This monitor detected a "downtime" while we were updating our largest internal foundation from CAPI 1.124.0 to 1.126.0.

<img width="1179" alt="cf-push-monitor" src="https://user-images.githubusercontent.com/618301/161284214-8e9a760f-137d-4c61-9d39-54cd1a71d04c.png">

## Failure

The errors lasted for about 1.5 hours. When looking into the logs of the monitor, it could be seen that `cf push` did not fail, but did not succeed in time (it has a timeout of less than 5 minutes).

The "downtime" correlated with the update of `api` and `cc-worker` vms. When searching for logs produced for a `cf push` that timed out, a deserialization error could be found:
```
Mar 24, 2022 @ 18:29:29.407	(0.000888s) (lib/delayed_job_plugins/deserialization_retry.rb:12:in `block (2 levels) in <class:DeserializationRetry>') (query_length=21114) UPDATE "delayed_jobs" SET "guid" = '920c6540-0f2d-4f46-91dc-2769fefec587', "created_at" = '2022-03-24 18:29:29.250646+0000', "updated_at" = CURRENT_TIMESTAMP, "priority" = 0, "attempts" = 1, "handler" = '--- !ruby/object:VCAP::CloudController::Jobs::LoggingContextJob
handler: !ruby/object:VCAP::CloudController::Jobs::TimeoutJob
  handler: !ruby/object:VCAP::CloudController::Jobs::PollableJobWrapper
    existing_guid:
    handler: !ruby/object:VCAP::CloudController::Jobs::SpaceApplyManifestActionJob
[...]
', "last_error" = 'Job failed to load: undefined method `default_proc='' for nil:NilClass.'
```

## Root Cause

Further analyses showed that the root cause for this error is the Ruby on Rails update from version 5.x to 6.x that included a change in the error model (rails/rails/pull/32313). The serialization of an empty `ActiveModel::Errors` object changed from:
```
--- !ruby/object:ActiveModel::Errors
  messages: {}
  details: {}
```
to:
```
--- !ruby/object:ActiveModel::Errors
  errors: []
```

The `nil:NilClass` error is raised when a `cc-worker` running Rails 5 tries to deserialize a job created by an `api` node already updated to Rails 6 as the payload does not contain `messages` anymore.

```mermaid
sequenceDiagram
  participant API 1.124.0
  participant WORKER 1.124.0
  participant CCDB
  participant API 1.126.0
  participant WORKER 1.126.0

  rect rgba(0, 255, 0, .4)
    API 1.124.0->>CCDB: serialize job
    CCDB->>WORKER 1.124.0: deserialize job
    note over WORKER 1.124.0: success
  end

  note over API 1.124.0,WORKER 1.126.0: update started

  rect rgba(255, 0, 0, .4)
    API 1.126.0->>CCDB: serialize job
    CCDB->>WORKER 1.124.0: deserialize job
    note over WORKER 1.124.0: failure
  end

  note over API 1.124.0,WORKER 1.126.0: update finished

  rect rgba(0, 255, 0, .4)
    API 1.126.0->>CCDB: serialize job
    CCDB->>WORKER 1.126.0: deserialize job
    note over WORKER 1.126.0: success
  end
```

## Mitigation

There already is a mitigation in place, i.e. the [DeserializationRetry plugin](/cloudfoundry/cloud_controller_ng/blob/main/lib/delayed_job_plugins/deserialization_retry.rb) that ensures that all jobs failing with `Job failed to load` are rescheduled.

Although this retry mechanism works as intended, it could not prevent the significant service degradation we experienced. Due to the fact that we have many `api` nodes (far more than 100), it simply takes some time until all of them are updated. And only after all `api` nodes have been updated, the `cc-worker` vms are in line.

So when the job was finally processed by an updated `cc-worker`, it failed as the app was already deleted. This is due to the fact that the `cf push` monitor deletes the app in the end and also cleans up failed apps from previous runs:

```
# Mar 24, 2022 @ 19:59:33.598	(0.001951s) (51086a20d7be919cc370b15e4225a7ea526e809f/gem_home/ruby/2.7.0/gems/talentbox-delayed_job_sequel-4.3.0/lib/delayed/backend/sequel.rb:111:in `save!') (query_length=19799) UPDATE "delayed_jobs" SET "guid" = '920c6540-0f2d-4f46-91dc-2769fefec587', "created_at" = '2022-03-24 18:29:29.250646+0000', "updated_at" = CURRENT_TIMESTAMP, "priority" = 1, "attempts" = 19, "handler" = '--- !ruby/object:VCAP::CloudController::Jobs::LoggingContextJob
handler: !ruby/object:VCAP::CloudController::Jobs::TimeoutJob
  handler: !ruby/object:VCAP::CloudController::Jobs::PollableJobWrapper
    existing_guid:
    handler: !ruby/object:VCAP::CloudController::Jobs::SpaceApplyManifestActionJob
[...]
', "last_error" = 'App with guid ''279c616d-2e8c-4cfa-9eca-bda9ec464740'' not found'
```

## Fix

Before continuing to rollout the update to customer-facing foundations, we decided to patch the deserialization to be compatible between Rails versions 5 and 6.

Therefore we cherry-picked [this commit](/sap-contributions/cloud_controller_ng/commit/86d25b3a065dd999378290980822c5d94db0ecbc) onto CAPI 1.124.0 and 1.126.0.

Depending on the Rails version (`Rails::VERSION::MAJOR`) it monkey patches `ActiveModel::Errors.init_with` either with a version that can also deserialize errors serialized with Rails 6 or those serialized with Rails 5.

The error outlined above could be prevented by solely making the old CAPI release capable of deserializing Rails 6 errors; but as there is also the possibility that jobs created by an `api` node running Rails 5 are still in the database after the update (e.g. due to retries of failed jobs), we also patched the new CAPI release to be able to deserialize Rails 5 errors.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Post-Mortem] Deserialization errors caused by Rails update #2748

Symptom

Failure

Root Cause

Mitigation

Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Post-Mortem] Deserialization errors caused by Rails update #2748

Description

Symptom

Failure

Root Cause

Mitigation

Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions