Rake-backed data migrations ("shifts") for Rails apps, with dry run by default, progress output, and a consistent summary. Define shift classes in lib/data_shifts/*.rb; run them as rake data:shift:<task_name>.
# Gemfile
gem "data_shifter"bundle installNo extra setup in a Rails app: the railtie registers the generator and defines rake tasks by scanning lib/data_shifts/*.rb.
Generate a shift (optionally scoped to a model):
bin/rails generate data_shift backfill_foo
bin/rails generate data_shift backfill_users --model UserAdd your logic to the generated file in lib/data_shifts/.
Run it:
rake data:shift:backfill_foo
COMMIT=1 rake data:shift:backfill_fooFor systemic migrations across many records, implement:
collection: anActiveRecord::Relation(usesfind_each) or anArray/Enumerableprocess_record(record): applies the change for one record
module DataShifts
class BackfillCanceledById < DataShifter::Shift
description "Backfill canceled_by_id"
def collection
Bar.where(canceled_by_id: nil).where.not(canceled_at: nil)
end
def process_record(bar)
bar.update!(canceled_by_id: bar.company.primary_contact_id)
end
end
endFor targeted changes to specific records (e.g. fixing a bug for particular IDs), use task blocks instead:
module DataShifts
class FixOrderDiscrepancies < DataShifter::Shift
description "Fix order #1234 shipping and billing issues"
task "Correct shipping address" do
order.update!(shipping_address: "123 Main St")
end
task "Apply missing discount" do
order.update!(discount_cents: 500)
end
private
def order
@order ||= Order.find(1234)
end
end
endTask blocks run in the context of the shift instance, so they have access to private helper methods, dry_run?, log, skip!, find_exactly!, and any other instance methods you define. Use private methods to DRY up shared lookups across tasks.
Task blocks:
- Run in sequence within the same lifecycle (transaction, dry run protection, summary)
- Default to single transaction (all tasks commit or roll back together); use
transaction :per_recordfor per-task transactions
Generate a task-based shift with:
bin/rails generate data_shift fix_order_1234 --taskShifts run in dry run mode by default. DB changes are always rolled back in dry run mode, regardless of transaction setting.
- Dry run (default):
rake data:shift:backfill_foo - Commit:
COMMIT=1 rake data:shift:backfill_foo- (
COMMIT=trueorDRY_RUN=falsealso commit)
- (
In dry run mode, DataShifter automatically blocks or fakes these side effects so unguarded code is less likely to hit the network or send mail/jobs:
| Service | Behavior in dry run |
|---|---|
| HTTP | Blocked via WebMock (disable_net_connect!). Allow specific hosts with allow_external_requests [...] or DataShifter.config.allow_external_requests. |
| ActionMailer | perform_deliveries = false (restored after run). |
| ActiveJob | Queue adapter set to :test (restored after run). |
| Sidekiq | Sidekiq::Testing.fake! (restored with disable! after run). Only applied if Sidekiq::Testing is already loaded. |
Guarding other side effects: For anything we don't cover (e.g. another service, or allowed HTTP that mutates), use e.g. return if dry_run? in your shift. DB changes are always rolled back in dry run; only non-DB side effects need this.
To allow HTTP to specific hosts during dry run (e.g. a migration that must call an API to compute values), use the per-shift DSL or global config (NOTE: it is your responsibility to ensure you only make readonly requests in dry_run? mode):
# Per shift
module DataShifts
class BackfillFromApi < DataShifter::Shift
allow_external_requests ["api.readonly.example.com", %r{\.internal\.company\z}]
# ...
end
end# Global (e.g. in config/initializers/data_shifter.rb)
DataShifter.configure do |config|
config.allow_external_requests = ["api.readonly.example.com"]
endAllowed hosts are combined (per-shift + global). Restore (WebMock, mail, jobs) happens in an ensure so later code and other specs are unaffected.
Set the transaction mode at the class level:
transaction :single/transaction true(default): one DB transaction for the entire run; dry run rolls back at the end; a record error aborts the run.transaction :per_record: in commit mode, each record runs in its own transaction (errors are collected and the run continues); in dry run, the run is wrapped in a single rollback transaction.transaction false/transaction :none: No automatic transaction in commit mode only. In dry run, the run is still wrapped in a single rollback transaction so DB changes are never committed. Use when you have external side effects or your own transaction strategy in commit mode.
module DataShifts
class BackfillLegacyId < DataShifter::Shift
description "Per-record so one failure doesn't roll back all"
transaction :per_record
def collection = Item.where(legacy_id: nil)
def process_record(item)
item.update!(legacy_id: LegacyIdService.fetch(item))
end
end
endmodule DataShifts
class SyncToExternal < DataShifter::Shift
description "Side effects outside DB"
transaction false
def process_record(record)
return if dry_run?
record.update!(synced_at: Time.current)
ExternalAPI.notify(record)
end
end
end- Progress bar: enabled by default (requires
ruby-progressbar), and only shown for collections with at least 5 records. - Header: prints mode (DRY RUN vs LIVE), record count, transaction mode, and available status triggers.
- Live status (without aborting):
STATUS_INTERVAL=60prints a status block periodically (checked between records)- macOS/BSD:
Ctrl+T(SIGINFO) - Any OS:
kill -USR1 <pid>(SIGUSR1)
If your collection is an ActiveRecord::Relation, you can resume by filtering the primary key:
CONTINUE_FROM=123 COMMIT=1 rake data:shift:backfill_fooNotes:
- Only supported for
ActiveRecord::Relationcollections (Array-based collections—like those fromfind_exactly!—cannot be resumed). - The filter is
primary_key > CONTINUE_FROM, so it's only useful with monotonically increasing primary keys (e.g.find_each's default behavior).
DataShifter defines one rake task per file in lib/data_shifts/*.rb.
- Task name: derived from the filename with any leading digits removed.
20260201120000_backfill_foo.rb→data:shift:backfill_foo(leading<digits>_prefix is stripped)backfill_foo.rb→data:shift:backfill_foo
- Class name: task name camelized, inside the
DataShiftsmodule.backfill_foo→DataShifts::BackfillFoo
Shift files are required only when the task runs (tasks are defined up front; classes load lazily).
The description "..." line is extracted from the file and used for rake -T output without loading the shift class.
Configure DataShifter globally in an initializer:
# config/initializers/data_shifter.rb
DataShifter.configure do |config|
# Hosts allowed for HTTP during dry run only (no effect in commit mode)
config.allow_external_requests = ["api.readonly.example.com"]
# Suppress repeated log messages during a shift run (default: true)
config.suppress_repeated_logs = true
# Max unique messages to track for deduplication (default: 1000)
config.repeated_log_cap = 1000
# Global default for progress bar visibility (default: true)
config.progress_enabled = true
# Default status print interval in seconds when ENV STATUS_INTERVAL is not set (default: nil)
config.status_interval_seconds = nil
endPer-shift overrides:
class MyShift < DataShifter::Shift
progress false # Disable progress bar for this shift
suppress_repeated_logs false # Disable log deduplication for this shift
end- Start with a dry run: run the task once with no environment variables set, confirm logs and summary look right, then re-run with
COMMIT=1. - Make shifts idempotent: structure
process_recordso re-running is safe (for example, update only when the target column isNULL, or compute the same derived value deterministically). - Guard side effects we don't auto-block: use
return if dry_run?for any side effect not covered by Automatic side-effect guards (see above).
transaction :single(default):- Behavior: the first raised error aborts the run (all-or-nothing).
- Use when: partial success is worse than failure, or you want a clean rollback on any unexpected error.
transaction :per_record:- Behavior: in commit mode, records are committed one-by-one; errors are collected and the run continues; the overall run fails at the end if any record failed.
- Use when: you want maximum progress and are OK investigating/fixing a subset of failures.
transaction false/:none:- Behavior: in commit mode, no automatic transaction; in dry run, the run is still wrapped in a rollback transaction so DB changes are not committed.
- Use when: you have intentional external side effects or your own transaction/locking strategy in commit mode.
- Prefer returning an
ActiveRecord::Relationfromcollectionfor large datasets (DataShifter iterates relations withfind_each). - Be aware
counthappens up front for relations to print the header and size the progress bar. On very large/expensive relations, that extra query may be non-trivial. - Use status output for long runs: set
STATUS_INTERVALin environments where signals are awkward (for example, some process managers).
Use find_exactly!(Model, ids) to fetch a fixed list and raise if any are missing:
def collection
ids = ENV.fetch("BUYBACK_IDS").split(",").map(&:strip)
find_exactly!(Buyback, ids)
end
def process_record(buyback)
buyback.recompute!
endMark a record as skipped. Calling skip! terminates the current process_record immediately (no return needed). The record is counted as "Skipped" in the summary.
def process_record(record)
skip!("already done") if record.foo.present?
record.update!(foo: value) # not executed if skipped
endSkip reasons are grouped: the summary shows the top 10 reasons by count (e.g. "already done" (42), "not eligible" (3)) instead of logging each skip inline. This keeps the progress bar clean.
class SomeShift < DataShifter::Shift
throttle 0.1 # sleep seconds between records
progress false # disable progress bar rendering
end| Command | Generates |
|---|---|
bin/rails generate data_shift backfill_foo |
lib/data_shifts/<timestamp>_backfill_foo.rb with a DataShifts::BackfillFoo class |
bin/rails generate data_shift backfill_users --model User |
Same, with User.all in collection and process_record(user) |
bin/rails generate data_shift backfill_users --spec |
Also generates spec/lib/data_shifts/backfill_users_spec.rb when RSpec is enabled |
bin/rails generate data_shift fix_order_1234 --task |
Generates a shift with a task block instead of collection/process_record |
The generator refuses to create a second shift if it would produce a duplicate rake task name.
This gem ships a small helper module for running shifts in tests. Require it and include DataShifter::SpecHelper in specs or in RSpec.configure for type: :data_shift.
Helpers:
run_data_shift(shift_class, dry_run: true, commit: false)— Runs the shift; returns anAxn::Result. Usecommit: trueto run in commit mode.silence_data_shift_output— Suppresses STDOUT for the block (e.g. progress bar).capture_data_shift_output— Runs the block and returns[result, output_string]for asserting on printed output.
Use expect { ... }.not_to change(...) and expect { ... }.to change(...) to assert that data stays unchanged in dry run and changes when committed:
require "data_shifter/spec_helper"
RSpec.describe DataShifts::BackfillFoo do
include DataShifter::SpecHelper
before { allow($stdout).to receive(:puts) }
it "does not persist changes in dry run" do
expect do
result = run_data_shift(described_class, dry_run: true)
expect(result).to be_ok
end.not_to change(Foo, :count)
end
it "persists changes when committed" do
expect do
result = run_data_shift(described_class, commit: true)
expect(result).to be_ok
end.to change(Foo, :count).by(1)
# Or for in-place updates: .to change { record.reload.bar }.from(nil).to("baz")
end
end- Ruby ≥ 3.2.1
- Rails (ActiveRecord, ActiveSupport, Railties) ≥ 7.0
axn(Shift classes includeAxn)ruby-progressbar(for progress bars)webmock(for dry-run HTTP blocking; optional allowlist viaallow_external_requests [...]/DataShifter.config.allow_external_requests)