Reliable data migrations in Rails

From the Rails Guide to Active Record Migrations:

Migrations are a convenient way to alter your database schema over time in a consistent way.

As noted by Thoughtbot’s Data Migrations in Rails, the word “data” does not appear in this definition. Rails migrations are explicitly defined to manage database structural changes, that is, schema changes. Yet in common practice, Rails migrations typically refer to either changing the database schema, or moving data around in the database, occasionally both in the same migration. In other words, any change involving the database is regarded and discussed as a “Rails migration.”

So what’s the big deal?

In short, schema migrations are typically (not always) trivial to revert or roll back. In comparison, migrating data may involve data transformation as well as copying, and may not be (usually aren’t) easily revertible. When the data are subject to regulatory constraint, the stakes get even higher: deleted data may need to be archived as well.

Let’s take as a given that it’s rarely a good idea to implement changes in the schema and move data in the same operation, and leave discussion of schema migrations to the existing documentation. For data migrations, we have the following:

  1. Data migrations take longer to run than schema migrations, especially if the data is large, or the data migration requires computation, or works with model associations.

  2. Testing data migrations using rollbacks induces potentially long run times, and is difficult to test otherwise.

  3. Rails schema migrations are designed for DDL responsibilities, using them for data migrations violates the Single Responsibility Principle.

  4. Induces maintenance issues around class renaming, attribute validation, and callback side effects.

A case can be made that the decision on how to migrate is context-dependent, that simple data migrations are fine to execute automatically on deployment. The crux is determining the appropriate context. For example, does the migration fire one or more ActiveRecord callbacks? Will changed or new data require indexing?

While context is important for any data migration, every person writing a data migration will have to determine, by context, whether the migration should be auto-deployed. Then the code reviewers have to also understand the context as well. This is extra decision work, which can induce stress and analysis paralysis over the correctness of the decision. Remediating an automatically executed incorrect migration could be really time consuming.

How best to proceed? Let’s list some data migration requirements:

Data Migration Requirements

Ideally, a data migration would meet the following requirements:

  1. Runnable independently and on demand rather then automatically on a deployment.
  2. Idempotent, the data migration should easily halted and able to proceed in multiple stages.
  3. Easily testable to reduce risk, important in continuous deployment environments.
  4. Simple to create and understand.
  5. Support reverseability if necessary. This is a hard to difficult requirement, and may involve process changes to archive data in case it’s needed for reverting data changes.

The proposed method removes all such decisions, restricting the decision space to focus on the actual data migration itself rather than the context within which the data migration is being performed.

The tool we’re going to use is a Rails Generator, which provides the following benefits:

  • A reuseable, reliable pattern providing automatic structure for writing testable data migrations.
  • A standard structure incorporating test code at the setup lowers barrier to entry, junior engineer can do the coding.
  • The migration files are in a documented, standardized location, and generated easily: rails generate data_migration Thing.
  • data migration scripts are collected into their own module, reduces cognitive load induced with a common scripts directory.
  • A provisioned Rake task which can run at the sh prompt, no need to build a Rails console. Specifically, in a continuous delivery environment, the change can be merged, and the migration run at any convenient time, for example, after business hours when traffic is slower.
  • Easy to programmatically ensure and test idempotency.
  • if reverseability is desired or required, it’s much easier to implement and test.
  • Meets all audiability requirements for compliance.

Using a Rails genarator is not a silver bullet, and some engineers may find it overkill. In very small organizations, and when compliance is not a concern, it may be more appropriate for many cases to just log into production and bang out the changes in the Rails console.

However, once an application or organization is large enough, console activity on production becomes increasingly risky. Fewer and fewer engineers have enough global context to make console-based data changes with impunity, and the impact when a mistake is made conmpounds with size.

Hence, fully separating data migrations from schema migrations provides a way to reliably lower the bar for executing data migrations in a safe, compliant manner when safety and compliance are paramount, and the workload is heavy enough that this sort of migration represents high opportunity cost for senior engineering staff who would otherwise be working on higher impact projects.

More Benefits of Separating Data Migrations

Data migrations are often performed with “looks right testing,” that is, the code is developed by iteration until it “looks right.”

The cost for using this template is having to do the heavy lifting up front through the defined and implemented tests. The benefits are many:

  • Consistency results from having a specific location for data migrations, there is no ambiguity which might be found in a catchall directory such as (say) exe.
  • Edge cases are easier to keep track of once implemented as tests, saving time in the console.
  • Auditability is maintained or improved with a standardized process for data transformation.
  • For someone comfortable with testing, the total elapsed time is likely to be about the same. Writing the tests out may feel slower, but will be balanced with less time visually confirming at the console.

Rails-Generated Output

The locations for the generated data migration files is somewhat arbitrary. However, following Rails conventions, these locations have been found to work well in practice.

lib/data_migrations/some_migration.rb
lib/tasks/some_migration.rake
spec/lib/data_migrations/some_migration_spec.rb

Adjust to your team’s location preferences, there is no community standard for this at the time of writing.