Skeema.io: Managing Data Migrations and other Imperative Changes

July 23, 2024

Skeema uses a declarative design for schema management, representing the desired state of database objects as a collection of CREATE statements. This model works extremely well for DDL, but it doesn’t directly capture imperative changes, such as row data migrations. In this post, we’ll cover the underlying concepts and show a few approaches for managing imperative changes when using Skeema.

Why DDL and DML should be separated

The solutions described later in this post are admittedly somewhat intricate. Users sometimes request support for a more natural built-in way to manage imperative changes and DML directly in Skeema – after all, traditional migration tools have always supported both DML and DDL, essentially treating them identically. However, there are major drawbacks with that approach.

DDL can’t be used in transactions in MySQL or MariaDB

Most migration tools will allow you to mix DDL and DML statements in the same migration, providing a generic abstraction which vaguely implies an atomic operation. However, this is a dangerous illusion, since MySQL and MariaDB do not provide transactional behavior for DDL.

If a migration consists of a mix of DDL and DML, there is no possible way to make the full operation atomic. As soon as each individual DDL statement completes, its effects are immediately visible to all other sessions.

By handling DDL and DML with separate abstractions and workflows, it is much clearer to developers that these are conceptually distinct operations, with differing semantics in the database server.

Lack of DML scalability

At smaller scale, row data migrations are simple, sometimes consisting of just a single UPDATE statement. However, with larger tables, this approach isn’t safe, as it results in a single long-lived transaction. This can be extremely disruptive to the database server: it may hold locks for an excessively long time; it causes replication lag; it blocks purge of old row versions; it can cause furious flushing of dirty buffer pool pages.

For this reason, larger companies often build custom in-house data migration tools and frameworks. Large-scale data migrations typically require precise orchestration with application logic to redundantly write to old and new locations/formats, and a multi-step deployment process of changing write and read locations separately. Additionally, for companies using sharding, any data migration inherently requires complex tracking of DML statements across many physically distinct database servers. There is no “one size fits all” solution for this problem.

New environment spin-up

Schema management tools are often used to populate newly-created dev or test databases. With migration tools, this becomes increasingly slow as incremental migrations accumulate over time, so these tools sometimes offer the ability to “flatten” schema changes to their final state. However, doing so can inherently break any incremental DML statements from the migration history, as they may refer to tables/columns that no longer exist in the latest flattened schema, or had changes to column data types or constraints.

Handling this problem requires treating new environment seed data differently from other types of DML, and/or having some mechanism for marking some DML migrations as one-time “patches” which are skipped when spinning up a new environment. Many migration tools lack this functionality.

External solution: combining multiple tools

Skeema can be used alongside any traditional imperative migration tool quite easily. With this approach, Skeema is responsible for managing DDL, as well as seed data. Meanwhile, the migration tool handles execution of incremental DML migration.

When automating this approach, simply have your pipeline run skeema push prior to the migration tool. This way, DML statements in your data migrations can safely refer to newly-added tables, columns, and functions. Just be sure to abort early if skeema push returned a non-zero exit code.

In order to keep this solution organized, we recommend using separate directories for your Skeema CREATE statements vs non-Skeema DML migrations. These can be parallel directory trees of the same repo, or even parent/child directories where the DML migrations are a subdirectory of the declarative schema directory. Just make sure the directories for the DML migrations omit a .skeema configuration file entirely, and then skeema diff and skeema push won’t interact with it.

Another possible variant of this approach is to use Skeema only for versioning and deploying stored routines, while using a migration tool for table DDL as well as DML. On the Skeema side, this can be configured easily by setting ignore-table=. to block all table management.

Native solution: idempotent `INSERT`s

In some cases, DML is tracked in the schema management repo only for representing seed data: initial row data for populating new tables. In Skeema Premium Edition, this can be managed somewhat declaratively via INSERT statements which represent the full desired seed data set. By default, the inserts option defaults to “seed”, which causes skeema push to execute the DML only when the target table is empty.

Complexity arises when you need to retroactively adjust a seed data set in the schema management repo (to apply to future dev/test environments), and also make a corresponding modification to that same row data on any existing environments.

If you are strictly adding or modifying seed data rows (but not deleting any of them), one solution is to make your seed data INSERT statements effectively idempotent, by using either INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE. This way, you can run skeema push --inserts=all ad hoc when needed, and all INSERT statements will be re-executed, inserting any missing rows or modifying any existing ones.

Native solution: one-time events

MySQL and MariaDB both support events, which are scheduled tasks on the database server. Events can be managed and diffed using Skeema Premium Edition. Although events are most commonly used for cron-like recurring SQL jobs, this server feature also supports non-recurring “one-time events” – a block of SQL that the database server will only execute once, at a specific scheduled time. Since you can wrap any collection of SQL statements in a CREATE EVENT statement, this feature provides a native mechanism for tracking arbitrary imperative changes inside an otherwise-declarative repo of CREATE statements.

To use this technique, the event definition should always include an ON COMPLETION PRESERVE clause. This tells the database server to store the event’s metadata directly in its internal data dictionary, providing persistence of migration metadata – without even requiring any sort of additional migration tracking table!

This approach can be used for tracking row data migrations – that is, INSERT, UPDATE, and DELETE statements that should only be executed a single time per server. It can also be used for imperative DDL patches, such as one-time RENAME TABLE or ALTER TABLE ... RENAME COLUMN.

To wrap an imperative change in an event, the definition will typically be of this form:

CREATE EVENT `event_name_here`
ON SCHEDULE AT CURRENT_TIMESTAMP
ON COMPLETION PRESERVE
DO ... -- event body here, either a BEGIN...END block or a single SQL statement

When running skeema push, as usual the event will only be created if it does not exist yet. If a given directory’s diff includes multiple types of objects, the event-related DDL will occur last, so it is safe for the event body to refer to any newly-created tables, columns, or routines.

For ease of development and debugging, you may want to wrap any complex logic in a separate stored procedure. Then, the body of the event can simply consist of a single CALL statement.

With a dynamic expression schedule of AT CURRENT_TIMESTAMP, the server will execute the event immediately after creation, assuming the event scheduler is enabled. Due to the ON COMPLETION PRESERVE clause, the database server will persist this event’s metadata in its internal data dictionary, ensuring that this named event is only executed a single time.

After execution, the server will automatically transition the event’s state to “disabled”. Once in this state, skeema push won’t re-enable the event, even if its .sql file CREATE EVENT definition still lacks the DISABLE attribute.

Patching existing environments only

The skeema push command can be used to bring an existing environment up-to-date, or to populate a newly-created test or dev database. Sometimes it is desirable to run migrations only on existing environments, and not when spinning up a newly-created database, since the latter will already be handled in your declarative CREATE definitions separately. Some examples:

Renaming a table – You need to run a RENAME TABLE statement on all existing database servers; but after that, you want to adjust the schema repo’s CREATE TABLE statement to use the new name, perhaps by using skeema pull. The RENAME TABLE then shouldn’t be run on any future environment deployments, but you still want to keep a record of it in your repo’s Git history.
Renaming a column – Same as previous bullet, but with ALTER TABLE ... RENAME COLUMN.
Adjusting seed data or “static” data sets – As an alternative to the idempotent INSERTs approach, you may wish to manage the data adjustment of existing environments using an event, and then refresh your seed data (for use in future dev/test envs) by doing a data dump of an existing environment.

One solution to this problem is to schedule the event for a literal static timestamp in the near future, rather than using a CURRENT_TIMESTAMP expression. For example, if the current date and time is 2024-08-17 17:00, you could use a static timestamp in one hour, ON SCHEDULE AT '2024-08-17 18:00:00'. As long as this event is pushed out to all environments within an hour, it will successfully execute everywhere at the specified time. Meanwhile, any future environments will skip the event entirely, since the server will not execute events that have a static time in the past.

Migration dependencies and ordering

For row data migrations which should be executed even when populating a new environment, take special care with migrations that have ordering dependencies on other migrations. When skeema push detects that multiple events need to be created in the same push, the ordering of the various CREATE EVENT statements is effectively random. Therefore, if the events' schedules are all AT CURRENT_TIMESTAMP, then their execution order will be random as well.

You can influence the events' execution ordering by adding staggered time offsets, such as AT CURRENT_TIMESTAMP + INTERVAL 10 SECOND. However, this does not guarantee that any prior scheduled events have already completed, since multiple events can execute in parallel. To enforce strict ordering, you must also use locking functions in your event bodies. For example, you could design every dependent event to call SELECT GET_LOCK(...) as its first line, using the same lock name for all related events that must execute in a strict ordering.

Some features described in this blog post were newly added in Skeema v1.12. For the latest Skeema updates, follow us on GitHub, Twitter, or RSS.

Managing Data Migrations and other Imperative Changes