July 23, 2024
Skeema uses a declarative design for schema management, representing the desired state of database objects as a collection of CREATE
statements. This model works extremely well for DDL, but it doesn’t directly capture imperative changes, such as row data migrations. In this post, we’ll cover the underlying concepts and show a few approaches for managing imperative changes when using Skeema.
Why DDL and DML should be separated
The solutions described later in this post are admittedly somewhat intricate. Users sometimes request support for a more natural built-in way to manage imperative changes and DML directly in Skeema – after all, traditional migration tools have always supported both DML and DDL, essentially treating them identically. However, there are major drawbacks with that approach.
DDL can’t be used in transactions in MySQL or MariaDB
Most migration tools will allow you to mix DDL and DML statements in the same migration, providing a generic abstraction which vaguely implies an atomic operation. However, this is a dangerous illusion, since MySQL and MariaDB do not provide transactional behavior for DDL.
If a migration consists of a mix of DDL and DML, there is no possible way to make the full operation atomic. As soon as each individual DDL statement completes, its effects are immediately visible to all other sessions.
By handling DDL and DML with separate abstractions and workflows, it is much clearer to developers that these are conceptually distinct operations, with differing semantics in the database server.
Lack of DML scalability
At smaller scale, row data migrations are simple, sometimes consisting of just a single UPDATE
statement. However, with larger tables, this approach isn’t safe, as it results in a single long-lived transaction. This can be extremely disruptive to the database server: it may hold locks for an excessively long time; it causes replication lag; it blocks purge of old row versions; it can cause furious flushing of dirty buffer pool pages.
For this reason, larger companies often build custom in-house data migration tools and frameworks. Large-scale data migrations typically require precise orchestration with application logic to redundantly write to old and new locations/formats, and a multi-step deployment process of changing write and read locations separately. Additionally, for companies using sharding, any data migration inherently requires complex tracking of DML statements across many physically distinct database servers. There is no “one size fits all” solution for this problem.
New environment spin-up
Schema management tools are often used to populate newly-created dev or test databases. With migration tools, this becomes increasingly slow as incremental migrations accumulate over time, so these tools sometimes offer the ability to “flatten” schema changes to their final state. However, doing so can inherently break any incremental DML statements from the migration history, as they may refer to tables/columns that no longer exist in the latest flattened schema, or had changes to column data types or constraints.
Handling this problem requires treating new environment seed data differently from other types of DML, and/or having some mechanism for marking some DML migrations as one-time “patches” which are skipped when spinning up a new environment. Many migration tools lack this functionality.
External solution: combining multiple tools
Skeema can be used alongside any traditional imperative migration tool quite easily. With this approach, Skeema is responsible for managing DDL, as well as seed data. Meanwhile, the migration tool handles execution of incremental DML migration.
When automating this approach, simply have your pipeline run skeema push
prior to the migration tool. This way, DML statements in your data migrations can safely refer to newly-added tables, columns, and functions. Just be sure to abort early if skeema push
returned a non-zero exit code.
In order to keep this solution organized, we recommend using separate directories for your Skeema CREATE
statements vs non-Skeema DML migrations. These can be parallel directory trees of the same repo, or even parent/child directories where the DML migrations are a subdirectory of the declarative schema directory. Just make sure the directories for the DML migrations omit a .skeema configuration file entirely, and then skeema diff
and skeema push
won’t interact with it.
Another possible variant of this approach is to use Skeema only for versioning and deploying stored routines, while using a migration tool for table DDL as well as DML. On the Skeema side, this can be configured easily by setting ignore-table=.
to block all table management.
Native solution: idempotent INSERT
s
In some cases, DML is tracked in the schema management repo only for representing seed data: initial row data for populating new tables. In Skeema Premium Edition, this can be managed somewhat declaratively via INSERT
statements which represent the full desired seed data set. By default, the inserts option defaults to “seed”, which causes skeema push
to execute the DML only when the target table is empty.
Complexity arises when you need to retroactively adjust a seed data set in the schema management repo (to apply to future dev/test environments), and also make a corresponding modification to that same row data on any existing environments.
If you are strictly adding or modifying seed data rows (but not deleting any of them), one solution is to make your seed data INSERT
statements effectively idempotent, by using either INSERT IGNORE
or INSERT ... ON DUPLICATE KEY UPDATE
. This way, you can run skeema push --inserts=all
ad hoc when needed, and all INSERT
statements will be re-executed, inserting any missing rows or modifying any existing ones.
Native solution: one-time events
MySQL and MariaDB both support events, which are scheduled tasks on the database server. Events can be managed and diffed using Skeema Premium Edition. Although events are most commonly used for cron-like recurring SQL jobs, this server feature also supports non-recurring “one-time events” – a block of SQL that the database server will only execute once, at a specific scheduled time. Since you can wrap any collection of SQL statements in a CREATE EVENT
statement, this feature provides a native mechanism for tracking arbitrary imperative changes inside an otherwise-declarative repo of CREATE
statements.
To use this technique, the event definition should always include an ON COMPLETION PRESERVE
clause. This tells the database server to store the event’s metadata directly in its internal data dictionary, providing persistence of migration metadata – without even requiring any sort of additional migration tracking table!
This approach can be used for tracking row data migrations – that is, INSERT
, UPDATE
, and DELETE
statements that should only be executed a single time per server. It can also be used for imperative DDL patches, such as one-time RENAME TABLE
or ALTER TABLE ... RENAME COLUMN
.
To wrap an imperative change in an event, the definition will typically be of this form:
CREATE EVENT `event_name_here`
ON SCHEDULE AT CURRENT_TIMESTAMP
ON COMPLETION PRESERVE
DO ... -- event body here, either a BEGIN...END block or a single SQL statement
When running skeema push
, as usual the event will only be created if it does not exist yet. If a given directory’s diff includes multiple types of objects, the event-related DDL will occur last, so it is safe for the event body to refer to any newly-created tables, columns, or routines.
For ease of development and debugging, you may want to wrap any complex logic in a separate stored procedure. Then, the body of the event can simply consist of a single CALL
statement.
With a dynamic expression schedule of AT CURRENT_TIMESTAMP
, the server will execute the event immediately after creation, assuming the event scheduler is enabled. Due to the ON COMPLETION PRESERVE
clause, the database server will persist this event’s metadata in its internal data dictionary, ensuring that this named event is only executed a single time.
After execution, the server will automatically transition the event’s state to “disabled”. Once in this state, skeema push
won’t re-enable the event, even if its .sql file CREATE EVENT
definition still lacks the DISABLE
attribute.
Patching existing environments only
The skeema push
command can be used to bring an existing environment up-to-date, or to populate a newly-created test or dev database. Sometimes it is desirable to run migrations only on existing environments, and not when spinning up a newly-created database, since the latter will already be handled in your declarative CREATE
definitions separately. Some examples:
- Renaming a table – You need to run a
RENAME TABLE
statement on all existing database servers; but after that, you want to adjust the schema repo’sCREATE TABLE
statement to use the new name, perhaps by usingskeema pull
. TheRENAME TABLE
then shouldn’t be run on any future environment deployments, but you still want to keep a record of it in your repo’s Git history. - Renaming a column – Same as previous bullet, but with
ALTER TABLE ... RENAME COLUMN
. - Adjusting seed data or “static” data sets – As an alternative to the idempotent
INSERT
s approach, you may wish to manage the data adjustment of existing environments using an event, and then refresh your seed data (for use in future dev/test envs) by doing a data dump of an existing environment.
One solution to this problem is to schedule the event for a literal static timestamp in the near future, rather than using a CURRENT_TIMESTAMP
expression. For example, if the current date and time is 2024-08-17 17:00, you could use a static timestamp in one hour, ON SCHEDULE AT '2024-08-17 18:00:00'
. As long as this event is pushed out to all environments within an hour, it will successfully execute everywhere at the specified time. Meanwhile, any future environments will skip the event entirely, since the server will not execute events that have a static time in the past.
Migration dependencies and ordering
For row data migrations which should be executed even when populating a new environment, take special care with migrations that have ordering dependencies on other migrations. When skeema push
detects that multiple events need to be created in the same push, the ordering of the various CREATE EVENT
statements is effectively random. Therefore, if the events' schedules are all AT CURRENT_TIMESTAMP
, then their execution order will be random as well.
You can influence the events' execution ordering by adding staggered time offsets, such as AT CURRENT_TIMESTAMP + INTERVAL 10 SECOND
. However, this does not guarantee that any prior scheduled events have already completed, since multiple events can execute in parallel. To enforce strict ordering, you must also use locking functions in your event bodies. For example, you could design every dependent event to call SELECT GET_LOCK(...)
as its first line, using the same lock name for all related events that must execute in a strict ordering.
Some features described in this blog post were newly added in Skeema v1.12. For the latest Skeema updates, follow us on GitHub, Twitter, or RSS.