Skeema Premium v1.10+ includes the ability to manage seed data in your schema repository. This feature allows you to track the initial data for populating a newly-created table. Seed data can be useful for ensuring that new tables immediately have some necessary data in production. It can also be used for tracking development data or “fixture” rows, for use when populating a new dev/test/CI database.
Seed data is represented by SQL INSERT
statements in your *.sql files. By default, skeema diff
and skeema push
only process these statements when the INSERT
is targeting an empty or newly-created table.
Skeema Premium also provides optional support for automatically generating your seed INSERT
statements, based on conditionally dumping data from a specific environment when running skeema pull
. These conditional data dumps are controlled by specially-named “seed views”, which allow you to track arbitrary SELECT
statements for use in seed data population. This provides a powerful mechanism for subsetting or anonymizing production data using the full expressiveness of declarative SQL.
Repo layout and INSERT
statements
INSERT
statements may be placed in any *.sql files within the same directory as the definition of their target tables. You may place them in the same file as their corresponding CREATE TABLE
, or in a separate file per INSERT
, or place all INSERT
s into a single file.
There are several approaches for creating and managing INSERT
statements in your *.sql files:
Manually: write
INSERT
statements from scratch, and adjust them manually when needed.Skeema seed views:
skeema pull
’s update-seed-inserts option provides an automated mechanism for using specially-named views to dump seed data from the appropriate environment; see below for usage.External dump tools: for example, a custom wrapper around the output of
mysqldump --no-create-info --where='...' --tables ...
could be used in an automation pipeline, in order to obtain the desiredINSERT
s and write them to a consistent location in your schema repo.
If maintaining INSERT
statements manually, you can split seed data for a single table into multiple separate INSERT
statements if you wish. If a single table has multiple INSERT
statements defined in the same .sql file, their execution ordering will match their order in the file.
Diff/push and INSERT
statements
The inserts enum option specifies how skeema diff
and skeema push
should handle INSERT
statements.
With the default of inserts=seed
, these commands process INSERT
statements only when the target table has no rows.
With inserts=none
, no INSERT
statements are processed. This is effectively how the Community Edition operates. The alias skip-inserts
may also be used equivalently.
With inserts=all
, all INSERT
statements are processed, regardless of table emptiness. It is important to understand that Skeema does not diff your table row data or support imperative patching of table data / generic data migrations and this option simply re-runs your full INSERT
s. When using inserts=all
, you must take caution to avoid duplicate key errors in your INSERT
statements, such as by manually ensuring your statements all use syntax INSERT IGNORE
or INSERT ... ON DUPLICATE KEY UPDATE
as necessary.
As with all other operations, skeema diff
only ever displays INSERT
s (as appropriate based on the value of the inserts option) but does not execute them. Meanwhile skeema push
executes them, but always after any DDL for the corresponding table. When skeema push
executes INSERT
s, each statement is run in its own separate transaction. If INSERT
s are present for multiple tables, their execution ordering is random. The session will use foreign_key_checks=0
, avoiding foreign key dependency chain issues, at the cost of not ensuring referential integrity of the seed data.
With inserts=seed
, if a single table has multiple INSERT
statements defined in the directory, the table-emptiness check is run only once prior to execution of any INSERT
statements.
The ddl-wrapper option does not affect INSERT
statements; skeema push
always executes INSERT
s directly, not through a wrapper script. Query timeouts (readTimeout
in connect-options) are not enforced for INSERT
s.
Neither skeema diff
nor skeema push
pre-verifies INSERT
statements for correctness prior to execution. INSERT
s are not run inside of a workspace.
Presence of INSERT
s does not affect the exit code behavior of skeema diff
.
Skeema Seed Views
Seed views provide a mechanism for auto-creating INSERT
statements, based on data dumps from a specific database environment. Each seed view encapsulates a SELECT
query for obtaining seed data for a specific table. Seed data INSERT
s are created or updated, based on the contents of corresponding seed views, whenever skeema pull
is run with the update-seed-inserts option enabled.
Among your *.sql files, Skeema considers a CREATE VIEW
statement to be a seed view definition if where the view name begins with “_seed_”, followed by a table name. As an example, consider the following view definition:
CREATE VIEW _seed_app_users AS SELECT * FROM app_users WHERE user_id < 100;
Skeema would consider this to be a seed view, defining which seed data to dump for the table `app_users`. When running skeema pull
, if this view definition is present in a .sql file and update-seed-inserts is enabled, Skeema will automatically run the query from `_seed_app_users`, convert that row data into an INSERT
statement, and write it to the end of app_users.sql – or, if there was already an INSERT
statement there, dynamically replace the statement’s contents to reflect the current data.
The update-seed-inserts option is not enabled by default. Appropriate usage of this option depends on your preferred workflow for INSERT
statements, and the source of the seed data. A few possibilities include:
Obtaining seed data using a subset of production data: enable update-seed-inserts in the [production] section of a .skeema file. Your seed views will typically involve WHERE clauses which define what subset of the row data to dump on pull. If desired, and your SELECT clauses can mask sensitive production data by using string manipulation functions, or the special-purpose data masking components in either MySQL Enterprise or Percona Server for a more powerful approach. Running
skeema pull production
will updateCREATE
as well asINSERT
statements; pulling from other environments will only updateCREATE
statements.Maintaining a canonical set of seed data in one database environment, such as considering the “staging” DB to be the source-of-truth for dumping seed data: enable update-seed-inserts in that environment section in a .skeema file. In this case, typically each seed view will be of the simple form
CREATE VIEW _seed_foo AS SELECT * FROM foo;
which dump the entirety of the table. Runningskeema pull staging
(for example) will updateCREATE
as well asINSERT
statements; meanwhile pulling from other environments will only updateCREATE
statements.Updating seed data ad hoc: enable update-seed-inserts on the command-line only when needed, for example
skeema pull production --update-seed-inserts
. If needed, you can combine this approach with --ignore-table to filter out tables whose seed data should not be updated.
When update-seed-inserts is enabled, its exact behavior is as follows: skeema pull
looks in the *.sql files for view definitions where the view name begins with “_seed_”. For each such view definition that is found, skeema pull
will take the following actions:
- Create the view on the target database (replacing it if it already exists) using the filesystem definition.
- Note that this is the only case where
skeema pull
treats the filesystem CREATE VIEW definition as its source-of-truth, rather than the live database’s definition. It is also the only case whereskeema pull
modifies the live database schema (to create or replace the seed view definition as needed). - This design decision is based on the notion of treating the filesystem view definition as a declarative mechanism for mapping a result-set to a table name. In a sense, the seed views are representing configuration metadata for the pull operation, rather than expressing a desired state for the schema.
- Favoring the filesystem definition also avoids problems with the DBMS normalizing view definitions. For example, the filesystem definition can use
SELECT *
and this will reflect all current columns dynamically, whereas view definitions on the database side become “frozen” at creation time by the DBMS.
- Note that this is the only case where
- Query all row data exposed by the view, e.g.
SELECT * FROM _seed_foo
- Convert the data into an
INSERT
statement- The table’s name will be whatever followed “_seed_” in the view’s name; for example view `_seed_foo` would emit
INSERT INTO `foo` ...
. - The column names in the
INSERT
statement will match whatever columns were found in the view’s definition.
- The table’s name will be whatever followed “_seed_” in the view’s name; for example view `_seed_foo` would emit
- Write the
INSERT
statement to the filesystem- If there is no
INSERT
statement for this table yet in the directory, the newINSERT
statement will be written to the same file as the table’s definition. In the previous example of view `_seed_foo` and table `foo`, by default this would write theINSERT
to foo.sql. You may manually move the statement to a different file in the same directory if you wish, and subsequent pulls will respect the new location. - If there was already a single
INSERT
statement for this table in the directory, its text will be replaced with that of the newINSERT
statement. - If there were multiple
INSERT
statements for this table in the directory, a warning will be logged and none of the statements will be modified.
- If there is no
Seed views may be defined anywhere within a directory’s *.sql files: you may place them all in a single file, or place them in the same file as each corresponding table definition, or place each of them in their own file. Seed view definitions are always manually maintained: commands such as skeema init
, skeema pull
, and skeema format
will not ever export or update CREATE VIEW
statements if the view name begins with “_seed_”.
Seed view edge cases
The seed data dumping functionality currently has several minor shortcomings:
- As a boolean, the operation of update-seed-inserts is currently “all or nothing”, operating on all or none of your seed views. Using --ignore-table on the command-line can help filter out some tables, but is insufficient if you have many seed views and only wish to update a small subset of your tables'
INSERT
statements. As a work-around, after pulling you can usegit restore
(or similar) to discard unwanted changes from your working directory before committing. - In order to create or update the seed view definition in the live database, Skeema’s database user needs the
CREATE VIEW
privilege, along with theDROP
privilege. There is not currently a mechanism to haveskeema pull
just run rawSELECT
queries directly instead of using views. - The seed view feature only emits standard
INSERT
statements. If you needINSERT IGNORE
or perhapsINSERT ... ON DUPLICATE KEY UPDATE
, you will need to manually edit the generated statements after each pull. - For tables with generated columns, you must manually list columns in your seed view’s
SELECT
and cannot useSELECT *
.
These situations may be improved in future Skeema releases, depending on customer interest. Please contact us with any feedback on the seed data and seed view functionality.