Docs: Features: Sharding

Ordinarily, each schema subdirectory maps to a single physical database schema in each environment. For example, when you run skeema push staging, your .skeema configuration files specify what “staging” database server host and schema name to push your *.sql definitions to, and normally this specifies a singular host and schema.

However, Skeema also supports the ability to map a directory to multiple locations within the same environment. This is called “sharding”: a technique for physically dividing a data set across multiple underlying database servers. Within a logical “cluster”, all shards have the exact same table definitions, but each shard contains a different subset of the data set. (In a sharded environment, either the application is responsible for knowing where to route application SELECT queries and DML, or a sharding-aware proxy/middleware solution handles this automatically. In either case, application queries are outside of the scope of Skeema, which only handles the DDL portion of shard management.)

Some companies use large shards, where each database server contains only a single shard. Other companies use smaller shards, where each database server contains many shards, each in a separate database schema. Skeema’s configuration is flexible enough to support both situations, and also optionally permits additional unsharded database schemas to be colocated on each database server.

Several techniques described on this page can also be used for connecting Skeema to dynamic service discovery, even in an unsharded environment. Skeema’s sharding options provide ways of mapping a directory to N shards dynamically, and it doesn’t matter if N=1 or N=1000.

Sharding is an advanced topic, with many potential architectural differences depending on a company’s specific needs. If your company makes extensive use of sharded MySQL and needs assistance with configuring Skeema to manage schema changes across shards, we recommend subscribing to Skeema Max to obtain private technical support.

Mapping a directory to multiple database servers

In a normal unsharded environment, the host option value in a .skeema file is set to a single database server hostname or IP address, configuring a 1:1 mapping between a directory and its corresponding server for the environment name (production, development, etc). In a sharded environment, a single directory is instead mapped to N database servers within the same environment.

There are three different ways to configure this in Skeema, listed here in order of increasing complexity.

Static comma-separated list

With a small number of shards, you can simply hard-code the list of hostnames in the .skeema file as a comma-separated list. For example, host=shard1.example.com,shard2.example.com,shard3.example.com.

Env variable mapping to comma-separated list

In Skeema v1.9+, several options can be set to OS environment variables, which are automatically expanded to their corresponding value. For example, in your .skeema file you could configure host=$SKEEMA_PROD_SHARDS, and then whichever program invokes skeema push could dynamically set the SKEEMA_PROD_SHARDS env variable to a comma-separated list like “shard1.example.com,shard2.example.com,shard3.example.com”.

If the env variable is not set, or is set to a blank/whitespace-only string, the directory will be skipped with a warning.

Environment variable substitution only works if the full option value is a single complete environment variable name. In other words, syntax such as host=${DBSUBDOMAIN}.myhost.com or host=$SHARD1,$SHARD2 is not supported.

Dynamic shell-out via host-wrapper

You can configure Skeema to obtain a list of hostnames dynamically from the output of any custom script/program by setting the host-wrapper option to a command-line to execute. With this configuration, the purpose of host changes: instead of specifying a literal hostname or address, host is used for specifying a lookup key, such as a cluster name which your custom host-wrapper script can use for querying a service discovery system.

With this configuration, host should be just a single value, never a comma-separated list. In a sharded environment, it is the wrapper script’s responsibility to map a single lookup key to multiple shard servers whenever appropriate. If you have no need for a lookup key (for example, if you only have one sharded cluster), you must still set host to a placeholder/dummy value in order to indicate that the directory maps to database servers.

If host-wrapper is configured, whenever Skeema needs to process a leaf directory defining the schema option and containing *.sql files, the configured wrapper command-line will be executed to determine the list of hostnames dynamically. The command line may contain special placeholder variables, which Skeema will replace with appropriate values. See options with variable interpolation for more information. The following variables are supported by host-wrapper:

VariableValue
{HOST}The value of the host option, to use as a lookup key
{ENVIRONMENT}Environment name from the first positional arg on Skeema’s command-line, or “production” if none specified
{DIRNAME}The base name (last path element) of the directory being processed
{DIRPATH}The full (absolute) path of the directory being processed
{SCHEMA}The value of the schema option for the directory being processed

The command’s STDOUT will be split on a consistent delimiter (newline, tab, comma, or space), and each token will be treated as an address. Each address can use any of the following formats:

  • hostname
  • hostname:port
  • ipv4
  • ipv4:port
  • [ipv6]
  • [ipv6]:port

If ports are omitted, the port option is used instead, which defaults to MySQL’s standard port 3306.

The external command should only return addresses of writable (master) database servers, never replicas.

The host-wrapper option is designed to be specified generically at a high level directory, such as a .skeema file at the repository root, or perhaps a global option file. This way, you may specify a single generic service discovery command-line usable across your infrastructure, rather than redundantly configuring a command-line for each database cluster.

Setting or overriding host-wrapper in a subdirectory does not inherently cause the wrapper to be invoked upon processing that subdirectory; host-level subdirectories must also still specify some value for the host option in order to be processed. If your host-wrapper command-line does not make use of the {HOST} variable as a lookup key (to disambiguate between multiple separate sharded clusters), then just use a static value such as host=1 in directories where the host-wrapper script should be invoked.

Mapping a directory to multiple schemas per server

Some companies using a sharding layout where each database server contains multiple shard schemas, each with the same schema but a different subset of the data. With this layout, each shard is typically kept relatively small, which improves performance of common operational activities such as logical backups and schema changes. When the combined size of the shard schemas on one DB server exceeds a size threshold, rebalancing is accomplished by moving one or more entire shards to a different database server, rather than splitting a shard’s row data.

Skeema provides several ways of configuring this sort of many-small-shards mapping, to indicate that a directory needs to be mapped to multiple schema names on each host. Instead of setting the schema option to a single static value in a given .skeema file, you may use any of these alternative formats to specify multiple database names:

  • Multiple static schema names in a comma-separated list, for example schema=shard1,shard2,shard3
  • An environment variable beginning with $, such as schema=$MY_DB_NAME (Skeema v1.9+); the environment variable’s value can be either a single value or a comma-separated list
  • A single asterisk character *, that is schema=*, to mean “all non-system database schemas on the database server”
  • A forward-slash-wrapped regular expression, such as schema=/^foo/ (Skeema v1.4+), to mean “all database schemas on the server that match this regular expression”
  • A backtick-wrapped command line to execute; the command’s STDOUT will be split on a consistent delimiter (newline, tab, comma, or space) and each token will be treated as a schema name

Setting schema=* is a special value meaning “all non-system schemas on the database server”. This is the easiest choice for a multi-tenant sharded environment, where all non-system schemas have the exact same set of tables. The ignored system schemas include information_schema, performance_schema, mysql, sys, and test. Additional schema names may be excluded from matching schema=* by using the ignore-schema option in the same .skeema file.

In some sharded environments, it is easier to express a dynamic set of schema names to include, rather than exclude. Setting the schema value to a forward-slash-wrapped regular expression accomplishes this. For example, schema=/^foo/ will map this directory to all schema names beginning with prefix “foo”. This approach is useful when some schemas (with a common naming convention) represent shards with the same set of tables, while other special unsharded schemas are also present.

Some sharded environments may need even more flexibility – for example, when the sharding scheme does not follow a consistent naming pattern. In this case, set schema to a backtick-wrapped external command shellout. This permits the directory to be mapped to one or more schema names dynamically, based on the output of any arbitrary script or binary, such as a service discovery client. The command line may contain special variables, which Skeema will dynamically replace with appropriate values. See options with variable interpolation for more information. The following variables are supported within a backtick-wrapped schema command-line:

VariableValue
{HOST}Hostname (or IP) for the database server being processed
{PORT}Port number for the database server being processed
{USER}MySQL username defined by the user option either via command-line or option file
{PASSWORD}MySQL password defined by the password option either via command-line or option file
{PASSWORDX}Behaves like {PASSWORD} when the command-line is executed, but only displays X’s whenever the command-line is displayed on STDOUT
{ENVIRONMENT}Environment name from the first positional arg on Skeema’s command-line, or “production” if none specified
{DIRNAME}The base name (last path element) of the directory being processed. Can be useful as a key in a service discovery lookup.
{DIRPATH}The full (absolute) path of the directory being processed

Regardless of which form of the schema option is used, the ignore-schema option is applied last as a regex “filter” against it, potentially removing some of the listed schema names based on the configuration.

Detecting schema drift among shards

In a sharded environment, each schema change must be made on each shard. It is possible that a primary (master/writer) database server can have a hardware failure at an inopportune time during a schema change rollout, causing a shard to miss out on a change. This becomes more likely over time if you have hundreds or thousands of shards, and perform schema changes frequently.

Skeema’s declarative design makes drift detection trivially easy: simply run skeema diff, which will only return an exit code of 0 if all shards are in sync with the definitions in your schema repo. This can be automated as part of your post-failover cleanup process, or perhaps run regularly as a scheduled task / cron job.

If shard drift is present, it is sometimes desirable to first identify which database servers have drifted, instead of outputting the full SQL DDL to bring each shard back in sync. Use skeema diff --brief for this purpose.

To bring the drifting shard(s) back up-to-date, you can just run skeema push again as needed.

Shard concurrency

By default, skeema diff and skeema push only operate on one database server at a time. To operate on multiple servers simultaneously, use concurrent-instances to specify the maximum number of database servers to run on concurrently.

Within each individual database server, only one DDL operation will be run at a time by skeema push, regardless of concurrent-instances.

Canary shards

Ordinarily, for individual directories that map to multiple shards, skeema diff and skeema push will operate on the full list of shards. It is sometimes useful to run a schema change operation just on the first shard, as a “canary”, before deciding whether to continue with the full fleet of shards.

The first-only option can be enabled to make skeema diff or skeema push only operate on the first host and schema mapped by the directory. Here, “first” means following whatever order the hostnames or schema names were emitted by the configured host / host-wrapper and schema options.

Alternatively, for more complex control, a similar effect can be achieved by using environment names. For example, you could create an environment called “production-canary” with host configured to map to a subset of the database servers in the “production” environment.