In my previous post I introduced materialized view concepts. This post begins with an introduction to change data capture technology and describes some of the ways in which it can be leveraged for your benefit. This is followed by a description of FlexCDC, the change data capture tool included with Flexviews. It continues with an overview of how to install and run FlexCDC, and concludes with a demonstration of the utility.
As a reminder, the first post covered the following topics:
- What is a materialized view(MV)?
- It explained that an MV can pre-compute joins and may aggregate and summarize data.
- Using the aggregated data can significantly improve query response times compared to accessing the non-aggregated data.
- Keeping MVs up-to-date (refreshing) is usually expensive.
- A change data capture tool can be used to implement an efficient way of refreshing them.
What is Change Data Capture (CDC)?
As the name implies, CDC software captures the changes made to database rows and makes those changes available in a convenient form which can be accessed by other programs. CDC applications exist for many commercial databases but until recently this type of software was not available for MySQL.
Change Data Capture can be used to:
- Monitor a database table, or tables for changes.
- Improve ETL performance by identifying the data which has changed.
- Maintain materialized views with Flexviews (the primary purpose of FlexCDC).
- Feed search engines like Sphinx or Solr only the rows that change.
- Feed third party replication systems.
- Provide data to “external triggers” such as Gearman jobs.
CDC tools usually operate in one of the following ways:
- Timestamps (usually more than one) to identify rows that have changed
- Triggers to capture changes synchronously
- Database log reading to capture the changes asynchronously
The first method has serious drawbacks, such that it can’t identify deleted rows and MySQL timestamps may not be flexible enough.
The trigger method has a lot of problems. Triggers add a significant overhead. When the structure of a table is changed, the triggers must be changed. The work in the trigger is immediate and affects every transaction. Finally, MySQL has limited trigger support, some of which is the cause of the aforementioned problems. The biggest problem, at least from standpoint of how Flexviews works, is that triggers can not, under normal conditions, detect the commit order of transactions. This above all makes triggers an unacceptable CDC method.
This leaves the third method, log based capture as the best option because it imposes less overhead than triggers and change data capture may be done remotely and asynchronously.
Binary log based CDC
The CDC tool included with Flexviews is called FlexCDC. It seemed like an appropriate name. The Binary Log is the MySQL log which records changes to tables in the database. FlexCDC reads the binary log to determine what rows have changed. Because of this, FlexCDC requires that you use row-based binary logs (RBR). If you don’t have MySQL 5.1 or aren’t using RBR, then it is possible to set up a dedicated MySQL slave which has log_slave_updates=1 and binlog_format=row to process SBR changes from a MySQL master. I’ll talk more about that in another blog post.
FlexCDC does not implement a full binary log parser. Instead, it invokes the ‘mysqlbinlog’ utility and it processes the predictable output of this program. mysqlbinlog will always be able to read the binary logs of the version of mysql it ships with (and previous versions) so there is no worry about binary log format changes. FlexCDC is written in PHP so it is portable.
Setting up FlexCDC
FlexCDC has some basic requirements:
- MySQL 5.1+
- row based logging (binlog_format=1)
- unique server_id in the my.cnf
- log_slave_updates=1, if this is a MySQL slave
You can get FlexCDC directly out of the Flexviews SVN. I suggest that you just grab all of Flexviews:
$ svn checkout http://flexviews.googlecode.com/svn/trunk/ flexviews
Next you have to customize the example ini file. FlexCDC is located in the flexviews/consumer/ subdirectory.
Create the settings file:
Change to the flexviews/consumer directory and copy the consumer.ini.example file to consumer.ini and edit it, making appropriate changes. The file is well commented. The example settings file should work for most MySQL installations which allow connections as root with no password from localhost. It is possible to read from and/or write to remote servers, but this example uses the local machine which is the usual configuration for Flexviews since it requires local access to the tables and the changelogs in order to maintain materialized views. Most database servers have some spare CPU for binary log parsing.
Run the setup script:
This will create the metadata tables and capture the initial binary log position.
php ./setup_flexcdc.php --ini consumer.ini $ php setup_flexcdc.php setup starting setup completed
If the setup detects any problems (such as binary logging not being enabled) it will exit with an error. It will exit with a message “setup completed” otherwise.
The binary log stores it progress in a metadata table:
$ mysql -e 'select * from flexviews.binlog_consumer_status\G' -uroot *************************** 1. row *************************** server_id: 999 master_log_file: binary_log.000001 master_log_size: 214652 exec_master_log_pos: 214652
If you select from that table you won’t see anything changing, even if you are writing into your database. This isn’t anything to worry about, since the background process isn’t running yet.
Starting the background process:
FlexCDC includes a consumer_safe.sh script that will start up a copy of FlexCDC and restart it if it exits with error. You can shut down FlexCDC by sending it a HUP signal. The script will drop a .pid file so you know what process to HUP.
$ ./consumer_safe.sh --ini=consumer.ini &  959
$ ps PID TTY TIME CMD 959 pts/1 00:00:00 consumer_safe.sh 960 pts/1 00:00:00 php 967 pts/1 00:00:00 ps 6248 pts/1 00:00:00 bash
$ cat flexcdc.pid 960
Adding a changelog to a table
FlexCDC copies the contents of database rows which change into special tables called changelogs. Each changelog is located in the flexviews database and is named $SCHEMA_$TABLE where $SCHEMA is the schema in which the source table is located and $TABLE the name of the source table. If that is confusing it should be clear in a moment.
Lets create a table, insert some rows, add a change log, delete rows and then insert some more of them:
mysql> create table > test.demo ( > c1 int auto_increment primary key, > c2 int > ) > engine=innodb; Query OK, 0 rows affected (0.00 sec) mysql> insert into test.demo values (1,1); Query OK, 1 row affected (0.00 sec)
Even though FlexCDC is running in the background, it didn’t capture any changes from that insert. We need to add the table to the list of tables to changelog. There is a utility included with FlexCDC called ‘add_table.php’. This script automates the process of adding a table to the list of tables to changelog. It does this by adding an entry to the `flexviews`.`mvlogs` metadata table, and it creates the changelog table itself.
$ php add_table.php --schema=test --table=demo success
Note that you can enable auto_changelog=true in the config file to automatically record changes for any table, starting from the first time a change is seen for that table. This is generally only useful if you have a small number of tables, and you want to track changes on all of them.
You may have also noted that I did not include –ini=consumer.ini. This is because this is the default config filename to search for. I included it in the earlier examples for illustration purposes.
Examine the changes
Now that the changelog has been added, any changes to `test`.`demo` will automatically be captured.
Insert data in one transaction (two rows):
mysql> insert into test.demo values (NULL,2),(NULL,3); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0
And delete data in a second transaction:
mysql> delete from test.demo where c1=1; Query OK, 1 row affected (0.00 sec)
The changelog is flexviews.test_demo. This because the source table is `test`.`demo`.
mysql> select * from flexviews.test_demo\G *************************** 1. row *************************** dml_type: 1 uow_id: 10 fv$server_id: 999 c1: 2 c2: 2 *************************** 2. row *************************** dml_type: 1 uow_id: 10 fv$server_id: 999 c1: 3 c2: 3 *************************** 3. row *************************** dml_type: -1 uow_id: 11 fv$server_id: 999 c1: 1 c2: 1 3 rows in set (0.00 sec)
As you can see, there are three rows in the changelog, each representing one of the changes we made.
You will notice that the source table only has two columns, but the changelog contains five. All change logs contain three additional metadata columns: dml_type, uow_id and fv$server_id. These columns represent the type of change (insert is 1, delete -1), the transaction order and the source of the changes, respectively.
Finally, note that the two insertions happened inside of the same transaction, and that the insertions happened before the deletion. Though they are none shown here, updates would be represented by a deletion followed by an insertion.