Combating “data drift”

In my first post in this series, I described materialized views (MVs). An MV is essentially a cached result set at one point in time. The contents of the MV will become incorrect (out of sync) when the underlying data changes. This loss of synchronization is sometimes called drift. This is conceptually similar to a replication slave that is behind. Until it catches up, the view of the data on the slave is “behind” the changes on the master. An important difference is that each MV could have drifted by a different length of time.

A view which has drifted out of sync must be refreshed. Since an MV drifts over time from the “base tables” (those tables on which the view was built) there must be a process to bring them up-to-date. Flexviews includes two different refresh methods. Each method is named after the way in which the contents of the view are updated. The two methods are complete refresh and incremental refresh.   The view type/refresh method are interchangeable since a refresh method is selected at the time of view creation and may not be changed. If I talk about a ‘complete MV’, I mean a ‘MV for which the complete refresh method was selected at creation’.  Also, when I say ‘view’ I mean MV.

I will fully describe each method below, but the important thing to know right now is that there are two significant differences between the methods.

  • A view which uses the complete refresh method must be rebuilt from scratch each time the view is refreshed. This is similar to storing the results of a query in Memcache. When the cache “expires”, the contents must be fully recalculated..
  • An incrementally refreshable view (that is, one which uses the incremental method) can be updated much more efficiently, by examining a history of the rows which have changed since the view was either created, or last refeshed.

The incremental method is somewhat conceptually similar to using mysql binary logs for point-in-time recovery after a restoring a backup. Since the backup is a essentially a ‘snapshot’ of the data at the time of the backup, upon restoration the backup may be ‘out of date’ since changes to the database may have happened since the backup was taken. Replaying the binary logs brings the database up-to-date by replaying the changes.

An MV is also like a snapshot. The incremental refresh method uses the changelogs collected by FlexCDC to update the snapshot to reflect the changes that have happened in the database, instead of replaying binary logs directly.

Selecting a refresh method

The refresh method you select is determined by the following two major factors:

  1. The cost (in terms of execution time) of the query on which the view is based.
  2. The SQL features used in the query on which the view is based.

The cost of the query

The complete method completely rebuilds an MV from scratch at each refresh. This means that the minimum amount of drift for this type of MV is the amount of time it takes to execute the query on which the view is based.  An incrementally refreshable view, on the other hand, can usually be refreshed very quickly, depending on the amount of data which has changed since the last refresh. This is because the incremental refresh method must examine only those rows related to the ones which changed since the last time the view was refreshed.

The incremental refresh method can be many orders of magnitude faster than the original query execution time. It may be possible, therefore, to refresh an MV based on a query that takes 45 minutes to execute with a greater frequency than 45 minutes, even as frequently as every minute, perhaps.

The refresh method you select effectively controls the minimum amount of time it takes to refresh the view, and therefore the minimum amount of drift it will encounter. Keep in mind that you can build complete refresh views on top of incrementally refreshable view, which can give you a lot of flexibility to get the results you want, but more quickly. More on this later.

SQL features used in the query

Incremental refresh doesn’t support all of the SQL features available to SELECT statements.

Refresh typeAggregationOuter joinAll SQL functionsBuilt using SQLRequires FlexCDC
COMPLETEYYYYN
INCREMENTALY*NN**N***Y

*All aggregate functions supported, except GROUP_CONCAT and AVG(distinct).
**non-deterministic functions like RAND(),NOW(), etc, are not supported.
***There is a script to convert SQL to the Flexviews API (see convert.php below)

Refresh methods

The incremental refresh method

The incremental refresh method uses table changelogs which are created by FlexCDC. The refresh algorithm computes changes in multiple transactions. After each transaction it must wait for FlexCDC to process changes from that transaction. In practice, this means that it takes a minimum of a few seconds to incrementally refresh a view, even when the number of changes is very small.

The Flexviews SQL_API

Unlike views which use the complete refresh method, incrementally refreshable views are not built directly from SQL. Instead, Flexviews includes a MySQL stored procedure API, called the SQL_API which is used to define the view. These stored procedures maintain the Flexviews data dictionary. The data dictionary is used by the incremental refresh algorithm to build and maintain the view. You should always use the stored procedures to modify the dictionary. Do not modify it directly.

convert.php
You are probably unfamiliar with the SQL_API. In order to make working with the Flexviews easier, it includes a script called ‘convert.php’. This script makes it easy to create incrementally refreshable views from SQL statements. It reads one or more “CREATE TABLE db.schema … AS SELECT” and/or “INSERT INTO db.schema … AS SELECT” statements from standard input, and outputs the SQL_API statements representing the original SQL statements. Each MV name will be taken from the table name specified in each statement. You should try each statement for correctness before you attempt to convert it. You can easily check that it parses by running the SELECT portion, adding LIMIT 0 to the SELECT part of the query.

convert.php
This script takes as one argument the default database. Notice in the following example, the schema is not specified.

The MV will be placed into whatever schema is specified as the first argument on the command line:

The script (convert.php) supports basic queries which use select/group by/join/where. You may not use sub-queries or any non-deterministic functions like NOW() or RAND(). HAVING clauses, ORDER BY clauses, etc, are not supported. You can work around many of these limitations. I will discuss one of the workarounds below, in the ‘complete refresh’ section. I’ll discuss others in future blog posts.

Enable the view to use it

The SQL_API call flexviews.enable() is used to actually build the contents of the view, making it available for querying:

Data dictionary

Here is a quick example of the list of tables used by the above view, as stored in the data dictionary:

Using the dictionary

The following SQL is dynamically generated by the flexviews.get_sql() function call. This function reads the meta-data stored in the Flexviews data dictionary. It returns the SQL which represents the data stored in the view. This is much more convenient than trying to read from the dictionary directly to determine the contents of the view.

The complete refresh method

This method is actually quite simple in its operation. It will completely rebuild a view based on the SQL which defines it, each time the view is refreshed.

This is not much different from building a table with CREATE TABLE .. AS SELECT. In fact, this is actually part of what the COMPLETE refresh method does during its operation. It also takes care of atomically* replacing the new contents of the view with the old contents (using RENAME TABLE) which means that the view remains available for querying, even during refresh.

A complete refresh type view can use non-deterministic functions like NOW(). In fact any valid SELECT statement can be materialized. The definition of the MV is stored internally, and represents the actual SQL that defines the view. A simple API call is made to associate the SQL definition with the view, but the definition itself is a SQL statement, similar to a regular view. If you choose to use only the complete refresh method, then FlexCDC is not required to use Flexviews.

In the following example, the total sales amounts are calculated (for all time) from a monthly summary. The monthly summary is actually another MV. This saves a significant amount of time. If this MV did not access the other MV, then the SQL necessary to compute this list would take over 40 minutes to execute, because that is how long it takes to create the monthly summary from scratch.

Here is an example of a complete refresh MV which computes the total sales for all customers. It computes this from the dashboard_top_customers MV, ordering them in descending order:

When the view is ‘enabled’, the contents are created:

It only took about six seconds to calculate the total sales for all customers from the dashboard_customer_sales table. If you were paying attention to the incremental example, the dashboard_customer_sales table just happens to be the view we created above.

I picked this example, because it is important to understand that you can build complete refresh views on top of incremental ones. This will allow you to create complete refresh views that refresh much more quickly.

If this had to actually access the orders, order_lines and customers tables directly (the tables on which dashboard_customer_sales is built), then the query would take significantly longer (40+ minutes):

As an added benefit, if you build more than one complete refresh view from an incrementally refreshable one, then you can keep the complete views transactionally consistent with each other, as long as you refresh all of the complete views after the incremental one.

We can use the new MV (dashboard_top_customers) to list the top 10 customers nearly instantly. Note that every MV gets a special column mview$pk which is an auto_increment BIGINT surrogate key for the table. For complete refresh views which are ordered with ORDER BY, such as this one, this creates a ranking function automatically. In cases where this is not useful, simply ignore this column. It is used to prevent wide innodb primary keys on the MV.

You can also use this MV to display the total sales amount for any particular customer.

Notice how you can use this summary to calculate the number of lines in the order_lines table, three orders of magnitude more quickly than COUNT(*).

Notice the difference in response time.

Refresh method performance comparison

For demonstration purposes, I did the following:

  1. Created one view of each type. The view is the same as the one specified above.
  2. Both views took about 40 minutes to build
  3. Deleted 100 order lines.
  4. Refreshed both views.
  5. Compared the refresh performance and query results.

The view type does not usually* affect the time it takes to build the MVs the first time. Both of these views build in about the same about of time:

*If you use MIN/MAX/COUNT_DISTINCT, a secondary view will be built in the flexviews schema, transparently, to manage the distinct values for each group in the view. This will increases the time it takes to build, and the refresh will be more expensive. This is an optimization which is required to efficiently refresh MV which use those aggregate functions.

Now I delete some line items from orders:

Since the base table has changed, the views now exhibit drift. There should be 484 less lines reflected in the count in our MVs but they are out-of-date and must be refreshed:

Refreshing the MVs

The view which uses the complete refresh method takes a very long time to refresh:

But the incrementally refreshable version does not take long to refresh:

The second one is 365x faster because it examines only the rows that changed. This exceptional ability to look at only what changed, even for MVs with aggregation and joins, is the value proposition for Flexviews.

The flexviews.refresh() stored procedure

The first parameter to flexviews.refresh() is the materialized view id. Each view has an identifier which can be obtained with flexviews.get_id(‘schema’,’table’).

The second parameter reflects the refresh type. This parameter can take the options: ‘BOTH’,’COMPUTE’,’APPLY’, or ‘COMPLETE’. I specified ‘BOTH’ for the second parameter in both examples. ‘BOTH’ means compute the changes to the view, and also apply them. This is a combination of the ‘COMPUTE’ and ‘APPLY’ options. When you specify ‘COMPUTE’ then Flexviews will compute the row changes for the view, but not apply them. If their are changes which have not been applied, then you can apply those changes by passing the ‘APPLY’ option.

In both examples, I pass NULL as the last parameter. Materialized views which based on the complete refresh method will always take NULL for this parameter. For incrementally refreshable views, the last parameter is a transaction id number, which is usually obtained with flexviews.get_uow_id_from_datetime() function. If you pass a NULL value, then it refreshes the view up to the latest changes which have been collected. You can use this to refresh multiple incremental refresh views to the same transactional point in time.

Also note that you can ‘COMPUTE’ or ‘APPLY’ changes to a particular transaction id. Logically, you can not apply changes past the transaction id to which you have computed them.

And then confirm they contain the same results

Conclusion

Flexviews supports two refresh methods, the complete method and the incremental method. I think you will agree that the incremental method has significant advantages over the complete method. In this example the former method was over 350x faster than the latter.

I also showed how you can combine both types of views together. The complete method examples show how to create a complete refresh view which reads from an incrementally refreshable one. This allows the use of SQL features not available with the incremental method, like ORDER BY or use of NOW(), it but still provides improved performance during refresh by accessing summarized data.

I hope this helps you understand how Flexviews can help you ensure fast response times in your application by making access to summary data efficient.

28 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jonathan

That’s a really good post about Flexviews.
I also didn’t know about the convert.php part.
With that, it couldn’t be more simple to use.

Frank Xu

Thanks for your detailed explanation, and i’m looking forward to new improved version of flexviews,
thanks for your great work.

Frank Xu

Hi,
I’m using the great flexviews, and I have a question about the mview_signal table.
When i refresh an mview, I notice that the uow_execute procedure will insert a new row into mview_signal, then the newly inserted row will be captured by Flexcdc and insert into flexviews_mview_signal. Is this right?

The question is when no data change happened in my own database, the table mview_signal table grows every time i call flexviews.refresh, how could I fix this?
thank you.

Jonathan

That’s a really good post about Flexviews.
I also didn’t know about the convert.php part.
With that, it couldn’t be more simple to use.

aivan

Hi I tried the refresh procedure but I got this error.

“Unknown column ‘ERROR: XYZ UNSUPPORTED REFRESH METHOD’ in ‘field list'”

I need help regarding this error.

aivan

Hi what is the cause of this error :

Unknown column ‘ERROR: ERROR: SIGNAL ID NOT FOUND (FlexCDC consumer is likely behind)’ in ‘field list’

Atrevido

“ERROR: ERROR: SIGNAL ID NOT FOUND”
This error keeps coming up when I try to refresh the incremental views. I realized that this is because flexviews_mview_signal table is empty, but no matter what I do, nothing is inserted into this table. What do I do??
Your suggestion would greatly help.

Thanks in advance

Atrevido

Hi aivan
There seems to be some problem with the auto_changelog setting in the consumer.ini file. The SIGNAL ID NOT FOUND error keeps coming when we try to refresh the incremental refresh. Try with the auto_changelog setting as ‘false’. It worked for me.

Cheers

Kairen.Huang

Dear Justin,
How to refresh the MV that created by using “call flexviews.create(‘test’,’mv1′,’INCREMENTAL’);” automatically? thanks.

Kairen.Huang

When I set “auto_changelog=true” in the consumer.ini, flexviews can’t work and throws “Restarting FlexCDC!”(can’t stop the Terminal until close the window)

Kairen.Huang

Thanks Justin so much. The success code as :

CREATE EVENT if not exists e_test
on schedule every 15 second
on completion preserve
do call flexviews.refresh(14,’BOTH’,NULL);

set global event_scheduler=1;

Kairen.Huang

Dear Justin,

How can I use “[source]” and “[dest]” in “consumer.ini” file to refresh the MV data in another database(different host)?
Thank you.

guojhq

Hi what is the cause of this error :

Unknown column ‘ERROR: ERROR: SIGNAL ID NOT FOUND (FlexCDC consumer is likely behind)’ in ‘field list’

Atrevido

The “ERROR: ERROR: SIGNAL ID NOT FOUND (FlexCDC consumer is likely behind)” is usually confronted when the flexviews_mview_signal table gets stale and is not logging the changes on the original table ‘mview_signal’ (Both in the flexviews database). This is probably because either your consumer has stopped running or it has stopped logging the table. Check in the mvlogs table if the flexviews_mview_signal has an entry and is active.

Everett

Very interesting.. thank you! 🙂

Dhairya

Hello

I m just getting confused with the table and materialized view concept. what is the difference if we are to process about 40 million records and crunching the records using group by queries to make it to about 4 million records and saving it to another table. and if we do the same process as described above and use materialized view instead of table to save those 4 million records ..
Please let me know where in the process does it make the difference between the two.

Vadim

Hello,

I have one problem – everything is wokring – just the refreshing doesn’t succeed – error mesage:
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘t.uow_id >37 AND t.uow_id <=41' at line 3

I'm using the incremental type – FlexCDC is working – i see new netries in the table mvlog_….
I try refrishing with the following SQL:
CALL flexviews.refresh(flexviews.get_id('views','transaction_view'),'BOTH',NULL);

Justin Swanhart

Vadim,

Are you using the latest version of Flexviews from http://github.com/greenlion/swanhart-tools?

Vadim Sacharow

Yes – i have downloaded the latest version on sunday.
Will it be enough to copy all entries from the mvlog_ table to my view and delete all entries from the mvlog_ table?
I can do it with simple sql.

The logic with the transaction ids is not needed for me – i want all updates since last refreshing

Justin Swanhart

Vadim,

If you have only single table views, then you can merge the changes from the log to the table. If you have views with more than one table, that won’t work properly.

Also, as a debug measure, can you try to refresh your view and when it fails do:
select @v_sql;

That should print out the SQL that failed and I can debug the problem.

Vadim

Here is the SQL – thee is no WHERE … may be it is the problem

INSERT INTO vadim_views.transaction_view_delta
SELECT (t.dml_type * 1) as dml_type,t.uow_id as uow_id,t.fv$gsn as fv$gsn ,(t.id) as id, (site.name) as siteName, (c.name) as categoryName, (a.shopId) as shopId, (a.title) as title, (a.startPrice) as startPrice, (a.listingType) as listingType, (t.price) as price, (t.shopDate) as shopDate, (b.name) as buyerName, (s.name) as sellerName, (a.url) as url, (a.id) as auctionId, (s.id) as sellerId, (b.id) as buyerId, (crc32(concat(t.id,site.name,c.name,a.shopId,a.title,a.startPrice,a.listingType,t.price,t.shopDate,b.name,s.name,a.url,a.id,s.id,b.id))) as mview$hash
FROM flexviews.mvlog_66e71d03820fed5312b77f14ca205f83 as t
JOIN vadim.shop_category as c ON a.shopCategoryId = c.id
JOIN vadim.shop_seller as s ON a.shopSellerId = s.id
JOIN vadim.shop_site as site ON a.shopSiteId = site.id
JOIN vadim.auction as a ON t.auctionId = a.id
JOIN vadim.shop_buyer as b ON t.shopBuyerId = b.id
t.uow_id >37 AND t.uow_id <=42

Justin Swanhart

Interesting. I will check the code to see how the missing WHERE could happen. I’ll let you know when I have a fix.

Justin Swanhart

I pushed a bug fix for your issue into git. You will need to pull a new version and run
cd location of Flexviews (./swanhart-tools/flexviews by default)
mysql> use flexviews;
mysql> \.procs\delta.sql

Try to refresh your view.

Please let me know how it works.

sudheer

it is not showing consistence result when compare with MV and direct queries.

Ex: I have created flexviews with the following query and checked for one account count it is mismatching.

select count total_count ,account_id from tickets group by account_id ;

for checking I have used the following command in mysql.

select count(1) from helpkit.helpdesk_tickets where account_id = 281917; call flexviews.refresh(6,’BOTH’,null); select * from helpkit.account_count_3 order by total_count desc limit 1;

I got the following output.

mysql> select count(1) from helpkit.helpdesk_tickets where account_id = 281917; call flexviews.refresh(6,’BOTH’,null); select * from helpkit.account_count_3 order by total_count desc limit 1;
+———-+
| count(1) |
+———-+
| 907765 |
+———-+
1 row in set (0.25 sec)

Query OK, 2 rows affected (0.15 sec)

+———-+————-+————+
| mview$pk | total_count | account_id |
+———-+————-+————+
| 13217 | 907753 | 281917 |
+———-+————-+————+
1 row in set (0.00 sec)

original count is 907765 but MV is showing 907753