August 22, 2014

MySQL Storage Engines – PBXT

I promised to write little articles about all storage engines which sessions I attended during MySQL Users Conference but I did not get too far yet, too busy. So today it is time for PBXT.

I was very interested about session about PBXT because this is storage engine does not target the same general purpose transactional storage engine market a lot of other people targeting. It also has number of unusual design decisions which will position it further away.

Paul was for a while comparing PBXT to MyISAM with multi versioning and transactions and this is valid comparison for good and for bad. At its current state (as of the conference) PBXT does not offer Durable transactions (meaning you can lose committed transactions if power goes down) furthermore in case of crash database may become corrupted, just as with MyISAM, and you might need to repair tables. This is of course something which is scheduled to be fixed before stable release but performance effect of it is yet unknown – how transactional system implements logging and dirty buffers flushing has serious impact on performance.

The other gotcha which you should aware at this time is “per database transactions” – so if you modify data in number of databases you may see partial changes. Same applies to multi-versioning – if you’re using consistent reads isolation mode this consistent reads is per database not global. We’re yet to see what will happen to this in the future – for transactions there is obvious fix to use two phase commit for transactions spawning multiple databases – this might not even add too much overhead as XA already used for synchronizing with binary log.

The focus of PBXT and one if it strengths seems to be handling of the blobs which are never fragmented and handled efficiently. PBXT team even leads the project to add scalable blob streaming to MySQL which would be fun for many applications if implemented well.

It is too bad MySQL Users Conference only allowed 45 minutes for Storage Engine presentations. This is complex topic and especially newsly presented storage engines deserved more than 45 minutes. I left presentation with a lot of question about details, such as index structures, buffer management locking implementation and so on.

As we already seen in our benchmarks PBXT both performs and scales well in many read workloads. We did not check writes because this is where a lot of changes are expected to happen anyway. We also checked for CPU bound workload – with disk IO situation can be rather different.

Looking at PBXT architecture there is small row pointer file which is expected to be accessed a lot on each row read and row write. The file is taking about 8 bytes per row as I understand so it should be very small and well cached in database cache. Write policy however because important at this stage as unless you can delay and group IOs together you may end up with a lot of extra writes.

Besides row pointer each rows has fixed data part and dynamic data parts stored in separate files. Dynamic length part is stored in file called “Data Log File” which is quite confusing to my taste. So all together we have row which is stored in 3 pieces which are stored in different locations which may require a lot of IO for large data sizes. But you of course should take into account this is only worse case scenario – for many queries you will not need to touch dynamic length part and row pointer file should be cached in most cases.

As PBXT does not use pages for data storage, which tend to lead to “holes” it should have rather compact foot print in optimal case. On other hand because it leaves old versions in the data files it needs special compaction operations to keep data files compact and efficient. As I heard someone joked at the conference “Now so many years after PostgreSQL, MySQL has finally got a storage engine which needs VACUUM”

So overall PBXT is very interesting project to watch and to try out (We’re using PBXT with MySQL 5.1 in one of our projects) and if developed with same pace and dedication it will become one major storage engine for MySQL. We surely should get back soon and run more benchmarks and try out newest version.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. To be fair, InnoDB has always had a VACUUM. It is the purge thread that removes deleted rows when they are no longer visible to other transactions. But as is typical of InnoDB, you don’t have to worry about it. It just works. Although for multi-disk servers we do need to tune some of the constants governing the amount of IO done by all of the background tasks.

  2. Srini says:

    Based on your experience upon using PBXT for your project, can you clarify the following. We are evaluating usage of PBXT for one of our project. To my understanding PBXT engine writes to only one file and there is no transaction log file in the PBXT engine which will maintain the history of transactions.

    In this scenario, Is there any option to specify the number of versions to be maintained by the PBXT storage engine. If it is possible could you please let me know whether the following is possible using PBXT storage engine,

    a. Fetching a row using its version number (or) fetch the previous version? Fetching the previous version should fetch the previous version from all the related tables if the row spans across multiple tables. This will enable to have a consistent rollback.

    b. Fetch/Select based on a timestamp. For example let us say, i am showing some Sales records in a client view and also showing the total sales as S1 at time X. There may be a new sales record added or some modified after time X, but whatever queries I make must always query the data that was existing before time X.

    So, I would like to query the database with an explicit timestamp value specified in the query. This will help me having a cache which need not have to be deleted for every update that is happening after time X if at all an explicit refresh operation is done.

    Srini

  3. peter says:

    Mark,

    Innodb is different. Innodb has purge which removes not needed row records. It does not do anything with other data.

    In PBXT it is more like “Optimize” which scans the original data set and writes it to the new location eliminating holes.

    This is actually very interesting – a lot of people would with MyISAM or any storage engine for this sake has online optimize table. Now PBXT kind of has it (for dynamic length portion). I’m however concerned about performance impact and controls available for frequency of the process and for it resource consumption.

    The fact thread is low priority normally would not help much because it is going to be IO bound.

  4. peter says:

    Shrini,

    I’m not sure what you mean by writing to only one file. PBXT has transactional log file and even more – there is per table “Data Log File” and there is per database “action log file”, the last one is what you’re looking for.

    Regarding access to the old versions – it looks like it is not available and the main reason is There is no such functionality in SQL and MySQL Storage engine interface so that would require quite a lot of hacks.

    But I agree this would be quite handy. In fact most multi versioning systems should be able to implement such functionality. All they need to do is to map between timestamps and version numbers and make sure they only purge data versions which are more than certain time old.

    Most systems however are not designed to store potentially many thousands of row versions – quite frequently versions are implemented as linked lists so it could become rather slow. Not to mention other problems such as examining all row versions in index tree for index lookups etc.

  5. for thanks.

Speak Your Mind

*