I’ve seen my posts on Ontime Air traffic and Star Schema Benchmark got a lot of interest
(links:
- https://www.percona.com/blog/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
- https://www.percona.com/blog/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/
- https://www.percona.com/blog/2009/10/26/air-traffic-queries-in-luciddb/
- https://www.percona.com/blog/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/
).
However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to Wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: http://dammit.lt/wikistats/archive/ or the original Domas’s announcement .
Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).
I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.
I have next goals with proposed benchmark:
- Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
- Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
- Understand specific features and characteristic of each engine.
- Compare throughput on simple queries (queries and scenario to be drafted yet)
- Check ability to load data and serve queries at the same time ( availability during data load ) (queries and scenario to be drafted yet)
So in proposed schema I have four tables:
1 | pagestat |
(fact table), and
1 | pages, datesinfo, projects |
(dimensions tables).
Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).
EER diagram
( made with MySQL Workbench )
In current dataset, which you can download from Amazon snapshot (name: “percona-wikistatâ€, ID:snap-a5f9bacc) we have:
- Table: 724.550.811 rows. data size: 40476M1pages
- Table: 9624 rows, one entry represents 1 hour1datesinfo
- Table2025 rows1projects:
- Table1pagestats
Data for 2009-06: # 3.453.013.109 rows / size 68352M
Data for 2009-07: # 3.442.375.618 rows / size 68152M
So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.
Example of query ( again, full list on Benchmark Wiki)
1 2 3 4 5 6 7 8 | SELECT project, sum(page_count) sm FROM pagestat JOIN datesinfo di ON ( di.id=date_id ) JOIN projects p ON (p.id=project_id ) WHERE di.calmonth=7 and di.calyear=2009 GROUP BY project ORDER BY sm DESC LIMIT 20; |
I am going to load data and run queries against available engines:
- MySQL MyISAM / InnoDB (to have reference results)
- InfoBright
- InfiniDB
- MonetDB
- LucidDB
- Greenplum
and I will report my results ( so stay with MySQLPerformanceBlog ;))
I’d like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.
I welcome your feedback on the benchmark, and what else you would like to see here.
Cool!
I can’t wait 🙂
Hi Vadim —
It should be very interesting test. Thanks for including GreenPlum here.
In our company we are currently testing star schema performance for MySQL MyISAM, MySQL TokuDB, InfiniDB, GreenPlum and Vertica. It will be interesting to compare our results with yours.
Regards,
Alexander
I’m very interested in the continuous query abilites of these engines and will look forward to the results as they come in. Thanks for doing the work, Vadim.
David
If you are benching Greenplum which is a proprietary fork of Postgres, why not also bench Pg for comparison sake?
Rob,
GreenPlum allows column-oriented tables + can utilize all CPUs executing single query.
Postgres can’t do afaik.
Could you consider adding Firebird 2.5 RC in SuperClassic mode to the comparison?
There aren’t that many benchmarks of it and this could be interesting to see.
One other thing: have you considered using a db design used by a content management system (Drupal, TYPO3) or an e-commerce solution (Opencart, Magento, …).
Adding vanilla Postgres would still be interesting just to see how it compares to Greenplum.
Thanks 🙂
Slapo,
I do not think db design in CMS suitable for OLAP benchmarks.
Either I do not think OLTP oriented FireBird and PostgresSQL will be good for proposed benchmark.
I expect that queries will take 5-1000 mins to execute, and if we do not use special engine,
it will be just pain. I am going to ran on MySQL MyISAM / InnoDB only to have baseline.
Vadim,
I was suggest testing Postgres because I imagine it would have very little increased difficulty over testing Greenplum. I do not think that you are incorrect in that performance would probably be lacking, I just thought that it would be another interesting baseline.
Vadim,
I suggested what I wrote before because someone actually might want to do some analytics on a cms or an e-shop – visits, visitors, countries, extensive products history, etc.
I see your point of it being perhaps too lengthy to bother with it, though.
Rob,
There are two factors:
1) I am not expert in PostgreSQL
2) Results, which directly compare PostregSQL vs MySQL, usually cause holy wars, and one who presents them, is claimed unqualified a) in PostgreSQL, b) in MySQL, c) in both.
So I do not want to go that way.
However if there is expert who wants to install PostgreSQL, tune it, and run proposed scenarios,
I would be happy to even provide access to box.
Hi Vadim,
Sounds great! What kind of box will you be using and how much memory/nr of disks will it have? E.g. Infobright compresses data quite aggressively so your 172GB will probably fit in 64 or even 32GB of memory, and Greenplum likes lots of disks (at least 1 data segment per CPU core, so with a dual quadcore machine you get better results with 8 partitions on separate drives). What you should also try to do is simulate concurrent access by firing multiple concurrent query streams. Some database are fast when running one query at a time but get crippled when the have to perform multiple tasks simultaneously.
About the ‘holy wars’ you mention: look at http://bit.ly/caLeRF where I ran TPCH using Greenplum and got discussions between PostgreSQL and GP advocates…
anyway, good luck with the benchmark and if you need help (or access to my machine ;-), just let me know.
best, Jos
Vadim,
I dropped a note to the PG advocacy list in hopes of getting an expert to help with tuning.
I *wish* I was (and hope to be someday) that expert.
I’d like to see the performance of the following query:
SELECT project_id, sum(page_count) sm
FROM pagestat
GROUP BY project_id;
This query is important to me because it look at ALL the data in the fact table without any joins. This sort of queries gives an overview of all data, which is typical in my field of data mining. I suspect that row-based engines will perform better for this one.
David,
I added your query, but I’d like to have at least date restriction,
as we may have different amount of data loaded into fact table, depending on scale factor.
Vadim,
Thanks for you answer. I see your point, though the problem is that queries that do not have any restrictions are essential for answering the following typical business questions (do not know whether they can be answered with the data in your example database):
– How many visits have we had so far in total?
– How many pages has been created?
– What is the distribution of visits per page?
– How has the development of page creations been?
All these sorts of questions run on all the data. I believe that looking at fractions of data is more the IT-perspective on data.
David,
Sure, that makes sense.
Just in current enviroment it may be runaway query,
which will take days to execute for some engines…