I’ve seen my posts on Ontime Air traffic and Star Schema Benchmark got a lot of interest
However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to Wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: http://dammit.lt/wikistats/archive/ or the original Domas’s announcement .
Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).
I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.
I have next goals with proposed benchmark:
- Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
- Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
- Understand specific features and characteristic of each engine.
- Compare throughput on simple queries (queries and scenario to be drafted yet)
- Check ability to load data and serve queries at the same time ( availability during data load ) (queries and scenario to be drafted yet)
So in proposed schema I have four tables:
(fact table), and
pages, datesinfo, projects
Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).
( made with MySQL Workbench )
In current dataset, which you can download from Amazon snapshot (name: â€œpercona-wikistatâ€, ID:snap-a5f9bacc) we have:
- Table: 724.550.811 rows. data size: 40476MShell1pages
- Table: 9624 rows, one entry represents 1 hourShell1datesinfo
- Table2025 rowsShell1projects:
Data for 2009-06: # 3.453.013.109 rows / size 68352M
Data for 2009-07: # 3.442.375.618 rows / size 68152M
So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.
Example of query ( again, full list on Benchmark Wiki)
SELECT project, sum(page_count) sm
JOIN datesinfo di ON ( di.id=date_id )
JOIN projects p ON (p.id=project_id )
WHERE di.calmonth=7 and di.calyear=2009
GROUP BY project
ORDER BY sm DESC
I am going to load data and run queries against available engines:
- MySQL MyISAM / InnoDB (to have reference results)
and I will report my results ( so stay with MySQLPerformanceBlog )
I’d like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.
I welcome your feedback on the benchmark, and what else you would like to see here.