New OLAP Wikistat benchmark: Introduction and call for feedbacks

I’ve seen my posts on Ontime Air traffic and Star Schema Benchmark got a lot of interest
(links:

).
However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to Wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: http://dammit.lt/wikistats/archive/ or the original Domas’s announcement .

Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).

I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.

I have next goals with proposed benchmark:

Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
Understand specific features and characteristic of each engine.
Compare throughput on simple queries (queries and scenario to be drafted yet)
Check ability to load data and serve queries at the same time ( availability during data load ) (queries and scenario to be drafted yet)

So in proposed schema I have four tables:

pagestat

pagestat

(fact table), and

pages, datesinfo, projects

1	pages, datesinfo, projects

(dimensions tables).

Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).

EER diagram

( made with MySQL Workbench )

In current dataset, which you can download from Amazon snapshot (name: â€œpercona-wikistatâ€, ID:snap-a5f9bacc) we have:

Table
pages
1
pages
: 724.550.811 rows. data size: 40476M
Table
datesinfo
1
datesinfo
: 9624 rows, one entry represents 1 hour
Table
projects:
1
projects:
2025 rows
Table
pagestats
1
pagestats

Data for 2009-06: # 3.453.013.109 rows / size 68352M
Data for 2009-07: # 3.442.375.618 rows / size 68152M

So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.

Example of query ( again, full list on Benchmark Wiki)

SELECT project, sum(page_count) sm 
 FROM pagestat 
   JOIN datesinfo di ON ( di.id=date_id ) 
   JOIN projects p ON  (p.id=project_id ) 
 WHERE di.calmonth=7 and di.calyear=2009 
 GROUP BY project 
 ORDER BY sm DESC 
 LIMIT 20;

SELECT project, sum(page_count) sm

FROM pagestat

JOIN datesinfo di ON ( di.id=date_id )

JOIN projects p ON (p.id=project_id )

WHERE di.calmonth=7 and di.calyear=2009

GROUP BY project

ORDER BY sm DESC

LIMIT 20;

I am going to load data and run queries against available engines:

MySQL MyISAM / InnoDB (to have reference results)
InfoBright
InfiniDB
MonetDB
LucidDB
Greenplum

and I will report my results ( so stay with MySQLPerformanceBlog ;))

I’d like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.

I welcome your feedback on the benchmark, and what else you would like to see here.

16 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Roland Bouman

14 years ago

Cool!

I can’t wait 🙂

Alexander

14 years ago

Hi Vadim —

It should be very interesting test. Thanks for including GreenPlum here.

In our company we are currently testing star schema performance for MySQL MyISAM, MySQL TokuDB, InfiniDB, GreenPlum and Vertica. It will be interesting to compare our results with yours.

Regards,
Alexander

David Prime

14 years ago

I’m very interested in the continuous query abilites of these engines and will look forward to the results as they come in. Thanks for doing the work, Vadim.

David

Rob Wultsch

14 years ago

If you are benching Greenplum which is a proprietary fork of Postgres, why not also bench Pg for comparison sake?

Vadim Tkachenko

Author

14 years ago

Rob,

GreenPlum allows column-oriented tables + can utilize all CPUs executing single query.
Postgres can’t do afaik.

Slapo

14 years ago

Could you consider adding Firebird 2.5 RC in SuperClassic mode to the comparison?
There aren’t that many benchmarks of it and this could be interesting to see.

One other thing: have you considered using a db design used by a content management system (Drupal, TYPO3) or an e-commerce solution (Opencart, Magento, …).

Adding vanilla Postgres would still be interesting just to see how it compares to Greenplum.

Thanks 🙂

Vadim Tkachenko

Author

14 years ago

Slapo,

I do not think db design in CMS suitable for OLAP benchmarks.
Either I do not think OLTP oriented FireBird and PostgresSQL will be good for proposed benchmark.

I expect that queries will take 5-1000 mins to execute, and if we do not use special engine,
it will be just pain. I am going to ran on MySQL MyISAM / InnoDB only to have baseline.

Rob Wultsch

14 years ago

Vadim,
I was suggest testing Postgres because I imagine it would have very little increased difficulty over testing Greenplum. I do not think that you are incorrect in that performance would probably be lacking, I just thought that it would be another interesting baseline.

Slapo

14 years ago

Vadim,

I suggested what I wrote before because someone actually might want to do some analytics on a cms or an e-shop – visits, visitors, countries, extensive products history, etc.
I see your point of it being perhaps too lengthy to bother with it, though.

Vadim

14 years ago

Rob,

There are two factors:
1) I am not expert in PostgreSQL
2) Results, which directly compare PostregSQL vs MySQL, usually cause holy wars, and one who presents them, is claimed unqualified a) in PostgreSQL, b) in MySQL, c) in both.
So I do not want to go that way.

However if there is expert who wants to install PostgreSQL, tune it, and run proposed scenarios,
I would be happy to even provide access to box.

Jos van Dongen

14 years ago

Hi Vadim,

Sounds great! What kind of box will you be using and how much memory/nr of disks will it have? E.g. Infobright compresses data quite aggressively so your 172GB will probably fit in 64 or even 32GB of memory, and Greenplum likes lots of disks (at least 1 data segment per CPU core, so with a dual quadcore machine you get better results with 8 partitions on separate drives). What you should also try to do is simulate concurrent access by firing multiple concurrent query streams. Some database are fast when running one query at a time but get crippled when the have to perform multiple tasks simultaneously.

About the ‘holy wars’ you mention: look at http://bit.ly/caLeRF where I ran TPCH using Greenplum and got discussions between PostgreSQL and GP advocates…

anyway, good luck with the benchmark and if you need help (or access to my machine ;-), just let me know.

best, Jos

Rob Wultsch

14 years ago

Vadim,
I dropped a note to the PG advocacy list in hopes of getting an expert to help with tuning.

I *wish* I was (and hope to be someday) that expert.

David

14 years ago

I’d like to see the performance of the following query:
SELECT project_id, sum(page_count) sm
FROM pagestat
GROUP BY project_id;

This query is important to me because it look at ALL the data in the fact table without any joins. This sort of queries gives an overview of all data, which is typical in my field of data mining. I suspect that row-based engines will perform better for this one.

Vadim Tkachenko

Author

14 years ago

David,

I added your query, but I’d like to have at least date restriction,
as we may have different amount of data loaded into fact table, depending on scale factor.

David

14 years ago

Vadim,

Thanks for you answer. I see your point, though the problem is that queries that do not have any restrictions are essential for answering the following typical business questions (do not know whether they can be answered with the data in your example database):
– How many visits have we had so far in total?
– How many pages has been created?
– What is the distribution of visits per page?
– How has the development of page creations been?

All these sorts of questions run on all the data. I believe that looking at fractions of data is more the IT-perspective on data.

Vadim Tkachenko

Author

14 years ago

David,

Sure, that makes sense.
Just in current enviroment it may be runaway query,
which will take days to execute for some engines…

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

New OLAP Wikistat benchmark: Introduction and call for feedbacks

Related

Related Blog Articles

RECOMMENDED ARTICLES

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

Seamless Table Modifications: Leveraging pt-online-schema-change for Online Alterations

Securing Your MySQL Database: Essential Best Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

New OLAP Wikistat benchmark: Introduction and call for feedbacks

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

Seamless Table Modifications: Leveraging pt-online-schema-change for Online Alterations

Securing Your MySQL Database: Essential Best Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation