I’ve seen my posts on Ontime Air traffic and Star Schema Benchmark got a lot of interest
(links:

).
However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to Wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: http://dammit.lt/wikistats/archive/ or the original Domas’s announcement .

Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).

I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.

I have next goals with proposed benchmark:

  • Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
  • Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
  • Understand specific features and characteristic of each engine.
  • Compare throughput on simple queries (queries and scenario to be drafted yet)
  • Check ability to load data and serve queries at the same time ( availability during data load ) (queries and scenario to be drafted yet)

So in proposed schema I have four tables:

(fact table), and

(dimensions tables).

Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).

EER diagram

( made with MySQL Workbench )

In current dataset, which you can download from Amazon snapshot (name: “percona-wikistat”, ID:snap-a5f9bacc) we have:

  • Table: 724.550.811 rows. data size: 40476M
  • Table: 9624 rows, one entry represents 1 hour
  • Table2025 rows
  • Table
    Data for 2009-06: # 3.453.013.109 rows / size 68352M
    Data for 2009-07: # 3.442.375.618 rows / size 68152M

So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.

Example of query ( again, full list on Benchmark Wiki)

I am going to load data and run queries against available engines:

  • MySQL MyISAM / InnoDB (to have reference results)
  • InfoBright
  • InfiniDB
  • MonetDB
  • LucidDB
  • Greenplum

and I will report my results ( so stay with MySQLPerformanceBlog ;))

I’d like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.

I welcome your feedback on the benchmark, and what else you would like to see here.

16 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Roland Bouman

Cool!

I can’t wait 🙂

Alexander

Hi Vadim —

It should be very interesting test. Thanks for including GreenPlum here.

In our company we are currently testing star schema performance for MySQL MyISAM, MySQL TokuDB, InfiniDB, GreenPlum and Vertica. It will be interesting to compare our results with yours.

Regards,
Alexander

David Prime

I’m very interested in the continuous query abilites of these engines and will look forward to the results as they come in. Thanks for doing the work, Vadim.

David

Rob Wultsch

If you are benching Greenplum which is a proprietary fork of Postgres, why not also bench Pg for comparison sake?

Slapo

Could you consider adding Firebird 2.5 RC in SuperClassic mode to the comparison?
There aren’t that many benchmarks of it and this could be interesting to see.

One other thing: have you considered using a db design used by a content management system (Drupal, TYPO3) or an e-commerce solution (Opencart, Magento, …).

Adding vanilla Postgres would still be interesting just to see how it compares to Greenplum.

Thanks 🙂

Rob Wultsch

Vadim,
I was suggest testing Postgres because I imagine it would have very little increased difficulty over testing Greenplum. I do not think that you are incorrect in that performance would probably be lacking, I just thought that it would be another interesting baseline.

Slapo

Vadim,

I suggested what I wrote before because someone actually might want to do some analytics on a cms or an e-shop – visits, visitors, countries, extensive products history, etc.
I see your point of it being perhaps too lengthy to bother with it, though.

Vadim

Rob,

There are two factors:
1) I am not expert in PostgreSQL
2) Results, which directly compare PostregSQL vs MySQL, usually cause holy wars, and one who presents them, is claimed unqualified a) in PostgreSQL, b) in MySQL, c) in both.
So I do not want to go that way.

However if there is expert who wants to install PostgreSQL, tune it, and run proposed scenarios,
I would be happy to even provide access to box.

Jos van Dongen

Hi Vadim,

Sounds great! What kind of box will you be using and how much memory/nr of disks will it have? E.g. Infobright compresses data quite aggressively so your 172GB will probably fit in 64 or even 32GB of memory, and Greenplum likes lots of disks (at least 1 data segment per CPU core, so with a dual quadcore machine you get better results with 8 partitions on separate drives). What you should also try to do is simulate concurrent access by firing multiple concurrent query streams. Some database are fast when running one query at a time but get crippled when the have to perform multiple tasks simultaneously.

About the ‘holy wars’ you mention: look at http://bit.ly/caLeRF where I ran TPCH using Greenplum and got discussions between PostgreSQL and GP advocates…

anyway, good luck with the benchmark and if you need help (or access to my machine ;-), just let me know.

best, Jos

Rob Wultsch

Vadim,
I dropped a note to the PG advocacy list in hopes of getting an expert to help with tuning.

I *wish* I was (and hope to be someday) that expert.

David

I’d like to see the performance of the following query:
SELECT project_id, sum(page_count) sm
FROM pagestat
GROUP BY project_id;

This query is important to me because it look at ALL the data in the fact table without any joins. This sort of queries gives an overview of all data, which is typical in my field of data mining. I suspect that row-based engines will perform better for this one.

David

Vadim,

Thanks for you answer. I see your point, though the problem is that queries that do not have any restrictions are essential for answering the following typical business questions (do not know whether they can be answered with the data in your example database):
– How many visits have we had so far in total?
– How many pages has been created?
– What is the distribution of visits per page?
– How has the development of page creations been?

All these sorts of questions run on all the data. I believe that looking at fractions of data is more the IT-perspective on data.