Sample Datasets for Benchmarking and Testing

Sometimes you just need some data to test and stress things. But randomly generated data is awful — it doesn’t have realistic distributions, and it isn’t easy to understand whether your results are meaningful and correct. Real or quasi-real data is best. Whether you’re looking for a couple of megabytes or many terabytes, the following sources of data might help you benchmark and test under more realistic conditions.

Datasets for Benchmarking

The venerable sakila test database: small, fake database of movies.
The employees test database: small, fake database of employees.
The Wikipedia page-view statistics database: large, real website traffic data.
The IMDB database: moderately large, real database of movies.
The FlightStats database: flight on-time arrival data, easy to import into MySQL.
The Bureau of Transportation Statistics: airline on-time data, downloadable in customizable ways.
The airline on-time performance and causes of delays data from data.gov: ditto.
The statistical review of world energy from British Petroleum: real data about our energy usage.
The Amazon AWS Public Data Sets: a large variety of data such as the mapping of the Human Genome and the US Census data.
The Weather Underground weather data: customize and download as CSV files.

Post your favorites in the comments!

21 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

pcrews

13 years ago

It should be noted that imdb has some restrictive licensing that would prevent anyone from making a data set publicly available. While the data is quite good and interesting, it would be of limited utility since it isn’t in database-ready form (IIRC).

I’m also working on a dataset based on this information:
http://www.eia.doe.gov/cneaf/electricity/epa/epa_sprdshts_monthly.html

Will be making it available on Launchpad (ala the employee dataset) once it is ready.

There are also some interesting datasets here:
http://www.grouplens.org/

Morgan Tocker

13 years ago

@Baron Schwartz – there’s also the world sample database. Useful for testing, but not for benchmarking: http://dev.mysql.com/doc/index-other.html

@pcrews – See http://imdbpy.sourceforge.net/. There is nothing stopping you from having scripted recreation of the data.

“easy to understand whether your results are meaningful and correct”.

I think this part needs to be written in bold, with an underline. We use the IMDB data for examples in our Percona training courses.

Baron Schwartz

Author

13 years ago

I bolded a couple of words 🙂

pcrews

13 years ago

@Morgan Tocker

Perhaps I’m missing something, but how do you get around this:
http://www.imdb.com/licensing/noncommercial

using the data in a training course doesn’t seem like personal or non-commercial use.

“IMDb grants you a limited license to access and make personal use of this site and not to download (other than page caching) or modify it, or any portion of it, except with express written consent of IMDb. This site or any portion of this site may not be reproduced, duplicated, copied, sold, resold, visited, or otherwise exploited for any commercial purpose without express written consent of IMDb. This license does not include any resale or commercial use of this site or its contents or any derivative use of this site or its contents. ”

I’m not trying to be a pain, but rather to educate myself on these issues. When I was at MySQL/Sun/Oracle, we were expressly forbidden from trying to create any test datasets based on the data by their fleet of lawyers

Baron Schwartz

Author

13 years ago

Hmm. If we were to ask a lawyer right now, I bet they’d scare us and tell us whatever we do, don’t admit wrongdoing. I think the best thing to do is say that we should probably find or generate a dataset whose licensing clearly permits using it for examples in our courses, and thank you for your feedback, Patrick.

pcrews

13 years ago

No worries and sorry if I was a wet-blanket. Mainly, I am very interested in datasets and making them available to everyone.

The imdb data is very juicy and I want to use it, was mainly curious to see if I could do something with it as well : )

Roland Bouman

13 years ago

freebase. http://download.freebase.com/datadumps/latest/

For relational work, the csv dumps are probably the quickest bet, but it does contain multivalued attributes. The quadruples dump is a normalized format. I haven’t gotten round to building a normalized format from that one.

pcrews

13 years ago

@Roland

I had built a dataset that merged Grouplens and Freebase movie information, but that work has become dusty.

However, the Freebase data is *great*. Just sifting through the movies-related information produced a fair number of tables with respectable populations : )

Gerry

13 years ago

My favorite is the Amarok player one, it’s small, has real data based on your music collection and listening habits, and the dataset is easy to understand and manipulate.

http://amarok.kde.org

My $.02
G

John

13 years ago

Where are the freely available literature citation datasets? DBLP seems to only cover compsci mostly. I need a REALLY big citation dataset over many disciplines and publishing houses (ACM, IEEE, Springer, etc.). Anyone?

Ronald Bradford

13 years ago

I have in the past compiled a list of public data sources. Many of the comments contain great links

More information at http://ronaldbradford.com/blog/seeking-public-data-for-benchmarks-2009-08-28/

Tim Riemenschneider

13 years ago

For a large dataset, you could import the openstreetmap-data:

http://wiki.openstreetmap.org/wiki/User:BigPeteB/Setting_up_a_local_database_copy

J. Andrew Rogers

13 years ago

A big problem with these data sets are that they are small, trivial cases, which limits the amount and kind of testing you can do. Large data sets exist but they are often implausibly large to move around over the Internet. You can use the listed data sets to easily test basic correctness but you can’t use them to test scaling behaviors.

Synthetic data sets are not interesting but neither are they random or unrealistic if built by a competent designer. The great thing about synthetic data set generators beyond producing data of unbounded size is that you can configure arbitrary distributions and properties of the data that test a broad range of characteristics not possible with real-world data sets. For example, if I want to simulate several types of skew and bias that move logically in space and time over the properties of the data set, it is pretty simple to do that. A good example of this is location and sensing data, which has fairly complex skew patterns in reality that will break most spatial indexing systems at scale — it is hard to get a data set that demonstrates this, but it is fairly easy to generate a synthetic data set that generates the same bulk behavior.

Purely synthetic data set generators can very accurately model real-world data patterns, it just requires the ability to generate complex skew behaviors and interactions in the data that do not rise above the noise floor in trivial samples. What would be useful, possibly more useful than sample data sets, is building a collection of synthetic workload generators with parameter sets that exactly match the distribution, dynamics, and skew of real-world data sets. The results are not interesting per se but they allow you to characterize runtime behaviors under all sorts of assumptions at arbitrary scales. For complex data sets like real-time spatial and graphs, synthetic is really the only way to get an accurate measure of a system.

Real-world data sets would be preferable in theory but in practice you cannot test a lot of things that matter using them. I would not suggest using badly designed randomly generated data sets but a good synthetic data set generator can be an excellent tool.

Baron Schwartz

Author

13 years ago

Here’s another one to add to the mix: the EFF’s dump of SSL certificate data (4gb) https://www.eff.org/observatory

Theresia

13 years ago

Thanks for this post. I am currently searching for a dataset of blogs or forums. I need to use them to test different spam detection techniques (as part of my studies )but I did not manage to find any available data . Ideally it should be realistic data that contains both spam comments and realistic comments.

Do you have an idea of where I can find such datasets please :s as I’ve been searching for a long time now and I did not find anything useful.

Thanks

Ronald Speelman

11 years ago

Hi Baron,

Thanks for this list, some of them are really useful especially the airline data, that is very difficult to generate.
For other stuff like e-mail addresses etc. ,I used to use a tool like generatedata.com. Nice and free, but very limited, so about a year ago I decided to write my own generator that can be used directly in your application code.

I have published an updated version (the code is provided in a zip file) on my blog because this version is very versatile and can be used to generate very good and real looking test data. This might be useful for many MySQL developers. This is the url to the article: http://moinne.com/blog/ronald/mysql/howto-generate-meaningful-test-data-using-a-mysql-function

chad ambrosius

11 years ago

it looks like the BP energy use is now available as an excel file at http://www.bp.com/statisticalreview

http://www.data.gov also appears to have a lot of potentially good data sets (census, earthquakes, etc.) so you don’t have to search individual places like the census bureau or national earthquake information center.

Martin Robaey

9 years ago

And would someone have a clue to find public datasets produced in a NoSQL database?

Ashoke

9 years ago

Hi Baron,

Do you know of any data sets that follow normal distribution?

Thanks!

Md Monjur Ul Hasan

7 years ago

I have created a sqlite version of the employee database but cannot push into my fork. I would like to contribute that. How can I do that. The .db file is 73 MB in size.

farid

6 years ago

I hv 24 tables in the schema, and needs to use DML and select operations on these tables for 24 tables to check the db performance testing. Can someone give the idea how can I generate the load on db??

If you needs any details then let me know.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Sample Datasets for Benchmarking and Testing

Datasets for Benchmarking

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Securing Your MySQL Database: Essential Best Practices

Hello World… Hello Valkey! Let’s Get Started!

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Sample Datasets for Benchmarking and Testing

Datasets for Benchmarking

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Sets and Sorted Sets

Securing Your MySQL Database: Essential Best Practices

Hello World… Hello Valkey! Let’s Get Started!

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation