InfiniDB MySQL server with Hadoop cluster for data analytics

In my previous post about Hadoop and Impala I benchmarked performance of analytical queries in Impala.

This time I’ve tried InfiniDB for Hadoop (open-source version) on the modern hardware with an 8-node Hadoop cluster. One of the main advantages (at least for me) of InifiniDB for Hadoop is that it stores the data inside the Hadoop cluster but uses the MySQL server to execute queries. This allows for an easy “migration” of existing analytical tools. The results are quite interesting and promising.

Quick How-To

The InfiniDB documentation is not very clear on step-by-step instructions so I’ve created this quick guide:

Install Hadoop cluster (minimum install will work). I’ve used Cloudera Manager (CDH5) to compare the speed of InfiniDB to Cloudera Impala. Install the tools in the “Pre-requirements” sections of InfiniDB for Hadoop Manual
Install the InfiniDB for Hadoop binaries on 1 Hadoop node (you can choose any node). This will install InfiniDB and its version of MySQL (based on MySQL 5.1).
After installation it will tell you the variables to set and run the postConfigure script. Example:
Shell
export JAVA_HOME=/usr/java/jdk1.6.0_31 export LD_LIBRARY_PATH=/usr/java/jdk1.6.0_31/jre/lib/amd64/server . /root/setenv-hdfs-20 /usr/local/Calpont/bin/postConfigure
1
2
3
4
export JAVA_HOME=/usr/java/jdk1.6.0_31
export LD_LIBRARY_PATH=/usr/java/jdk1.6.0_31/jre/lib/amd64/server
. /root/setenv-hdfs-20
/usr/local/Calpont/bin/postConfigure
The postConfigure script will ask the questions. Couple imfortant notes:

Make sure to use HDFS as a “type of Data Storage”.
The performance module 1 (pm1) should point to the host (hostname and IP) you are running the postConfigure script on. Other pm(s) should point to other Hadoop nodes

When installation is finished you will be able to login into MySQL server, it uses script called ibdmysql which will call mysql cli with the correct socket and port. Check that the infiniDB is enabled by running “show engines”, InfiniDB should be in the list.

The next step will be importing data.

Data import

First we will need to create a MySQL table with “engine=InfiniDB”:

CREATE TABLE `ontime` (
  `YearD` int(11) NOT NULL,
  `Quarter` tinyint(4) DEFAULT NULL,
  `MonthD` tinyint(4) DEFAULT NULL,
  `DayofMonth` tinyint(4) DEFAULT NULL,
  `DayOfWeek` tinyint(4) DEFAULT NULL,
  `FlightDate` date DEFAULT NULL,
...
) ENGINE=InfiniDB DEFAULT CHARSET=latin1

CREATE TABLE `ontime` (

`YearD` int(11) NOT NULL,

`Quarter` tinyint(4) DEFAULT NULL,

`MonthD` tinyint(4) DEFAULT NULL,

`DayofMonth` tinyint(4) DEFAULT NULL,

`DayOfWeek` tinyint(4) DEFAULT NULL,

`FlightDate` date DEFAULT NULL,

...

) ENGINE=InfiniDB DEFAULT CHARSET=latin1

Second, I’ve used the cpimport to load the data. It turned out it is much more efficient and easier to load 1 big file rather than 20×12 smaller files (original “ontime” data is 1 file per month), so I’ve exported the “Ontime” data from MySQL table and created 1 big file “ontime.psv”.

I used the following command to export data into InfiniDB:

[root@n0 ontime]# /usr/local/Calpont/bin/cpimport -s '|' ontime ontime ontime.psv
2014-05-20 15:12:58 (18787) INFO : Running distributed import (mode 1) on all PMs...
2014-05-20 15:25:28 (18787) INFO : For table ontime.ontime: 155083620 rows processed and 155083620 rows inserted.
2014-05-20 15:25:28 (18787) INFO : Bulk load completed, total run time : 751.561 seconds

[root@n0 ontime]# /usr/local/Calpont/bin/cpimport -s '|' ontime ontime ontime.psv

2014-05-20 15:12:58 (18787) INFO : Running distributed import (mode 1) on all PMs...

2014-05-20 15:25:28 (18787) INFO : For table ontime.ontime: 155083620 rows processed and 155083620 rows inserted.

2014-05-20 15:25:28 (18787) INFO : Bulk load completed, total run time : 751.561 seconds

The data is stored in Hadoop:

[root@n0 ontime]# hdfs dfs -du -h /usr/local/Calpont
1.4 G /usr/local/Calpont/data1
1.4 G /usr/local/Calpont/data2
1.4 G /usr/local/Calpont/data3
1.4 G /usr/local/Calpont/data4
1.4 G /usr/local/Calpont/data5
1.4 G /usr/local/Calpont/data6
1.4 G /usr/local/Calpont/data7
1.4 G /usr/local/Calpont/data8

[root@n0 ontime]# hdfs dfs -du -h /usr/local/Calpont

1.4 G /usr/local/Calpont/data1

1.4 G /usr/local/Calpont/data2

1.4 G /usr/local/Calpont/data3

1.4 G /usr/local/Calpont/data4

1.4 G /usr/local/Calpont/data5

1.4 G /usr/local/Calpont/data6

1.4 G /usr/local/Calpont/data7

1.4 G /usr/local/Calpont/data8

The total size of the data is 8×1.4G = 11.2G (compressed). To compare the size of the same dataset in Impala Parquet format is 3.6G. Original size was ~60G.

[root@n0 ontime]# hdfs dfs -du -h /user/hive/warehouse
3.6 G /user/hive/warehouse/ontime_parquet_snappy

1 2	[root@n0 ontime]# hdfs dfs -du -h /user/hive/warehouse 3.6 G /user/hive/warehouse/ontime_parquet_snappy

Now we can run the 2 queries I’ve tested before:

1. Simple group-by

mysql> select yeard, count(*) from ontime group by yeard order by yeard;
+-------+----------+
| yeard | count(*) |
+-------+----------+
|  1988 |  5202096 |
|  1989 |  5041200 |
|  1990 |  5270893 |
|  1991 |  5076925 |
|  1992 |  5092157 |
|  1993 |  5070501 |
|  1994 |  5180048 |
|  1995 |  5327435 |
|  1996 |  5351983 |
|  1997 |  5411843 |
|  1998 |  5384721 |
|  1999 |  5527884 |
|  2000 |  5683047 |
|  2001 |  5967780 |
|  2002 |  5271359 |
|  2003 |  6488540 |
|  2004 |  7129270 |
|  2005 |  7140596 |
|  2006 |  7141922 |
|  2007 |  7455458 |
|  2008 |  7009726 |
|  2009 |  6450285 |
|  2010 |  6450117 |
|  2011 |  6085281 |
|  2012 |  6096762 |
|  2013 |  6369482 |
|  2014 |  1406309 |
+-------+----------+
27 rows in set (0.22 sec)

mysql> select yeard, count(*) from ontime group by yeard order by yeard;

+-------+----------+

| yeard | count(*) |

+-------+----------+

| 1988 | 5202096 |

| 1989 | 5041200 |

| 1990 | 5270893 |

| 1991 | 5076925 |

| 1992 | 5092157 |

| 1993 | 5070501 |

| 1994 | 5180048 |

| 1995 | 5327435 |

| 1996 | 5351983 |

| 1997 | 5411843 |

| 1998 | 5384721 |

| 1999 | 5527884 |

| 2000 | 5683047 |

| 2001 | 5967780 |

| 2002 | 5271359 |

| 2003 | 6488540 |

| 2004 | 7129270 |

| 2005 | 7140596 |

| 2006 | 7141922 |

| 2007 | 7455458 |

| 2008 | 7009726 |

| 2009 | 6450285 |

| 2010 | 6450117 |

| 2011 | 6085281 |

| 2012 | 6096762 |

| 2013 | 6369482 |

| 2014 | 1406309 |

+-------+----------+

27 rows in set (0.22 sec)

2. The complex query from my original post:

mysql> select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC, cnt desc LIMIT  1000;
+------------+------------+---------+----------+-----------------+------+
| min(yeard) | max(yeard) | Carrier | cnt      | flights_delayed | rate |
+------------+------------+---------+----------+-----------------+------+
|       2003 |       2009 | EV      |  1454777 |          237698 | 0.16 |
|       2003 |       2009 | FL      |  1082489 |          158748 | 0.15 |
|       2006 |       2009 | YV      |   740608 |          110389 | 0.15 |
|       2006 |       2009 | XE      |  1016010 |          152431 | 0.15 |
|       2003 |       2009 | B6      |   683874 |          103677 | 0.15 |
|       2001 |       2009 | MQ      |  3238137 |          448037 | 0.14 |
|       2003 |       2005 | DH      |   501056 |           69833 | 0.14 |
|       2004 |       2009 | OH      |  1195868 |          160071 | 0.13 |
|       2003 |       2006 | RU      |  1007248 |          126733 | 0.13 |
|       1988 |       2009 | UA      |  9593284 |         1197053 | 0.12 |
|       2003 |       2006 | TZ      |   136735 |           16496 | 0.12 |
|       1988 |       2001 | TW      |  2656286 |          280283 | 0.11 |
|       1988 |       2009 | AA      | 10568437 |         1183786 | 0.11 |
|       1988 |       2009 | CO      |  6023831 |          673354 | 0.11 |
|       1988 |       2009 | DL      | 11866515 |         1156048 | 0.10 |
|       2003 |       2009 | OO      |  2654259 |          257069 | 0.10 |
|       1988 |       2009 | AS      |  1506003 |          146920 | 0.10 |
|       2007 |       2009 | 9E      |   577244 |           59440 | 0.10 |
|       1988 |       2009 | US      | 10276862 |          990995 | 0.10 |
|       1988 |       2009 | NW      |  7601727 |          725460 | 0.10 |
|       1988 |       2005 | HP      |  2607603 |          235675 | 0.09 |
|       1988 |       2009 | WN      | 12722174 |         1107840 | 0.09 |
|       2005 |       2009 | F9      |   307569 |           28679 | 0.09 |
|       1988 |       1991 | PA      |   203401 |           19263 | 0.09 |
+------------+------------+---------+----------+-----------------+------+
24 rows in set (0.86 sec)

mysql> select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC, cnt desc LIMIT 1000;

+------------+------------+---------+----------+-----------------+------+

+------------+------------+---------+----------+-----------------+------+

| 2003 | 2009 | EV | 1454777 | 237698 | 0.16 |

| 2003 | 2009 | FL | 1082489 | 158748 | 0.15 |

| 2006 | 2009 | YV | 740608 | 110389 | 0.15 |

| 2006 | 2009 | XE | 1016010 | 152431 | 0.15 |

| 2003 | 2009 | B6 | 683874 | 103677 | 0.15 |

| 2001 | 2009 | MQ | 3238137 | 448037 | 0.14 |

| 2003 | 2005 | DH | 501056 | 69833 | 0.14 |

| 2004 | 2009 | OH | 1195868 | 160071 | 0.13 |

| 2003 | 2006 | RU | 1007248 | 126733 | 0.13 |

| 1988 | 2009 | UA | 9593284 | 1197053 | 0.12 |

| 2003 | 2006 | TZ | 136735 | 16496 | 0.12 |

| 1988 | 2001 | TW | 2656286 | 280283 | 0.11 |

| 1988 | 2009 | AA | 10568437 | 1183786 | 0.11 |

| 1988 | 2009 | CO | 6023831 | 673354 | 0.11 |

| 1988 | 2009 | DL | 11866515 | 1156048 | 0.10 |

| 2003 | 2009 | OO | 2654259 | 257069 | 0.10 |

| 1988 | 2009 | AS | 1506003 | 146920 | 0.10 |

| 2007 | 2009 | 9E | 577244 | 59440 | 0.10 |

| 1988 | 2009 | US | 10276862 | 990995 | 0.10 |

| 1988 | 2009 | NW | 7601727 | 725460 | 0.10 |

| 1988 | 2005 | HP | 2607603 | 235675 | 0.09 |

| 1988 | 2009 | WN | 12722174 | 1107840 | 0.09 |

| 2005 | 2009 | F9 | 307569 | 28679 | 0.09 |

| 1988 | 1991 | PA | 203401 | 19263 | 0.09 |

+------------+------------+---------+----------+-----------------+------+

24 rows in set (0.86 sec)

The same query in impala (on the same hardware) runs for 7.18 seconds:

  [n8.local:21000] > select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_parquet_snappy WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC LIMIT  1000;
  
Query: select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_parquet_snappy WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC LIMIT  1000
  +------------+------------+---------+----------+-----------------+------+
  | min(yeard) | max(yeard) | carrier | cnt      | flights_delayed | rate |
  +------------+------------+---------+----------+-----------------+------+
  | 2003       | 2009       | EV      | 1454777  | 237698          | 0.16 |
  | 2003       | 2009       | FL      | 1082489  | 158748          | 0.15 |
  | 2006       | 2009       | XE      | 1016010  | 152431          | 0.15 |
  | 2006       | 2009       | YV      | 740608   | 110389          | 0.15 |
  | 2003       | 2009       | B6      | 683874   | 103677          | 0.15 |
  | 2001       | 2009       | MQ      | 3238137  | 448037          | 0.14 |
  | 2003       | 2005       | DH      | 501056   | 69833           | 0.14 |
  | 2004       | 2009       | OH      | 1195868  | 160071          | 0.13 |
  | 2003       | 2006       | RU      | 1007248  | 126733          | 0.13 |
  | 1988       | 2009       | UA      | 9593284  | 1197053         | 0.12 |
  | 2003       | 2006       | TZ      | 136735   | 16496           | 0.12 |
  | 1988       | 2001       | TW      | 2656286  | 280283          | 0.11 |
  | 1988       | 2009       | CO      | 6023831  | 673354          | 0.11 |
  | 1988       | 2009       | AA      | 10568437 | 1183786         | 0.11 |
  | 1988       | 2009       | US      | 10276862 | 990995          | 0.10 |
  | 2007       | 2009       | 9E      | 577244   | 59440           | 0.10 |
  | 1988       | 2009       | DL      | 11866515 | 1156048         | 0.10 |
  | 2003       | 2009       | OO      | 2654259  | 257069          | 0.10 |
  | 1988       | 2009       | NW      | 7601727  | 725460          | 0.10 |
  | 1988       | 2009       | AS      | 1506003  | 146920          | 0.10 |
  | 1988       | 1991       | PA      | 203401   | 19263           | 0.09 |
  | 1988       | 2009       | WN      | 12722174 | 1107840         | 0.09 |
  | 1988       | 2005       | HP      | 2607603  | 235675          | 0.09 |
  | 2005       | 2009       | F9      | 307569   | 28679           | 0.09 |
  +------------+------------+---------+----------+-----------------+------+
  Returned 24 row(s) in 7.18s

[n8.local:21000] > select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_parquet_snappy WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC LIMIT 1000;

Query: select min(yeard), max(yeard), Carrier, count(*) as cnt, sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed, round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate FROM ontime_parquet_snappy WHERE DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI') and DestState not in ('AK', 'HI', 'PR', 'VI') and flightdate < '2010-01-01' GROUP by carrier HAVING cnt > 100000 and max(yeard) > 1990 ORDER by rate DESC LIMIT 1000

+------------+------------+---------+----------+-----------------+------+

+------------+------------+---------+----------+-----------------+------+

| 2003 | 2009 | EV | 1454777 | 237698 | 0.16 |

| 2003 | 2009 | FL | 1082489 | 158748 | 0.15 |

| 2006 | 2009 | XE | 1016010 | 152431 | 0.15 |

| 2006 | 2009 | YV | 740608 | 110389 | 0.15 |

| 2003 | 2009 | B6 | 683874 | 103677 | 0.15 |

| 2001 | 2009 | MQ | 3238137 | 448037 | 0.14 |

| 2003 | 2005 | DH | 501056 | 69833 | 0.14 |

| 2004 | 2009 | OH | 1195868 | 160071 | 0.13 |

| 2003 | 2006 | RU | 1007248 | 126733 | 0.13 |

| 1988 | 2009 | UA | 9593284 | 1197053 | 0.12 |

| 2003 | 2006 | TZ | 136735 | 16496 | 0.12 |

| 1988 | 2001 | TW | 2656286 | 280283 | 0.11 |

| 1988 | 2009 | CO | 6023831 | 673354 | 0.11 |

| 1988 | 2009 | AA | 10568437 | 1183786 | 0.11 |

| 1988 | 2009 | US | 10276862 | 990995 | 0.10 |

| 2007 | 2009 | 9E | 577244 | 59440 | 0.10 |

| 1988 | 2009 | DL | 11866515 | 1156048 | 0.10 |

| 2003 | 2009 | OO | 2654259 | 257069 | 0.10 |

| 1988 | 2009 | NW | 7601727 | 725460 | 0.10 |

| 1988 | 2009 | AS | 1506003 | 146920 | 0.10 |

| 1988 | 1991 | PA | 203401 | 19263 | 0.09 |

| 1988 | 2009 | WN | 12722174 | 1107840 | 0.09 |

| 1988 | 2005 | HP | 2607603 | 235675 | 0.09 |

| 2005 | 2009 | F9 | 307569 | 28679 | 0.09 |

+------------+------------+---------+----------+-----------------+------+

Returned 24 row(s) in 7.18s

Conclusion and charts

To summaries I’ve created the following charts:

Simple query:

As we can see InfiniDB looks pretty good here. It also uses MySQL protocol, so existing application which uses MySQL will be able to work here without any additional “connectors”.

One note regarding my query example: the “complex” query is designed in a way that will make it hard to use any particular set of index; this query will have to scan the >70% of the table to generate the resultset. That is why it is so slow in MySQL compared to columnar databases. Another “issue” is that the table is very wide and most of the columns are declared as varchar (table is not normalized), which makes it large in MySQL. All this will make it ideal for columnar storage and compression. Other cases may not show that huge of a difference.

So far I was testing with small data (60G), I will plan to run big data benchmark next.

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Dipti Joshi

9 years ago

Thank you Alex for sharing your experience with InfiniDB! Let us know as you test on larger data set.

Ebot Tabi

9 years ago

Hey nice piece, thanks for sharing your experience with Infinidb.

Vadim Tkachenko

Admin

9 years ago

Alexander,

Now it would be interesting to compare with Spark http://spark.apache.org/

Krishan

7 years ago

Dear Alex, Do you still consider InfiniDB-Hadoop to be a superior option for SQL-on-Hadoop for BI analytics?

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Using InfiniDB MySQL server with Hadoop cluster for data analytics

Related

Related Blog Articles

RECOMMENDED ARTICLES

Securing Your MySQL Database: Essential Best Practices

MySQL 8.4 First Peek

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Using InfiniDB MySQL server with Hadoop cluster for data analytics

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Securing Your MySQL Database: Essential Best Practices

MySQL 8.4 First Peek

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation