July 31, 2014

Using Apache Hadoop and Impala together with MySQL for data analysis

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format) and run a reporting […]

Quickly finding unused indexes (and estimating their size)

I had a customer recently who needed to reduce their database size on disk quickly without a lot of messy schema redesign and application recoding.  They didn’t want to drop any actual data, and their index usage was fairly high, so we decided to look for unused indexes that could be removed. Collecting data It’s […]

How (not) to find unused indexes

I’ve seen a few people link to an INFORMATION_SCHEMA query to be able to find any indexes that have low cardinality, in an effort to find out what indexes should be removed.  This method is flawed – here’s the first reason why:

The cardinality of status index is woeful, but provided that the application […]

How rows_sent can be more than rows_examined?

When looking at queries that are candidates for optimization I often recommend that people look at rows_sent and rows_examined values as available in the slow query log (as well as some other places). If rows_examined is by far larger than rows_sent, say 100 larger, then the query is a great candidate for optimization. Optimization could […]

MySQL 5.6 vs MySQL 5.5 and the Star Schema Benchmark

So far most of the benchmarks posted about MySQL 5.6 use the sysbench OLTP workload.  I wanted to test a set of queries which, unlike sysbench, utilize joins.  I also wanted an easily reproducible set of data which is more rich than the simple sysbench table.  The Star Schema Benchmark (SSB) seems ideal for this. […]

Fun with the MySQL pager command

Last time I wrote about a few tips that can make you more efficient when using the command line on Unix. Today I want to focus more on pager. The most common usage of pager is to set it to a Unix pager such as less. It can be very useful to view the result […]

A case for MariaDB’s Hash Joins

MariaDB 5.3/5.5 has introduced a new join type “Hash Joins” which is an implementation of a Classic Block-based Hash Join Algorithm. In this post we will see what the Hash Join is, how it works and for what types of queries would it be the right choice. I will show the results of executing benchmarks […]

Shard-Query EC2 images available

Infobright and InnoDB AMI images are now available There are now demonstration AMI images for Shard-Query. Each image comes pre-loaded with the data used in the previous Shard-Query blog post. The data in the each image is split into 20 “shards”. This blog post will refer to an EC2 instances as a node from here […]

Advanced index analysis with mk-index-usage

The new release of Maatkit has a useful feature in mk-index-usage to help you determine how indexes are used in more flexible ways. The default report just prints out ALTER statements for removing unused indexes, which is nice, but it’s often helpful to ask more sophisticated questions about index usage. I’ll use this blog’s queries […]

Getting around optimizer limitations with an IN() list

There was a discussion on LinkedIn one month ago that caught my eye: Database search by “within x number of miles” radius? Anyone out there created a zipcode database and created a “search within x numer of miles” function ? Thankful for any tips you can throw my way.. J A few people commented that […]