April 18, 2014

Using any general purpose computer as a special purpose SIMD computer

Often times, from a computing perspective, one must run a function on a large amount of input. Often times, the same function must be run on many pieces of input, and this is a very expensive process unless the work can be done in parallel.

Shard-Query introduces set based processing, which on the surface appears to be similar to other technologies on the market today. However, the scaling features of Shard-Query are just a side effect of the fact that it operates on sets in parallel. Any set can be operated on to any arbitrary degree of parallelism up to, and including, the cardinality of the set.
Given that:

  1. It is often possible to arbitrarily transform one type of expression into a different, but compatible type for computational purposes as long as the conversion is bidirectional
  2. An range operation over a set of integers or dates can be transformed into one or more discrete sub-ranges
  3. Any operation on an entire set is the same as running that operation on each item individually.
  4. After expanding a set operation into N discrete dimensions, it can always be collapsed back into a one dimensional set.
  5. Arrays are sets


Treating a general purpose computer as an SIMD computer is possible because in a set, you can perform operations on all of the items independently. The SIMD processor simply needs to wait for all parallel operations on its input to complete. Parallelism is embarrassing and the maximum degree of parallelism is easily enforced with queue.

Today I am going to show you how to take almost any function, and treat any size cluster of Turing computers as a specialized purpose SIMD computer with respect to your function. The SQL interface to Shard-Query imposes a wait for all the workers to complete, but you can register a callback function to handle the output of each input asynchronously, if you like.

Right now I believe this only works on finite sets. I’ve decided to show how to count the number of unique words, an how many times those words appear in a document. Set based processing of course works on sets. A document is a set of words.

Before you read further, I want to tell you why I’ve decided to use the words of the Constitution of the United States of America as an example. It is my favourite document in the world. It speaks of honesty, and integrity, truth and openness. I believe in all of these things. I believe, that with cheap computation, our world can become an amazing place. Please use this technology constructively and for peaceful purposes. Love one another and let’s solve all the complex problems in the world together in peace and harmony.

The following is a somewhat naive example, since grammar will not be taken into account. I start by splitting our document into a list of words:
cat /tmp/constitution.txt | sed -e’s/ /n/g’ > words

I am going to perform the following operations on every word in the constitution:
1) compute the md5 of every item
2) compute the md5 on the reverse of every item
3) count the total number of words
4) count the frequency of words
5) order by the frequency of words, then by the md5 of the word, then by the md5 of the reverse of the word.
6) determine the number of unique words. This is not projected, but you can infer it from the number of items in the output set.

The US Constitution is not very large. I inflated the document size significantly to over 3 million “words” by duplicating the entire set multiple times.
mysql> load data infile ‘/tmp/words’ into table words (chars);
Query OK, 6033 rows affected (0.01 sec)
Records: 6033 Deleted: 0 Skipped: 0 Warnings: 0

I blow up the size of the words table and I create words2. This is the data upon which we will operate:

Here is the serial version as run by the native database interface (MySQL):

1427 rows in set (5.00 sec)
1427 rows in set (5.03 sec)
1427 rows in set (5.00 sec)

Since the data fits in memory speed is near constant and the single threaded operation burns one CPU.

To help completely demonstrate how Shard-Query makes parallel set operations work, I’ll operate in only one dimension for the first example, just like the MySQL client. This will be a linear operation because Shard-Query has no idea how to add parallelism in this case. It is data set agnostic, operating only on sets, not relations. If it were smarter it would ask the data dictionary about partitioning.

The set of three numbers are wall clock time (as calculated by microtime()), SQL execution time, and parse time, respectively.

Actually performance is a little worse. This is not unexpected. Since Shard-Query must add at small amount of overhead, a single threaded operation may be slower than the same operation on the native database.

That doesn’t matter because Shard-Query is a smart database proxy that can add parallelism. In this mode it will add additional six degrees of parallelism the query:

Why six degrees of parallelism? Because that is how many physical cores are connected to my bus, and because I chose to create six hash “buckets” in the table. This allows MySQL to set up a sequential scan over the items in this bucket, particularly since we are examining all the items. We operate on all the buckets and then use intelligent expression substitution to put the results back together, when necessary. When sorting or grouping are used, a final pass over the final result may be necessary, and this may add a small amount of serialization at the end.

How does this work?

Here is the most important part of the explain plan in the mode without parallelism. Notice that there is only one query. If your database system can not provide native parallelism, then performance will be poor.

The other important optimization combines results from multiple queries together. This query is single threaded, and thus this serves no purpose. It will be much more important in a moment.

Now, consider the query with BETWEEN 1 and 6 added to the where clause. This creates boundary conditions for our query. Any set of integers can be broken up into as many items are the set contains, and thus it is possible to convert the BETWEEN expression into a partition elimination expression.

Here is the output from the parallel version:

This powers the UPSERT. The results from the six branches are combined with this.

All six branches compute fully in parallel.

Finally, I would like to note that this set has a cardinality of 3088896. This is the maximum theoretical degree of parallelism that this data set can achieve with my method. Likely current network technologies can not support such a degree.

How to factor numbers

create table dataset (
id bigint auto_increment primary key,
prime_number bigint
);

spread this over as many machines as necessary using the method above, assuming there are any number of primes which you want to divide, and you want to use 1024 way parallelism:

This allows you to use data flow semantics on any cluster with respect to almost any generic computational function.

About Justin Swanhart

Comments

  1. Isioma Nnodum says:

    I am so exited to try this out.

  2. Michael Moses Ali says:

    I am convince and understand the saying of JUSTIN SWANHART which i believe i can also do that in the next coming season or time.

  3. Justin Swanhart says:

    Here is most of the math for the rewrites:
    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7788

    You can rewrite non-distributable aggregate functions using GROUP BY rewriting:
    select a expr1, count(distinct b) expr2, count(*) expr3, max(a) expr4, min(b) expr5
    from some_table;

    This query is sent to each processor:
    select a expr1, b expr2, sum(1) expr3, min(a) expr4, max(b) exp4
    from some_table;

    The results are merged together with ON DUPLICATE KEY UPDATE in the insert.
    INSERT INTO #tmp values (…),(…)
    ON DUPLICATE KEY UPDATE
    set expr1=expr1,
    expr2 = expr2,
    expr3 = expr3 + VALUES(expr3),
    expr4 = VALUES(expr4) > expr4, VALUES(expr4), expr4),
    expr5 = VALUES(expr5) < expr5, VALUES(expr5), expr5);

    The finally, after all the simd processes finish:
    select expr1, count(distinct expr2) expr2, sum(expr3) expr3, min(a), max(b)
    from #tmp;

    A having clause would be applied as a where clause to the final SQL

  4. Justin Swanhart says:

    Add a where clause and LIMIT 1

  5. Justin Swanhart says:

    Also, the factorization example can use a callback and you can remove the min. Then as soon as one computation returns non-zero results you can stop.

Speak Your Mind

*