November 10, 2009

Tokyo Tyrant – The Extras Part I : Is it Durable?

Posted by matt |

You know how in addition to the main movie you have extras on the DVD.  Extra commentary, bloopers, extra scenes, etc? Well welcome the Tyrant extras.  With my previous blog posts I was trying to set-up a case for looking at NOSQL tools, and not meant to be a decision making tool.  Each solution has pros and cons that will impact how well the technology works for you.  Based on some of the comments and questions to the other blogs, I thought I would put together a little more detail into some of the deficiencies and strengths of Tokyo Tyrant.

#1.  How durable is Tokyo Tyrant?

Well I went ahead and built a quick script that just inserted data into a TC table ( an id, and a timestamp) and did a kill -9 on the  the server in the middle of it.

Insert:
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355
put error: recv error
159800,1256131127.17364

Here we failed at a time of 1256131127.17355 , before the next record was inserted.

After bringing the server up from a crash:

159795,1256131127.1732
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355

All the records are still there.  So we are good right?  Looking in the code,  Tokyo Cabinet actually utilizes memory mapped files.  I personally have not using mmaped files, so feel free to correct me if you know better then I.  Using mmap here and performing a kill -9 seems to preserve the changes in memory, while powering down the server does not:

163,1257780699.10123
164,1257780699.35172
165,1257780699.60209
166,1257780699.85246

insert yanking of power cord here… gives us Post crash data of:

142,1257780693.84303
143,1257780694.09345

So we basically lost 5 secondish of data.

Looking at the Tyrant & Cabinet  documentation you will see mention of a  SYNC command which they say does the following:

“The function `tcrdbsync’ is used in order to synchronize updated contents of a remote database object with the file and the device.”

Let’s dig a little deeper into the code and see what’s going on:

/* Synchronize updated contents of a hash database object with the file and the device. */
bool tchdbsync(TCHDB *hdb){
assert(hdb);
if(!HDBLOCKMETHOD(hdb, true)) return false;
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER) || hdb->tran){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
HDBUNLOCKMETHOD(hdb);
return false;
}
if(hdb->async && !tchdbflushdrp(hdb)){
HDBUNLOCKMETHOD(hdb);
return false;
}
bool rv = tchdbmemsync(hdb, true);
HDBUNLOCKMETHOD(hdb);
return rv;
}

If it first checks if the file descriptor for the database is less then 0, or your not operating as a writer…  in which case it errors.  Then if checks if your running in async io mode.  If your running async it flushes the records from the delayed record pool.  If your running async and you do not flush your records, then your at the mercy of Tokyo cabinet, or your application to call one of the numerous operations that flushes the delayed record pool ( i.e.  all regular sync operations like tchdbput will flush it ).  I did not test with async, in fact to the best of my knowledge it does not look like tyrant supports async, even though cabinet does.   Which means the meat of the sync command coming from tyrant is tchdbmemsync.

/* Synchronize updating contents on memory of a hash database object. */
bool tchdbmemsync(TCHDB *hdb, bool phys){
assert(hdb);
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER)){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
return false;
}
bool err = false;
char hbuf[HDBHEADSIZ];
tchdbdumpmeta(hdb, hbuf);
memcpy(hdb->map, hbuf, HDBOPAQUEOFF);
if(phys){
size_t xmsiz = (hdb->xmsiz > hdb->msiz) ? hdb->xmsiz : hdb->msiz;
if(msync(hdb->map, xmsiz, MS_SYNC) == -1){
tchdbsetecode(hdb, TCEMMAP, __FILE__, __LINE__, __func__);
err = true;
}
if(fsync(hdb->fd) == -1){
tchdbsetecode(hdb, TCESYNC, __FILE__, __LINE__, __func__);
err = true;
}
}
return !err;
}

Here you see the call to msync.  What does msync do?  The man page says:

“The msync() function writes all modified data to permanent storage locations, if any, in those whole pages containing any part of the address space of the process starting at address addr and continuing for len bytes.”

Basically in the Tokyo Tyrant context msync will flush all the changes to a memory mapped object to disk.  This msync is crucial as you can not guarantee data ever makes it to disk if its not called.  (more below)

The tchdbmemsync function is the only place I saw calling msync. What calls  tchdbmemsync?

tchdbmemsync Called via:
tchdboptimize
tchdbsync
tchdbtranbegin
tchdbtrancommit
tchdbtranabort
tchdbcloseimpl
tchdbcopyimpl

The commands that will indirectly call an msync are : running the optimize command, calling a sync directly, closing a connection to the db, or starting,commiting, or aborting a transaction.  Note a transaction in TC is actually a global transaction and locks all write operations ( used for maintenance ).  What is missing here is a scheduled call to msync.  I looked and traced back the calls from Tyrant into Cabinet and could not find anything that is called by automatically.

The documentation on msync actually says without calling msync there is no guarantee of the data making it to disk.  This implies that it may eventually get written without a direct msync call ( When you purge/lru old data from memory ).    Testing this theory I crashed my server several times and found that data written out to disk without calling msync was very flaky indeed.  I had anywhere from 5 seconds of missing data to 60 seconds post crash.

This means for durability you really need to directly call the sync command.  In my previous post someone pointed out a flaw in this approach saying that they had seen that calling a sync after writes ruined performance.  Looking at the code you can see why calling a sync after each write can severely degrade performance.  Before I explain lets look at the performance hit:

Sync After every Call

Saying there is a performance hit here is an understatement.  The reason for this however is really how msync works and how its used in Tokyo Cabinet.  In a sense it is implemented as a global sync, not a record sync. i.e.  all changes  to the underlying database are flushed at once.  So instead of sync the record you just changed, all of the changed records in the DB will be flushed and synced.  In order to perform this operation a lock is required, which blocks other SYNC calls.   So if you have 32 threads, you could have 1 sync running and 31 others blocked.  This means calling a sync after every call is going to severely degrade performance.

So what can we do to Make Cabinet more durable?   Well the best option in my opinion is to steal a trick from Innodb:

We can easily write a a script that calls a background sync every second ( i.e. like innodb_flush_log_at_trx_commit = 0/2).  I have tested this and I see almost 0 impact on my gaming benchmark from when this is running to when it is not.

Once a Second Sync

You can write this and cron the script or TTSERVER actually provides you a method to call functions periodically:

-ext path : specify the script language extension file.
-extpc name period : specify the function name and the calling period of a periodic command.

Now while I did not see a drop in my benchmark, heavy write operations will see a drop in performance… for instance with 8 threads simply update/inserting data is saw this:

heavy insert sync once a second

Ouch, a 2X hit.  But you can configure the frequency of the sync  up or down as needed to ensure you have the proper recovery -vs- performance setting.

Related posts: :Tokyo Tyrant -The Extras Part III : Write Bottleneck::Tokyo Tyrant – The Extras Part II : The Performance Wall::MySQL-Memcached or NOSQL Tokyo Tyrant – part 3:
 

4 Comments »

  1. A minor comment, the “sync”, “nosync” colors used for the charts change from one to the other; this is confusing.

    Comment :: November 10, 2009 @ 1:19 pm

  2. Hi!

    Awesome post and analysis of TC/TT. Mikio wrote his thoughts about this matter on his blog. Thought you’d find it interesting.

    http://1978th.net/tech-en/promenade.cgi?id=6

    Cheers,
    Toru

    Comment :: November 10, 2009 @ 11:24 pm

  3. 3. Nicolas

    Very good post! I stopped testing TT after getting timeouts when calling sync. However, I was calling it every 5 minutes. I will now try the 1-second sync to see if response time keeps < 10 ms

    Comment :: November 14, 2009 @ 2:42 am

  4. 4. Mark

    FYI: TokyoTyrant / LUA doesn’t provide a mechanism to call sync() or any of the other methods required except optimize but that sounds like something that shouldn’t be called each second on a live database…
    Source: http://1978th.net/tokyotyrant/spex.html#luaext

    perl does, but using -ext requires a LUA extension.
    I’m referencing the latest docs. Did I miss something?

    Thanks.

    Comment :: February 21, 2010 @ 1:43 pm

 

Subscribe without commenting

Trackbacks/Pingbacks