Continuing my look at Tokyo Tyrant/Cabinet and addressing some of the concerns I have seen people have brought up this is post #2.
#2. As your data grows does Tokyo Cabinet slow down?
Yes your performance can degrade. One obvious performance decrease with a larger dataset is you start to increase the likelihood that your data no longer fits into memory. This decreases the number of memory operations and trades them for more expensive disk based operations.   As fast as any application is, as you read off disk opposed to memory performance is going to drop off substantially. One of the more difficult things to test with Tyrant is disk bound performance. The FS Cache can make Tyrant seem like small amounts of memory will still make it scream. Once your data set is larger then that, people start to claim they hit the performance “wallâ€.
In order to help test this I went ahead an mounted the FS with my data files with the sync option which effectively disables the FS cache. This should help show the real performance of the hash engine. Here performance dips substantially, as expected :
Look at the IO rate:
NoSync:Â 31 MB/s
Sync:Â 3.2 MB/s
As one would expect the IO goes crazy when the drive is mounted with the sync option hitting 99% IO wait. The interesting this here is we are actually bottlenecking on writes and not reads. You see without the FS cache to buffer the writes when we need to remove data from memory we now have to rely on the internal Tyrant cache and when that is exhausted have to then really write to disk not the FS Cache. Now Tyrant starts to take on the same characteristics as your classic DB, the bigger the buffer pool the faster the performance:
Even here the performance drop-off once you exhaust memory is relative. The focus here should be the drop off versus other solutions with the same configuration, not the drop off versus a completely cached version. In this case ask yourself given similar datasets and similar memory requirements what is the performance? Take the above sync test, when I use 256M of memory and run my test with writes going directly to disk I hit 964 TPS, in previous MySQL tests the same setup (256M BP) netted ~160 TPS. So 5x improvement all things being equal. Of course this is a far drop off from the 13K I was getting when everything was effectively in the file system cache or in memory, but 5x is still a very solid improvement.
Next up is looking at Tyrant’s and Cabinet’s write bottleneck.
This is strange because afaik most linux filesystems do flush dirty pages
every 5 second per default (commit kernel param) unless sync syscall
or mount option. So with “NoSync: 31 MB/s” we should expect 5*31 + small
overhead would be enough to keep all writes in dirty buffers for a whole
pdflush cycle, and then 512MB should be enough to handle the load exactly
as 1024MB would, no? Or… what /proc/sys/vm/dirty_ratio value did you
use for this test?
Though “I use 256M of memory and run my test with writes going directly to
disk” confuses me a lot; are you saying that the second test ran sync too?
(if so I don’t understand why memory size do impact results); or are you
measuring read perfs in this second test? or mixed r+w with w being sync?
But I guess the question (“As your data grows does Tokyo Cabinet
slow down”) was probably more about read performances than write
performances, wasn’t it?
Maybe justing testing reads (on different keys) after a cache purge
(like sync && echo 3 > /proc/sys/vm/drop_caches) would show worst
case (similar to when a very small proportion of data do fit in
kernel buffers and everything is retrieved from disk) for read perfs?
@herodiade,
I think your correct, the only difference is how the kernel handles mmaped files… i have looked for details but have not found out, so I need to put together a test to research this further. I think what’s happening here is all the writes that exceed the allocated cabinet memory are hitting the FS cache, where the kernel takes care of these in the background. This means very little slowdown in the write process when cabinet flushes dirty pages to make room for new pages. Mounting the FS as sync means we are writing directly to disk, so each new page read into memory requires a page to be written to disk, which slows this down to a crawl. Adding more memory means there is a great change the record I am changing is already in memory.
These tests are the based on the benchmark I put together for my first tests. Each transaction does two reads and one write.
hello, there is a problem troubles me so long.
Why the size of a InnoDB page is 16k?
As the page size of system is 4k.
If the InnoDB page is also 4k, perhaps system io can be reduce?
As we know that InnoDB wirtes a least a whole page whenever it writes.
But I think there must be more advantages for the setting.
Can pls tell me why or how to find the answer.
Much thanks.