This week I’ve worked with the customer doing certain work during maintenance window which involved a lot of data copying around between MySQL boxes. We had prepared well and had measured how fast we could copy the data between servers of these kind connected to the same network, and we did the same thing before. Using simple tar+netcat based copy we can get 80-90MB/sec on 1GigE assuming RAID is powerful enough. This applies to large Innodb tables with not overly fragmented tablespace, or it is easy to become IO bound rather than network bound.

As I mentioned you can get even better with using fast parallel compression like LZO or QuickLZ but there was no need in this case.

So estimates were great but once we had started the real copy process we saw the copy speed about 20MB/sec instead of projected 80MB/sec. The IO and CPU usage both on source and target servers was low so it must have been network. Though there was no other traffic between the servers.

The mystery could be easily resolved while looking at network topology. Some database servers were connected to Switch A others to Switch B which had only 1Gbit connection in between.

During the maintenance window multiple tasks concerning different servers made this connection to be the bottleneck.

What does this tell us ? Even if you’re the DBA you better to understand network topology to understand what kind of performance availability and failure scenarios you should expect. If your network is too complicated it is at least worth to know the numbers.

It does not only apply to network but to any resource indeed. For example what if you have the catastrophic event and now would like to restore all 50 servers from backup… in parallel. Will you backup system will be able to restore these in parallel efficiently ? Will there be enough network bandwidth to pipe them through. These and similar questions are what you should be asking yourself.

5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Bill Karwin

Another network-level performance concern is faulty cables. When I worked in tech support for InterBase, we helped customers whose performance problems were solved when they replaced the CAT5 cables between their app server and database server.

Cables contain tiny copper filaments, and these get metal fatigue if they are bent too frequently. If too many filaments are broken this way, the cable can become faulty. TCP is designed to re-try packets that aren’t transmitted reliably, but that cuts down on throughput.

We helped a customer who had up to 90% packet loss (which means each packet needs to be re-sent up to 10 times) because of his old, damaged cables.

Bill Karwin

Just to correct myself: those anecdotes apply to stranded coaxial network cables, not CAT5.

Pat

Another thing worth pointing out is the check the latency of your network connections as well as the throughput. A high throughput high latency link can often give you worse performance (despite the high throughput) than a narrow pipe without latency problems.

Especially if you app makes a lot of very fast queries, you may find network round trip takes longer than actual query execution.