Recently my attention was brought to this bug which is a nightmare bug for any consultant.

Working with production systems we assume reads are reads and if we’re just reading we can’t break anything. OK may be we can crash the server with some select query which runs into some bug but not cause the data loss.

This case teaches us things can be different – reads can in fact cause certain writes (updates) inside which add risk, such as exposed by this bug.

This is why transparency is important – to understand how safe something is it is not enough to know what is this logically but also what really happens inside and so what can go wrong.

19 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Pedro Melo

Sorry Peter,

that is not the funniest bug ever…

This is: http://code.google.com/p/blackgold/issues/detail?id=3

Best regards,

Jonas

Hi,

We inside cluster team actually consider it as one worst bug ever…
Not exactly our proudest moment 🙁

But I agree, it’s so bad that it’s actually funny…

/Jonas

Morgan Tocker

At least it looks like it doesn’t represent the typical usecase of what most people will be doing – but very serious indeed. I’ve always found this bug the funniest:

http://bugs.mysql.com/bug.php?id=2

Shlomi Noach

Reminds me of the ‘noatime’ option on unix file systems.
When first I learned that by reading any file or file property I commit a write – I was in utter surprise.

Baron Schwartz

Perhaps we should investigate the on-disk format of NDB so we can start providing data recovery services for it, too.

Matthew Montgomery

, data recovery services for NDB disk data are not relevant.
For this bug it is simply not an option, ndb just DROPed the tables from the cluster, you’d have to restore from backup.
If disk data files do get corrupt on a particular individual node you simply restart that node with –initial option and those files are restored from the peer in the node group.

Matthew Montgomery

Correction… “simply restart that node with –initial option [after deleting the on-disk data and log files]” (–initial will not remove on-disk data files).

Baron Schwartz

Matthew, there is probably no system in existence (that uses on-disk storage) for which data recovery from on-disk files is not relevant. The point of our data recovery tools and services are to recover data that has been dropped, deleted, corrupted, etc and there is no backup. If it’s been dropped from the cluster, are the 1s and 0s on disk anywhere? If yes, then that’s exactly the type of scenario I’m thinking of. NDB can’t find the data anymore, but maybe something else can. And the customer might call us up, and we might write tools on the spot to do the recovery — that’s how our other tools got started 😉

What if there’s a bug in NDB such that the disk data files get corrupt on every node and there is no peer in the node group with a good copy? If it hasn’t happened yet, it may someday, who knows.

Disclaimer: I have not investigated the on-disk format of NDB at all.

o.u.

Wow, Shlomi Noach @ 4, re: the atime .. a write on every read, even from cache – I’m kind of shocked.

Baron Schwartz

It’s not quite that bad. It’s only once per second. (There’s only a write if the atime has actually changed, which is only true once a second.)

johan

Ha, it was a funny bug indeed. Unfortunately, I found it on a customer site 🙁
-j

Log Buffer

“Peter Zaitsev shares the funniest bug ever.”

Log Buffer #134

o.u.

Thank you Barron – ok, not as crazy as it sounded then, though still a shock.

Shlomi Noach

Hi Baron,

Though I haven’t benchmarked myself the difference between ‘atime’ and noatime, please see what Linus Torvalds has to say about it:
http://lkml.org/lkml/2007/8/4/98

He claims more then 10% savings, though he wasn’t testing MySQL.
Do you have any benchmarks comparing with/out “noatime”?

Regards

Baron Schwartz

No, I don’t, but anecdotally I can say that it matters if you have a lot of files. For example, suppose you have 100k tables, which is pretty common in certain types of apps. That’s at least 300k files if you’re using indexed MyISAM tables (also common for the same scenarios). Now suppose that you’re accessing them all randomly; you can do the math at how many times you’ll be doing an atime/diratime write. In many cases you won’t access a file more than once a second, so each access suffers the hit.

My anecdotal evidence is that I haven’t seen significant performance changes from adding noatime,nodiratime to the mount options on “normal” servers with a few hundred tables. You can change the mount options at runtime so it’s pretty easy to see.

Shlomi Noach

@Peter, ,

Thanks for the information. It does sound a lot more reasonable in light of your explanation.