Funniest bug ever

Recently my attention was brought to this bug which is a nightmare bug for any consultant.

Working with production systems we assume reads are reads and if we’re just reading we can’t break anything. OK may be we can crash the server with some select query which runs into some bug but not cause the data loss.

This case teaches us things can be different – reads can in fact cause certain writes (updates) inside which add risk, such as exposed by this bug.

This is why transparency is important – to understand how safe something is it is not enough to know what is this logically but also what really happens inside and so what can go wrong.

19 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Pedro Melo

15 years ago

Sorry Peter,

that is not the funniest bug ever…

This is: http://code.google.com/p/blackgold/issues/detail?id=3

Best regards,

Jonas

15 years ago

Hi,

We inside cluster team actually consider it as one worst bug ever…
Not exactly our proudest moment 🙁

But I agree, it’s so bad that it’s actually funny…

/Jonas

Morgan Tocker

15 years ago

At least it looks like it doesn’t represent the typical usecase of what most people will be doing – but very serious indeed. I’ve always found this bug the funniest:

http://bugs.mysql.com/bug.php?id=2

Shlomi Noach

15 years ago

Reminds me of the ‘noatime’ option on unix file systems.
When first I learned that by reading any file or file property I commit a write – I was in utter surprise.

Baron Schwartz

15 years ago

Perhaps we should investigate the on-disk format of NDB so we can start providing data recovery services for it, too.

Matthew Montgomery

15 years ago

@Baron Schwartz, data recovery services for NDB disk data are not relevant.
For this bug it is simply not an option, ndb just DROPed the tables from the cluster, you’d have to restore from backup.
If disk data files do get corrupt on a particular individual node you simply restart that node with –initial option and those files are restored from the peer in the node group.

Matthew Montgomery

15 years ago

Correction… “simply restart that node with â€“initial option [after deleting the on-disk data and log files]” (–initial will not remove on-disk data files).

Baron Schwartz

15 years ago

Matthew, there is probably no system in existence (that uses on-disk storage) for which data recovery from on-disk files is not relevant. The point of our data recovery tools and services are to recover data that has been dropped, deleted, corrupted, etc and there is no backup. If it’s been dropped from the cluster, are the 1s and 0s on disk anywhere? If yes, then that’s exactly the type of scenario I’m thinking of. NDB can’t find the data anymore, but maybe something else can. And the customer might call us up, and we might write tools on the spot to do the recovery — that’s how our other tools got started 😉

What if there’s a bug in NDB such that the disk data files get corrupt on every node and there is no peer in the node group with a good copy? If it hasn’t happened yet, it may someday, who knows.

Disclaimer: I have not investigated the on-disk format of NDB at all.

Peter Zaitsev

Author

15 years ago

Jonas, right. I meant “Funny” in this case which is really kind of tragic.

Though it is very nice to see you got the fix for it relatively quickly and honestly publishing such bug also gives you a good credit.

Peter Zaitsev

Author

15 years ago

Baron, Matthew,

Right. If you would have a good backup (with point in time recovery) we would not have any recovery tools. In practice however backups sometimes are found to be broken and you have to recover the data. Our experience shows no one is immune – number of companies you’ve think should have a backup have contacted us for help (with Innodb)

With Innodb it is easy – thanks to the page format it is possible to locate data even if filesystem was totally ruined (like RAID meltdown).

o.u.

15 years ago

Wow, Shlomi Noach @ 4, re: the atime .. a write on every read, even from cache – I’m kind of shocked.

Baron Schwartz

15 years ago

It’s not quite that bad. It’s only once per second. (There’s only a write if the atime has actually changed, which is only true once a second.)

johan

15 years ago

Ha, it was a funny bug indeed. Unfortunately, I found it on a customer site 🙁
-j

Log Buffer

15 years ago

“Peter Zaitsev shares the funniest bug ever.”

— Log Buffer #134

o.u.

15 years ago

Thank you Barron – ok, not as crazy as it sounded then, though still a shock.

Shlomi Noach

15 years ago

Hi Baron,

Though I haven’t benchmarked myself the difference between ‘atime’ and noatime, please see what Linus Torvalds has to say about it:
http://lkml.org/lkml/2007/8/4/98

He claims more then 10% savings, though he wasn’t testing MySQL.
Do you have any benchmarks comparing with/out “noatime”?

Regards

Peter Zaitsev

Author

15 years ago

Shlomi,

Linus mentions “mail spool” which tend to have a lot of tiny files…. and this is where overhead is significant. Unless you’re dealing with tens of thousands of tables in MySQL you’re in different situation. This also means any benchmarks you would like to do should be workload specific – in your particular case it is possible you will see significant gain.

Baron Schwartz

15 years ago

No, I don’t, but anecdotally I can say that it matters if you have a lot of files. For example, suppose you have 100k tables, which is pretty common in certain types of apps. That’s at least 300k files if you’re using indexed MyISAM tables (also common for the same scenarios). Now suppose that you’re accessing them all randomly; you can do the math at how many times you’ll be doing an atime/diratime write. In many cases you won’t access a file more than once a second, so each access suffers the hit.

My anecdotal evidence is that I haven’t seen significant performance changes from adding noatime,nodiratime to the mount options on “normal” servers with a few hundred tables. You can change the mount options at runtime so it’s pretty easy to see.

Shlomi Noach

15 years ago

@Peter, @Baron Schwartz,

Thanks for the information. It does sound a lot more reasonable in light of your explanation.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Funniest bug ever

Related

Related Blog Articles

RECOMMENDED ARTICLES

Bringing Percona Experts to a City Near You

MySQL 5.7 End of Life Options – Free Course at Percona University Online

Percona University is Back in Business

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Funniest bug ever

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Bringing Percona Experts to a City Near You

MySQL 5.7 End of Life Options – Free Course at Percona University Online

Percona University is Back in Business

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation