The combination of max_allowed_packet variable and replication in MySQL is a common source of headaches. In a nutshell, max_allowed_packet is the maximum size of a MySQL network protocol packet that the server can create or read. It has a default value of 1MB (<= 5.6.5) or 4MB (>= 5.6.6) and a maximum size of 1GB. This adds some constraints in our replication environment:

  • The master server shouldn’t write events to the binary log larger than max_allowed_packet
  • All the slaves in the replication chain should have the same max_allowed_packet as the master server

Sometimes, even following those two basic rules we can have problems.

For example, there are situations (also called bugs) where the master writes more data than the max_allowed_packet limit causing the slaves to stop working. In order to fix this Oracle created a new variable called slave_max_allowed_packet. This new configuration variable available from 5.1.64, 5.5.26 and 5.6.6 overrides the max_allowed_packet value for slave threads. Therefore, regardless of the max_allowed_packet value the slaves’ threads will have 1GB limit, the default value of slave_max_allowed_packet. Nice trick that works as expected.

Sometimes even with that workaround we can get the max_allowed_packet error in the slave servers. That means that there is a packet larger than 1GB, something that shouldn’t happen in a normal situation. Why? Usually it is caused by a binary log corruption. Let’s see the following example:

Slave stops working with the following message:

The important part is “got fatal error 1236 from master”. The master cannot read the event it wrote to the binary log seconds ago. To check the problem we can:

  • Use mysqlbinlog to read the binary log from the position it failed with –start-position.

This is an example taken from our Percona Forums:

Check the size of the event, 1953066613 bytes. Or the “Unknown event” messages. Something is clearly wrong there. Another usual thing to check is the server id that sometimes doesn’t correspond with the real value. In this example the person who posted the binary log event confirmed that the server id was wrong.

  • Check master’s error log.

Again, the event is bigger than expected. There is no way the master and slave can read/write it, so the solution is to skip that event in the slave and rotate the logs on the master. Then, use pt-table-checksum to check data consistency.

MySQL 5.6 includes replication checksums to avoid problems with log corruptions. You can read more about it in Stephan’s blog post.

Conclusion

Errors on slave servers about max_allowed_packet can be caused by very different reasons. Although binary log corruption is not a common one, it is something worth checking when you have run out of ideas.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
sbester

For one genuine case, see the description of Bug #17842137 in the 5.6.16 change history…. Another commonly seen cause is incorrect positions for CHANGE MASTER, placing the slave to read in the middle of some event.

Karol Murawski

Another option:

if error: Got fatal error 1236 from master when reading data from binary log: ‘log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event ‘mysql-bin.000478’ at 251035246, the last event read from ‘/DATABASE/mysql/mysql-bin.000478’ at 251035246, the last byte read from ‘/DATABASE/mysql/mysql-bin.000478′ at 251035265.’

mysqlbinlog [path_to_master_file] | grep ‘[file pos]’

[file pos] – should be 251035265 but we take 251035 – we must find event before corruption

mysqlbinlog /DATABASE/mysql/mysql-bin.000478 | grep -i ‘251035’

…………
0
# at 225103579
#150831 10:44:06 server id 16777300 end_log_pos 251035064 CRC32 0x14484166 Query thread_id=23754573 exec_time=0 error_code=0
# at 251035064
#150831 10:44:06 server id 16777300 end_log_pos 251035145 CRC32 0x630bd054 Query thread_id=23754573 exec_time=0 error_code=0
# at 251035145
# at 251035177
#150831 10:44:06 server id 16777300 end_log_pos 251035177 CRC32 0x23816d86 Intvar
#150831 10:44:06 server id 16777300 end_log_pos 251035856 CRC32 0x49720f8e Query thread_id=23754573 exec_time=0 error_code=0
# at 251035856
#150831 10:44:06 server id 16777300 end_log_pos 251035938 CRC32 0x6d5fc69c Query thread_id=23754573 exec_time=0 error_code=0
# at 251035938
………

and we see that we can’t find event on position 251035246 so we take first event before corrupted postition : 251035177

than:
STOP SLAVE;
RESET SLAVE; CHANGE MASTER TO MASTER_HOST=’10.X.X.X’,MASTER_USER=’XX’,MASTER_PASSWORD=’XX’, MASTER_PORT=3306, MASTER_LOG_FILE=’mysql-bin.000478′,MASTER_LOG_POS=251035177;
START SLAVE;