Most technologies achieving high-availability for MySQL need a load-balancer to spread the client connections to a valid database host, even the Tungsten special connector can be seen as a sophisticated load-balancer. People often use hardware load balancer or software solution like haproxy. In both cases, in order to avoid having a single point of failure, multiple load balancers must be used. Load balancers have two drawbacks: they increase network latency and/or they add a validation check load on the database servers. The increased network latency is obvious in the case of standalone load balancers where you must first connect to the load balancer which then completes the request by connecting to one of the database servers. Some workloads like reporting/adhoc queries are not affected by a small increase of latency but other workloads like oltp processing and real-time logging are. Each load balancers must also check regularly if the database servers are in a sane state, so adding more load balancers increases the idle chatting over the network. In order to reduce these impacts, a very different type of load balancer is needed, let me introduce the Iptables ClusterIP target.

Normally, as stated by the RFC 1812 Requirements for IP Version 4 Routers an IP address must be unique on a network and each host must respond only for IPs it own. In order to achieve a load balancing behavior, the Iptables ClusterIP target doesn’t strictly respect the RFC. The principle is simple, each computer in the cluster share an IP address and MAC address with the other members but it answers requests only for a given subset, based on the modulo of a network value which is sourceIP-sourcePort by default. The behavior is controlled by an iptables rule and by the content of the kernel file /proc/net/ipt_CLUSTERIP/VIP_ADDRESS. The kernel /proc file just informs the kernel to which portion of the traffic it should answer. I don’t want to go too deep in the details here since all those things are handled by the Pacemaker resource agent, IPaddr2.

The IPaddr2 Pacemaker resource agent is commonly used for VIP but what is less know is its behavior when defined as part of a clone set. When part of clone set, the resource agent defines a VIP which uses the Iptables ClusterIP target, the iptables rules and the handling of the proc file are all done automatically. That seems very nice in theory but until recently, I never succeeded in having a suitable distribution behavior. When starting the clone set on, let’s say, three nodes, it distributes correctly, one instance on each but if 2 nodes fail and then recover, the clone instances all go to the 3rd node and stay there even after the first two nodes recover. That bugged me for quite a while but I finally modified the resource agent and found a way to have it work correctly. It also now set correctly the MAC address if none is provided to the MAC multicast address domain which starts by “01:00:5E”. The new agent, IPaddr3, is available here. Now, let’s show what we can achieve with it.

We’ll start from the setup described in my previous post and we’ll modify it. First, download and install the IPaddr3 agent.

Repeat these steps on all 3 nodes. Then, we’ll modify the pacemaker configuration like this (I’ll explain below):

First, the VIP primitive is modified to use the new agent, IPaddr3, and we set resource-stickiness=”0″. Next, we define the cl_cluster_vip clone set using: clone-max=”3″ to have three instances, clone-node-max=”3″ to allow up to three instances on the same node and globally-unique=”true” to tell Pacemaker it has to allocate an instance on a node even if there’s already one. Finally, there’re three location rules needed to get the behavior we want, one using the p_cluster_vip_clone_count attribute and the other two around the writable_monit attribute. Enabling all that gives:

and the network configuration is:

In order to test the access, you need to query the VIP from a fourth node:

So, all good… Let’s now desync the pacemaker-1 and pacemaker-2.

Now, if pacemaker-1 and pacemaker-2 are back in sync, we have the desired distribution:

All the clone instances redistributed on all nodes as we wanted.

As a conclusion, Pacemaker with a clone set of IPaddr3 is a very interesting kind of load balancer, especially if you already have pacemaker deployed. It introduces almost no latency, it doesn’t need any other hardware, doesn’t increase the database validation load and is as highly-available as your database is. The only drawback I can see is in a case where the inbound traffic is very important. In that case, all nodes are receiving all the traffic and are equally saturated. With databases and web type traffics, the inbound traffic is usually small. This solution also doesn’t redistribute the connections based on the server load like a load balancer can do but that would be fairly easy to implement with something like a server_load attribute and an agent similar to mysql_monitor but that will check the server load instead of the database status. In such a case, I suggest using much more than 1 VIP clone instance per node to have a better granularity in load distribution. Finally, the ClusterIP target, although still fully supported, has been deprecated in favor of the Cluster-match target. It is basically the same principle and I plan to adapt the IPaddr3 agent to Cluster-match in a near future.

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Gerry

What sort of load balancer do you use?

Peter Zaitsev

Interesting,

This reminds me first ENIC architecture for their MySQL clustering product – everyone gets the traffic but only one responds.

rad

Thanks for your post.

Although I’m not using pacemaker, but I managed to configure clusterip on two servers correctly without any issue. Now the problem is that I could not get response from both servers. Only one server is always responding. I tried several IPs. The configuration is OK because when I use “sourceip-sourceport” instead of “sourcip” I can see that some part of requests are responded by one server and the rest by the other. How is the hash calculated based on source IP? Does the load really shared equally among the two servers? Since the servers are web servers, I don’t like to use source port.

Thanks in advance.