Using Heartbeat on ClearOS
This guide reviews the heartbeat stack of services and covers the basics of implementation. Heartbeat is a set of services designed to provide fail-over capability for services under linux and ClearOS. This open source software allows a backup server to provide services if the primary server fails.
Some terms we will use for high availability.
- Primary: this server is the default server
- Secondary: the server providing redundancy if the primary fails
- Master: the server or service containing the authoritative data
- Slave: the server or service containing the replicate data of the master.
- Split brain: a condition for data and services in a cluster where multiple servers believe that they are authoritatively in control. This condition can lead to corrupted data and requires intervention and recovery.
- STONITH: [Shoot The Other Node In The Head] a concept in server redundancy where a node is capable of physically preventing another server from providing data or services by interrupting power or connectivity.
- heartbeat: a communication or signalling method between peers in a cluster to inform the other members of that cluster as to the well-being of the node.
You will need to decide how you will establish a heartbeat will communicate. Common methods are serial or dedicated NIC. You should NOT use the same NIC that you use for the resources your are clustering…ever.
To ease cluster administration we recommend that you unify the SSH key infrastructure so that you can make remote calls and copy files between servers over their shared network.
yum --enablerepo=clearos-epel,clearos-core install heartbeat
There are basically three files that you need to configure for basic heartbeat functionality:
- ha.cf Main configuration file
- haresources Resource configuration file
- authkeys Authentication information
These files need to be created in /etc/ha.d/.
The ha.cf file determines how the cluster communicates. An example file is here:
## /etc/ha.d/ha.cf ## This configuration is to be the same on both machines keepalive 2 deadtime 10 warntime 5 initdead 15 serial /dev/ttyS0 baud 19200 auto_failback on node firewall1.domain.lan node firewall2.domain.lan
On this cluster the server communicates using a Serial rollover cable on COM1. You can use ethernet as well.
It is VITAL that the node names are correct. To determine what node name you should use for this server that you are on, run the following:
For the most part, this file should exist the same on ALL nodes in the cluster. The exception being that resources (like communication devices) exist differently between servers.
For specific information, refer to the manual.
The haresources file contains what will be clustered. It also contains the name of the node that will be the primary node. Following the name of the node, services and IP addresses and other resources are listed in the order that they should start. When a node stops, services are stopped in reverse order as listed.
This file should exist the same on all nodes in the cluster.
Here is an example of an haresources file:
firewall1.domain.lan 192.168.5.3 bypassd smbd nmbd dnsmasq
For details on this file, refer to the manual here.
The authkeys file is simple and will look something like this:
auth 1 1 sha1 0123456789abcdef0123456789abcdef
You can generate this file by running the following from command line:
cat <<-'!'AUTH >/etc/ha.d/authkeys # Automatically generated authkeys file auth 1 !AUTH echo "1 sha1" `dd if=/dev/urandom count=4 2>/dev/null | md5sum | cut -c1-32` >> /etc/ha.d/authkeys echo " "&& echo "New authkeys file generated." && echo "Ensure the following is distributed to all nodes in the cluster (/etc/ha.d/authkeys):" && echo && cat /etc/ha.d/authkeys && echo chmod 600 /etc/ha.d/authkeys
This file must exist on all nodes in the cluster. Either copy the file to /etc/ha.d/ on those servers or create the file and copy the content verbatim.