HA Active / Passive NFS Cluster

Get a Pair of CentOS 5.5 machines serving up Highly available NFS

I use NFS volumes for all of my production vmware servers. As a result, loosing your NFS server to a hardware failure would be a very bad thing.

We accomplish this using drbd, and heartbeat

This Guide assumes you have two identicaly configured systems, with an empty partition on each node that is un-formatted and ready to go. It also assumes a dedicated crossover connection between network cards on both nodes.

This setup was completed on a CentOS 5.5 system, utilizing the centosplus yum repository for drbd and the fedora epel repository for heartbeat.

Installing

I use epel, centosplus, centos extra yum repositories, after setting them up do:

yum install drbd83 kmod-drbd83 heartbeat

If you want to use the xfs file system (wonderful for big files on big filesystems.. vmware anyone?)

yum install xfsprogs kmod-xfs

DRBD Configuration

You need a blank, unformatted block device, it can be a partition (/dev/sdb1 for example) or it can be a whole block device (/dev/sdb) careful not to use any in use file systems. (it is possible to turn an existing filesystem with existing data into a drbd device, that's another blog)

we need to set up the distributed block devices to mirror the main data file systems:

edit /etc/drbd.conf :

global {
  usage-count yes;
}
common {
  protocol C;
}

resource drbd0 {

  device    /dev/drbd0;
  disk      <your_blank_drbd_partition eg: /dev/sdb1>;
  meta-disk internal;

  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
    split-brain "/usr/lib/drbd/notify-split-brain.sh <your_name@email_server>";
  }

  startup {
    degr-wfc-timeout 120;
  }

  disk {
    on-io-error   detach;
    no-disk-flushes;
    no-md-flushes;
  }

  net {
    cram-hmac-alg "sha1";
    shared-secret "HaDxWpLXRIB6dxa54CnV";
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
  }

  syncer {
    rate 100M;
    al-extents 257;
    csums-alg sha1;
  }

  on drbd-lvm-test1 {
    address   <ip_address_of_node1>:7789;
  }

  on drbd-lvm-test2 {
    address   <ip_address_of_node2>:7789;
  }
}

then issue these commands on BOTH nodes:

drbdadm create-md drbd0

service drbd start

you can see the device sucessfully created, by issuing:

cat /proc/drbd

Issue the following command on the PRIMARY node (Only one)

drbdadm -- --overwrite-data-of-peer primary drbd0

Wait for the Sync to complete, just periodically run cat /proc/drbd until it looks like this:

version: 8.3.8 (api:88/proto:86-94)
GIT-hash: d78846e52224fd00562f7c225bcc25b2d422321d build by mockbuild@builder10.centos.org, 2010-06-04 08:04:09
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
    ns:223608780 nr:0 dw:44 dr:223610936 al:1 bm:13649 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

LVM setup

Create an LVM stack on top of the drbd device, edit lvm.conf to force lvm to ignore the underlying block device(s) that are being used by drbd. This will prevent LVM from starting on the wrong device when heartbeat starts it up.

On BOTH Nodes:

vim /etc/lvm/lvm.conf

comment out this line:

#filter = [ "a/.*/" ]

Add this line so LVM ignores the block device that drbd sits on top of:

filter = [ "a|drbd.*|", "r|/dev/sda3|" ]

On the drbd PRIMARY node Only:

pvcreate /dev/drbd0

vgcreate <volume_group_name> /dev/drbd0

lvcreate -l 100%FREE -n <logical_volume_name> <volume_group_name>

XFS setup

create an XFS file system on the logical volume created above, I suggest you use the tuning parameters below:

mkfs.xfs -f -d su=256k,sw=<number_of_data_disks_in_the_raid> -l size=64m /dev/<volume_group_name>/<logical_volume_name>

 - the sw parameter is the number of data disks in the array, example: if there are 24 drives, and 2 are used as hot spares, 2 are used for raid6, then sw=20

make sure it all works:

mkdir /data && mount /dev/<volume_group_name> /<logical_volume_name>

setting up heartbeat is next so we need to make sure the filesystem is un-mounted:

umount /data

NFS Setup

The NFS metadata has to go on to the shared block device, otherwise, all your NFS clients will suffer from "Stale NFS File Handle" errors, and will need to be rebooted when your cluster fails over, not good.. so this procedure must be done on both nodes, one after the other:

On Node1:

Change where the rpc_pipefs file system gets mounted:

mkdir /var/lib/rpc_pipefs

vim /etc/modprobe.d/modprobe.conf.dist

locate the module commands for sunrpc, and change the mount path statement from /var/lib/nfs/rpc_pipefs to /var/lib/rpc_pipefs

vim /etc/sysconfig/nfs

Add this line to the bottom:

RPCIDMAPDARGS="-p /var/lib/rpc_pipefs"

reboot the node.

Make it the primary drbd node:

drbdadm primary drbd0

scan for the volume group on the drbd0 block device:

vgscan

Make the drbd volume group active:

vgchange -a y

mount the xfs file system:

mount /dev/<volume_group_name>/<logical_volume_name> /data

Move /var/lib/nfs to the shared filesystem:

mv /var/lib/nfs /data/

ln -s /data/nfs /var/lib/nfs

Put the nfs exports config file in the shared file system as well:

mv /etc/exports /data/nfs/

ln -s /data/nfs/exports /etc/exports

create a dir under /data for export:

mkdir /data/supercriticalstuff

export it:

echo "/data/supercriticalstuff *(ro,async,no_root_squash)" >> /data/nfs/exports

edit /etc/init.d/nfs, and change killproc nfs -2 to killproc nfs -9, to make sure nfs really dies when stopped:

back it up so you can fix it after rpm updates:

cp /etc/init.d/nfs ~/nfs_modded_init_script

Start NFS and make sure it all works:

service nfs start

now on to Node 2: shut down NFS on Node 1

umount /data

deactivate the logical volume:

vgchange -a n 

give up the drbd resource:

drbdadm secondary drbd0

On Node 2:

Change where the rpc_pipefs file system gets mounted:

mkdir /var/lib/rpc_pipefs

vim /etc/modprobe.d/modprobe.conf.dist

locate the module commands for sunrpc, and change the mount path statement from /var/lib/nfs/rpc_pipefs to /var/lib/rpc_pipefs

vim /etc/sysconfig/nfs

Add this line to the bottom:

RPCIDMAPDARGS="-p /var/lib/rpc_pipefs"

reboot the node.

Make it the primary drbd node:

drbdadm primary drbd0

scan for the volume group on the drbd0 block device:

vgscan

Make the drbd volume group active:

vgchange -a y

mount the xfs file system:

mount /dev/<volume_group_name>/<logical_volume_name> /data

get rid of /var/lib/nfs, and /etc/exports

rm -rf /var/lib/nfs

rm -f /etc/exports

Make the appropriate symlinks:

ln -s /data/nfs /var/lib/nfs
ln -s /data/nfs/exports /etc/exports

edit /etc/init.d/nfs, and change killproc nfs -2 to killproc nfs -9, to make sure nfs really dies when stopped:

back it up so you can fix it after rpm updates:

cp /etc/init.d/nfs ~/nfs_modded_init_script

Start NFS and make sure it all works:

service nfs start

shut down NFS

umount /data

deactivate the logical volume:

vgchange -a n <volume_group_name>

give up the drbd resource:

drbdadm secondary drbd0

Setting up Heartbeat

On Both Nodes:

edit /etc/hosts - make sure both cluster nodes on both hosts are listed using their crossover IP addresses.

rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm

yum install heartbeat heartbeat-stonith heartbeat-pils

NOTE - make sure yum only tries to install the x86_64 version of the rpm, you may have to specify the exact version like so:

yum install heartbeat-2.1.4-11.el5.x86_64 (the version number might not be as shown here, check the output of yum for the right rpm)

edit /etc/ha.d/ha.cf :

logfacility     local0
keepalive 5
deadtime 20
warntime 10
udpport 695
bcast   bond0  # ethernet interface 1
bcast   bond1  # ethernet interface 2
bcast   bond2  # ethernet interface, or serial interface
auto_failback off
node    node1 node2
respawn hacluster /usr/lib64/heartbeat/ipfail

edit /etc/ha.d/haresources:

node1 \
IPaddr2::<virtual_ha_ip_address>/24/bond0 \
IPaddr2::<virtual_ha_ip_address>/24/bond1 \
drbddisk::drbd0 \
LVM::ha-lvm \
Filesystem::/dev/<volume_group_name>/<logical_volume_name>::/<mountpoint>::xfs::rw,nobarrier,noatime,nodiratime,logbufs=8 \
nfslock \
nfs \

Note: use ip address that are NOT currently assigned to any network adapters. This IP will move from host to host as the cluster fails over.

edit /etc/ha.d/authkeys:

auth 2
2 sha1 <random_gibberish_20_carachetrs_long>

chmod 600 /etc/ha.d/authkeys

make sure the HA services are disabled at boot:

chkconfig nfs off

chkconfig nfslock off

chkconfig heartbeat on

On the primary node:

service heartbeat start && tail -f /var/log/messages

make sure the file system is mounted:

mount

make sure the HA IP address is up:

ip addr

On the secondary node:

service heartbeat start

/var/log/messages should show: Status update: Node node1 now has status active

/var/log/messages on the primary node should show the secondary node joining the cluster

Testing

service heartbeat stop on the primary node, make sure the services fail over properly

halt -p

do it a few timesa fail the services back and forth while the nfs export is mounted from another system to make sure everything fails over as it should.

enjoy an active/passive NFS server.

Rob Thursday 10 March 2011 - 1:39 pm | | Tech Stuff

two comments

Jim Nickel

Hiya,

One other thing:

The exports line you have is:
echo “/data/supercriticalstuff *(ro,async,no_root_squash)” >> /data/nfs/exports

Shouldn’t it be rw? Like:
echo “/data/supercriticalstuff *(rw,async,no_root_squash)” >> /data/nfs/exports

Jim

Jim Nickel, - 14-04-’12 09:27
Jim Nickel

Hiya,

Thanks for these instructions. A few problems to correct:

1) You don’t indicate it, but in the drbd.conf file you need to change drbd-lvm-test1 and drbd-lvm-test2 to be whatever hostname you have for your machines

2) you can also check the status of drbd replication by going: service drbd status

3) In LVM setup, shouldn’t this line: filter = [ “a|drbd.*|”, “r|/dev/sda3|” ]
be filter = [ “a|drbd.*|”, “r|/dev/sdb1|” ] if we are using /dev/sdb1 as the mount for drbd0?

4) Under XFS Setup, you don’t provide an example of what the mkfs.xfs line should look like. I used a sw=2 in my config, but I am not sure if that is right

5) Under XFS Setup, you have this line:
mkdir /data && mount /dev/ /
I think it should be:
mkdir /data && mount /dev// /data

Thanks!

Jim

Jim Nickel, - 14-04-’12 09:36
(optional field)
(optional field)
To prevent automated comment spam, I require you to answer this silly question

Comment moderation is enabled on this site. This means that your comment will not be visible until it has been approved by an editor.

Remember personal info?
Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.