EzDevInfo.com

pacemaker interview questions

Top pacemaker frequently asked interview questions

Backend high availability solutions in nginx

Looking for possibilities / alternatives for backend HA in nginx. At the moment we are using lua-nginx which does not support HttpUpstream module, which would be first choice for me. I know a bit about pacemaker but not never used it so not sure if it would be good combination with nginx. Any hints, experience?

Source: (StackOverflow)

any way to find out Master Node from Linux-HA cluster by "crm" command?

I have one Linux-HA based cluster(Master Node/Slave Node), and have some resources defined on Pacemaker, my question is any way we can used by "crm" command to find out the Master Node of this Linux-HA cluster? I mean at the timeslot before all resource agent loaded or during resource loading?

After the resource loaded, I think we can use crm_mon or "crm status" and grep resource on Master Node to identify it. but I cannot finger out a way to find out before or during resource loading.

thanks, Emre

Source: (StackOverflow)

Advertisements

how to configure pacemaker from a python software

I am working on adding high availability to our python software via pacemaker.
I successfully POC it, and now wonder what is the best way to configure the pacemaker from the software.
In the POC i mainly used the pcs cli (pcs) , but i've looked around and saw i can also configure the cib.xml via cibadmin.
My question is what would be better to use?

Source: (StackOverflow)

Ubuntu: issue with starting corosync at boot

I had configured corosync + pacemaker and it working well.

But i have issue with starting corosync on boot:

It start's, node goes online and everything ok. And after a while interface goes offline then online and corosync has stopped:

May 23 15:59:48 [922] node-01 corosync notice  [MAIN  ] Corosync Cluster Engine ('2.3.3'): started and ready to provide service.
May 23 15:59:48 [922] node-01 corosync info    [MAIN  ] Corosync built-in features: dbus testagents rdma watchdog augeas pie relro bindnow
May 23 15:59:48 [922] node-01 corosync notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
May 23 15:59:48 [922] node-01 corosync notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 23 15:59:48 [922] node-01 corosync notice  [TOTEM ] The network interface [10.10.2.93] is now up.
May 23 15:59:48 [922] node-01 corosync notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
May 23 15:59:48 [922] node-01 corosync info    [QB    ] server name: cmap
May 23 15:59:48 [922] node-01 corosync notice  [SERV  ] Service engine loaded: corosync configuration service [1]
May 23 15:59:48 [922] node-01 corosync info    [QB    ] server name: cfg
May 23 15:59:48 [922] node-01 corosync notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 23 15:59:48 [922] node-01 corosync info    [QB    ] server name: cpg
May 23 15:59:48 [922] node-01 corosync notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
May 23 15:59:48 [922] node-01 corosync warning [WD    ] No Watchdog, try modprobe <a watchdog>
May 23 15:59:48 [922] node-01 corosync info    [WD    ] no resources configured.
May 23 15:59:48 [922] node-01 corosync notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] Using quorum provider corosync_votequorum
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 23 15:59:48 [922] node-01 corosync info    [QB    ] server name: votequorum
May 23 15:59:48 [922] node-01 corosync notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 23 15:59:48 [922] node-01 corosync info    [QB    ] server name: quorum
May 23 15:59:48 [922] node-01 corosync notice  [TOTEM ] adding new UDPU member {10.10.2.88}
May 23 15:59:48 [922] node-01 corosync notice  [TOTEM ] adding new UDPU member {10.10.2.93}
May 23 15:59:48 [922] node-01 corosync notice  [TOTEM ] A new membership (10.10.2.93:648) was formed. Members joined: 2
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] Members[1]: 2
May 23 15:59:48 [922] node-01 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 23 15:59:48 [922] node-01 corosync notice  [TOTEM ] A new membership (10.10.2.88:652) was formed. Members joined: 1
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] This node is within the primary component and will provide service.
May 23 15:59:48 [922] node-01 corosync notice  [QUORUM] Members[2]: 1 2
May 23 15:59:48 [922] node-01 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
May 23 16:14:43 [960] node-01 corosync debug   [TOTEM ] waiting_trans_ack changed to 0
May 23 16:14:56 [960] node-01 corosync debug   [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug   [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug   [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug   [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug   [TOTEM ] The token was lost in the OPERATIONAL state.
May 23 16:00:01 [922] node-01 corosync notice  [TOTEM ] A processor failed, forming new configuration.
May 23 16:00:01 [922] node-01 corosync notice  [TOTEM ] The network interface is down.
May 23 16:00:01 [922] node-01 corosync notice  [TOTEM ] adding new UDPU member {10.10.2.88}
May 23 16:00:01 [922] node-01 corosync notice  [TOTEM ] adding new UDPU member {10.10.2.93}
May 23 16:00:03 [922] node-01 corosync notice  [TOTEM ] The network interface [10.10.2.93] is now up.
May 23 16:00:03 [922] node-01 corosync notice  [TOTEM ] adding new UDPU member {10.10.2.88}
May 23 16:00:03 [922] node-01 corosync notice  [TOTEM ] adding new UDPU member {10.10.2.93}
May 23 16:00:04 [922] node-01 corosync warning [MAIN  ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

In case if i start service manually everything works ok

corosync.conf:

totem {
version: 2
cluster_name: c-01
transport: udpu
interface {
  ringnumber: 0
  bindnetaddr: 10.10.2.0
  broadcast: yes
  mcastport: 5405
 }
}

quorum {
provider: corosync_votequorum
two_node: 1
}

nodelist {
  node {
  ring0_addr: 10.10.2.88
  nodeid: 1
  }
  node {
  ring0_addr: 10.10.2.93
  nodeid: 2
  }
}

logging {
to_logfile: yes
logfile: /var/log/corosync/corosync.log
timestamp: on
debug: on
}

Source: (StackOverflow)

Can packemaker monitor a port?

We are installing (for the first time) pacemaker + corosync in 2 nodes (active/passive). We need to move the virtual IP from one node to the other when a certain port is not responding. We have checked all the available resource agents and it seems there is no resource for this.

is this correct? If so, what are my alternatives?

Source: (StackOverflow)

Need the way to sync corosync.conf in clusters

I facing problem with conrosync.conf file. I have two node: node1 and node2. corosync.conf in node1 different with corosync.conf in node2. I need the way to sync corosync.conf between two node via bash script. Example: if i stand in node 2 and call this script, it will change corosync.conf in node 1 like corosync.conf in node 2. Not allow use some command like:ssh, rsync...

Source: (StackOverflow)

Unique idea for outgoing data for a 1:1 redundancy pacemaker cluster

I'm using Pacemaker 1.1.14-8.el6, pcs v0.9.148

and I'm looking for a solution to the following case:

I have an active/passive setup with 2 machines and I want to make sure outgoing data will seem to always come from the same IP

so that the server receiving messages from my machine won't notice a failover

I've managed to do the opposite using ocf:heartbeat:IPaddr resource and clients that send me messages always send it to the same addr, but now I'm trying to figure out if pacemaker has this option

Source: (StackOverflow)

which one is the official command line package of pacemaker? crmsh or pcs?

I am working on a Linux-HA cluster with pacemaker-1.1.10-1.el6_4.4, as you know, in this pacemaker version, cluster command line functionality is not packaged with pacemaker package, I found 2 packages: crmsh and pcs, my question is which one is the official command line interface? which one is the recommendation? and what is the relation between them?

thanks,
Emre

Source: (StackOverflow)

is there any method/api to identify master/slave node of Linux-HA cluster?

I am setting up a Linux-HA(corosync+pacemaker) cluster which includes 2 nodes, and I defined several resources:

primitive virtual-ip   
primitive main-service   
primitive db  
clone db-clone

My question is could we identify which node will be the master node before pacemaker begin or during loading services? I mean which node will virutal-ip resource running on? Is there any crm command line api or other method?

thanks,
Emre

Source: (StackOverflow)

Zookeeper can work with Pacemaker?

As mentioned in the title, do you think Zookeeper can work with Pacemaker instead of Corosync? Is there any examination about that?

Thanks for your help!

Source: (StackOverflow)

Why is heartbeat failing over from master to slave?

OS: CentOS 6.6 x64
Heartbeat Package: heartbeat-3.0.4-2.el6.x86_64

Steps:
1. Power on master and slave.
2. master: service heartbeat start
3. wait till master has all resources and cluster ip address.
4. master: service network restart
5. master: service heartbeat stop
6. master: service heartbeat start
7. wait till master has all resources and cluster ip address.
8. slave: service heartbeat start
Results: Slave also becomes the new master. I'm trying to understand why restarting the network is causing problem.

If I do this:
Steps:
1. Power on master and slave.
2. master: service heartbeat start
3. slave: service heartbeat start
Results: Everything works fine.

I understand heartbeat is old and obsolete, but until the customer transition away from heartbeat, I still need to support it.
Thanks!

Source: (StackOverflow)

corosync immediately shuts down after start

I am running a pacemaker cluster with corosync on two nodes. I had to restart node2 and after reboot and doing

service corosync start

corosync is started but shuts down itself immediately.

After the log entry "Completed service synchronization, ready to provide service." there is an entry "Node was shut down by a signal" and the shutdown starts.

This is the complete log output:

notice  [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to 
provide service.
info    [MAIN  ] Corosync built-in features: debug testagents augeas systemd pie relro bindnow
warning [MAIN  ] member section is deprecated.
warning [MAIN  ] Please migrate config file to nodelist.
notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
notice  [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
notice  [TOTEM ] The network interface [192.168.1.102] is now up.
notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
info    [QB    ] server name: cmap
notice  [SERV  ] Service engine loaded: corosync configuration service [1]
info    [QB    ] server name: cfg
notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
info    [QB    ] server name: cpg
notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
notice  [QUORUM] Using quorum provider corosync_votequorum
notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
info    [QB    ] server name: votequorum
notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
info    [QB    ] server name: quorum
notice  [TOTEM ] The network interface [x.x.x.3] is now up.
notice  [TOTEM ] adding new UDPU member {x.x.x.3}
notice  [TOTEM ] adding new UDPU member {x.x.x.2}
warning [TOTEM ] Incrementing problem counter for seqid 1 iface x.x.x.3 to [1 of 10]
notice  [TOTEM ] A new membership (192.168.1.102:7420) was formed. Members joined: -1062731418
notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice  [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice  [QUORUM] Members[1]: -1062731418
notice  [MAIN  ] Completed service synchronization, ready to provide service.
notice  [MAIN  ] Node was shut down by a signal
notice  [SERV  ] Unloading all Corosync service engines.
info    [QB    ] withdrawing server sockets
notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
info    [QB    ] withdrawing server sockets
notice  [SERV  ] Service engine unloaded: corosync configuration map access
info    [QB    ] withdrawing server sockets
notice  [SERV  ] Service engine unloaded: corosync configuration service
info    [QB    ] withdrawing server sockets
notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
info    [QB    ] withdrawing server sockets
notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
notice  [SERV  ] Service engine unloaded: corosync profile loading service
notice  [MAIN  ] Corosync Cluster Engine exiting normally

Source: (StackOverflow)

celery beat HA - using pacemaker?

As far as I know about celery, celery beat is a scheduler considered as SPOF. It means the service crashes, nothing will be scheduled and run.

My case is that, I will need a HA set up with two schedulers: master/slave, master is making some calls periodically(let's say every 30 mins) while slave can be idle.

When master crashes, the slave needs to become the master and pick up the left over from the dead master, and carry on the periodic tasks. (leader election)

The requirements here are:

the task is scheduled every 30mins (this can be achieved by celery beat)
the task is not atomic, it's not just a call every 30 mins which either fails or succeeds. Let's say, every 30 mins, the task makes 50 different calls. If master finished 25 and crashed, the slave is expected to come up and finish the remaining 25, instead of going through all 50 calls again.
when the dead master is rebooted from failure, it needs to realize there is a master running already. By all means, it doesn't need to come up as master and just needs to stay idle til the running master crashes again.

Is pacemaker a right tool to achieve this combined with celery?

Source: (StackOverflow)

Installing corosync and pacemaker on Amazon EC2 instances

I'm trying to setup a HA cluster for 2 amazon instances. The OS of my instances is CentOS7.

Hostnames:
master1.example.com
master2.example.com

IP internal:
10.0.0.x1
10.0.0.x2

IP public:
52.19.x.x
52.18.x.x

I'm following this tutorial: http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs

[root@master1 centos]# pcs status nodes
Pacemaker Nodes:
 Online: master1.example.com 
 Standby: 
 Offline: master2.example.com

while my master 2 is showing the following

[root@master2 centos]# pcs status nodes
Pacemaker Nodes:
 Online: master2.example.com 
 Standby: 
 Offline: master1.example.com

But they should be online, both. What am I doing wrong?
Which IP do I have to choose as Virtual IP? Because the IP's are not in the same subnet.

Source: (StackOverflow)

Unable to start Corosync Cluster Engine

I'm trying to create HA OpenStack cluster for controller nodes by following OpenStack HA-guide.
So I have three nodes in cluster:
controller-0
controller-1
controller-2

Setted up a password for hacluster user on each host.

[root@controller-0 ~]# yum install pacemaker pcs corosync libqb fence-agents-all resource-agents –y ;

Authenticated in all nodes using password which should make up the cluster

[root@controller-0 ~]# pcs cluster auth controller-0 controller-1 controller-2 -u hacluster -p password --force  
controller-2: Authorized
controller-1: Authorized
controller-0: Authorized

After that created cluster:

[root@controller-1 ~]# pcs cluster setup --force --name ha-controller controller-0 controller-1 controller-2
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
controller-0: Succeeded
controller-1: Succeeded
controller-2: Succeeded
Synchronizing pcsd certificates on nodes controller-0, controller-1 controller-2...
controller-2: Success
controller-1: Success
controller-0: Success
Restaring pcsd on the nodes in order to reload the certificates...
controller-2: Success
controller-1: Success
controller-0: Success

Started cluster:

[root@controller-0 ~]# pcs cluster start --all
controller-0:
controller-2:
controller-1:

But when I start corosync, I get:

[root@controller-0 ~]# systemctl start corosync
Job for corosync.service failed because the control process exited with error code. 
See "systemctl status corosync.service" and "journalctl -xe" for details.

In message log:

controller-0 systemd: Starting Corosync Cluster Engine...
controller-0 corosync[23538]: [MAIN  ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
controller-0 corosync[23538]: [MAIN  ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
controller-0 corosync[23539]: [TOTEM ] Initializing transport (UDP/IP Unicast).
controller-0 corosync[23539]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
controller-0 corosync: Starting Corosync Cluster Engine (corosync): [FAILED]
controller-0 systemd: corosync.service: control process exited, code=exited status=1
controller-0 systemd: Failed to start Corosync Cluster Engine.
controller-0 systemd: Unit corosync.service entered failed state.
controller-0 systemd: corosync.service failed.

My corosync config file:

[root@controller-0 ~]# cat /etc/corosync/corosync.conf    
totem {   
    version: 2    
    secauth: off    
    cluster_name: ha-controller    
    transport: udpu    
}    
nodelist {    
    node {    
        ring0_addr: controller-0    
        nodeid: 1     
    }
    node {
        ring0_addr: controller-1
        nodeid: 2
    }
    node {
        ring0_addr: controller-2
        nodeid: 3
    }
}
quorum {
    provider: corosync_votequorum
    expected_votes: 3
    wait_for_all: 1
    last_man_standing: 1
    last_man_standing_window: 10000
}
logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

Also all names are resolvable

OS is CentOS Linux release 7.2.1511 (Core)

[root@controller-0 ~]# uname -a
Linux controller-0 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Installed versions:

pacemaker.x86_64                1.1.13-10.el7_2.2   @updates
pacemaker-cli.x86_64            1.1.13-10.el7_2.2   @updates
pacemaker-cluster-libs.x86_64   1.1.13-10.el7_2.2   @updates
pacemaker-libs.x86_64           1.1.13-10.el7_2.2   @updates
corosync.x86_64                 2.3.4-7.el7_2.1     @updates
corosynclib.x86_64              2.3.4-7.el7_2.1     @updates
libqb.x86_64                    0.17.1-2.el7.1      @updates
fence-agents-all.x86_64         4.0.11-27.el7_2.7   @updates
resource-agents.x86_64          3.9.5-54.el7_2.9    @updates

Source: (StackOverflow)