pacemaker interview questions
Top pacemaker frequently asked interview questions
Looking for possibilities / alternatives for backend HA in nginx. At the moment we are using lua-nginx which does not support HttpUpstream module, which would be first choice for me. I know a bit about pacemaker but not never used it so not sure if it would be good combination with nginx. Any hints, experience?
Source: (StackOverflow)
I have one Linux-HA based cluster(Master Node/Slave Node), and have some resources defined on Pacemaker, my question is any way we can used by "crm
" command to find out the Master Node of this Linux-HA cluster? I mean at the timeslot before all resource agent loaded or during resource loading?
After the resource loaded, I think we can use crm_mon
or "crm status
" and grep resource on Master Node to identify it. but I cannot finger out a way to find out before or during resource loading.
thanks,
Emre
Source: (StackOverflow)
I am working on adding high availability to our python software via pacemaker.
I successfully POC it, and now wonder what is the best way to configure the pacemaker from the software.
In the POC i mainly used the pcs cli (pcs) , but i've looked around and saw i can also configure the cib.xml via cibadmin.
My question is what would be better to use?
Source: (StackOverflow)
I had configured corosync + pacemaker and it working well.
But i have issue with starting corosync on boot:
It start's, node goes online and everything ok. And after a while interface goes offline then online and corosync has stopped:
May 23 15:59:48 [922] node-01 corosync notice [MAIN ] Corosync Cluster Engine ('2.3.3'): started and ready to provide service.
May 23 15:59:48 [922] node-01 corosync info [MAIN ] Corosync built-in features: dbus testagents rdma watchdog augeas pie relro bindnow
May 23 15:59:48 [922] node-01 corosync notice [TOTEM ] Initializing transport (UDP/IP Unicast).
May 23 15:59:48 [922] node-01 corosync notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha1
May 23 15:59:48 [922] node-01 corosync notice [TOTEM ] The network interface [10.10.2.93] is now up.
May 23 15:59:48 [922] node-01 corosync notice [SERV ] Service engine loaded: corosync configuration map access [0]
May 23 15:59:48 [922] node-01 corosync info [QB ] server name: cmap
May 23 15:59:48 [922] node-01 corosync notice [SERV ] Service engine loaded: corosync configuration service [1]
May 23 15:59:48 [922] node-01 corosync info [QB ] server name: cfg
May 23 15:59:48 [922] node-01 corosync notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
May 23 15:59:48 [922] node-01 corosync info [QB ] server name: cpg
May 23 15:59:48 [922] node-01 corosync notice [SERV ] Service engine loaded: corosync profile loading service [4]
May 23 15:59:48 [922] node-01 corosync warning [WD ] No Watchdog, try modprobe <a watchdog>
May 23 15:59:48 [922] node-01 corosync info [WD ] no resources configured.
May 23 15:59:48 [922] node-01 corosync notice [SERV ] Service engine loaded: corosync watchdog service [7]
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] Using quorum provider corosync_votequorum
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
May 23 15:59:48 [922] node-01 corosync info [QB ] server name: votequorum
May 23 15:59:48 [922] node-01 corosync notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
May 23 15:59:48 [922] node-01 corosync info [QB ] server name: quorum
May 23 15:59:48 [922] node-01 corosync notice [TOTEM ] adding new UDPU member {10.10.2.88}
May 23 15:59:48 [922] node-01 corosync notice [TOTEM ] adding new UDPU member {10.10.2.93}
May 23 15:59:48 [922] node-01 corosync notice [TOTEM ] A new membership (10.10.2.93:648) was formed. Members joined: 2
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] Members[1]: 2
May 23 15:59:48 [922] node-01 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
May 23 15:59:48 [922] node-01 corosync notice [TOTEM ] A new membership (10.10.2.88:652) was formed. Members joined: 1
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] This node is within the primary component and will provide service.
May 23 15:59:48 [922] node-01 corosync notice [QUORUM] Members[2]: 1 2
May 23 15:59:48 [922] node-01 corosync notice [MAIN ] Completed service synchronization, ready to provide service.
May 23 16:14:43 [960] node-01 corosync debug [TOTEM ] waiting_trans_ack changed to 0
May 23 16:14:56 [960] node-01 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug [TOTEM ] sendmsg(ucast) failed (non-critical): Invalid argument (22)
May 23 16:14:56 [960] node-01 corosync debug [TOTEM ] The token was lost in the OPERATIONAL state.
May 23 16:00:01 [922] node-01 corosync notice [TOTEM ] A processor failed, forming new configuration.
May 23 16:00:01 [922] node-01 corosync notice [TOTEM ] The network interface is down.
May 23 16:00:01 [922] node-01 corosync notice [TOTEM ] adding new UDPU member {10.10.2.88}
May 23 16:00:01 [922] node-01 corosync notice [TOTEM ] adding new UDPU member {10.10.2.93}
May 23 16:00:03 [922] node-01 corosync notice [TOTEM ] The network interface [10.10.2.93] is now up.
May 23 16:00:03 [922] node-01 corosync notice [TOTEM ] adding new UDPU member {10.10.2.88}
May 23 16:00:03 [922] node-01 corosync notice [TOTEM ] adding new UDPU member {10.10.2.93}
May 23 16:00:04 [922] node-01 corosync warning [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
In case if i start service manually everything works ok
corosync.conf:
totem {
version: 2
cluster_name: c-01
transport: udpu
interface {
ringnumber: 0
bindnetaddr: 10.10.2.0
broadcast: yes
mcastport: 5405
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
nodelist {
node {
ring0_addr: 10.10.2.88
nodeid: 1
}
node {
ring0_addr: 10.10.2.93
nodeid: 2
}
}
logging {
to_logfile: yes
logfile: /var/log/corosync/corosync.log
timestamp: on
debug: on
}
Source: (StackOverflow)
We are installing (for the first time) pacemaker + corosync in 2 nodes (active/passive). We need to move the virtual IP from one node to the other when a certain port is not responding. We have checked all the available resource agents and it seems there is no resource for this.
is this correct? If so, what are my alternatives?
Source: (StackOverflow)
I facing problem with conrosync.conf file. I have two node: node1 and node2. corosync.conf in node1 different with corosync.conf in node2. I need the way to sync corosync.conf between two node via bash script.
Example: if i stand in node 2 and call this script, it will change corosync.conf in node 1 like corosync.conf in node 2. Not allow use some command like:ssh, rsync...
Source: (StackOverflow)
I'm using Pacemaker 1.1.14-8.el6, pcs v0.9.148
and I'm looking for a solution to the following case:
I have an active/passive setup with 2 machines and I want to make sure outgoing data will seem to always come from the same IP
so that the server receiving messages from my machine won't notice a failover
I've managed to do the opposite using ocf:heartbeat:IPaddr resource and clients that send me messages always send it to the same addr, but now I'm trying to figure out if pacemaker has this option
Source: (StackOverflow)
I am working on a Linux-HA cluster with pacemaker-1.1.10-1.el6_4.4
, as you know, in this pacemaker version, cluster command line functionality is not packaged with pacemaker package, I found 2 packages: crmsh
and pcs
, my question is which one is the official command line interface? which one is the recommendation? and what is the relation between them?
thanks,
Emre
Source: (StackOverflow)
I am setting up a Linux-HA(corosync+pacemaker) cluster which includes 2 nodes, and I defined several resources:
primitive virtual-ip
primitive main-service
primitive db
clone db-clone
My question is could we identify which node will be the master node before pacemaker begin or during loading services? I mean which node will virutal-ip
resource running on? Is there any crm
command line api or other method?
thanks,
Emre
Source: (StackOverflow)
As mentioned in the title, do you think Zookeeper can work with Pacemaker instead of Corosync? Is there any examination about that?
Thanks for your help!
Source: (StackOverflow)
OS: CentOS 6.6 x64
Heartbeat Package: heartbeat-3.0.4-2.el6.x86_64
Steps:
1. Power on master and slave.
2. master: service heartbeat start
3. wait till master has all resources and cluster ip address.
4. master: service network restart
5. master: service heartbeat stop
6. master: service heartbeat start
7. wait till master has all resources and cluster ip address.
8. slave: service heartbeat start
Results: Slave also becomes the new master. I'm trying to understand why restarting the network is causing problem.
If I do this:
Steps:
1. Power on master and slave.
2. master: service heartbeat start
3. slave: service heartbeat start
Results: Everything works fine.
I understand heartbeat is old and obsolete, but until the customer transition away from heartbeat, I still need to support it.
Thanks!
Source: (StackOverflow)
I am running a pacemaker cluster with corosync on two nodes.
I had to restart node2 and after reboot and doing
service corosync start
corosync is started but shuts down itself immediately.
After the log entry "Completed service synchronization, ready to provide service." there is an entry "Node was shut down by a signal" and the shutdown starts.
This is the complete log output:
notice [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to
provide service.
info [MAIN ] Corosync built-in features: debug testagents augeas systemd pie relro bindnow
warning [MAIN ] member section is deprecated.
warning [MAIN ] Please migrate config file to nodelist.
notice [TOTEM ] Initializing transport (UDP/IP Unicast).
notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
notice [TOTEM ] Initializing transport (UDP/IP Unicast).
notice [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
notice [TOTEM ] The network interface [192.168.1.102] is now up.
notice [SERV ] Service engine loaded: corosync configuration map access [0]
info [QB ] server name: cmap
notice [SERV ] Service engine loaded: corosync configuration service [1]
info [QB ] server name: cfg
notice [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
info [QB ] server name: cpg
notice [SERV ] Service engine loaded: corosync profile loading service [4]
notice [QUORUM] Using quorum provider corosync_votequorum
notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
info [QB ] server name: votequorum
notice [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
info [QB ] server name: quorum
notice [TOTEM ] The network interface [x.x.x.3] is now up.
notice [TOTEM ] adding new UDPU member {x.x.x.3}
notice [TOTEM ] adding new UDPU member {x.x.x.2}
warning [TOTEM ] Incrementing problem counter for seqid 1 iface x.x.x.3 to [1 of 10]
notice [TOTEM ] A new membership (192.168.1.102:7420) was formed. Members joined: -1062731418
notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
notice [QUORUM] Members[1]: -1062731418
notice [MAIN ] Completed service synchronization, ready to provide service.
notice [MAIN ] Node was shut down by a signal
notice [SERV ] Unloading all Corosync service engines.
info [QB ] withdrawing server sockets
notice [SERV ] Service engine unloaded: corosync vote quorum service v1.0
info [QB ] withdrawing server sockets
notice [SERV ] Service engine unloaded: corosync configuration map access
info [QB ] withdrawing server sockets
notice [SERV ] Service engine unloaded: corosync configuration service
info [QB ] withdrawing server sockets
notice [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
info [QB ] withdrawing server sockets
notice [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
notice [SERV ] Service engine unloaded: corosync profile loading service
notice [MAIN ] Corosync Cluster Engine exiting normally
Source: (StackOverflow)
As far as I know about celery, celery beat is a scheduler considered as SPOF. It means the service crashes, nothing will be scheduled and run.
My case is that, I will need a HA set up with two schedulers: master/slave, master is making some calls periodically(let's say every 30 mins) while slave can be idle.
When master crashes, the slave needs to become the master and pick up the left over from the dead master, and carry on the periodic tasks. (leader election)
The requirements here are:
- the task is scheduled every 30mins (this can be achieved by celery beat)
- the task is not atomic, it's not just a call every 30 mins which either fails or succeeds. Let's say, every 30 mins, the task makes 50 different calls. If master finished 25 and crashed, the slave is expected to come up and finish the remaining 25, instead of going through all 50 calls again.
- when the dead master is rebooted from failure, it needs to realize there is a master running already. By all means, it doesn't need to come up as master and just needs to stay idle til the running master crashes again.
Is pacemaker a right tool to achieve this combined with celery?
Source: (StackOverflow)
I'm trying to setup a HA cluster for 2 amazon instances. The OS of my instances is CentOS7.
Hostnames:
master1.example.com
master2.example.com
IP internal:
10.0.0.x1
10.0.0.x2
IP public:
52.19.x.x
52.18.x.x
I'm following this tutorial:
http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs
[root@master1 centos]# pcs status nodes
Pacemaker Nodes:
Online: master1.example.com
Standby:
Offline: master2.example.com
while my master 2 is showing the following
[root@master2 centos]# pcs status nodes
Pacemaker Nodes:
Online: master2.example.com
Standby:
Offline: master1.example.com
But they should be online, both.
What am I doing wrong?
Which IP do I have to choose as Virtual IP? Because the IP's are not in the same subnet.
Source: (StackOverflow)
I'm trying to create HA OpenStack cluster for controller nodes by following OpenStack HA-guide.
So I have three nodes in cluster:
controller-0
controller-1
controller-2
Setted up a password for hacluster user on each host.
[root@controller-0 ~]# yum install pacemaker pcs corosync libqb fence-agents-all resource-agents –y ;
Authenticated in all nodes using password which should make up the cluster
[root@controller-0 ~]# pcs cluster auth controller-0 controller-1 controller-2 -u hacluster -p password --force
controller-2: Authorized
controller-1: Authorized
controller-0: Authorized
After that created cluster:
[root@controller-1 ~]# pcs cluster setup --force --name ha-controller controller-0 controller-1 controller-2
Redirecting to /bin/systemctl stop pacemaker.service
Redirecting to /bin/systemctl stop corosync.service
Killing any remaining services...
Removing all cluster configuration files...
controller-0: Succeeded
controller-1: Succeeded
controller-2: Succeeded
Synchronizing pcsd certificates on nodes controller-0, controller-1 controller-2...
controller-2: Success
controller-1: Success
controller-0: Success
Restaring pcsd on the nodes in order to reload the certificates...
controller-2: Success
controller-1: Success
controller-0: Success
Started cluster:
[root@controller-0 ~]# pcs cluster start --all
controller-0:
controller-2:
controller-1:
But when I start corosync, I get:
[root@controller-0 ~]# systemctl start corosync
Job for corosync.service failed because the control process exited with error code.
See "systemctl status corosync.service" and "journalctl -xe" for details.
In message log:
controller-0 systemd: Starting Corosync Cluster Engine...
controller-0 corosync[23538]: [MAIN ] Corosync Cluster Engine ('2.3.4'): started and ready to provide service.
controller-0 corosync[23538]: [MAIN ] Corosync built-in features: dbus systemd xmlconf snmp pie relro bindnow
controller-0 corosync[23539]: [TOTEM ] Initializing transport (UDP/IP Unicast).
controller-0 corosync[23539]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: none hash: none
controller-0 corosync: Starting Corosync Cluster Engine (corosync): [FAILED]
controller-0 systemd: corosync.service: control process exited, code=exited status=1
controller-0 systemd: Failed to start Corosync Cluster Engine.
controller-0 systemd: Unit corosync.service entered failed state.
controller-0 systemd: corosync.service failed.
My corosync config file:
[root@controller-0 ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
cluster_name: ha-controller
transport: udpu
}
nodelist {
node {
ring0_addr: controller-0
nodeid: 1
}
node {
ring0_addr: controller-1
nodeid: 2
}
node {
ring0_addr: controller-2
nodeid: 3
}
}
quorum {
provider: corosync_votequorum
expected_votes: 3
wait_for_all: 1
last_man_standing: 1
last_man_standing_window: 10000
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
Also all names are resolvable
OS is CentOS Linux release 7.2.1511 (Core)
[root@controller-0 ~]# uname -a
Linux controller-0 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Installed versions:
pacemaker.x86_64 1.1.13-10.el7_2.2 @updates
pacemaker-cli.x86_64 1.1.13-10.el7_2.2 @updates
pacemaker-cluster-libs.x86_64 1.1.13-10.el7_2.2 @updates
pacemaker-libs.x86_64 1.1.13-10.el7_2.2 @updates
corosync.x86_64 2.3.4-7.el7_2.1 @updates
corosynclib.x86_64 2.3.4-7.el7_2.1 @updates
libqb.x86_64 0.17.1-2.el7.1 @updates
fence-agents-all.x86_64 4.0.11-27.el7_2.7 @updates
resource-agents.x86_64 3.9.5-54.el7_2.9 @updates
Source: (StackOverflow)