giovedì 6 settembre 2012

Oracle VM: VM Server status not recognized by VM Manager

Hello, it was a bit since my last post here. Meanwhile I had some weeks of vacation but ...when I came back here's an issue waiting for me: Oracle VM, production environment, guest VMs correctly working but VM Manager completely unable to discover server status (servers reported in stopped status with error) despite servers are running fine.

VM Manager was totally unusable, server discovery processes stucked, I had to manually abort these processes from VM Manager. After aborting these discovery processes VM Manager changed server statuses from "stopped with error" to "starting". If I performed a manually rediscovery of servers I got this error:

https://?uname?:?pwd?@10.0.0.8:8899/api/2 discover_hardware, Status: org.apache.xmlrpc.XmlRpcException: agent.utils.filelock.LockError:Lock file /var/run/ovs-agent/discover_hardware.lock failed: timeout occured.

Logged in to server via SSH and under /var/run/ovs-agent I got:

discover_hardware.lock
monitor.lock

So it became pretty clear this was an issue with ovs-agent which is the agent who performs communication between VM Server and VM Manager.

Please note that if you can afford to stop or move your VMs a server reboot could solve all this without messing with PIDs and services.

If you, like me, don't have such luck you need to deal with PIDs and services.

ovs-agent service is run by a script located at /etc/init.d/ovs-agent and it is registered as a service that can be invoked with:

service ovs-agent [stop | start | restart | status]

So I checked ovs-agent status:

[root@orclvmsrv1 ~]# service ovs-agent status
OVSHA (PID: 6763) is running...
OVSNotificationServer (PID: 6725) is running...
OVSStat (PID: 6767) is running...
OVSAgentServer (PID: 6771) is running...
OVSLogServer (PID: 6709) is running...
OVSMonitor (PID: 6759) is running...
OVSRemaster (PID: 6747) is running...

and all services were reported as running fine even if they actually are not.

My next step was to open a SR with Oracle, which involved sending a lot of logs, from  both VM Server and VM Manager, to support engineers.

One of the requested logs was VM Server "lsof" which is the list of the current open files and a support engineer made me focus on this:

[root@orclvmsrv1 ~]# lsof | grep discover_hardware.lock
python 31309 root 6u REG 104,2 0 31704 /var/run/ovs-agent/discover_hardware.lock

which basically is the PID of the (python) process currently holding "discover_hardware".

This gave me the hint of killing processes involved with ovs-agent services.

So I got all PIDs involved:

[root@orclvmsrv1 ~]# service ovs-agent status
OVSHA (PID: 6763) is running...
OVSNotificationServer (PID: 6725) is running...
OVSStat (PID: 6767) is running...
OVSAgentServer (PID: 6771) is running...
OVSLogServer (PID: 6709) is running...
OVSMonitor (PID: 6759) is running...
OVSRemaster (PID: 6747) is running...

and then killed them:

[root@orclvmsrv1 ~]# kill 6763
[root@orclvmsrv1 ~]# kill 6725
[root@orclvmsrv1 ~]# kill 6767
[root@orclvmsrv1 ~]# kill 6771
[root@orclvmsrv1 ~]# kill 6709
[root@orclvmsrv1 ~]# kill 6759
[root@orclvmsrv1 ~]# kill 6747

and killed the process wich generated file lock seen above.

[root@orclvmsrv1 ~]# kill 31309

Checked status again:

[root@orclvmsrv1 ~]# service ovs-agent status
Procss OVSHA with PID 6763 doesn't exist.

Deleted .pid, .sock, .lock files under /var/run/ovs-agent

[root@orclvmsrv1 ~]# cd /var/run/ovs-agent
[root@orclvmsrv1 ~]# rm *.pid
[root@orclvmsrv1 ~]# rm *.sock
[root@orclvmsrv1 ~]# rm *.lock

Then started ovs-agent again and everything went back to work as usual:

[root@orclvmsrv1 ~]# service ovs-agent start
Starting ovs-agent services:                     [  OK  ]

So, I'm still not able to understand why this happend and what caused this issue with ovs-agent but this was the method I followed to solve this problem.

One last note I would to add is that dealing with ovs-agent DO NOT affect guest VMs running on servers, they will be running fine even if ovs-agent is stopped, restarted, or whatsoever.

Suggestion: if you have more available servers in the pool, live migrate all guest VMs from affected server(s) to working ones and reboot affected servers.

That's all!!