Infuriating because it WAS working and now decided it doesn’t want to!
We’ve had a simple Windows 2000 cluster running on two IBM Blades for over a year now, last Friday it decided it didn’t want to speak to it’s neighbour over the heartbeat anymore.
Nothing was changed!
We tried removing and re-adding the second node but it failed saying it couldn’t communicate with the first, we checked comms and it’s all OK, no failures.
We changed the IP of the second nodes heartbeat, uninstalled Cluster services and re-installed which oddly worked.
Then literally four hours later the node dropped out, the same ‘fix’ as before didn’t work.
Any ideas at all and what this could be, failing NIC perhaps? No errors to that effect and the Cisco switch in the Blade Chassis doesn’t report any comms errors so it’s be hard to diagnose for sure!
We’ve had a simple Windows 2000 cluster running on two IBM Blades for over a year now, last Friday it decided it didn’t want to speak to it’s neighbour over the heartbeat anymore.
Event Type: Warning
Event Source: ClusSvc
Event Category: (16)
Event ID: 1123
Date: 22/01/2010
Time: 17:37:12
User: N/A
Computer: #####01
Description:
The node lost communication with cluster node '#####02' on network 'Private'.
Event Type: Warning
Event Source: ClusSvc
Event Category: (16)
Event ID: 1135
Date: 22/01/2010
Time: 17:37:28
User: N/A
Computer: #####01
Description:
Cluster node #####02 was removed from the active cluster membership. The Clustering Service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active cluster nodes.
Event Type: Error
Event Source: ClusSvc
Event Category: (16)
Event ID: 1108
Date: 22/01/2010
Time: 17:40:39
User: N/A
Computer: #####01
Description:
The join of node #####02 to the cluster timed out and was aborted.
Nothing was changed!
We tried removing and re-adding the second node but it failed saying it couldn’t communicate with the first, we checked comms and it’s all OK, no failures.
We changed the IP of the second nodes heartbeat, uninstalled Cluster services and re-installed which oddly worked.
Then literally four hours later the node dropped out, the same ‘fix’ as before didn’t work.
Any ideas at all and what this could be, failing NIC perhaps? No errors to that effect and the Cisco switch in the Blade Chassis doesn’t report any comms errors so it’s be hard to diagnose for sure!