This may fall outside the realm of what you sup****t, as we're using
AppCenter on Windows 2003 guest servers under VMWare ESX.
If you don't mind reading further, here's the problem in a nutshell:
We are now seeing that a NLB cluster member cannot ping the main nic of
any
NLB cluster member of _different_ clusters on another ESX that houses a
cluster member the first server is clustered with.
Example: You have six servers, 1, 2, 3, 4, 5 and 6. They are clustered
into
three clusters like so:
1-2
3-4
5-6
All the odds are on ESX01, all the evens are on ESX02.
1 cannot ping 4 or 6.
3 cannot ping 2 or 6.
5 cannot ping 2 or 4.
And vice versa.
I did a network capture, when 1 tries to ping 4 or 6, there is no ARP
reply.
I expect that if 1 pinged 2, or 3 pinged 4, but why no reply from the
other
cluster members on the opposing ESX?
If I disable NLB on 5 and 6 and reboot them, they are able to ping 1
through
4 no problem. Renable NLB & reboot, they go back to their misbehavior.
---------------------------------
Running a test vbs instantiation script on any of the clusters against any
other clusters, calling a local proxy pointed at said cluster, I can see
that all requests against the cluster name only go to the member on the
same
ESX that you're already on. If you take the cluster member that's on the
same ESX as you are and set if offline, you get 100% failure in
instantiation requests, even though the NLB IP always responds to ping.
If you haven't every looked into the ESX environment, it essentially
handles
all network communication via a 32 ****t virtual hub. So I'm wondering if
we're essentially running 3 servers on each hub, (odds on one hub, evens
on
the other hub), and both hubs are plugged into a switch. We make three
clusters across the hubs (1-2, 3-4, 5-6), and hub to hub traffic over the
main NLB nic is being denied for some reason, even though the virtual MAC
3-4 have is different than 1-2, so if I ping 4 from 1, why don't I get a
reply from ARP?


|