We recently patched two of our AC2K server clusters with W2K3SP2. The
administrator that patched them found that after they were patched, the
clusters were broken. The child nodes could not talk to the controllers
and
vice versa. After he applied the patch and saw that they were broken, he
rolled the SP back and just applied the relevant patches. Alas, the
damage
was done. The clusters have not been the same since.
I have not been able to get the controllers to talk to the child nodes
successfully since. On one cluster, we reinstalled AC2K with some success
but it's still throwing Sync errors. On the other cluster, the child node
was hopeless trashed and he attempted a restore only to find the restore
points on the server were not successfully backed up so we wound up
rebuilding that server. That server is also throwing sync errors.
Both clusters had 1 server each that appeared to successfully patch.
So now we have 2 AC2K clusters that are broken. Neither cluster is
successfully able to fully synchronize and to boot, the application
monitors
on both clusters are showing no activity for any counter selected.
It appears to be a security issue but I can't see where. We applied
hotfixqfe891330 to each server but that did not help. Certificates were a
problem but I managed to work through that issue.
There is not enough information on the web to work this problem. I will
probably open a call on this but past experience with calling MS for AC2K
has
yielded the static answers of either re-install or rebuild. If I have to
rebuild, I will not be installing AC2K but rather revert back to pre-AC2K
days of Robocopy and a lot of manual management. At least that's
reliable.
The answer of MS to use SMS and MOM is no better for us than doing it
manually since we deploy mainly fixes and updates to sites which do not
warrant creation of packages and there is no good replacement for the AC2K
interface.
Sorry for the venting but I've been bludgeoned too many times by AC2K in
the
past 4 years to put up with it much longer. It's as fragile as an egg and
we're constantly fearful of patching. I could copy the errors we're
getting
but this post would go on for several hundred lines.
If anyone out there has experience a catastrophic failure similar to this
after applying W2K3 SP2 to their clusters and managed to resolve it
without a
rebuild, I'd surely like to hear from you.


|