Q: How
does the fabric know that an instance has failed, and what actions does it take
to recover that instance?
A: There are a series of heartbeat probes between the
fabric and the instance --- Fabric <-> Host Agent <-> Guest Agent
(WaAppAgent.exe) <-> Host Bootstrapper (WaHostBootstrapper.exe) <->
Host Process (typically WaIISHost.exe or WaWorkerHost.exe).
- If the Fabric <-> Host Agent
probe fails then the fabric will attempt to restart the host. There
are heuristics in the fabric to determine what to do with that host if a
restart fails to resolve the problem, taking more aggressive actions to
remedy the problem until ultimately the fabric may determine that the
server itself is bad and it will create a new host on a new server and
then start all of the affected guest VMs on that new host.
- If the Host Agent <-> Guest
Agent probe fails then the Host will attempt to
restart the Guest OS, and this also includes a set of heuristics to take
additional actions including attempting to start that Guest VM on a new
server. If the Host <-> Guest probe succeeds then the
fabric no longer takes action on that instance and any further recovery
is handled by the guest agent within the VM.
- The only recovery action that the
guest agent will take is to restart the host stack (WaHostBootstrapper
and all of its children) if one of the child processes crashes. If
the probe times out then the guest agent assumes the host process is busy
working and lets it continue running indefinitely. The guest
agent will not restart the VM as part of a recovery process.
See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx
for more information about the processes and probes on the Guest OS.
Q: How
does the load balancer know when an instance is unhealthy?
A: There are 2 different mechanisms the load balancer can
use to determine instance health and whether or not to include that instance in
the round robin rotation and send new traffic to it.
- The default mechanism is that the
load balancer sends probes to the Guest Agent to request the instance
health. If the Guest Agent returns anything besides 'Ready' then
the load balancer will mark that instance as unhealthy and remove it from
the rotation. Looking back at the heartbeats from the guest agent to
the host process, this means that if any of those processes running in
the Guest OS has crashed or hung then the guest agent will not return
Ready and the instance will be removed from the LB rotation.
- The other mechanism is for you to
define a custom LoadBalancerProbe in your service definition. A
LoadBalancerProbe gives you much more control over how the load balancer
determines instance health and allows you to more accurately reflect the
status of your service, in particular the health of w3wp.exe and any
other external dependencies your service has. Make sure your probe
path is not a simple HTML page, but actually includes logic to determine
your service health (eg. Try to connect to your SQL database).
Q: What
does the load balancer do when an instance is detected as unhealthy?
A: The load balancer will route new incoming TCP connections
to instances which are in rotation. The instances that are in rotation
are either:
- Returning a 'Ready' state from
the guest agent for roles which do not have a LoadBalancerProbe.
- Returning 200 or TCP ACK from a
LoadBalancerProbe element.
If an instance drops out of rotation, the load balancer
will not terminate any existing TCP connections. So if the client and
server maintain the TCP connection then traffic on that connection will still
be sent to the instance which has dropped out of rotation, but no new TCP
connections will be sent to that instance. If the TCP connection is
broken by the server (ie. the VM restarts or the process holding the TCP
connection crashes) then the client should retry the connection, at which time
the load balancer will see it as a new TCP connection and route it to an
instance which is in rotation.
Note that for single instance deployments, the load
balancer considers that instance to always be in rotation. So regardless
of the status of the instance the load balancer will send traffic to that
instance.
Q: How can
you determine if a role instance was recycled or moved to a new server?
A: There is no direct way to know if an instance was
recycled. Fabric initiated restarts (ie. OS updates) will raise the
Stopping/OnStop events will be raised, but for unexpected shutdowns you will
not receive these events. There are some strategies to detect these
events:
- The most common way to achieve
this is to write a log in the RoleEntroyPoint.OnStart method. If
you unexpectedly see an instance of this log then you know a role
instance was recycled and you can look at various pieces of evidence to
determine why.
- If an instance is moved to a new
VM/server then the Changing/Changed events will be raised on all other
roles and instances with a type of RoleEnvironmentTopologyChange.
Note that this will only happen if you have an InternalEndpoint
defined. Also note that an InternalEndpoint is implicitly defined
for you if you have enabled RDP.
- See http://blogs.msdn.com/b/kwill/archive/2012/09/19/role-instance-restarts-due-to-os-upgrades.aspx
for information about determining when an instance is restarted due to OS
updates.
- The guest agent logs (reference
the Role Architecture blog post for log file location) will contain
evidence of all restarts, both planned and unplanned, but they are
internal undocumented logs and interpreting them is not trivial.
But if you are following #1 and you know the timestamp for when your role
restarted then you can focus on a specific timeframe in the agent logs.
- The host bootstrapper logs
(reference the Role Architecture blog post for log file location) will
tell you if a startup task or host process failed and caused the guest
agent to recycle the instance.
- The state of the drives on the
guest OS can provide information about what happened. See http://blogs.msdn.com/b/kwill/archive/2012/10/05/windows-azure-disk-partition-preservation.aspx.
- If the above doesn't help, the
support team can help investigate through a support incident.
No comments:
Post a Comment