Monday, October 16th, 2017

iLab North America RFO: Monday, October 9 Incident

Agilent Reason For Outage (RFO) - iLab Service Disruption on October 9, 2017

Start Date: Monday, October 9, 2017, 12:28 PM UTC

RFO Posted On: Monday, October 16, 2017, 3:12 PM UTC

Event Type: Unplanned Event

Subject: Network interruption

  • Summary

    • ​On Monday, October 9, at approximately 8:30 AM EDT, a key component (gateway server) of the iLab SaaS networking infrastructure became unavailable which caused all network traffic to the SaaS environment to be disabled. The incident was resolved at approximately 9:20 AM EDT, when all connectivity was restored.
  • What Happened?

    • As part of a regular process, our patching system applied operating-system level upgrades on this gateway server and the server was restarted. When the server was restarted, the firewall that protects all network traffic wasn't started automatically and it didn't allow any network traffic to flow into the iLab SaaS environment. This was the first time the gateway server has been restarted in the production environment since the firewall was implemented.
    • The OS upgrades should not have occurred during business hours ​and even then, the firewall should have automatically started upon the restart.
  • What Are We Doing About This? - Agilent’s Site Reliability Engineering team, Networking team and Infrastructure management team have taken the following steps:

    • Reinforcing the policy that any future updates are NOT applied during any business hours and only during scheduled maintenance windows (~2.00 am EDT on Sunday mornings) – COMPLETE, added extra monitoring and protections on October 10
    • Given this was the first time the server was restarted with the firewall, the configuration to start the firewall on server restart hadn’t been implemented or tested. Task to confirm that the firewall service will restart during the rare occurrence that the gateway server needs to be restarted – COMPLETE, implemented firewall start as part of server start-up on October 10
    • Re-evaluating communication methods about incidents (i.e. including the iLab User Group distribution list) - IN PROGRESS

We sincerely apologize for this incident, this was not in accordance with expectations of service delivery. As always, please visit this status site for the latest status and/or join the iLab User Group mailing list. Please e-mail to join that mailing list.