Downtime Guidelines


As described in the Service Level Agreement, sites and ROCs are supposed to comply with minimum availability and reliability metrics.

When a given site fails a monitoring test, it creates a period of unavailability and unreliability that counts against the site and the ROC. When a
scheduled downtime is declared, the site and ROC reliabilities are not affected. Please refer to the SLA document for further details and the actual numbers.

The points that should be clear are:
  1. whenever a site is consistently failing tests, it should be on downtime;
  2. whenever a site is on downtime, it should maximize the scheduled downtime / unscheduled downtime ratio.

The remainder of this page is dedicated to documenting some guidelines that will help achieving point 2.

A downtime is said
scheduled when it is declared, at least, 24 hours before its beginning. Thus, site admins are encouraged to declare downtimes related to system updates, hardware upgrades and other similar events in advance.

When an unforeseen event, such as a hardware failure takes place, there are still ways to minimize the reliability impact.
As soon as a problem is identified, site admins should declare two downtimes:
  • a 24-hours downtime (which will be unscheduled) starting immediately;
  • another 24-hours downtime (which will be scheduled) starting as soon as the first one ends.

If the problem is not solved by the next day, the site admin should create another downtime starting right after the last one ends. This step should be repeated until the problem is solved.

Remarks:
  • Scheduled downtimes may be created with a longer duration, if it is previously known that the problem will take longer than 24 hours to be solved, or if if would end on a non-working day.
  • One may create the first (unscheduled) downtime with a sightly longer duration (say 24 hours and 15 minutes), to make sure the second one is declared with more than 24 hours before its beginning.

By following these guidelines, one can ensure that, no matter how long a problem takes to be solved, the associated reliability impact will not exceed 24 hours.