Jump to content

Incidents/2017-06-13 ORES

From Wikitech

Summary

ORES had an intermittent outage from 1600 - 1940 UTC on June 13th. The issue was traced to scb1001.eqiad.wmnet.

Timeline

See https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1497366649350&to=1497383640251&panelId=2&fullscreen

  • 1600 UTC: Errors rise for ORES (not noticed. no icinga pings)
  • 1700 UTC: Deployment for task T167223 begins
  • 1715 UTC: During canary check, error rate is noted and task T167819 is created with "Unbreak now"
  • 1740 UTC: Problem is independent of deploy. The decision is made to continue with deploy.
  • 1816 UTC: Ops is pulled in (mutante responds). Rollback of deploy is considered but rejected.
  • 1828 UTC: Problem is narrowed down to scb1001 specifically. Logs show no errors despite intermittent 500s
  • 1923 UTC: Mutante notes that pdf rendering is taking a lot of CPU and kills it
  • 1940 UTC: Recovery confirmed.

Conclusions

  • icinga didn't tell us about the issue
  • for some reason, the error wasn't being written to app.log
  • it looks like there was some conflict with resource usage WRT pdf rendering
  • memory was very tight on SCB for the duration of the outage:

Actionables

  • task T167830 -- "Extend icinga check to catch 500 errors like those of the 20170613 incident"
  • task T146664 -- "Limit resources used by ORES", move ORES to dedicated hardware. See task T157222.
  • Limit resources used by the pdfrender service: task T167834
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy