We sincerely apologize for the disruption caused by two service outages over the weekend.
Incident 1
Time: Saturday, June 6, from 07:07 to 07:50 UTC
Impact: Our API and Dashboard were unavailable. Job submission and job processing were affected across the platform.
What happened: Core services stopped responding correctly due to elevated database load caused by long-running job-listing queries.
Mitigation: We restarted affected services and terminated the long-running database queries to reduce load and restore service. We then continued monitoring while investigating the underlying cause.
Incident 2
Time: Sunday, June 7, from 00:13 to 03:40 UTC
Impact: A similar issue occurred. Our API and Dashboard were unavailable, and job submission and job processing were affected across the platform.
What happened: Database load increased again from a similar query path, causing core services to stop responding normally.
Mitigation and resolution: We terminated the long-running queries again and implemented an additional fix to reduce load from this query path. Services recovered, and we have not observed the issue reoccur since.
Follow-up actions
We are applying additional improvements to reduce the likelihood and impact of similar incidents:
- Improve database query performance
- Add stronger monitoring and alerting for long-running queries and elevated database load
- Improve incident response procedures to reduce time to mitigation
- Review platform resilience so core services remain more stable under database pressure
Current status
The platform is operating normally. No action is required from users.
We sincerely apologize again for the inconvenience this caused.