On November 8th, 2024 from 4:06 PT to 4:41 PT, a large portion of ChatGPT traffic failed with 503 response codes. For most of the duration of the incident – from 4:06 PT to 4:32 PT – this affected requests from the web client, OS-specific application clients, and iOS devices. The Assistants API also experienced errors.
The root cause was a configuration change to the load-balancing configuration for a downstream service which ChatGPT depends on. This config change activated a latent bug in the logic to actuate the config, rapidly increasing server worker’s memory usage which led to all of them crashing.
Once we identified the root cause, we mitigated the outage by rolling back the configuration change. By 4:32 the errors had almost completely subsided, and service was fully recovered by 4:41.
To keep the system safe in the short term, we have already done the following:
In the coming weeks, we will significantly refactor our configuration delivery systems to prevent this class of outage from happening again:
We know that extended API outages affect our customers’ products and business, and outages of this magnitude are particularly damaging. While we came up short here, we are committed to preventing such incidents in the future and improving our service reliability.