Aug 18, 2017
2017 Production Outage Updates
Our primary production environment is currently hosted on Amazon Web Services (AWS). As we’ve taken on more system load, we have had to scale up these services. On August 16th and 17th, we started hitting Amazon’s IOPS burst quota, which then cascaded to pause our core matching engine. When our matching engine paused, we were forced to bring down the rest of our system to diagnose the problem. These disruptions lasted from approximately 12:18 PM EDT to 9:30 PM EDT on August 16th and 12:40 PM EDT to 3:30 PM EDT on August 17th.
The matching engine component of the Gemini platform was the only component affected by this infrastructure failure. When we brought down the entire system for maintenance, we temporarily suspended deposit and withdrawal processing. At no time were customers funds or accounts at risk, and there was no security impact on our online and offline digital asset storage systems. To remedy this issue, we’ve increased our Amazon IOPS burst quota by over 100x and scaled up the machines that we have running the core matching engine.
This is not the first scaling challenge we’ve encountered, and it won’t be the last — but usually we’re able to resolve these sorts of issues behind the scenes. We already have a system in place to monitor for degraded exchange performance and alert our Site Reliability team so they can remediate the problem before it affects our customers. We’ve upgraded to a larger and different instance type that supports a higher IOPS burst quota and allows us to monitor it. And we’re continuing to improve our performance and infrastructure monitoring so we can anticipate potential problems more quickly in the future.
We realize that our communication during the system outage was not consistent with the quality experience you have come to expect from Gemini, and we will be improving our communication plan going forward, taking into account the feedback we’ve received.
Some customer trading activity during this period was affected, and we will be reaching out to those customers within the coming days with more detailed information. If you have any questions about this outage, please don’t hesitate to contact Gemini customer support with specific details.
Onward and Upward,
As many are aware, Bitcoin attained new peaks of value on November 28th and into November 29th. On the morning of November 29th, our Web interface experienced an unprecedented increase in traffic which impacted the performance and availability of our Web interface and to a lesser extent our API servers. At no time was this increase in traffic an attack on any of our systems and our customers’ funds remain secure. This was solely a result of an unprecedented surge of requests on our platform.
Throughout the events that lasted from 10:11AM EST through 07:11PM EST, we made repeated attempts to tune the system to handle the extremely high traffic on the Web server. Most of these attempts were met with new, higher spikes in activity. The API remained unimpacted during the majority of the time that our Web interface was experiencing the excessive traffic, and our FIX and matching engines continued to fully operate throughout.
The environment stabilized at 7:11PM EST and, after monitoring the mitigation for a period of time, we communicated to our customers at 8:45PM EST on our status page that both our Web and API interfaces were in full functioning order.
As many of you may know, in August we migrated our primary trading platform and network PoP (Point of Presence) to our own hardware in the Equinix NY5 data center in Secaucus, New Jersey. The downtime on November 29th was due to an application tuning issue, and our servers in the data center had plenty of room to scale. Still, in preparation for events such as the one above, we have forecasted and prepared for increasing capacity and were already installing our new hardware when this occurred. We plan to bring this new capacity online in the near future to continue scaling our infrastructure to better service our customers and community. The security and availability of our Web, API, and FIX interfaces is our first concern and we strive to have them available for all our customers at all times.
Onward and Upward!