Return to blog home

Blog home


PRODUCT

DEC 13, 2021

Service Outage Update and Planned Upgrades

link
Blog Header-1

On Friday, December 10, we experienced a service disruption that resulted in an exchange outage for ten hours. During that downtime, all our customers' funds remained secure. In addition to maintaining our trading and custody platform, we have been working in parallel to upgrade our backend exchange platform to improve system capacity and scalability. This involved migrating 79 trading pairs on a coordinated schedule to a new platform while continuing to operate our exchange 24 hours a day, seven days a week.

This post provides additional information about the recent service disruption and how we are improving the reliability and performance of our exchange through a series of upgrades.

Incident Summary

Prior to the planned final migration of BTC/USD and ETH/USD to the new platform on Friday, we experienced a failure of the messaging infrastructure that this system depends on.

This messaging system is responsible for fast, high-capacity, reliable delivery of messages within our distributed exchange platform. It is multi-node and generally fault tolerant. Normally, this messaging infrastructure allows for strong reliability guarantees for applications that depend on it. All three nodes that make up this messaging platform failed at the same time with the same exception. The messaging system automatically restarted, but many internal message consumers and producers required manual intervention. After we restarted the impacted systems, it was determined that the messaging system errors led to state divergence of some downstream systems due to how our systems interacted with it.

Since this incident occurred at a time when Gemini was transitioning order flow to an upgraded version of the exchange matching engine, the process of state reconciliation prior to restarting production services required reconciling state across two trading systems. Once state reconciliation was completed and all services were stable, markets were restored by first enabling ActiveTrader and API connectivity in limit-only mode, then re-enabling Mobile and Retail Web.

Incident Follow-Up

Moving forward and in support of our chaos engineering approach, we will ensure reproducibility of this failure mode in our test environments and improve our trading platform to gracefully degrade and recover from this kind of subsystem interruption. By ensuring we test this failure mode more frequently, we will be confident our systems can handle this and similar challenges well.

Improvements to Performance and Scalability

We are very excited about the additional planned upgrades to our core exchange trading platform which we preview below and will detail more in a future post. The new system introduces multiple architectural improvements, including messaging improvements such as isolating the exchange’s high throughput, low latency messaging domain from the general purpose store/forward domain.

incident blog post image

Scalable Exchange Architecture

By isolating message traffic related to latency-sensitive trading operations such as order placement and order cancellation in this part of the system, we are able to observe improvements to the exchange’s performance that we’ll detail below.

In the process of developing the upgraded exchange system, we have developed tooling to measure multiple performance characteristics of the exchange. This includes an application that can measure round-trip time as observed by an internal order gateway, to give a general sense of latency distributions of a single order pipeline.

Below, we show the latency profiles for order placement in the new backend exchange platform. This represents simple order placement and fill flow. We plan to share more in-depth latency analysis in the future. The observed results represent an order of magnitude improvement to latency and throughput, in addition to the added capability to horizontally scale the exchange platform.

Chart

Building and maintaining a crypto platform in a market that never closes is not without its challenges, but we are confident that our new platform will mitigate against future disruptions and provide our customers the best exchange performance on the market.

Gemini is always recruiting engineers that want to help solve these challenges, operating a low latency architecture that builds the financial systems of the future. If you are interested in joining us visit our open engineering roles on Careers page.

Onward and Upward,

Team Gemini

RELATED ARTICLES

Spot BTC ETFs Launch in Asia

WEEKLY MARKET UPDATE

APR 18, 2024

BTC Takes a Hit, Spot BTC ETFs Launch in Asia, and Geopolitical Tensions Heat Up

Halving Blog

NEWS

APR 18, 2024

Unpacking the Arguments over Upcoming Bitcoin Halving Event

Blog BedfordFC (1)

INDUSTRY

APR 12, 2024

Cameron and Tyler Winklevoss Invest $4.5M of Bitcoin Into Real Bedford, Becoming Co-Owners With Peter McCormack

A simple, secure way to buy and sell cryptocurrency

Trade bitcoin and other cryptos in 3 minutes.