Removing Barriers to Scalability: How We Optimized the Use of Elasticsearch at Intercom

Elasticsearch is an indispensable part of Intercom.

It underpins core Intercom features such as inbox, inbox views, API, articles, user list, reports, resolution bot and our systems internal logging. Our Elasticsearch clusters contain over 350TB of customer data, store over 300 billion documents, and process over 60,000 requests per second at peak times.

As the use of Intercom's Elasticsearch grows, we need to ensure that our systems evolve to support our continued growth. With the recent launch of our next-gen inbox, Elasticsearch's reliability is more critical than ever.

We decided to address an issue with our Elasticsearch configuration that posed an availability risk and threatened future downtime: uneven distribution of traffic/work across nodes in our Elasticsearch clusters.
First Signs of Inefficiency: Load Imbalance
Elasticsearch allows you to scale horizontally by increasing the number of nodes that store data (data nodes). We started noticing a load imbalance between these data nodes: some of them were under more pressure (or "hotter") than others due to higher disk or CPU utilization.

(Fig. 1) Imbalance in CPU utilization: two hot nodes with approximately 20% higher CPU utilization below average.

Elasticsearch's built-in shard placement logic makes decisions based on a calculation that roughly estimates the available disk space in each node and the number of shards of an index per node. Resource usage per partition does not factor into this calculation. As a result, some nodes might receive more resource intensive shards and become "hot". Each search request is processed by multiple data nodes. A hot node that is pushed beyond its limits during a traffic spike can cause performance degradation for the entire cluster.

A common reason for active nodes is partition placement logic that assigns large partitions (based on disk usage) to clusters, making balanced allocation less likely. Typically, a node can be assigned a large shard more than the others, making it hotter in disk usage. The presence of large shards also hinders our ability to scale the cluster incrementally, as adding a data node does not guarantee that all hot nodes will be load reduced (Fig. 2).

(Fig. 2) Adding a data node did not reduce the load on Host A. Adding another node would reduce the load on host A, but the cluster will still have an uneven load spreading.

In contrast, having smaller shards helps reduce the load on all data nodes as the cluster scales, including the "hot" ones (Fig. 3).

Startups Sep 22, 2022 0 88 Add to Reading List

Removing Barriers to Scalability: How We Optimized the Use of Elasticsearch at Intercom

Elasticsearch is an indispensable part of Intercom.

It underpins core Intercom features such as inbox, inbox views, API, articles, user list, reports, resolution bot and our systems internal logging. Our Elasticsearch clusters contain over 350TB of customer data, store over 300 billion documents, and process over 60,000 requests per second at peak times.

As the use of Intercom's Elasticsearch grows, we need to ensure that our systems evolve to support our continued growth. With the recent launch of our next-gen inbox, Elasticsearch's reliability is more critical than ever.

We decided to address an issue with our Elasticsearch configuration that posed an availability risk and threatened future downtime: uneven distribution of traffic/work across nodes in our Elasticsearch clusters.

First Signs of Inefficiency: Load Imbalance

Elasticsearch allows you to scale horizontally by increasing the number of nodes that store data (data nodes). We started noticing a load imbalance between these data nodes: some of them were under more pressure (or "hotter") than others due to higher disk or CPU utilization.

Fig. 1 Imbalance in CPU-usage (Fig. 1) Imbalance in CPU utilization: two hot nodes with approximately 20% higher CPU utilization below average.

Elasticsearch's built-in shard placement logic makes decisions based on a calculation that roughly estimates the available disk space in each node and the number of shards of an index per node. Resource usage per partition does not factor into this calculation. As a result, some nodes might receive more resource intensive shards and become "hot". Each search request is processed by multiple data nodes. A hot node that is pushed beyond its limits during a traffic spike can cause performance degradation for the entire cluster.

A common reason for active nodes is partition placement logic that assigns large partitions (based on disk usage) to clusters, making balanced allocation less likely. Typically, a node can be assigned a large shard more than the others, making it hotter in disk usage. The presence of large shards also hinders our ability to scale the cluster incrementally, as adding a data node does not guarantee that all hot nodes will be load reduced (Fig. 2).

Fig. 4 Adding-nodes

(Fig. 2) Adding a data node did not reduce the load on Host A. Adding another node would reduce the load on host A, but the cluster will still have an uneven load spreading.

In contrast, having smaller shards helps reduce the load on all data nodes as the cluster scales, including the "hot" ones (Fig. 3).