ElasticSearch中Data too large问题排查解决方案

前言

在演示环境测试的时候,发现通过es查的数据偶尔会出现报错的情况,要根据实际情况进行排查和调优。

排查过程

1.问题定位

查看es报错日志,看了一下大致意思是请求数据的时候内存超限了,触发了熔断器。

1
2
3
4
5
[2021-03-16T21:05:10,338][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [java-d-service-es-200-56-client-1] failed to execute on node [hsF4JzeAQ6mflJRGnJIKzQ]
org.elasticsearch.transport.RemoteTransportException: [data-es-group-online-200-67-2][10.110.200.67:9301][cluster:monitor/nodes/info[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [33093117638/30.8gb], which is larger than the limit of [31621696716/29.4gb], real usage: [33093114144/30.8gb], new bytes reserved: [3494/3.4kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=3494/3.4kb, accounting=104564949/99.7mb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.3.2.jar:7.3.2]
at ......

拉下es源码,定位报错位置org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
public void checkParentLimit(long newBytesReserved, String label) throws CircuitBreakingException {
final MemoryUsage memoryUsed = memoryUsed(newBytesReserved);
long parentLimit = this.parentSettings.getLimit();
if (memoryUsed.totalUsage > parentLimit) {
this.parentTripCount.incrementAndGet();
final StringBuilder message = new StringBuilder("[parent] Data too large, data for [" + label + "]" +
" would be [" + memoryUsed.totalUsage + "/" + new ByteSizeValue(memoryUsed.totalUsage) + "]" +
", which is larger than the limit of [" +
parentLimit + "/" + new ByteSizeValue(parentLimit) + "]");
if (this.trackRealMemoryUsage) {
final long realUsage = memoryUsed.baseUsage;
message.append(", real usage: [");
message.append(realUsage);
message.append("/");
message.append(new ByteSizeValue(realUsage));
message.append("], new bytes reserved: [");
message.append(newBytesReserved);
message.append("/");
message.append(new ByteSizeValue(newBytesReserved));
message.append("]");
} else {
message.append(", usages [");
message.append(String.join(", ",
this.breakers.entrySet().stream().map(e -> {
final CircuitBreaker breaker = e.getValue();
final long breakerUsed = (long)(breaker.getUsed() * breaker.getOverhead());
return e.getKey() + "=" + breakerUsed + "/" + new ByteSizeValue(breakerUsed);
})
.collect(Collectors.toList())));
message.append("]");
}
// derive durability of a tripped parent breaker depending on whether the majority of memory tracked by
// child circuit breakers is categorized as transient or permanent.
CircuitBreaker.Durability durability = memoryUsed.transientChildUsage >= memoryUsed.permanentChildUsage ?
CircuitBreaker.Durability.TRANSIENT : CircuitBreaker.Durability.PERMANENT;
throw new CircuitBreakingException(message.toString(), memoryUsed.totalUsage, parentLimit, durability);
}
}

从代码可以看出,当memoryUsed.totalUsage > parentLimit时,才会出现熔断;parentLimit的值与配置indices.breaker.total.limit(默认值为95%或者70%)有关,它的默认值与indices.breaker.total.use_real_memory(默认值为true)的配置有关,如下代码所示:

1
2
3
4
5
6
7
8
9
10
11
public static final Setting<Boolean> USE_REAL_MEMORY_USAGE_SETTING =
Setting.boolSetting("indices.breaker.total.use_real_memory", true, Property.NodeScope);

public static final Setting<ByteSizeValue> TOTAL_CIRCUIT_BREAKER_LIMIT_SETTING =
Setting.memorySizeSetting("indices.breaker.total.limit", settings -> {
if (USE_REAL_MEMORY_USAGE_SETTING.get(settings)) {
return "95%";
} else {
return "70%";
}
}, Property.Dynamic, Property.NodeScope);

我们再来看看memoryUsed.totalUsage的值,它是该类的一个方法计算出来,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
private MemoryUsage memoryUsed(long newBytesReserved) {
long transientUsage = 0;
long permanentUsage = 0;

for (CircuitBreaker breaker : this.breakers.values()) {
long breakerUsed = (long)(breaker.getUsed() * breaker.getOverhead());
if (breaker.getDurability() == CircuitBreaker.Durability.TRANSIENT) {
transientUsage += breakerUsed;
} else if (breaker.getDurability() == CircuitBreaker.Durability.PERMANENT) {
permanentUsage += breakerUsed;
}
}
if (this.trackRealMemoryUsage) {
final long current = currentMemoryUsage();
return new MemoryUsage(current, current + newBytesReserved, transientUsage, permanentUsage);
} else {
long parentEstimated = transientUsage + permanentUsage;
return new MemoryUsage(parentEstimated, parentEstimated, transientUsage, permanentUsage);
}
}

trackRealMemoryUsage的值(取自该配置indices.breaker.total.use_real_memory)决定了是使用实际的内存使用量还是child circuit breakers的内存使用量来判断熔断; 官方解释如下:

Static setting determining whether the parent breaker should take real memory usage into account (true) or only consider the amount that is reserved by child circuit breakers (false). Defaults to true

2.解决方案

可以通过修改es节点配置来解决,给es配置文件elasticsearch.yml添加如下配置后重启节点即可:

1
indices.breaker.total.use_real_memory: false

如果还无法解决,可以尝试增加es的jvm内存,修改jvm.options:

1
2
3
# 根据实际情况修改内存, 此处分配了40g
-Xms40g
-Xmx40g

ElasticSearch中Data too large问题排查解决方案
https://aunero.github.io/2022/07/es-data-too-large.html
作者
AuthurNero
发布于
2022年7月27日
许可协议