Hi team :)
I would like to discuss this issue in more detail. Please let me know a convenient time for you to follow up.
Thank you for your time.
Title
NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed
Body
Summary
TDBatchConsumer (via TDHttpRequestClient) initializes a static Apache HttpClient connection pool without evictIdleConnections or any active stale-connection management. When the receiving endpoint's load balancer closes idle keep-alive connections (e.g., AWS ALB's default 60s idle timeout), the pool reuses dead connections, producing NoHttpResponseException on the next request. The pool field is private static final and not exposed through any public configuration API, leaving users without a clean fix.
Environment
- SDK: cn.thinkingdata:thinkingdatasdk:3.0.2
- Java: 21
- Framework: Spring Boot 3.5
- Receiver: TD-hosted endpoint behind AWS ALB (idle timeout = 60s, measured via openssl s_client)
① Background — Observed behavior
In production we intermittently see this stack from the SDK's auto-flush TimerThread:
org.apache.http.NoHttpResponseException: <receiver-host>:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
...
at cn.thinkingdata.analytics.request.TDBatchRequest.sendRequest(TDBatchRequest.java:46)
at cn.thinkingdata.analytics.TDBatchConsumer.httpSending(TDBatchConsumer.java:330)
at cn.thinkingdata.analytics.TDBatchConsumer.flushOnce(TDBatchConsumer.java:303)
at cn.thinkingdata.analytics.TDBatchConsumer$1.run(TDBatchConsumer.java:138)
at java.base/java.util.TimerThread.mainLoop(Timer.java:566)
at java.base/java.util.TimerThread.run(Timer.java:516)
The error appears as application log noise (triggering alerts) but the actual data delivery is mostly preserved due to the SDK's internal retry and #uuid idempotency.
② Investigation — Root cause
Pool initialization lacks active stale management
TDHttpRequestClient.java:
public class TDHttpRequestClient {
private static final PoolingHttpClientConnectionManager cm;
private static final RequestConfig globalConfig;
static {
cm = new PoolingHttpClientConnectionManager();
cm.setDefaultMaxPerRoute(80);
// ❗ No setValidateAfterInactivity (relies on default 2000ms)
globalConfig = RequestConfig.custom()
.setCookieSpec(CookieSpecs.IGNORE_COOKIES)
.setConnectTimeout(30000)
.setSocketTimeout(30000).build();
}
public static CloseableHttpClient getHttpClient() {
return HttpClients.custom()
.setConnectionManager(cm)
.setDefaultRequestConfig(globalConfig)
.setConnectionManagerShared(true)
// ❗ No evictIdleConnections(...)
.build();
}
}
Apache HttpClient's default validateAfterInactivity = 2000ms runs an isStale() check on lease, but it's a best-effort 1-byte read with a short timeout — under Linux concurrency it falsely returns "alive" when the server's FIN
hasn't been delivered to userspace yet. With no IdleConnectionEvictor, dead connections persist until the next failed use.
LB closes idle connections at 60s
Measured directly:
$ time openssl s_client -connect <receiver-host>:443 ... </dev/null -ign_eof
...
closed
real 1m0.49s
This matches AWS ALB's default 60s idle timeout. The LB sends a graceful FIN, the kernel buffers it, but the pool has no mechanism to detect this between flushes.
Configuration not exposed
TDBatchConsumer.Config only exposes: setBatchSize, setInterval, setCompress, setTimeout, setAutoFlush, setMaxCacheSize, setThrowException.
There is no public API to configure the underlying PoolingHttpClientConnectionManager, so users cannot fix this without reflection or replacing the entire consumer.
③ Verification — Deterministic reproducer
TDBatchConsumer.Config config = new TDBatchConsumer.Config();
config.setBatchSize(1);
config.setAutoFlush(false); // manual flush for timing control
config.setCompress("gzip");
TDBatchConsumer consumer = new TDBatchConsumer(serverUrl, appId, config);
TDAnalytics ta = new TDAnalytics(consumer, true);
TDAnalytics.enableLog(true); // expose internal SDK logs
// Disable validateAfterInactivity via reflection to make race deterministic
Field cmField = Class.forName("cn.thinkingdata.analytics.request.TDHttpRequestClient")
.getDeclaredField("cm");
cmField.setAccessible(true);
PoolingHttpClientConnectionManager cm = (PoolingHttpClientConnectionManager) cmField.get(null);
cm.setValidateAfterInactivity(-1);
ta.track("user", null, "stale_repro",
new HashMap<>(Map.of("phase", "A_before_idle")));
ta.flush();
Thread.sleep(130_000); // > 60s LB idle timeout
ta.track("user", null, "stale_repro",
new HashMap<>(Map.of("phase", "B_after_idle")));
ta.flush();
Result (with SDK log enabled)
[ThinkingData] flush data=[...A...]
[ThinkingData] Response={"code":0}
... 130s sleep ...
[ThinkingData] flush data=[...B...]
[ThinkingData] :443 failed to respond ← stale race triggered
[ThinkingData] Response={"code":0} ← retry on iter 1 succeeded
Findings
- NoHttpResponseException reliably reproduces when validateAfterInactivity is bypassed
- TDBatchRequest.sendRequest 3-attempt retry absorbs the failure in single-stale scenarios (iter 0 leases the dead connection → IOException, iter 1 creates a new connection → success)
- Data is preserved: the receiver-side dashboard shows all 3 events (A, B, C) ingested correctly; #uuid ensures idempotency
- In production with maxPerRoute=80 and bursty traffic, multiple stale connections can be leased successively, exhausting all 3 retries and propagating the exception to the timer thread (= the stack we see in production logs)
④ Proposal — Suggested fixes
In priority order:
- Enable active eviction in the default pool
Single-line change in TDHttpRequestClient.getHttpClient():
public static CloseableHttpClient getHttpClient() {
return HttpClients.custom()
.setConnectionManager(cm)
.setDefaultRequestConfig(globalConfig)
.setConnectionManagerShared(true)
.evictIdleConnections(30, TimeUnit.SECONDS) // ← add this
.build();
}
This proactively closes connections idle for >30s, well before AWS ALB's 60s idle timeout. Race window is eliminated entirely. No public API change, fully backward compatible.
- Expose pool configuration via TDBatchConsumer.Config
Add optional setters with current defaults preserved:
config.setIdleConnectionEvictSeconds(30); // default: 30
config.setValidateAfterInactivityMs(2000); // default: 2000
config.setMaxConnections(100); // default: 100
config.setMaxConnectionsPerRoute(80); // default: 80
Different LB products have different idle timeouts (AWS NLB defaults to 350s, nginx default 75s, etc.), so user-side tuning is valuable.
- Accept an injectable HttpClientConnectionManager
public static class Config {
public void setConnectionManager(HttpClientConnectionManager cm) { ... }
}
Hi team :)
I would like to discuss this issue in more detail. Please let me know a convenient time for you to follow up.
Thank you for your time.
Title
NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed
Body
Summary
TDBatchConsumer (via TDHttpRequestClient) initializes a static Apache HttpClient connection pool without evictIdleConnections or any active stale-connection management. When the receiving endpoint's load balancer closes idle keep-alive connections (e.g., AWS ALB's default 60s idle timeout), the pool reuses dead connections, producing NoHttpResponseException on the next request. The pool field is private static final and not exposed through any public configuration API, leaving users without a clean fix.
Environment
① Background — Observed behavior
In production we intermittently see this stack from the SDK's auto-flush TimerThread:
The error appears as application log noise (triggering alerts) but the actual data delivery is mostly preserved due to the SDK's internal retry and #uuid idempotency.
② Investigation — Root cause
Pool initialization lacks active stale management
TDHttpRequestClient.java:
Apache HttpClient's default validateAfterInactivity = 2000ms runs an isStale() check on lease, but it's a best-effort 1-byte read with a short timeout — under Linux concurrency it falsely returns "alive" when the server's FIN
hasn't been delivered to userspace yet. With no IdleConnectionEvictor, dead connections persist until the next failed use.
LB closes idle connections at 60s
This matches AWS ALB's default 60s idle timeout. The LB sends a graceful FIN, the kernel buffers it, but the pool has no mechanism to detect this between flushes.
Configuration not exposed
TDBatchConsumer.Config only exposes: setBatchSize, setInterval, setCompress, setTimeout, setAutoFlush, setMaxCacheSize, setThrowException.
There is no public API to configure the underlying PoolingHttpClientConnectionManager, so users cannot fix this without reflection or replacing the entire consumer.
③ Verification — Deterministic reproducer
Result (with SDK log enabled)
[ThinkingData] flush data=[...A...]
[ThinkingData] Response={"code":0}
... 130s sleep ...
[ThinkingData] flush data=[...B...]
[ThinkingData] :443 failed to respond ← stale race triggered
[ThinkingData] Response={"code":0} ← retry on iter 1 succeeded
Findings
④ Proposal — Suggested fixes
In priority order:
Single-line change in TDHttpRequestClient.getHttpClient():
This proactively closes connections idle for >30s, well before AWS ALB's 60s idle timeout. Race window is eliminated entirely. No public API change, fully backward compatible.
Add optional setters with current defaults preserved:
Different LB products have different idle timeouts (AWS NLB defaults to 350s, nginx default 75s, etc.), so user-side tuning is valuable.