NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed

Hi team :)       
I would like to discuss this issue in more detail. Please let me know a convenient time for you to follow up. 
Thank you for your time.

---

## Title                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                         
NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed                                                                                                                          
                                                                                                                                                                                                                                         
## Body                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                         
### Summary                                                                                                                                                                                                                                
TDBatchConsumer (via TDHttpRequestClient) initializes a static Apache HttpClient connection pool without evictIdleConnections or any active stale-connection management. When the receiving endpoint's load balancer closes idle keep-alive connections (e.g., AWS ALB's default 60s idle timeout), the pool reuses dead connections, producing NoHttpResponseException on the next request. The pool field is private static final and not exposed through any public configuration API, leaving users without a clean fix.

                                                                                                                                                                                                                                         
### Environment                                                                                                                                                                                                                                          
  - SDK: cn.thinkingdata:thinkingdatasdk:3.0.2
  - Java: 21                              
  - Framework: Spring Boot 3.5
  - Receiver: TD-hosted endpoint behind AWS ALB (idle timeout = 60s, measured via openssl s_client)                                                                                                                                      
                                              
  ---                                                                                                                                                                                                                                    
### ① Background — Observed behavior                                                                                                                                                                                                       
                                                                                                                                                                                                                                         
  In production we intermittently see this stack from the SDK's auto-flush TimerThread:                                                                                                                                                  
```                                                                                                                                                                                                                     
  org.apache.http.NoHttpResponseException: <receiver-host>:443 failed to respond                                                                                                                                                         
      at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
      ...                                                                                                                                                                                                                                
      at cn.thinkingdata.analytics.request.TDBatchRequest.sendRequest(TDBatchRequest.java:46)
      at cn.thinkingdata.analytics.TDBatchConsumer.httpSending(TDBatchConsumer.java:330)                                                                                                                                                 
      at cn.thinkingdata.analytics.TDBatchConsumer.flushOnce(TDBatchConsumer.java:303)                                                                                                                                                   
      at cn.thinkingdata.analytics.TDBatchConsumer$1.run(TDBatchConsumer.java:138)                                                                                                                                                       
      at java.base/java.util.TimerThread.mainLoop(Timer.java:566)                                                                                                                                                                        
      at java.base/java.util.TimerThread.run(Timer.java:516)                                                                                                                                                                             
```                                                                                                                                                                                         
  The error appears as application log noise (triggering alerts) but the actual data delivery is mostly preserved due to the SDK's internal retry and #uuid idempotency.                                                                 
                                                                                                                                                                                                                                         
### ② Investigation — Root cause                                                                                                                                                                                                           
                                                                                                                                                                                                                                         
  Pool initialization lacks active stale management
                                                                                                                                                                                                                                         
  TDHttpRequestClient.java:                   
```                                                                                                                                                                                                                      
  public class TDHttpRequestClient {
      private static final PoolingHttpClientConnectionManager cm;                                                                                                                                                                        
      private static final RequestConfig globalConfig;
      static {                                                                                                                                                                                                                           
          cm = new PoolingHttpClientConnectionManager();                                                                                                                                                                                 
          cm.setDefaultMaxPerRoute(80);                                                                                                                                                                                                  
          // ❗ No setValidateAfterInactivity (relies on default 2000ms)                                                                                                                                                                  
          globalConfig = RequestConfig.custom()                                                                                                                                                                                          
                  .setCookieSpec(CookieSpecs.IGNORE_COOKIES)                                                                                                                                                                             
                  .setConnectTimeout(30000)                                                                                                                                                                                              
                  .setSocketTimeout(30000).build();                                                                                                                                                                                      
      }                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                         
      public static CloseableHttpClient getHttpClient() {                                                                                                                                                                                
          return HttpClients.custom()                                                                                                                                                                                                    
                  .setConnectionManager(cm)                                                                                                                                                                                              
                  .setDefaultRequestConfig(globalConfig)                                                                                                                                                                                 
                  .setConnectionManagerShared(true)                                                                                                                                                                                      
                  // ❗ No evictIdleConnections(...)                                                                                                                                                                                      
                  .build();                                                                                                                                                                                                              
      }                                                                                                                                                                                                                                  
  }                                                                                                                                                                                                                                      
```                                                                                                                                                                                                                                      
  Apache HttpClient's default validateAfterInactivity = 2000ms runs an isStale() check on lease, but it's a best-effort 1-byte read with a short timeout — under Linux concurrency it falsely returns "alive" when the server's FIN      
  hasn't been delivered to userspace yet. With no IdleConnectionEvictor, dead connections persist until the next failed use.                                                                                                             
                                                                                                                                                                                                                                         
  LB closes idle connections at 60s
```                                                                                                                                                                                                                              
  Measured directly:                          
  $ time openssl s_client -connect <receiver-host>:443 ... </dev/null -ign_eof                                                                                                                                                           
  ...                                                                         
  closed                                                                                                                                                                                                                                 
  real  1m0.49s                           
```                                                                                                                                                                                                                                      
This matches AWS ALB's default 60s idle timeout. The LB sends a graceful FIN, the kernel buffers it, but the pool has no mechanism to detect this between flushes.                                                                     
                                              
  Configuration not exposed                                                                                                                                                                                                              
                                                                                                                                                                                                                                         
  TDBatchConsumer.Config only exposes: setBatchSize, setInterval, setCompress, setTimeout, setAutoFlush, setMaxCacheSize, setThrowException.                                                                                             
                                                                                                                                                                                                                                         
  There is no public API to configure the underlying PoolingHttpClientConnectionManager, so users cannot fix this without reflection or replacing the entire consumer.                                                                   
                                                                                                                                                                                                                                         
### ③ Verification — Deterministic reproducer   
```                                                                                                                                                                                                                       
  TDBatchConsumer.Config config = new TDBatchConsumer.Config();
  config.setBatchSize(1);                                                                                                                                                                                                                
  config.setAutoFlush(false);    // manual flush for timing control
  config.setCompress("gzip");                                                                                                                                                                                                            
  TDBatchConsumer consumer = new TDBatchConsumer(serverUrl, appId, config);                                                                                                                                                              
  TDAnalytics ta = new TDAnalytics(consumer, true);
  TDAnalytics.enableLog(true);   // expose internal SDK logs                                                                                                                                                                             
                                                                                                                                                                                                                                         
  // Disable validateAfterInactivity via reflection to make race deterministic                                                                                                                                                           
  Field cmField = Class.forName("cn.thinkingdata.analytics.request.TDHttpRequestClient")                                                                                                                                                 
          .getDeclaredField("cm");                                                                                                                                                                                                       
  cmField.setAccessible(true);                                                                                                                                                                                                           
  PoolingHttpClientConnectionManager cm = (PoolingHttpClientConnectionManager) cmField.get(null);                                                                                                                                        
  cm.setValidateAfterInactivity(-1);                                                                                                                                                                                                     
                                                                                                                                                                                                                                         
  ta.track("user", null, "stale_repro",                                                                                                                                                                                                  
      new HashMap<>(Map.of("phase", "A_before_idle")));                                                                                                                                                                                  
  ta.flush();                                                                                                                                                                                                                            
                                                                                                                                                                                                                                         
  Thread.sleep(130_000);   // > 60s LB idle timeout                                                                                                                                                                                      
                  
  ta.track("user", null, "stale_repro",
      new HashMap<>(Map.of("phase", "B_after_idle")));
  ta.flush();                                 
```                        
###  Result (with SDK log enabled)               
                                                                                                                                                                                                                                         
  [ThinkingData] flush data=[...A...]                                                                                                                                                                                                    
  [ThinkingData] Response={"code":0}                                                                                                                                                                                                     
                                                                                                                                                                                                                                         
  ... 130s sleep ...                                                                                                                                                                                                                     
                                                                                                                                                                                                                                         
  [ThinkingData] flush data=[...B...]                                                                                                                                                                                                    
  [ThinkingData] <receiver-host>:443 failed to respond   ← stale race triggered                                                                                                                                                          
  [ThinkingData] Response={"code":0}                      ← retry on iter 1 succeeded                                                                                                                                                    
                                                                                                                                                                                                                                         
###  Findings                                    
                                                                                                                                                                                                                                         
  - NoHttpResponseException reliably reproduces when validateAfterInactivity is bypassed                                                                                                                                                 
  - TDBatchRequest.sendRequest 3-attempt retry absorbs the failure in single-stale scenarios (iter 0 leases the dead connection → IOException, iter 1 creates a new connection → success)                                                
  - Data is preserved: the receiver-side dashboard shows all 3 events (A, B, C) ingested correctly; #uuid ensures idempotency                                                                                                            
  - In production with maxPerRoute=80 and bursty traffic, multiple stale connections can be leased successively, exhausting all 3 retries and propagating the exception to the timer thread (= the stack we see in production logs)      

### ④ Proposal — Suggested fixes                
                                                                                                                                                                                                                                         
  In priority order:
                                                                                                                                                                                                                                         
  1. Enable active eviction in the default pool
                                                                                                                                                                                                                                         
  Single-line change in TDHttpRequestClient.getHttpClient():
```                                                                                                                                                                                                                   
  public static CloseableHttpClient getHttpClient() {
      return HttpClients.custom()                                                                                                                                                                                                        
              .setConnectionManager(cm)
              .setDefaultRequestConfig(globalConfig)                                                                                                                                                                                     
              .setConnectionManagerShared(true)
              .evictIdleConnections(30, TimeUnit.SECONDS)   // ← add this
              .build();                                                                                                                                                                                                                  
  }                                       
```                                                                                                                                                                                                            
  This proactively closes connections idle for >30s, well before AWS ALB's 60s idle timeout. Race window is eliminated entirely. No public API change, fully backward compatible.                                                        
                                          
  2. Expose pool configuration via TDBatchConsumer.Config                                                                                                                                                                                
                                                                                                                                                                                                                                         
  Add optional setters with current defaults preserved:
```                                                                                                                                                                                              
  config.setIdleConnectionEvictSeconds(30);          // default: 30
  config.setValidateAfterInactivityMs(2000);          // default: 2000                                                                                                                                                                   
  config.setMaxConnections(100);                      // default: 100
  config.setMaxConnectionsPerRoute(80);               // default: 80                                                                                                                                                                     
```                                                                                                                                                                                                              
  Different LB products have different idle timeouts (AWS NLB defaults to 350s, nginx default 75s, etc.), so user-side tuning is valuable.                                                                                               
                                                                                                                                                                                                                                         
  3. Accept an injectable HttpClientConnectionManager                                                                                                                                                                                    
```                                                                                                                                                                                                                              
  public static class Config {                                                                                                                                                                                                           
      public void setConnectionManager(HttpClientConnectionManager cm) { ... }                                                                                                                                                           
  }
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed #6

Title

Body

Summary

Environment

① Background — Observed behavior

② Investigation — Root cause

③ Verification — Deterministic reproducer

Result (with SDK log enabled)

Findings

④ Proposal — Suggested fixes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed #6

Description

Title

Body

Summary

Environment

① Background — Observed behavior

② Investigation — Root cause

③ Verification — Deterministic reproducer

Result (with SDK log enabled)

Findings

④ Proposal — Suggested fixes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions