Skip to content

NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed #6

@capDoYeonLee

Description

@capDoYeonLee

Hi team :)
I would like to discuss this issue in more detail. Please let me know a convenient time for you to follow up.
Thank you for your time.


Title

NoHttpResponseException from stale keep-alive connections in TDBatchConsumer — pool configuration not exposed

Body

Summary

TDBatchConsumer (via TDHttpRequestClient) initializes a static Apache HttpClient connection pool without evictIdleConnections or any active stale-connection management. When the receiving endpoint's load balancer closes idle keep-alive connections (e.g., AWS ALB's default 60s idle timeout), the pool reuses dead connections, producing NoHttpResponseException on the next request. The pool field is private static final and not exposed through any public configuration API, leaving users without a clean fix.

Environment

  • SDK: cn.thinkingdata:thinkingdatasdk:3.0.2
  • Java: 21
  • Framework: Spring Boot 3.5
  • Receiver: TD-hosted endpoint behind AWS ALB (idle timeout = 60s, measured via openssl s_client)

① Background — Observed behavior

In production we intermittently see this stack from the SDK's auto-flush TimerThread:

  org.apache.http.NoHttpResponseException: <receiver-host>:443 failed to respond                                                                                                                                                         
      at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141)
      ...                                                                                                                                                                                                                                
      at cn.thinkingdata.analytics.request.TDBatchRequest.sendRequest(TDBatchRequest.java:46)
      at cn.thinkingdata.analytics.TDBatchConsumer.httpSending(TDBatchConsumer.java:330)                                                                                                                                                 
      at cn.thinkingdata.analytics.TDBatchConsumer.flushOnce(TDBatchConsumer.java:303)                                                                                                                                                   
      at cn.thinkingdata.analytics.TDBatchConsumer$1.run(TDBatchConsumer.java:138)                                                                                                                                                       
      at java.base/java.util.TimerThread.mainLoop(Timer.java:566)                                                                                                                                                                        
      at java.base/java.util.TimerThread.run(Timer.java:516)                                                                                                                                                                             

The error appears as application log noise (triggering alerts) but the actual data delivery is mostly preserved due to the SDK's internal retry and #uuid idempotency.

② Investigation — Root cause

Pool initialization lacks active stale management

TDHttpRequestClient.java:

  public class TDHttpRequestClient {
      private static final PoolingHttpClientConnectionManager cm;                                                                                                                                                                        
      private static final RequestConfig globalConfig;
      static {                                                                                                                                                                                                                           
          cm = new PoolingHttpClientConnectionManager();                                                                                                                                                                                 
          cm.setDefaultMaxPerRoute(80);                                                                                                                                                                                                  
          // ❗ No setValidateAfterInactivity (relies on default 2000ms)                                                                                                                                                                  
          globalConfig = RequestConfig.custom()                                                                                                                                                                                          
                  .setCookieSpec(CookieSpecs.IGNORE_COOKIES)                                                                                                                                                                             
                  .setConnectTimeout(30000)                                                                                                                                                                                              
                  .setSocketTimeout(30000).build();                                                                                                                                                                                      
      }                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                         
      public static CloseableHttpClient getHttpClient() {                                                                                                                                                                                
          return HttpClients.custom()                                                                                                                                                                                                    
                  .setConnectionManager(cm)                                                                                                                                                                                              
                  .setDefaultRequestConfig(globalConfig)                                                                                                                                                                                 
                  .setConnectionManagerShared(true)                                                                                                                                                                                      
                  // ❗ No evictIdleConnections(...)                                                                                                                                                                                      
                  .build();                                                                                                                                                                                                              
      }                                                                                                                                                                                                                                  
  }                                                                                                                                                                                                                                      

Apache HttpClient's default validateAfterInactivity = 2000ms runs an isStale() check on lease, but it's a best-effort 1-byte read with a short timeout — under Linux concurrency it falsely returns "alive" when the server's FIN
hasn't been delivered to userspace yet. With no IdleConnectionEvictor, dead connections persist until the next failed use.

LB closes idle connections at 60s

  Measured directly:                          
  $ time openssl s_client -connect <receiver-host>:443 ... </dev/null -ign_eof                                                                                                                                                           
  ...                                                                         
  closed                                                                                                                                                                                                                                 
  real  1m0.49s                           

This matches AWS ALB's default 60s idle timeout. The LB sends a graceful FIN, the kernel buffers it, but the pool has no mechanism to detect this between flushes.

Configuration not exposed

TDBatchConsumer.Config only exposes: setBatchSize, setInterval, setCompress, setTimeout, setAutoFlush, setMaxCacheSize, setThrowException.

There is no public API to configure the underlying PoolingHttpClientConnectionManager, so users cannot fix this without reflection or replacing the entire consumer.

③ Verification — Deterministic reproducer

  TDBatchConsumer.Config config = new TDBatchConsumer.Config();
  config.setBatchSize(1);                                                                                                                                                                                                                
  config.setAutoFlush(false);    // manual flush for timing control
  config.setCompress("gzip");                                                                                                                                                                                                            
  TDBatchConsumer consumer = new TDBatchConsumer(serverUrl, appId, config);                                                                                                                                                              
  TDAnalytics ta = new TDAnalytics(consumer, true);
  TDAnalytics.enableLog(true);   // expose internal SDK logs                                                                                                                                                                             
                                                                                                                                                                                                                                         
  // Disable validateAfterInactivity via reflection to make race deterministic                                                                                                                                                           
  Field cmField = Class.forName("cn.thinkingdata.analytics.request.TDHttpRequestClient")                                                                                                                                                 
          .getDeclaredField("cm");                                                                                                                                                                                                       
  cmField.setAccessible(true);                                                                                                                                                                                                           
  PoolingHttpClientConnectionManager cm = (PoolingHttpClientConnectionManager) cmField.get(null);                                                                                                                                        
  cm.setValidateAfterInactivity(-1);                                                                                                                                                                                                     
                                                                                                                                                                                                                                         
  ta.track("user", null, "stale_repro",                                                                                                                                                                                                  
      new HashMap<>(Map.of("phase", "A_before_idle")));                                                                                                                                                                                  
  ta.flush();                                                                                                                                                                                                                            
                                                                                                                                                                                                                                         
  Thread.sleep(130_000);   // > 60s LB idle timeout                                                                                                                                                                                      
                  
  ta.track("user", null, "stale_repro",
      new HashMap<>(Map.of("phase", "B_after_idle")));
  ta.flush();                                 

Result (with SDK log enabled)

[ThinkingData] flush data=[...A...]
[ThinkingData] Response={"code":0}

... 130s sleep ...

[ThinkingData] flush data=[...B...]
[ThinkingData] :443 failed to respond ← stale race triggered
[ThinkingData] Response={"code":0} ← retry on iter 1 succeeded

Findings

  • NoHttpResponseException reliably reproduces when validateAfterInactivity is bypassed
  • TDBatchRequest.sendRequest 3-attempt retry absorbs the failure in single-stale scenarios (iter 0 leases the dead connection → IOException, iter 1 creates a new connection → success)
  • Data is preserved: the receiver-side dashboard shows all 3 events (A, B, C) ingested correctly; #uuid ensures idempotency
  • In production with maxPerRoute=80 and bursty traffic, multiple stale connections can be leased successively, exhausting all 3 retries and propagating the exception to the timer thread (= the stack we see in production logs)

④ Proposal — Suggested fixes

In priority order:

  1. Enable active eviction in the default pool

Single-line change in TDHttpRequestClient.getHttpClient():

  public static CloseableHttpClient getHttpClient() {
      return HttpClients.custom()                                                                                                                                                                                                        
              .setConnectionManager(cm)
              .setDefaultRequestConfig(globalConfig)                                                                                                                                                                                     
              .setConnectionManagerShared(true)
              .evictIdleConnections(30, TimeUnit.SECONDS)   // ← add this
              .build();                                                                                                                                                                                                                  
  }                                       

This proactively closes connections idle for >30s, well before AWS ALB's 60s idle timeout. Race window is eliminated entirely. No public API change, fully backward compatible.

  1. Expose pool configuration via TDBatchConsumer.Config

Add optional setters with current defaults preserved:

  config.setIdleConnectionEvictSeconds(30);          // default: 30
  config.setValidateAfterInactivityMs(2000);          // default: 2000                                                                                                                                                                   
  config.setMaxConnections(100);                      // default: 100
  config.setMaxConnectionsPerRoute(80);               // default: 80                                                                                                                                                                     

Different LB products have different idle timeouts (AWS NLB defaults to 350s, nginx default 75s, etc.), so user-side tuning is valuable.

  1. Accept an injectable HttpClientConnectionManager
  public static class Config {                                                                                                                                                                                                           
      public void setConnectionManager(HttpClientConnectionManager cm) { ... }                                                                                                                                                           
  }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions