[codex] add adaptive inference rate limiting by hynky1999 · Pull Request #139 · macrodata-labs/refiner

Hynek Kydlíček (hynky1999) · 2026-05-20T13:58:37Z

Summary

Makes mdr.inference.generate(...) rate-limit aware by default, with a single rate_limit parameter for choosing adaptive or fixed request concurrency.

Changes

Adds mdr.inference.AdaptiveRateLimit, used by default when rate_limit is omitted or None.
Adds mdr.inference.StaticRateLimit for the old fixed-concurrency behavior.
Removes max_concurrent_requests from generate(...); concurrency caps now live on the rate-limit object.
Preserves Retry-After from OpenAI-compatible endpoints through a typed GenerationRateLimitError.
Adds focused tests, updates docs, and updates the LeRobot inference example.

Validation

uv run ruff check --force-exclude --fix src/refiner/inference tests/test_inference.py tests/test_inference_rate_limit.py examples/lerobot/sarm_annotation.py
uv run ruff format --force-exclude src/refiner/inference tests/test_inference.py tests/test_inference_rate_limit.py examples/lerobot/sarm_annotation.py
uv run ty check
uv run pytest

Full suite: 627 passed.

gemini-code-assist

Code Review

This pull request implements an adaptive request limiting mechanism for inference providers, allowing the system to dynamically adjust concurrency based on HTTP 429 responses and successful request windows. The changes include the AdaptiveRateLimit configuration, the AdaptiveRateLimiter logic, and integration into the generation pipeline with enhanced metrics and documentation. Reviewer feedback identifies opportunities to improve documentation self-containment, eliminate redundant variable updates in the generation loop, and simplify logic in the rate limiter by leveraging class invariants.

gemini-code-assist · 2026-05-20T14:03:34Z

+
+pipeline = mdr.read_jsonl("input.jsonl").map_async(
+    mdr.inference.generate(
+        fn=summarize,


The summarize function is used here but is not defined in this code example. To make the example self-contained and easier for users to understand and copy, please define the summarize function within the python code block, similar to how it's done in the preceding example.

gemini-code-assist · 2026-05-20T14:03:34Z

        running_requests += 1
        try:
-            response = await client.generate(request_payload)
+            if permit is not None:
+                async with permit:
+                    response = await client.generate(request_payload)
+            else:
+                response = await client.generate(request_payload)
+        except GenerationRateLimitError as err:
+            if adaptive_limiter is not None:
+                await adaptive_limiter.record_rate_limit(err.retry_after_seconds)
+            row.log_throughput("rate_limited_requests", 1, unit="requests")
+            row.log_throughput("failed_requests", 1, unit="requests")
+            raise
        except Exception:
            row.log_throughput("failed_requests", 1, unit="requests")
            raise
        finally:
            running_requests -= 1
-            request_semaphore.release()
+            if request_semaphore is not None:
+                request_semaphore.release()


The running_requests variable is updated on every request, even when an adaptive_limiter is in use. In the adaptive case, the running_requests metric already uses adaptive_limiter.running as the source of truth, making the updates to the local running_requests variable redundant and potentially confusing. For better clarity and maintainability, these updates should only happen when using the fixed-concurrency request_semaphore.

if adaptive_limiter is None: running_requests += 1 try: if permit is not None: async with permit: response = await client.generate(request_payload) else: response = await client.generate(request_payload) except GenerationRateLimitError as err: if adaptive_limiter is not None: await adaptive_limiter.record_rate_limit(err.retry_after_seconds) row.log_throughput("rate_limited_requests", 1, unit="requests") row.log_throughput("failed_requests", 1, unit="requests") raise except Exception: row.log_throughput("failed_requests", 1, unit="requests") raise finally: if adaptive_limiter is None: running_requests -= 1 if request_semaphore is not None: request_semaphore.release()

gemini-code-assist · 2026-05-20T14:03:35Z

+            now = time.monotonic()
+            if self._should_grow(now):
+                next_limit = min(
+                    self._config.max_concurrency or self._limit,


Since __init__ ensures self._config.max_concurrency is not None, the or self._limit part is redundant and can be confusing. Simplifying this expression will improve readability by relying on the class invariant.

Suggested change

self._config.max_concurrency or self._limit,

self._config.max_concurrency,

gemini-code-assist · 2026-05-20T14:03:35Z

+        return (
+            self._successes >= self._config.success_window_requests
+            and now - self._last_adjusted_at >= self._config.success_window_seconds
+            and self._limit < (self._config.max_concurrency or self._limit)


Since __init__ ensures self._config.max_concurrency is not None, the or self._limit part is redundant and can be confusing. Simplifying this expression will improve readability by relying on the class invariant.

Suggested change

and self._limit < (self._config.max_concurrency or self._limit)

and self._limit < self._config.max_concurrency

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 439b2e67ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-05-20T14:18:47Z

+                            self._condition.wait(),
+                            timeout=cooldown_remaining,
+                        )
+                    except TimeoutError:


Catch asyncio timeout class in cooldown wait

Use asyncio.TimeoutError (or both exception classes) here instead of built-in TimeoutError. On Python 3.10, asyncio.wait_for(...) raises asyncio.TimeoutError, so this except does not match; when cooldown expires, acquire() raises out of the limiter and fails queued inference requests instead of resuming them. Since pyproject.toml declares requires-python = ">=3.10", this is a real runtime regression for supported environments.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0e3998aca0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-05-20T14:28:23Z

+                await adaptive_limiter.record_rate_limit(err.retry_after_seconds)
+            row.log_throughput("rate_limited_requests", 1, unit="requests")
+            row.log_throughput("failed_requests", 1, unit="requests")
+            raise


Retry rate-limited calls instead of re-raising immediately

When rate_limit is enabled, this branch updates the adaptive limiter and then immediately re-raises the GenerationRateLimitError. In the current execution path, uncaught async step exceptions propagate out of map_async and fail the shard/worker, so a single HTTP 429 still aborts the job before the reduced concurrency can stabilize subsequent requests. This makes the new adaptive limiter ineffective for normal generate(...) usage unless every caller adds custom exception handling.

Useful? React with 👍 / 👎.

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Hynek Kydlíček (hynky1999) marked this pull request as ready for review May 20, 2026 14:14

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Hynek Kydlíček (hynky1999) force-pushed the codex/adaptive-inference-rate-limit branch from 439b2e6 to 0e3998a Compare May 20, 2026 14:23

Hynek Kydlíček (hynky1999) marked this pull request as draft May 20, 2026 14:24

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Hynek Kydlíček (hynky1999) force-pushed the codex/adaptive-inference-rate-limit branch from 0e3998a to 7416fc0 Compare May 20, 2026 14:53

add adaptive inference rate limiting

e0aeab5

Hynek Kydlíček (hynky1999) force-pushed the codex/adaptive-inference-rate-limit branch from 7416fc0 to e0aeab5 Compare May 21, 2026 09:20

Hynek Kydlíček (hynky1999) added 3 commits May 21, 2026 11:21

Format inference runtime

fb98f7d

Treat provider 503 responses as rate limits

64e9121

Allow OpenAI endpoint API key env override

1f15dd7

Hynek Kydlíček (hynky1999) mentioned this pull request Jun 2, 2026

Fix VLLM inference HTTP connection limits #186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] add adaptive inference rate limiting#139

[codex] add adaptive inference rate limiting#139
Hynek Kydlíček (hynky1999) wants to merge 4 commits into
mainfrom
codex/adaptive-inference-rate-limit

Hynek Kydlíček (hynky1999) commented May 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	self._config.max_concurrency or self._limit,
	self._config.max_concurrency,

	and self._limit < (self._config.max_concurrency or self._limit)
	and self._limit < self._config.max_concurrency

Uh oh!

Conversation

Hynek Kydlíček (hynky1999) commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Hynek Kydlíček (hynky1999) commented May 20, 2026 •

edited

Loading