Skip to content

[codex] add adaptive inference rate limiting#139

Draft
Hynek Kydlíček (hynky1999) wants to merge 4 commits into
mainfrom
codex/adaptive-inference-rate-limit
Draft

[codex] add adaptive inference rate limiting#139
Hynek Kydlíček (hynky1999) wants to merge 4 commits into
mainfrom
codex/adaptive-inference-rate-limit

Conversation

@hynky1999

@hynky1999 Hynek Kydlíček (hynky1999) commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Makes mdr.inference.generate(...) rate-limit aware by default, with a single rate_limit parameter for choosing adaptive or fixed request concurrency.

Changes

  • Adds mdr.inference.AdaptiveRateLimit, used by default when rate_limit is omitted or None.
  • Adds mdr.inference.StaticRateLimit for the old fixed-concurrency behavior.
  • Removes max_concurrent_requests from generate(...); concurrency caps now live on the rate-limit object.
  • Preserves Retry-After from OpenAI-compatible endpoints through a typed GenerationRateLimitError.
  • Adds focused tests, updates docs, and updates the LeRobot inference example.

Validation

  • uv run ruff check --force-exclude --fix src/refiner/inference tests/test_inference.py tests/test_inference_rate_limit.py examples/lerobot/sarm_annotation.py
  • uv run ruff format --force-exclude src/refiner/inference tests/test_inference.py tests/test_inference_rate_limit.py examples/lerobot/sarm_annotation.py
  • uv run ty check
  • uv run pytest

Full suite: 627 passed.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an adaptive request limiting mechanism for inference providers, allowing the system to dynamically adjust concurrency based on HTTP 429 responses and successful request windows. The changes include the AdaptiveRateLimit configuration, the AdaptiveRateLimiter logic, and integration into the generation pipeline with enhanced metrics and documentation. Reviewer feedback identifies opportunities to improve documentation self-containment, eliminate redundant variable updates in the generation loop, and simplify logic in the rate limiter by leveraging class invariants.

Comment thread docs/inference.md

pipeline = mdr.read_jsonl("input.jsonl").map_async(
mdr.inference.generate(
fn=summarize,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The summarize function is used here but is not defined in this code example. To make the example self-contained and easier for users to understand and copy, please define the summarize function within the python code block, similar to how it's done in the preceding example.

Comment thread src/refiner/inference/generate.py Outdated
Comment on lines +140 to +159
running_requests += 1
try:
response = await client.generate(request_payload)
if permit is not None:
async with permit:
response = await client.generate(request_payload)
else:
response = await client.generate(request_payload)
except GenerationRateLimitError as err:
if adaptive_limiter is not None:
await adaptive_limiter.record_rate_limit(err.retry_after_seconds)
row.log_throughput("rate_limited_requests", 1, unit="requests")
row.log_throughput("failed_requests", 1, unit="requests")
raise
except Exception:
row.log_throughput("failed_requests", 1, unit="requests")
raise
finally:
running_requests -= 1
request_semaphore.release()
if request_semaphore is not None:
request_semaphore.release()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The running_requests variable is updated on every request, even when an adaptive_limiter is in use. In the adaptive case, the running_requests metric already uses adaptive_limiter.running as the source of truth, making the updates to the local running_requests variable redundant and potentially confusing. For better clarity and maintainability, these updates should only happen when using the fixed-concurrency request_semaphore.

        if adaptive_limiter is None:
            running_requests += 1
        try:
            if permit is not None:
                async with permit:
                    response = await client.generate(request_payload)
            else:
                response = await client.generate(request_payload)
        except GenerationRateLimitError as err:
            if adaptive_limiter is not None:
                await adaptive_limiter.record_rate_limit(err.retry_after_seconds)
            row.log_throughput("rate_limited_requests", 1, unit="requests")
            row.log_throughput("failed_requests", 1, unit="requests")
            raise
        except Exception:
            row.log_throughput("failed_requests", 1, unit="requests")
            raise
        finally:
            if adaptive_limiter is None:
                running_requests -= 1
            if request_semaphore is not None:
                request_semaphore.release()

Comment thread src/refiner/inference/rate_limit.py Outdated
now = time.monotonic()
if self._should_grow(now):
next_limit = min(
self._config.max_concurrency or self._limit,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since __init__ ensures self._config.max_concurrency is not None, the or self._limit part is redundant and can be confusing. Simplifying this expression will improve readability by relying on the class invariant.

Suggested change
self._config.max_concurrency or self._limit,
self._config.max_concurrency,

Comment thread src/refiner/inference/rate_limit.py Outdated
return (
self._successes >= self._config.success_window_requests
and now - self._last_adjusted_at >= self._config.success_window_seconds
and self._limit < (self._config.max_concurrency or self._limit)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since __init__ ensures self._config.max_concurrency is not None, the or self._limit part is redundant and can be confusing. Simplifying this expression will improve readability by relying on the class invariant.

Suggested change
and self._limit < (self._config.max_concurrency or self._limit)
and self._limit < self._config.max_concurrency

@hynky1999 Hynek Kydlíček (hynky1999) marked this pull request as ready for review May 20, 2026 14:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 439b2e67ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

self._condition.wait(),
timeout=cooldown_remaining,
)
except TimeoutError:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Catch asyncio timeout class in cooldown wait

Use asyncio.TimeoutError (or both exception classes) here instead of built-in TimeoutError. On Python 3.10, asyncio.wait_for(...) raises asyncio.TimeoutError, so this except does not match; when cooldown expires, acquire() raises out of the limiter and fails queued inference requests instead of resuming them. Since pyproject.toml declares requires-python = ">=3.10", this is a real runtime regression for supported environments.

Useful? React with 👍 / 👎.

@hynky1999 Hynek Kydlíček (hynky1999) force-pushed the codex/adaptive-inference-rate-limit branch from 439b2e6 to 0e3998a Compare May 20, 2026 14:23
@hynky1999 Hynek Kydlíček (hynky1999) marked this pull request as draft May 20, 2026 14:24

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0e3998aca0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/refiner/inference/generate.py Outdated
await adaptive_limiter.record_rate_limit(err.retry_after_seconds)
row.log_throughput("rate_limited_requests", 1, unit="requests")
row.log_throughput("failed_requests", 1, unit="requests")
raise

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Retry rate-limited calls instead of re-raising immediately

When rate_limit is enabled, this branch updates the adaptive limiter and then immediately re-raises the GenerationRateLimitError. In the current execution path, uncaught async step exceptions propagate out of map_async and fail the shard/worker, so a single HTTP 429 still aborts the job before the reduced concurrency can stabilize subsequent requests. This makes the new adaptive limiter ineffective for normal generate(...) usage unless every caller adds custom exception handling.

Useful? React with 👍 / 👎.

@hynky1999 Hynek Kydlíček (hynky1999) force-pushed the codex/adaptive-inference-rate-limit branch from 0e3998a to 7416fc0 Compare May 20, 2026 14:53
@hynky1999 Hynek Kydlíček (hynky1999) force-pushed the codex/adaptive-inference-rate-limit branch from 7416fc0 to e0aeab5 Compare May 21, 2026 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant