Skip to content

Add OAuth2 login to clickhouse-client (--login / --login=device)#1606

Merged
zvonand merged 7 commits intoantalya-26.1from
feature/client-IdP
Apr 24, 2026
Merged

Add OAuth2 login to clickhouse-client (--login / --login=device)#1606
zvonand merged 7 commits intoantalya-26.1from
feature/client-IdP

Conversation

@BorisTyshkevich
Copy link
Copy Markdown
Collaborator

@BorisTyshkevich BorisTyshkevich commented Apr 1, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add OAuth2 login flow to clickhouse-client.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

Workflow [PR], commit [5fc4e6b]

Comment thread src/Access/TokenProcessorsOpaque.cpp Outdated
@@ -114,7 +114,7 @@ bool GoogleTokenProcessor::resolveAndValidate(TokenCredentials & credentials) co

auto token_info = getObjectFromURI(Poco::URI("https://www.googleapis.com/oauth2/v3/tokeninfo"), token);
if (token_info.contains("exp"))
credentials.setExpiresAt(std::chrono::system_clock::from_time_t((getValueByKey<time_t>(token_info, "exp").value())));
credentials.setExpiresAt(std::chrono::system_clock::from_time_t(static_cast<time_t>(getValueByKey<double>(token_info, "exp").value())));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exp means the number of seconds since Unix epoch. Usually it is integer, and it is expected to be integer, isn\t it?
At least, Azure and Keycloak stick to integer values.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// picojson only has one numeric type: double
typedef double number_type;

So value.is() returns false, value.is<time_t>() returns false, but value.is() returns true — for any JSON number, including 1719500000.

The original code getValueByKey<time_t>(token_info, "exp") would always fail with "Value for key 'exp' has incorrect type" because picojson doesn't have time_t as a native type. The double fix is actually the
correct fix for picojson's type system.

So I retract my earlier comment. Using double is not "accidentally working" — it's the only way to extract a number from picojson. The static_cast<time_t> then truncates to integer, which is fine since the
value was integer to begin with (JSON 1719500000 → picojson double(1719500000.0) → time_t(1719500000)).

The commit message and the fix are correct. If anything, a brief comment explaining picojson's type model would help future readers:

// picojson stores all JSON numbers as double; cast to time_t for epoch seconds.

But that's a style nit, not a bug.

@BorisTyshkevich
Copy link
Copy Markdown
Collaborator Author

App config for github.demo.altinity.cloud (not a secret!)

$ cat ~/.clickhouse-client/oauth_client-tv.json
{"installed":{"client_id":"925653064731-d1onu9jr6rdbpmeh41ast4t48cppbvvs.apps.googleusercontent.com","project_id":"altinity-mcp-oauth-test","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"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"}}

 $ cat ~/.clickhouse-client/oauth_client.json
{"installed":{"client_id":"925653064731-fnr56jga567r94470drdnjvvf6lbkf0r.apps.googleusercontent.com","project_id":"altinity-mcp-oauth-test","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_secret":"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","redirect_uris":["http://localhost"]}}

@svb-alt svb-alt added the antalya label Apr 1, 2026
@zvonand
Copy link
Copy Markdown
Collaborator

zvonand commented Apr 13, 2026

@BorisTyshkevich This PR was made into a stub branch -- I will squash its commits and rebase atop 26.1. I will also make my amendments.
No action needed from your side for now.

@zvonand zvonand force-pushed the feature/client-IdP branch from c78caad to bdae71b Compare April 14, 2026 11:18
Adds two new flags to `clickhouse-client`:
- `--login`: Authorization Code + PKCE flow (RFC 7636), opens browser
- `--login-device`: Device Authorization flow (RFC 8628), prints URL+code

Both flows obtain an ID token from any OpenID Connect provider and then
proceed exactly as if `--jwt <token>` had been passed. No changes to the
connection layer.

The user places a Google-format credentials file at
`~/.clickhouse-client/oauth_client.json` (override with
`--oauth-credentials PATH`). Refresh tokens are cached in
`~/.clickhouse-client/oauth_cache.json` (mode 0600, written atomically)
so that subsequent runs proceed silently without re-authenticating.

`OAuthLogin.cpp` is moved to `src/Client/` (compiled into
`clickhouse_client`) so it is unit-testable via `unit_tests_dbms`.
Tests cover `loadOAuthCredentials` for valid/invalid JSON, missing
required fields, unknown top-level keys, and file-not-found error paths.

Backward compatibility: when `--login` is passed without a credentials
file and without `--login-device`, the existing cloud-specific `login()`
path is used as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix Google JWT exp claim type in TokenProcessorsOpaque

Google ID tokens return the `exp` claim as a JSON number (floating-point
`double`), not an integer. Calling `getValueByKey<time_t>()` instantiated
`picojson::value::is<long>()` / `picojson::value::get<long>()`, which are
not provided by the `picojson` library and caused a linker exception on
arm64 (where `time_t` is `long`).

Fix: cast through `double` first.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix review blockers in OAuth2 login implementation

Blocker 1 — cloud auto-login hijack:
Remove the filesystem::exists(oauth_client.json) auto-detection from the
`--login` processing path. The credentials-file flow is now only entered
when `--login-device` or `--oauth-credentials` is explicitly passed.
`--login` alone (including the auto-added case for *.clickhouse.cloud)
always falls through to the existing cloud login() path. This prevents
breaking cloud auth for users who happen to have a Google credentials
file at the default path.

Restore `--login` as `po::bool_switch()` (was changed to a presence flag)
so that existing scripts using `--login=true` continue to work.

XSS in callback HTML:
The OAuth2 auth-code callback handler reflected the error= query-string
parameter directly into the HTML response body. Add a minimal htmlEscape
helper and use it, preventing a potential XSS via a crafted redirect URI.

Remove unused SUPPORT_IS_DISABLED declaration:
That error code is used in Client.cpp (where it belongs), not in
OAuthLogin.cpp. Remove the dead declaration from the anonymous
ErrorCodes block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Robustness fixes, issuer discovery, state CSRF, and docs

postForm — RFC 6749 error response handling:
Removed the early HTTP-status check that threw before JSON parsing.
RFC 6749 §5.2 returns errors (authorization_pending, slow_down, etc.)
as HTTP 400 with a JSON body. postForm now always attempts JSON parsing
and throws only if the body is non-JSON, which fixes the device-flow
polling bug where authorization_pending caused an exception instead of
being handled by the caller's retry loop.

discoverDeviceEndpoint — sub-path issuer support:
The generic heuristic now strips only the last path segment of token_uri
instead of the entire path, preserving realm prefixes like Keycloak's
/realms/<realm>. Accepts an explicit issuer_hint parameter populated
from the new "issuer" field in the credentials JSON, bypassing the
heuristic entirely for providers that need it.

OAuthCredentials — optional issuer field:
Add std::string issuer to the struct and load it from the credentials
JSON. Providers that use non-root issuers (Keycloak, Okta custom
domains) can now specify issuer directly instead of relying on
heuristic path-stripping.

Auth code flow — CSRF state parameter (RFC 6749 §10.12):
Generate a 16-byte random hex state, include it in the authorization
URL, echo it back via the callback handler, and verify it matches
before proceeding to code exchange.

Logging for silently swallowed exceptions:
readCachedRefreshToken prints a notice when the cache file is
unparseable. tryRefreshToken now prints the rejection reason (expired,
revoked, or network error) before falling through to interactive flow.

Tests: add issuer field, PKCE building-block, base64url property tests.
Docs: fix --login prefix, add --login-device and --oauth-credentials
entries with credentials file format and cache location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge --login-device into --login=device

Replace the separate --login-device flag with a mode parameter on --login:

  --login            auth-code + PKCE flow (browser, default)
  --login=device     Device Authorization flow (prints URL + code)

Uses po::value<std::string>()->implicit_value("browser") so that bare
--login keeps its existing meaning. An invalid mode value throws
BAD_ARGUMENTS immediately. The --login=... prefix is now recognised in
readArguments() as an auth credential indicator.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix --login=device crash and --login=browser routing

Two bugs from testing:

1. --login device crash (Cannot convert to boolean: device):
   argsToConfig() runs after processOptions() and overwrites config["login"]
   with the raw string value from the command line. The subsequent
   config().getBool("login", false) in main() then threw because "device"
   is not a valid boolean.
   Fix: the cloud-login signal is now stored under the separate key
   "cloud_oauth_pending" which argsToConfig never touches. All three
   references to config["login"] (set-true, set-false, get) are updated.

2. --login=browser (or --login browser) fell through to the cloud path:
   Any explicitly specified mode should use the credentials-file OIDC flow,
   since the user is clearly opting into it. Only bare --login (implicit
   empty value) falls back to the ClickHouse Cloud auto-login path.
   Fix: implicit_value changed from "browser" to ""; use_credentials_file
   now triggers on !login_mode.empty() instead of mode == "device" only.

Routing table after this change:
  --login                → ClickHouse Cloud path (backward compat)
  --login=browser        → OIDC auth-code + PKCE (credentials file)
  --login=device         → OIDC device flow (credentials file)
  --login=<other>        → BAD_ARGUMENTS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix device flow crash on Google verification_url field name

Google returns verification_url (legacy) instead of verification_uri (RFC 8628).
getValue on a missing key threw 'Can not convert empty value'.

Fallback order: verification_uri_complete -> verification_uri -> verification_url.
Also guard device_code/user_code with explicit has() checks before access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Document --login/--oauth-credentials and add OAuth option validation tests

- Fix `--login[=<mode>]` doc entry: bare `--login` = Cloud auto-login (no
  default mode implied); `--login=browser` / `--login=device` trigger the
  OIDC credentials-file flow
- Update `--oauth-credentials` description and link to new section
- Add `### OAuth credentials file` section: JSON format example, required
  and optional fields table, default path, cache location
- Add Tests 9–11 to `03749_cloud_endpoint_auth_precedence.sh`:
  - `--login=device --oauth-credentials /nonexistent` → file-not-found error
  - `--login=invalid` → `BAD_ARGUMENTS` ("must be 'browser' or 'device'")
  - `--jwt tok --login=browser` → `BAD_ARGUMENTS` ("cannot both be specified")
- Update `.reference` with expected output for new tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix security and reliability issues in OAuth2 login

Must-fix:
- Session token refresh: add OAuthJWTProvider extending JWTProvider so
  Connection::sendQuery can call getJWT() transparently; eliminates the
  1-hour session limit when id_token is obtained only once at startup
- Bind callback server to 127.0.0.1 explicitly (not 0.0.0.0) to prevent
  network-adjacent attackers from racing to deliver a forged callback
- Add offline_access scope to both auth-code and device flows so that
  standard OIDC providers (non-Google) issue refresh tokens

Should-fix:
- Guard postForm against non-object JSON: separate JSON parse error from
  Poco extract<Object::Ptr> BadCastException with a clearer message
- Warn on plain http:// token/auth endpoints in loadOAuthCredentials

Also fix include order: OAuthLogin.h now includes <config.h> first so
the USE_JWT_CPP && USE_SSL guard evaluates correctly regardless of the
order headers are included by callers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix Google OAuth scope: use access_type=offline instead of offline_access

Google rejects offline_access as an invalid scope (Error 400: invalid_scope).
It uses the access_type=offline query parameter in the authorization URL
instead. Standard OIDC providers still require offline_access in the scope.

Add isGoogleProvider helper (detected via token_uri host) and apply the
correct mechanism for each: Google gets access_type=offline appended to
the authorization URL with openid email profile scope; all other providers
get offline_access in the scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Split OAuthJWTProvider into its own file; fix silent cache catch

- Move OAuthJWTProvider class and createOAuthJWTProvider factory into
  OAuthJWTProvider.cpp. The class delegates entirely to obtainIDToken
  (the public API) so it has no anonymous-namespace dependencies and
  can live independently.  OAuthLogin.cpp loses ~60 lines and its
  JWTProvider.h include.
- Fix silent catch(...) in writeCachedRefreshToken: log a warning when
  the existing cache file cannot be parsed (mirrors the read-side
  warning in readCachedRefreshToken).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add OAuth2 integration tests and fix remaining review items

**Layer 2 — Keycloak integration tests**

New test suite `tests/integration/test_keycloak_auth/` exercises
Keycloak 26.0 as a real OIDC provider:

- `test_jwt_dynamic_jwks` — token validated via explicit JWKS URI
- `test_openid_discovery` — token validated via OIDC discovery document
- `test_username_claim` — `preferred_username` claim maps to ClickHouse user
- `test_token_refresh` — `refresh_token` grant produces a valid `id_token`
- `test_wrong_issuer_rejected` — tampered `iss` claim is rejected
- `test_expired_token_rejected` — expired `exp` claim is rejected
- `test_device_flow_initiation` — device auth endpoint returns correct fields
- `test_device_flow_round_trip` — full RFC 8628 round-trip: initiate →
  simulate browser approval via HTML form sequence → poll token endpoint →
  authenticate ClickHouse `SELECT 1` with the resulting `id_token`

Infrastructure additions:
- `tests/integration/compose/docker_compose_keycloak.yml` — Keycloak 26.0
  service with `--import-realm` and a health-check on the OIDC discovery URL
- `tests/integration/helpers/cluster.py` — `with_keycloak` flag following
  the existing `with_ldap` pattern: `setup_keycloak_cmd`,
  `wait_keycloak_to_start`, `get_keycloak_url`, docker-compose wiring, and
  `depends_on` entry in instance compose generation
- `tests/integration/test_keycloak_auth/keycloak/realm-export.json` —
  pre-configured realm with client `clickhouse` (direct grants + device
  auth enabled), user `alice`, group `analysts`, and a `groups` claim mapper
- `tests/integration/test_keycloak_auth/configs/validators.xml` —
  `jwt_dynamic_jwks` and `openid` token processors pointing at Keycloak
- `tests/integration/test_keycloak_auth/configs/users.xml` — default user
  with `access_management` enabled

**Fix: `--oauth-credentials` without `--login` silently ignored**

Add an explicit guard before the `if (options.count("login"))` block in
`programs/client/Client.cpp` that throws `BAD_ARGUMENTS` when
`--oauth-credentials` is supplied without `--login=browser` or
`--login=device`.

**Fix: dead CMake and duplicate jwt-cpp linkage in `src/CMakeLists.txt`**

Remove the dead `add_object_library(clickhouse_client_jwt Client/jwt)` block
(the `src/Client/jwt/` directory does not exist) and the duplicate
`target_link_libraries(clickhouse_common_io PUBLIC ch_contrib::jwt-cpp)` at
the Redis block — the identical linkage is already present earlier in the file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix blockers in Keycloak integration tests

- `query_with_token`: replace `node.query(..., headers=...)` (unsupported
  parameter) with `node.http_request("", method="POST", data=query,
  headers=...)`, which uses the HTTP interface and accepts custom headers.

- `get_keycloak_url`: return `http://localhost:{port}` instead of
  `http://keycloak:{port}`.  The `keycloak` hostname is only resolvable
  inside the Docker network; pytest runs on the host and connects via the
  mapped port.  `wait_keycloak_to_start` already used `localhost` correctly.

- `configs/users.xml`: add user `alice` with `<jwt>` authentication so
  that `test_username_claim` (and any test that maps `preferred_username`
  to a ClickHouse session user) can find the user in the user store.

- Remove unused `import copy` from `test_wrong_issuer_rejected` and
  `test_expired_token_rejected`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix OAuth HTTPS sessions to respect configured SSL context

postForm and discoverDeviceEndpoint in OAuthLogin.cpp constructed
Poco::Net::HTTPSClientSession with the bare two-argument constructor,
which picks up Poco's built-in default SSL context (system CA bundle,
full verification) and ignores any openSSL.client.* configuration the
user has set, including custom CA certificates (caConfig) and
verificationMode.

In a corporate environment with a TLS inspection proxy, the proxy CA
is trusted only through the client configured CA bundle. With the old
code, every OAuth HTTPS request — OIDC discovery, token endpoint,
postForm — would fail with a certificate verification error, while the
upstream Cloud login path (which uses SSLManager::instance().defaultClientContext())
worked fine.

Fix: pass Poco::Net::SSLManager::instance().defaultClientContext() to
both HTTPSClientSession constructors, matching the pattern already used
in JWTProvider::createHTTPSession. This makes OAuth traffic respect
openSSL.client.caConfig, verificationMode, and every other SSL setting
the user has configured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zvonand zvonand changed the base branch from antalya-26.2 to antalya-26.1 April 14, 2026 11:20
@zvonand zvonand force-pushed the feature/client-IdP branch from bdae71b to 67f019f Compare April 14, 2026 11:21
@zvonand zvonand added antalya-26.3 port-antalya PRs to be ported to all new Antalya releases labels Apr 19, 2026
zvonand
zvonand previously approved these changes Apr 23, 2026
@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Apr 23, 2026

Audit: PR #1606 — Add OAuth2 login to clickhouse-client (--login / --login=device)

AI audit note: This review was generated by AI (Composer/Claude Opus 4.7) following .cursor/skills/audit-review/SKILL.md and .cursor/rules/audit-review.mdc.

  • Branch: feature/client-IdPantalya-26.1
  • URL: Add OAuth2 login to clickhouse-client (--login / --login=device) #1606
  • Scope reviewed: new credentials-file OIDC login path for clickhouse-client (--login=browser|device, --oauth-credentials), refresh-token cache on disk, OAuthJWTProvider wiring into Connection, argument handling/auto-add interaction with cloud endpoints, and minor change to GoogleTokenProcessor::resolveAndValidate (exp claim parsing).

Confirmed defects

High

H1 — Local HTTP auth-code callback binds in a shared port namespace and any local process can observe code/state; combined with attacker-controllable interval/expires_in in device flow the client can be made to tight-poll or sleep indefinitely

This is two separate items consolidated under one root cause class ("trust of untrusted peer values without validation"). Split below as H1a (device flow) is the clearer correctness/DoS hazard; the callback port item is reported separately as M2 because the state CSRF check mitigates exfiltration.

H1a — runOAuthDeviceFlow trusts server-supplied interval / expires_in without validation → infinite-loop / tight-poll / sleep-forever
  • Impact: a hostile or misconfigured token endpoint can force clickhouse-client into: (a) a busy loop hammering the token endpoint (when interval <= 0), (b) a single multi-hour sleep_for that ignores user Ctrl-C and blocks the whole client process (when interval is huge), or (c) a polling loop that never terminates because expires_in overflows the steady_clock arithmetic.
  • Anchor: src/Client/OAuthFlowRunner.cpprunOAuthDeviceFlow.
  • Trigger: device endpoint response returns "interval": -1 (or 0, or "interval": 99999999) and/or "expires_in": 9999999999.
  • Why defect: the code feeds these values directly into std::this_thread::sleep_for(std::chrono::seconds(interval)) and into steady_clock::now() + std::chrono::seconds(expires_in). Neither path is Ctrl-C-interruptible, neither bounds the value, and on slow_down the handler does interval += 5 on top of the server-supplied value.
  • Evidence:
    int interval = device_resp->has("interval") ? device_resp->getValue<int>("interval") : 5;
    int expires_in = device_resp->has("expires_in") ? device_resp->getValue<int>("expires_in") : 300;

    std::cerr << "\nTo authenticate, visit:\n  " << verification_uri << "\nAnd enter code: " << user_code << "\n\n";

    const auto deadline = std::chrono::steady_clock::now() + std::chrono::seconds(expires_in);
    while (std::chrono::steady_clock::now() < deadline)
    {
        std::this_thread::sleep_for(std::chrono::seconds(interval));
  • Fix direction: clamp interval to [1, 60] and expires_in to [60, 1800]; treat out-of-range as protocol error.
  • Regression test direction: mock device endpoint returning interval=-1, interval=999999, expires_in=-1; assert AUTHENTICATION_FAILED and bounded wall-time.
  • Blast radius: any non-Google provider configured via --oauth-credentials, including an attacker whose only foothold is DNS poisoning against the device endpoint.

Medium

M1 — Refresh-token cache write is not crash-safe or concurrency-safe and silently loses/corrupts tokens

  • Impact: two concurrent clickhouse-client invocations using the same client_id (common for a single human user opening two shells) can produce a corrupted cache file or silently drop each other's freshly-issued refresh tokens, forcing the user into an interactive re-auth loop. On disk full / read-only home, the write is swallowed with no diagnostic.
  • Anchor: src/Client/OAuthLogin.cppwriteCachedRefreshToken.
  • Trigger:
    1. Two clients do read → modify → rename against a fixed path + ".tmp" concurrently. The later rename wins, discarding the earlier writer's new entries.
    2. Two std::ofstream out(tmp_path, std::ios::trunc) opens interleave → corrupt JSON in .tmp, and because fs::rename(tmp_path, cache_path, ec) uses the error-code overload and ec is never inspected, a failed rename leaves cache_path stale and .tmp leaking with no log message.
    3. if (!out.is_open()) return; silently no-ops when ~/.clickhouse-client is unwritable.
  • Why defect: OAuth refresh tokens are long-lived and privileged; a silent write loss forces re-auth and the fixed tmp name makes the race easy to trigger (two clickhouse-client sessions is a realistic workload).
  • Evidence:
    const std::string tmp_path = path + ".tmp";
    {
        std::ofstream out(tmp_path, std::ios::trunc);
        if (!out.is_open())
            return;
        Poco::JSON::Stringifier::stringify(obj, out);
    }

    std::error_code ec;
    fs::permissions(tmp_path, fs::perms::owner_read | fs::perms::owner_write, fs::perm_options::replace, ec);
    fs::rename(tmp_path, cache_path, ec);
  • Fix direction: create the tmp file with mkstemp (process-unique suffix) and O_EXCL, take an advisory flock() on cache_path across the read-modify-write, log a single std::cerr line on rename/permission failure instead of dropping silently.
  • Regression test direction: fork two processes that both call writeCachedRefreshToken with distinct client_ids and assert both entries are present after both return.

M2 — Auth-code callback server accepts any path/method and CSRF state is the only mitigation; no request-origin/path validation

  • Impact: any local process on the user's machine can fire GET / (or any path) at the ephemeral callback port and drive the auth-code flow as long as it can guess the 128-bit state. The handler does not check request method, path, Host header, or that the request actually originates from the loopback adapter (bind(127.0.0.1, 0) does mitigate remote origin, so this is bounded to local attackers).
  • Anchor: src/Client/OAuthFlowRunner.cppAuthCodeHandler::handleRequest.
  • Trigger: local attacker running as same UID can read /proc/<client_pid>/cmdline (or observe the browser URL) to grab state and then POST/GET to the callback port.
  • Why defect: defense-in-depth against local attackers on shared workstations is weakened; the handler writes into state.done and unblocks the main thread even if the request does not hit /callback. With state check intact this is not a full auth bypass, but it is a concrete deviation from the contract ("only accept the redirect we sent to the IdP").
  • Evidence:
    void handleRequest(Poco::Net::HTTPServerRequest & request, Poco::Net::HTTPServerResponse & response) override
    {
        Poco::URI uri("http://localhost" + request.getURI());
        const auto params = uri.getQueryParameters();
        ...
        response.setStatus(Poco::Net::HTTPResponse::HTTP_OK);
        response.setContentType("text/html");
        ...
        std::lock_guard<std::mutex> lock(state.mtx);
        state.code = code;
        state.error = error;
        state.received_state = received_state;
        state.done = true;
        state.cv.notify_one();
    }
  • Fix direction: reject requests where request.getMethod() != "GET" or uri.getPath() != "/callback" with HTTP 400; do not notify the condition variable for such requests.
  • Regression test direction: unit test the handler with GET /unexpected and with POST /callback; both must leave state.done == false.

M3 — --jwt / --login precedence check is only triggered when both appear on argv; config-file-provided jwt is not guarded

  • Impact: clickhouse-client with --login=browser and jwt set in the user config file silently ignores the config jwt and runs the interactive OAuth flow, but if a future code path or operator expects --login to refuse mixed auth, this silent precedence change is surprising. Conversely the if (options.count("jwt") && options.count("login")) check at processOptions time throws BAD_ARGUMENTS only when both come from the CLI; the pre-existing if (config().has("jwt")) guard that used to live inside the --login branch was removed:
-        if (config().has("jwt"))
-            throw Exception(ErrorCodes::BAD_ARGUMENTS, "JWT and login flags can't be specified together");
-        config().setBool("login", true);
+    if (options.count("jwt") && options.count("login"))
+        throw Exception(ErrorCodes::BAD_ARGUMENTS, "--jwt and --login cannot both be specified");
  • Anchor: programs/client/Client.cppClient::processOptions and the pre-existing if (config().getBool("cloud_oauth_pending", false) && !config().has("jwt")) in main(). Note that in the new credentials-file branch there is no !config().has("jwt") short-circuit — jwt_provider->getJWT() is called and its output overwrites whatever the user put in --config.
  • Trigger: user has <jwt>...</jwt> in ~/.clickhouse-client/config.xml and passes --login=browser on the CLI.
  • Why defect: the old code blocked this combination explicitly; the new code silently prefers the interactive flow, which is a security-relevant behavior change (a stale/leaked config-file JWT vs an interactive human login) and is undocumented.
  • Fix direction: restore if (config().has("jwt")) check inside the --login branch, or document precedence explicitly.
  • Regression test direction: shell test with config containing <jwt>dummy</jwt> + --login=browser, assert BAD_ARGUMENTS.

Low

L1 — tryRefreshToken does not evict a rejected refresh token from the on-disk cache

  • Impact: once a refresh token is revoked by the IdP, every subsequent clickhouse-client invocation repeats the wasted network round-trip to the token endpoint before falling back to the interactive flow. Minor latency/UX, plus a log line written to std::cerr on every startup.
  • Anchor: src/Client/OAuthLogin.cpptryRefreshToken.
  • Trigger: revoke the cached refresh token at the IdP, then run clickhouse-client --login=device repeatedly.
  • Evidence:
        if (resp->has("error"))
        {
            std::cerr << "Note: cached refresh token was rejected ("
                      << resp->getValue<std::string>("error")
                      << "); re-authenticating.\n";
            return "";
        }

(no call to writeCachedRefreshToken(client_id, "") or cache-entry removal)

  • Fix direction: clear/remove the entry via a dedicated helper when error is invalid_grant.
  • Regression test direction: mock token endpoint that returns invalid_grant; assert the cache entry for client_id is removed after one call.

L2 — openBrowser ignores the exit status of xdg-open/open, leaving the user staring at a silent terminal when no browser is installed

  • Impact: on a headless host without a configured xdg-open the client hangs for 120s at the browser-callback wait with no actionable message. Mitigated by the pre-print of the URL, so user can still copy/paste.
  • Anchor: src/Client/OAuthFlowRunner.cppopenBrowser.
  • Trigger: run --login=browser on a machine with no graphical session and no xdg-open in $PATH.
  • Evidence:
    pid_t pid;
    if (posix_spawnp(&pid, cmd, nullptr, nullptr, const_cast<char * const *>(argv), nullptr) == 0)
        waitpid(pid, nullptr, 0);
  • Fix direction: inspect waitpid status and, if non-zero, print "Unable to launch a browser; please copy the URL above manually.".
  • Regression test direction: not easily automatable; documentation-only is acceptable.

L3 — verification_uri selection silently accepts verification_url (Google) but the code prints it with \nTo authenticate, visit: even when the field is missing both _complete and plain — in that case the ternary throws an exception that is propagated as-is, but the exception message does not include the offending response body

  • Impact: diagnostic; operator can't tell which field the provider omitted.
  • Anchor: src/Client/OAuthFlowRunner.cpp at the chained-ternary starting line 384.
  • Fix direction: include the stringified device_resp in the exception message (as the adjacent device_code/user_code branch already does).

L4 — GoogleTokenProcessor::resolveAndValidate silently narrows exp from double to time_t

  • Impact: the change getValueByKey<double>(token_info, "exp").value() cast to time_t preserves precision for realistic Unix timestamps but loses the signaling-NaN / negative-value error path that getValueByKey<time_t> used to provide (invalid JSON numeric would now produce whatever static_cast<time_t>(NaN) yields — implementation-defined).
  • Anchor: src/Access/TokenProcessorsOpaque.cpp.
  • Evidence:
    if (token_info.contains("exp"))
        credentials.setExpiresAt(std::chrono::system_clock::from_time_t(static_cast<time_t>(getValueByKey<double>(token_info, "exp").value())));
  • Fix direction: validate std::isfinite(exp) and exp > 0 before casting.
  • Regression test direction: feed a Google tokeninfo mock with "exp": "not-a-number" and "exp": -1; assert rejection.

Coverage summary

  • Scope reviewed:
    • programs/client/Client.cpp — argument parsing, --login[=mode], --oauth-credentials, cloud auto-add interaction, cloud_oauth_pending config-key rename, login() / main() flow.
    • src/Client/OAuthLogin.{h,cpp} — credentials loader, refresh-token disk cache (read/write/rename), obtainIDToken dispatch.
    • src/Client/OAuthFlowRunner.{h,cpp} — PKCE, CSRF state, local callback HTTP server, auth-code flow, device flow, postOAuthForm, urlEncodeOAuth.
    • src/Client/OAuthJWTProvider.cppJWTProvider subclass, expiry-buffer refresh.
    • src/Client/OAuthProviderPolicy.{h,cpp} — policy dispatch, OIDC discovery document fetch.
    • src/Access/TokenProcessorsOpaque.cppexp claim parsing change.
    • Shell test 03749_cloud_endpoint_auth_precedence.sh and integration test test_keycloak_auth/.
  • Categories failed (defects confirmed):
    • Input validation on server-supplied device-flow timing (H1a).
    • Filesystem concurrency / crash-safety for refresh-token cache (M1).
    • Local-attacker surface on auth-code callback server (M2).
    • CLI/config precedence regression vs prior --login branch (M3).
    • Diagnostics / state-consistency on rejected refresh tokens and missing-browser path (L1, L2).
    • Error-contract consistency on provider-field absence (L3) and numeric-claim narrowing (L4).
  • Categories passed:
    • PKCE verifier/challenge generation (RFC 7636 §4 compliant; unit tested).
    • CSRF state verification in auth-code flow (128 bits, compared in constant-time-ish path via !=).
    • OAuth callback server lifetime vs AuthCodeState: HTTPServer is destructed before state (reverse-scope order) and Poco's TCPServer::~TCPServer() joins worker threads, so no use-after-free on in-flight handlers.
    • --user / --jwt / --login mutual exclusion on the CLI path (partially — see M3 for the config-file gap).
    • Provider policy dispatch (Google vs generic) and discovery-document error propagation.
    • OAuthJWTProvider::getJWT expiry check honors a 30-second buffer consistent with Connection::sendQuery's own check.
    • Cloud-endpoint auto-add gating on --user/--password/--jwt/--ssh-key-file/--login and on connection-string userinfo.
    • urlEncodeOAuth delegates to Poco::URI::encode — no double-encoding or injection in the token request body.
    • HTTPS path in postOAuthForm uses SSLManager::defaultClientContext() (honors --accept-invalid-certificate when set elsewhere in the client).
  • Assumptions / limits: static reasoning only; Poco TCPServer::~TCPServer() join semantics assumed per upstream documentation; no runtime fault-injection performed; non-Google providers other than Keycloak not exercised; Windows openBrowser path (absent) not analyzed because the conditional compilation excludes it.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Apr 23, 2026

Usability Audit: OAuth2 login for clickhouse-client (PR #1606)

Scope: Altinity/ClickHouse PR #1606 (Add OAuth2 login to clickhouse-client (--login / --login=device)), head ce914b1b07bcd9f298b24ab94d451530dc4904c3
Persona: Operator / DBA
Environment: Linux/macOS CLI clients, ClickHouse Cloud and custom OIDC providers over HTTPS

Primary flows (ranked by expected usage)

  1. Run clickhouse-client --login against *.clickhouse.cloud and complete Cloud auto-login with no extra OAuth flags.
  2. Run clickhouse-client --login=browser --oauth-credentials <path> against a custom OIDC provider and complete Authorization Code + PKCE in the browser.
  3. Run clickhouse-client --login=device --oauth-credentials <path> and complete Device Authorization on a second device.
  4. Keep a long-lived client session active while id_token refreshes through cached refresh_token.

Issues

Flow 2: Browser OIDC login

2.1 Callback host/address mismatch can cause false timeout

  • Symptom: User completes login in browser but client still fails with OAuth2 login timed out waiting for browser callback (src/Client/OAuthFlowRunner.cpp:275, src/Client/OAuthFlowRunner.cpp:278, src/Client/OAuthFlowRunner.cpp:315).
  • Trigger: Local callback server binds only 127.0.0.1, while OAuth redirect URI is http://localhost:<port>/callback; on stacks where localhost resolves to ::1 first, callback never reaches the bound socket.
  • Signal quality: Confusing.
  • Severity: High.
  • Recovery: Self-service (retry with --login=device or host/network workaround), but root cause is non-obvious.

2.2 Public-client OIDC setups are blocked by mandatory client_secret

  • Symptom: Credential loading fails for valid OIDC public-client configs that do not carry client_secret (src/Client/OAuthLogin.cpp:217, docs/en/interfaces/cli.md in #1606 @ ce914b1b07bcd9f298b24ab94d451530dc4904c3).
  • Trigger: loadOAuthCredentials() hard-requires client_secret for all modes, including PKCE and device-flow paths that can be configured as public clients.
  • Signal quality: Clear.
  • Severity: Medium.
  • Recovery: Self-service only by reconfiguring IdP client as confidential (not always allowed by org policy).

Flow 3: Device-flow login

3.1 Polling has no progress signal, looks stuck during normal wait

  • Symptom: After printing URL/code once, client is silent until success or timeout, so users cannot distinguish "waiting for approval" from "hung" (src/Client/OAuthFlowRunner.cpp:398, src/Client/OAuthFlowRunner.cpp:401, src/Client/OAuthFlowRunner.cpp:435).
  • Trigger: Repeated authorization_pending / slow_down responses during normal RFC 8628 polling do not emit status updates.
  • Signal quality: Silent.
  • Severity: Medium.
  • Recovery: Self-service (wait/retry), but requires guesswork about whether flow is alive.

Flow 4: Long-lived sessions with refresh token reuse

4.1 Refresh-token cache write failures are silent and cause repeated re-auth prompts

  • Symptom: Session works now, but next runs unexpectedly re-enter interactive login because cache was not persisted; no warning is emitted (src/Client/OAuthLogin.cpp:121, src/Client/OAuthLogin.cpp:123, src/Client/OAuthLogin.cpp:127, src/Client/OAuthLogin.cpp:129).
  • Trigger: Cache file create/open/rename/permission operations fail (~/.clickhouse-client/oauth_cache.json path or filesystem restrictions).
  • Signal quality: Silent.
  • Severity: Medium.
  • Recovery: Runbook (inspect file permissions/fs errors), often needs operator debugging.

Cross-flow issues (apply to all primary flows)

X.1 Non-object JSON responses can leak low-level parser errors

  • Symptom: User can receive Poco type-cast exceptions instead of OAuth endpoint context when server returns valid JSON that is not an object (src/Client/OAuthFlowRunner.cpp:245).
  • Trigger: OAuth endpoints or intermediaries return JSON arrays/scalars for failures.
  • Signal quality: Confusing.
  • Severity: Low.
  • Recovery: Engineer required for robust error normalization.

Readiness verdict

It is normal for individual flows to be ready in isolation while the overall production verdict is "ready with caveats (guardrails required)". In that case, the Operator guidance section below defines the required guardrails.

Operator guidance (until fixes land)

  • Prefer --login=device when browser callback reliability is uncertain on dual-stack hosts (Update README.md #2.1).
  • Standardize IdP client provisioning to include client_secret before rollout; treat public-client JSON as unsupported for now (Update README.md #2.2).
  • During device flow, document expected wait windows and instruct users not to abort before expires_in unless explicit error appears (Update README.md #3.1).
  • Add a local runbook check for ~/.clickhouse-client/oauth_cache.json ownership/mode and writable parent directory when repeated login prompts occur (Update README.md #4.1).
  • If OAuth errors look non-actionable, capture full stderr plus endpoint responses and escalate as parser-shape handling issue (X.1).

Recommended fixes

  • Align callback binding and redirect host (bind localhost dual-stack or emit redirect URI with explicit loopback address matching the bind target) (Update README.md #2.1).
  • Make client_secret optional when provider/client type allows public PKCE/device flow, with explicit validation rules by grant type (Update README.md #2.2).
  • Emit periodic polling status (pending, slow_down backoff, remaining time) in device flow to reduce "stuck" ambiguity (Update README.md #3.1).
  • Log write/rename/permission failures when persisting refresh-token cache and surface remediation hints (Update README.md #4.1).
  • Wrap non-object JSON parse paths in stable OAuth error messages with endpoint + status context (X.1).

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Apr 24, 2026

@zvonand here is another round of audit review, after your fixes. If those are not relevant and don't need fixing - I can proceed with verification.

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1606 (OAuth2 login for clickhouse-client):

Confirmed defects

Medium: Loopback /start endpoint leaks full auth URL (including CSRF state) to any local process

Impact: a local attacker process can read the in-flight OAuth state and race a forged callback, causing account-confusion login hijack on shared hosts.
Anchor: src/Client/OAuthFlowRunner.cpp (AuthCodeHandler::handleRequest, runOAuthAuthCodeFlow).
Trigger: during --login=browser, attacker discovers the temporary loopback port, calls /start, then sends /callback?code=...&state=... before the legitimate redirect.
Why defect: callback acceptance relies on received_state == csrf_state, but /start redirects to state.auth_url, which exposes that same csrf_state to any local caller.

if (path == "/start")
{
    std::string target;
    {
        std::lock_guard<std::mutex> lock(state.mtx);
        target = state.auth_url;
    }
    response.redirect(target, Poco::Net::HTTPResponse::HTTP_FOUND);
}
std::string auth_url
    = creds.auth_uri
    + "?response_type=code"
      // ...
    + "&state=" + csrf_state;

Affected transition: browser launch -> local redirect handling -> callback validation -> token exchange.
Fault-injection mapping: injected condition = unsolicited local /start; observed effect = attacker obtains valid state for callback race.
Smallest reproduction: start clickhouse-client --login=browser; local process calls /start; attacker completes auth with leaked state; attacker callback arrives first.
Fix direction (short): do not expose auth URL/state on unauthenticated /start; use one-time unguessable launch token or remove server-side redirect indirection.
Regression test direction (short): loopback-handler test that unsolicited /start does not return a state-bearing redirect and cannot enable attacker callback completion.
Affected subsystem and blast radius: client OAuth auth-code flow; impacts local multi-process environments running clickhouse-client.

Medium: OIDC discovery response in provider policy is unbounded (memory-exhaustion DoS path)

Impact: a malicious or misconfigured discovery endpoint can force unbounded memory growth in clickhouse-client by returning an arbitrarily large response body.
Anchor: src/Client/OAuthProviderPolicy.cpp (fetchDeviceEndpointFromIssuer).
Trigger: --login=device with discovery enabled (no device_authorization_uri) against an issuer that serves a huge .well-known/openid-configuration payload.
Why defect: discovery uses unbounded copyToString and parses afterward; no size guard exists in this path.

auto & stream = session.receiveResponse(response);
Poco::StreamCopier::copyToString(stream, body);
// ...
auto result = parser.parse(body);

Affected transition: device flow init -> issuer discovery fetch -> parse discovery document.
Fault-injection mapping: injected condition = oversized discovery body; observed effect = heap growth / potential OOM termination.
Smallest reproduction: configure issuer to attacker-controlled endpoint; serve very large discovery response; run --login=device; observe memory growth while buffering body.
Fix direction (short): apply the same bounded-copy pattern used in postOAuthForm (hard max bytes + explicit AUTHENTICATION_FAILED on overflow).
Regression test direction (short): mock discovery endpoint returning payload > max size and assert bounded failure without large RSS growth.
Affected subsystem and blast radius: OAuth provider policy/discovery; affects device-flow startup for generic and Google discovery paths.

Coverage summary

Scope reviewed: programs/client/Client.cpp, src/Client/OAuthFlowRunner.{h,cpp}, src/Client/OAuthLogin.{h,cpp}, src/Client/OAuthJWTProvider.cpp, src/Client/OAuthProviderPolicy.{h,cpp}, src/Access/TokenProcessorsOpaque.cpp, and PR tests/docs relevant to OAuth flow behavior.

Categories failed: loopback callback hardening against local-process adversary; discovery endpoint resource-bounds enforcement.

Categories passed: cloud/login routing and jwt/login exclusion, device-flow interval/expiry validation and clamping, refresh-token cache concurrency/crash-safety and invalid_grant eviction, callback method/path validation and first-writer-wins state handling, token exp finite/range validation.

Assumptions/limits: static analysis only (no runtime exploit harness); local-attacker model assumes same-host unprivileged process can reach loopback listener during active login.

@zvonand
Copy link
Copy Markdown
Collaborator

zvonand commented Apr 24, 2026

these findings will be addressed in a follow up -- they are not critical for this release

@zvonand zvonand merged commit 67683cd into antalya-26.1 Apr 24, 2026
219 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 antalya-26.1.11.20001 port-antalya PRs to be ported to all new Antalya releases verified Approved for release verified-with-issues Verified by QA and issues found.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants