Skip to content

Reissued certificates occasionally have authority_id=None #5447

@maperu

Description

@maperu

Symptom

Some certificates created via the reissue path end up persisted with authority_id=None despite the input schema requiring authority. Once a cert is in this state, all subsequent reissue attempts fail because the rotation pipeline can't determine which CA to talk to.

The same pattern affects certs reissued in distinct date clusters rather than continuously, suggesting transient conditions during particular celery cycles. Predecessors in the rotation chain have the correct authority_id set; the broken cert is generated at a specific reissue, then the next-generation reissue inherits the broken state.

Why I think the bug is real

In commit e874f83c ("various fixups to disable autorotate task", 2025-05-22), the codebase added defensive None-handling for cert.authority_id in logging:

log_data["authority_name"] = "unknown"
authority = authorities_get_by_id(cert.authority_id)
if authority:
    log_data["authority_name"] = authority.name

That patch makes the logger not crash when a cert has authority_id=None, but doesn't address why a cert can reach that state in the first place. The change pretty strongly implies the maintainers have seen these certs in their own deployment.

Expected vs actual

CertificateInputSchema.authority is fields.Nested(AssociatedAuthoritySchema, required=True) (lemur/certificates/schemas.py:105), and validate_authority at lemur/certificates/schemas.py:154-163 raises ValidationError if authority is missing or inactive. So a cert created through service.create() with validate_schema(certificate_input_schema, ...) should never lack an authority.

The reissue_certificate path goes through get_certificate_primitives() (lemur/certificates/service.py:942), which round-trips CertificateOutputSchemaCertificateInputSchema and asserts not ser.errors. So the reissue path should also enforce authority.

Both paths set cert.authority = kwargs["authority"] on line 554 after the constructor, which sets self.authority_id = kwargs.get("authority_id") (None, since the schema produces authority as an Authority object, not authority_id as an integer key). The relationship assignment should populate the FK on flush.

Despite this, the bug fires occasionally.

Hypotheses (none confirmed)

  1. A SQLAlchemy session/flush issue: if kwargs["authority"] is fetched in one session and database.commit() happens in another (e.g. celery worker race), the relationship may not flush the FK column.
  2. An older code path or migration that bypassed the schema and is still in the lineage of some certs.
  3. A specific race between cert.authority = ... and the commit that occasionally drops the relationship.

Reproduction

I haven't been able to reproduce this in a controlled setup — the symptom appears in production every ~6 months, often across multiple certs reissued in the same celery cycle.

Asks

  1. Has the team seen this internally? The e874f83c log-defensive change implies yes.
  2. Any insight into why the reissue path can produce certs with authority_id=None?
  3. Would a more defensive setter on Certificate.authority (e.g. asserting authority_id != None at end of __init__ or after cert.authority = X) be welcome as a guard?

Happy to send a PR if there's a direction you'd prefer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions