Symptom
Some certificates created via the reissue path end up persisted with authority_id=None despite the input schema requiring authority. Once a cert is in this state, all subsequent reissue attempts fail because the rotation pipeline can't determine which CA to talk to.
The same pattern affects certs reissued in distinct date clusters rather than continuously, suggesting transient conditions during particular celery cycles. Predecessors in the rotation chain have the correct authority_id set; the broken cert is generated at a specific reissue, then the next-generation reissue inherits the broken state.
Why I think the bug is real
In commit e874f83c ("various fixups to disable autorotate task", 2025-05-22), the codebase added defensive None-handling for cert.authority_id in logging:
log_data["authority_name"] = "unknown"
authority = authorities_get_by_id(cert.authority_id)
if authority:
log_data["authority_name"] = authority.name
That patch makes the logger not crash when a cert has authority_id=None, but doesn't address why a cert can reach that state in the first place. The change pretty strongly implies the maintainers have seen these certs in their own deployment.
Expected vs actual
CertificateInputSchema.authority is fields.Nested(AssociatedAuthoritySchema, required=True) (lemur/certificates/schemas.py:105), and validate_authority at lemur/certificates/schemas.py:154-163 raises ValidationError if authority is missing or inactive. So a cert created through service.create() with validate_schema(certificate_input_schema, ...) should never lack an authority.
The reissue_certificate path goes through get_certificate_primitives() (lemur/certificates/service.py:942), which round-trips CertificateOutputSchema → CertificateInputSchema and asserts not ser.errors. So the reissue path should also enforce authority.
Both paths set cert.authority = kwargs["authority"] on line 554 after the constructor, which sets self.authority_id = kwargs.get("authority_id") (None, since the schema produces authority as an Authority object, not authority_id as an integer key). The relationship assignment should populate the FK on flush.
Despite this, the bug fires occasionally.
Hypotheses (none confirmed)
- A SQLAlchemy session/flush issue: if
kwargs["authority"] is fetched in one session and database.commit() happens in another (e.g. celery worker race), the relationship may not flush the FK column.
- An older code path or migration that bypassed the schema and is still in the lineage of some certs.
- A specific race between
cert.authority = ... and the commit that occasionally drops the relationship.
Reproduction
I haven't been able to reproduce this in a controlled setup — the symptom appears in production every ~6 months, often across multiple certs reissued in the same celery cycle.
Asks
- Has the team seen this internally? The
e874f83c log-defensive change implies yes.
- Any insight into why the reissue path can produce certs with
authority_id=None?
- Would a more defensive setter on
Certificate.authority (e.g. asserting authority_id != None at end of __init__ or after cert.authority = X) be welcome as a guard?
Happy to send a PR if there's a direction you'd prefer.
Symptom
Some certificates created via the reissue path end up persisted with
authority_id=Nonedespite the input schema requiring authority. Once a cert is in this state, all subsequent reissue attempts fail because the rotation pipeline can't determine which CA to talk to.The same pattern affects certs reissued in distinct date clusters rather than continuously, suggesting transient conditions during particular celery cycles. Predecessors in the rotation chain have the correct
authority_idset; the broken cert is generated at a specific reissue, then the next-generation reissue inherits the broken state.Why I think the bug is real
In commit
e874f83c("various fixups to disable autorotate task", 2025-05-22), the codebase added defensive None-handling forcert.authority_idin logging:That patch makes the logger not crash when a cert has authority_id=None, but doesn't address why a cert can reach that state in the first place. The change pretty strongly implies the maintainers have seen these certs in their own deployment.
Expected vs actual
CertificateInputSchema.authorityisfields.Nested(AssociatedAuthoritySchema, required=True)(lemur/certificates/schemas.py:105), andvalidate_authorityatlemur/certificates/schemas.py:154-163raisesValidationErrorif authority is missing or inactive. So a cert created throughservice.create()withvalidate_schema(certificate_input_schema, ...)should never lack an authority.The
reissue_certificatepath goes throughget_certificate_primitives()(lemur/certificates/service.py:942), which round-tripsCertificateOutputSchema→CertificateInputSchemaand assertsnot ser.errors. So the reissue path should also enforce authority.Both paths set
cert.authority = kwargs["authority"]on line 554 after the constructor, which setsself.authority_id = kwargs.get("authority_id")(None, since the schema producesauthorityas an Authority object, notauthority_idas an integer key). The relationship assignment should populate the FK on flush.Despite this, the bug fires occasionally.
Hypotheses (none confirmed)
kwargs["authority"]is fetched in one session anddatabase.commit()happens in another (e.g. celery worker race), the relationship may not flush the FK column.cert.authority = ...and the commit that occasionally drops the relationship.Reproduction
I haven't been able to reproduce this in a controlled setup — the symptom appears in production every ~6 months, often across multiple certs reissued in the same celery cycle.
Asks
e874f83clog-defensive change implies yes.authority_id=None?Certificate.authority(e.g. asserting authority_id != None at end of__init__or aftercert.authority = X) be welcome as a guard?Happy to send a PR if there's a direction you'd prefer.