Skip to content

Add OOM protection for Export operations#5511

Merged
apurvabhaleMS merged 15 commits intomainfrom
personal/abhale/export-oom-protect
May 7, 2026
Merged

Add OOM protection for Export operations#5511
apurvabhaleMS merged 15 commits intomainfrom
personal/abhale/export-oom-protect

Conversation

@apurvabhaleMS
Copy link
Copy Markdown
Contributor

Description

Export jobs previously failed immediately on OutOfMemoryException with no retry. This change adds OOM recovery to ExportProcessingJob, following the pattern established by Reindex.

Two recovery strategies based on backend

SQL path (jobs with surrogate ID ranges — All and Patient parallel exports):

  • On OOM, reduces batch size by 10x and splits the failed surrogate ID range into smaller sub-ranges via GetSurrogateIdRanges
  • Sub-ranges are queued internally and processed sequentially within the same job
  • Each sub-range tracks its own OOM reduction count (max 3 reductions before failing)
  • Resource type for range splitting matches orchestrator behavior: Patient for Patient exports, the assigned single type for All exports

Cosmos path (jobs without surrogate IDs — Group exports, non-parallel, filtered exports):

  • On OOM, reduces batch size by 10x and retries the full job execution
  • Up to 3 retries before failing

Key changes

  • ExportProcessingJob: New ExecuteWithSurrogateIdRangeSplittingAsync (SQL) and ExecuteWithBatchReductionAsync (Cosmos) methods. Injected ISearchService for calling GetSurrogateIdRanges.
  • ExportJobTask: Removed old catch (OutOfMemoryException) that immediately failed the job. OOM now re-throws to ExportProcessingJob for recovery.
  • ExportJobRecord: MaximumNumberOfResourcesPerQuery, StartSurrogateId, EndSurrogateId changed to internal set for batch size reduction and range updates.

Failure behavior preserved

When retries are exhausted, the job fails with the same 413 RequestEntityTooLarge status and ExportOutOfMemoryException message as before, ensuring no change to user-facing error responses.

Related Issues

Addresses AB189470, AB187913

Testing

  • Unit tests added/updated in ExportProcessingJobTests and ExportJobTaskTests
  • SQL path: verifies GetSurrogateIdRanges is called with correct resource type and range, sub-ranges are processed with reduced batch size
  • Cosmos path: verifies batch size reduction and retry, max reduction failure with 413
  • OOM bubble-up from ExportJobTask verified for all three export types (All, Patient, Group)

FHIR Team Checklist

  • Update the title of the PR to be succinct and less than 65 characters
  • Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
  • Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
  • Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
  • Tag the PR with Schema Version backward compatible or Schema Version backward incompatible or Schema Version unchanged if this adds or updates Sql script which is/is not backward compatible with the code.
  • When changing or adding behavior, if your code modifies the system design or changes design assumptions, please create and include an ADR.
  • CI is green before merge Build Status
  • Review squash-merge requirements

Semver Change (docs)

Patch|Skip|Feature|Breaking (reason)

@apurvabhaleMS apurvabhaleMS requested a review from a team as a code owner April 20, 2026 18:12
@apurvabhaleMS apurvabhaleMS added Enhancement Enhancement on existing functionality. Area-BulkExport Area related to bulk export. Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs No-PaaS-breaking-change No-ADR ADR not needed labels Apr 20, 2026
@apurvabhaleMS apurvabhaleMS added this to the FY26\Q4\2Wk\2Wk21 milestone Apr 20, 2026
Comment thread src/Microsoft.Health.Fhir.Core/Features/Operations/Export/ExportProcessingJob.cs Dismissed
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.38%. Comparing base (6faa795) to head (0da235b).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5511      +/-   ##
==========================================
- Coverage   77.63%   77.38%   -0.26%     
==========================================
  Files         985      985              
  Lines       36189    36337     +148     
  Branches     5498     5513      +15     
==========================================
+ Hits        28097    28120      +23     
- Misses       6738     6856     +118     
- Partials     1354     1361       +7     

see 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds OutOfMemoryException (OOM) recovery to export processing jobs, aligning export behavior with the existing reindex OOM mitigation approach to avoid immediate job failure and enable controlled retries.

Changes:

  • Added OOM recovery to ExportProcessingJob with two strategies: surrogate-id range splitting (SQL) and batch-size reduction retries (Cosmos/non-SQL).
  • Updated ExportJobTask to let OOM exceptions bubble up to the processing job for centralized recovery handling.
  • Relaxed setters on select ExportJobRecord fields to allow internal batch/range adjustments during recovery; added/updated unit tests for both recovery paths.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/Microsoft.Health.Fhir.Core/Features/Operations/Export/Models/ExportJobRecord.cs Allows internal updates to batch size and surrogate id range fields needed for OOM recovery.
src/Microsoft.Health.Fhir.Core/Features/Operations/Export/ExportProcessingJob.cs Implements OOM recovery via range splitting (SQL) or batch reduction (Cosmos/non-SQL).
src/Microsoft.Health.Fhir.Core/Features/Operations/Export/ExportJobTask.cs Removes “fail immediately on OOM” behavior so recovery can be handled by ExportProcessingJob.
src/Microsoft.Health.Fhir.Core.UnitTests/Features/Operations/Export/ExportProcessingJobTests.cs Adds/updates tests validating OOM retries, range splitting calls, and failure after max reductions.
src/Microsoft.Health.Fhir.Core.UnitTests/Features/Operations/Export/ExportJobTaskTests.cs Adds tests ensuring OOM exceptions bubble up from search paths.

Comment thread src/Microsoft.Health.Fhir.Core/Features/Operations/Export/ExportProcessingJob.cs Outdated
Comment thread src/Microsoft.Health.Fhir.Core/Features/Operations/Export/ExportProcessingJob.cs Outdated
Comment thread src/Microsoft.Health.Fhir.Core/Features/Operations/Export/ExportProcessingJob.cs Dismissed
Comment thread src/Microsoft.Health.Fhir.Core/Features/Operations/Export/ExportProcessingJob.cs Dismissed
LTA-Thinking
LTA-Thinking previously approved these changes May 7, 2026
@apurvabhaleMS apurvabhaleMS merged commit 76f8a6d into main May 7, 2026
49 checks passed
@apurvabhaleMS apurvabhaleMS deleted the personal/abhale/export-oom-protect branch May 7, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-BulkExport Area related to bulk export. Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs Enhancement Enhancement on existing functionality. No-ADR ADR not needed No-PaaS-breaking-change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants