Skip to content

Branch HistFromNtupleProducerTask per dataset, not per dataset×variable#260

Open
kandrosov wants to merge 13 commits into
cms-flaf:mainfrom
kandrosov:issue-257-histfromntuple-per-dataset
Open

Branch HistFromNtupleProducerTask per dataset, not per dataset×variable#260
kandrosov wants to merge 13 commits into
cms-flaf:mainfrom
kandrosov:issue-257-histfromntuple-per-dataset

Conversation

@kandrosov

@kandrosov kandrosov commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Fix for #257: Improvement to HistFromNtupleProducerTask nJobs Management

Root cause

HistFromNtupleProducerTask.create_branch_map() created one branch per
(dataset, variable) pair — nDatasets × nVariables branches total. Each branch
localised all HistTuple files for its dataset (potentially 100+ files) and then
ran HistProducerFromNTuple.py for a single variable. The next branch for the
same dataset would localise exactly the same files again for the next variable.
With 50 variables and 50+ datasets this multiplied network I/O and queue slots
by the number of variables for no gain.

Solution

Branch HistFromNtupleProducerTask per dataset only (nDatasets branches).
Each branch localises the dataset's HistTuple files once and loops over all
variables inside run(), producing one output file per variable. The output of
each branch is now a dict keyed by variable name rather than a single file.

HistMergerTask is adapted to match:

  • create_branch_map() reads the new per-dataset HFN branch map and emits one
    Merger branch per variable (same as before), recording all HFN branch indices.
  • requires() requests all HFN dataset branches (each branch covers all vars).
  • run() extracts inp[var_name] from each dict input instead of using
    inp.abspath directly.

HistPlotTask is unchanged — it still iterates over HistMergerTask's branch
map whose tuple structure (var_name, br_indices, datasets) is preserved.

Files changed

  • FLAF/Analysis/tasks.py
    • HistFromNtupleProducerTask.create_branch_map: {n: (dataset, prod_br_list)} instead of {n: (var, prod_br_list, dataset)}
    • HistFromNtupleProducerTask.workflow_requires: updated branch_map unpacking
    • HistFromNtupleProducerTask.requires: updated branch_data unpacking; fixed list-vs-generator bug in old code
    • HistFromNtupleProducerTask.output: returns {var_name: remote_target} dict
    • HistFromNtupleProducerTask.run: loops over variables; localises inputs once
    • HistMergerTask.create_branch_map: rewritten to read per-dataset HFN branches
    • HistMergerTask.workflow_requires: updated branch_map unpacking
    • HistMergerTask.requires: updated to request HFN branches by dataset index
    • HistMergerTask.run: extracts inp[var_name] from dict input

Testing

Run the full CI pipeline (HistPlotTask --version CI --period Run3_2022 --workflow local --branches 0 --test 1000) for each analysis. The number of HistFromNtupleProducerTask branches should drop from nDatasets × nVariables to nDatasets.

@kandrosov kandrosov force-pushed the issue-257-histfromntuple-per-dataset branch from 92b173b to 7edf88d Compare June 1, 2026 18:15
@kandrosov

Copy link
Copy Markdown
Contributor Author

@cms-flaf-bot please test

@kandrosov kandrosov requested a review from aebid June 1, 2026 18:16
@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14918124 started

@kandrosov kandrosov requested a review from Copilot June 1, 2026 18:16
@kandrosov kandrosov changed the title fix #257: branch HistFromNtupleProducerTask per dataset, not per dataset×variable Branch HistFromNtupleProducerTask per dataset, not per dataset×variable Jun 1, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses FLAF issue #257 by reducing the LAW branching granularity of HistFromNtupleProducerTask from (dataset × variable) to just dataset, so each dataset’s HistTuple inputs are localized once and then reused to produce outputs for all variables. HistMergerTask is updated accordingly to consume the new per-dataset dict outputs while keeping the merger branching per variable unchanged.

Changes:

  • Reworked HistFromNtupleProducerTask branching to be per-dataset and updated output() to return a {var_name: target} mapping.
  • Updated HistFromNtupleProducerTask.run() to localize inputs once per dataset and loop over variables to generate per-variable outputs.
  • Adapted HistMergerTask to request per-dataset HFN branches and extract inp[var_name] from dict inputs.
Comments suppressed due to low confidence (4)

Analysis/tasks.py:446

  • branches is created from a set, so converting it directly to a tuple can yield non-deterministic ordering across runs. Since branches is a Luigi/law parameter, non-deterministic ordering can lead to unstable task IDs and unnecessary reruns. Consider sorting before converting to a tuple.
        branch_set = set()
        for br_idx, (dataset_name, prod_br_list) in self.branch_map.items():
            branch_set.update(prod_br_list)
        branches = tuple(branch_set)
        req_dict = {
            "HistTupleProducerTask": HistTupleProducerTask.req(
                self, branches=branches, customisations=self.customisations
            )
        }

Analysis/tasks.py:601

  • new_branchset is a set, but it is converted to a list when passed to branches. Since parameter ordering can affect task IDs, prefer a deterministically ordered tuple (and keep it consistent with other workflow_requires implementations in this file).
        return {
            "HistFromNtupleProducerTask": HistFromNtupleProducerTask.req(
                self, branches=list(new_branchset)
            )
        }

Analysis/tasks.py:446

  • branches is created from a set, so converting it directly to a tuple can yield non-deterministic ordering across runs. Since branches is a Luigi/law parameter, non-deterministic ordering can lead to unstable task IDs and unnecessary reruns. Consider sorting before converting to a tuple.
        branch_set = set()
        for br_idx, (dataset_name, prod_br_list) in self.branch_map.items():
            branch_set.update(prod_br_list)
        branches = tuple(branch_set)
        req_dict = {
            "HistTupleProducerTask": HistTupleProducerTask.req(
                self, branches=branches, customisations=self.customisations
            )
        }

Analysis/tasks.py:601

  • new_branchset is a set, but it is converted to a list when passed to branches. Since parameter ordering can affect task IDs, prefer a deterministically ordered tuple (and keep it consistent with other workflow_requires implementations in this file).
        return {
            "HistFromNtupleProducerTask": HistFromNtupleProducerTask.req(
                self, branches=list(new_branchset)
            )
        }

Comment thread Analysis/tasks.py Outdated
Comment thread Analysis/tasks.py Outdated
Comment thread Analysis/tasks.py Outdated
Comment thread Analysis/tasks.py Outdated
@kandrosov

Copy link
Copy Markdown
Contributor Author

@cms-flaf-bot please test

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14918165 started

@kandrosov kandrosov linked an issue Jun 1, 2026 that may be closed by this pull request
@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14918124 failed

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14918165 failed

@kandrosov

Copy link
Copy Markdown
Contributor Author

@cms-flaf-bot please test

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14918419 started

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14918419 passed

@aebid

aebid commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

The task looks better but still has a problem. There is no continue to variables that already have a successful output. If the job fails halfway through, or if the variable list is updated, the current structure would repeat the already finished plots.

I ran the task for 3 plots, then killed it to start again. Now the DeepHME_mass, error, and M300 DNN already exist.
Screenshot 2026-06-02 at 2 10 24 PM

But then when calling the task again, it starts from variable 1 "DeepHME_mass" again.
Screenshot 2026-06-02 at 2 11 24 PM

@aebid

aebid commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Another issue with current iteration is the next step HistMergerTask.

If you want to only run one variable out of the list in HistMerger, then current law structure would require all variables to be finished, which is a huge problem.

Example shows trying to make HistMerger branch 0 (DeepHME_mass). Since law just sees dependent tasks, it thinks the entire HistFromNtupleProducer is required -- even though it only needs the first output. If this is a consequence of this PR then it is unacceptable and this idea is dead.

Screenshot 2026-06-02 at 10 30 45 PM

@kandrosov

Copy link
Copy Markdown
Contributor Author

@cms-flaf-bot please test

  • gitlab_ref=ci-dynamic-era-jobs

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14985279 started

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14985279 failed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

Analysis/tasks.py:726

  • branches=list(new_branchset) uses a set, so branch order is nondeterministic. Since branches contributes to the task id, this can lead to unstable task signatures / duplicate submissions across runs.
        reqs["HistFromNtupleProducerTask"] = HistFromNtupleProducerTask.req(
            self, branches=list(new_branchset)
        )

Comment thread Analysis/tasks.py Outdated
Comment thread Analysis/tasks.py Outdated
Comment thread Analysis/tasks.py Outdated
Comment thread Analysis/tasks.py Outdated
Comment thread Analysis/HistProducerFromNTuple.py Outdated
@kandrosov

Copy link
Copy Markdown
Contributor Author

@cms-flaf-bot please test

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14990928 started

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#14990928 passed

…ple-per-dataset

# Conflicts:
#	Analysis/tasks.py
…nching (cms-flaf#257) + queue-based download pipeline; default tasks-per-job (cms-flaf#264)
@kandrosov kandrosov linked an issue Jun 16, 2026 that may be closed by this pull request
…er-file branches); stop .req from propagating it (exclude_params_req)
…ches + deterministic per-variable bucketing (adding/removing variables no longer renumbers branches)
…laceholder output so the branch is complete and never submitted to HTCondor
…_simultaneous_downloads, default 4) and total in-flight size (max_total_download_size_gb, default 10) from global config
@kandrosov

Copy link
Copy Markdown
Contributor Author

@cms-flaf-bot please test

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#15050082 started

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#15050082 failed

@kandrosov

Copy link
Copy Markdown
Contributor Author

@cms-flaf-bot please test

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#15063269 started

@cms-flaf-bot

Copy link
Copy Markdown
Collaborator

pipeline#15063269 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default --tasks-per-job settings Improvement to HistFromNtupleProducerTask nJobs Management

4 participants