Looked at the tasks we have in the wdl, and some of them might not make much sense.
For example, get_phenolist localizes all of the summary statistic files, and seemingly the only outputs it has is a json with the filepath (though even that is specific to the cromwell execution) and the endpoint name, which is extracted from the filename.
So that task could be replaced by one where we just take in the filepaths, extract the endpoint name, and the assoc files would be some predetermined path + the filename.
Especially in userresults, where this task is done each run, this would make sense.
Second, in the subsequent task fix_json, which uses that endpointlist, it localizes qqplots and manhattan files to calculate the amount of peaks (from manhattan file) and to get gc lambda (from qqplot). Most of the time spent comes from the localization, the calculations are quite trivial compared to that. If we instead calculated these in the manhattan and qqplot task(s?), where the data already exists, we can really speed this task up.
Third, the long matrix creation currently takes 13 hours in userresults for 1770 endpoints, which is quite long. However, I do not see a quick way of making that less resource-intensive. We could do some tuning like change the disk to be very much smaller (currently 8TB and using less than 1TB of space), and change it to SSD for faster IO. We can also parallelize indexing if using recent enough tabix with the --threads flag.
Here's the output of a recent userresults import:
++ grep -c '^processor' /proc/cpuinfo
+ n_cpu=6
++ date
+ echo Wed May 6 11:19:54 AM UTC 2026 decompress
+ cat /mnt/disks/cromwell_root/green-production-userresults-cromwell-storage/import_pheweb/6b91e6af-2303-415d-96d9-744e0ae5e4a8/call-matrix_longformat/write_lines_94bc5e7609a73a4b92b8519c0c88eb13.tmp
+ xargs -P 6 '-I{}' gzip -d --force '{}'
+ cat /mnt/disks/cromwell_root/green-production-userresults-cromwell-storage/import_pheweb/6b91e6af-2303-415d-96d9-744e0ae5e4a8/call-matrix_longformat/write_lines_94bc5e7609a73a4b92b8519c0c88eb13.tmp
+ sed 's_\.gz$__'
+ tr '\n' '\0'
++ date
+ echo Wed May 6 11:37:35 AM UTC 2026 merge
+ bgzip -@6
+ cat /dev/fd/63 /dev/fd/62
++ sort -m -T . --parallel=6 --compress-program=gzip --files0-from=merge_these --batch-size=16 -k2,2V -k3,3g -k4,5 -k1,1
++ echo '#pheno #chrom pos ref alt pval mlogp beta sebeta af_alt af_alt_cases af_alt_controls'
++ tr ' ' '\t'
real 656m4.722s
user 218m52.712s
sys 32m30.497s
++ date
+ echo Wed May 6 10:33:40 PM UTC 2026 tabix
+ tabix -s 2 -b 3 -e 3 long.tsv.gz
++ date
+ echo Wed May 6 11:28:08 PM UTC 2026 copy_files
+ find ./
++ date
+ echo Wed May 6 11:28:10 PM UTC 2026 end
Decompress takes 20 minutes, that we probably can't speed up.
Merge takes 11 hours, it can be that we could speed that up with faster IO by using an SSD.
Tabix indexing takes 1 hour, this can be sped up with the parallelization.
We should also take a look at what the per-phenotype tasks are doing. I didn't look into them as I was looking this from a userresults import perspective where most of the endpoints are already imported and in the callcache, but that is relevant for release imports.
Looked at the tasks we have in the wdl, and some of them might not make much sense.
For example, get_phenolist localizes all of the summary statistic files, and seemingly the only outputs it has is a json with the filepath (though even that is specific to the cromwell execution) and the endpoint name, which is extracted from the filename.
So that task could be replaced by one where we just take in the filepaths, extract the endpoint name, and the assoc files would be some predetermined path + the filename.
Especially in userresults, where this task is done each run, this would make sense.
Second, in the subsequent task fix_json, which uses that endpointlist, it localizes qqplots and manhattan files to calculate the amount of peaks (from manhattan file) and to get gc lambda (from qqplot). Most of the time spent comes from the localization, the calculations are quite trivial compared to that. If we instead calculated these in the manhattan and qqplot task(s?), where the data already exists, we can really speed this task up.
Third, the long matrix creation currently takes 13 hours in userresults for 1770 endpoints, which is quite long. However, I do not see a quick way of making that less resource-intensive. We could do some tuning like change the disk to be very much smaller (currently 8TB and using less than 1TB of space), and change it to SSD for faster IO. We can also parallelize indexing if using recent enough tabix with the --threads flag.
Here's the output of a recent userresults import:
Decompress takes 20 minutes, that we probably can't speed up.
Merge takes 11 hours, it can be that we could speed that up with faster IO by using an SSD.
Tabix indexing takes 1 hour, this can be sped up with the parallelization.
We should also take a look at what the per-phenotype tasks are doing. I didn't look into them as I was looking this from a userresults import perspective where most of the endpoints are already imported and in the callcache, but that is relevant for release imports.