Build 2.2 Bug Fixes - Large PR #441

jjacobson95 · 2025-08-05T17:07:34Z

Build v2.2 is still in progress, and not ready to merge, but I wanted to show what bugs are in progress already. It will take some time to test everything, but its coming along well.

Summary of all changes made on this branch:

Drug file generation was overhauled.
This was a major update across all datasets due to the fact that every dataset required some of its own code to adjust the SMI values (which not all did, #430) and to create a merged file of all of the previous drug files (which not all did, #428, #429). Additionally one dataset (#427) did not use the pubchem_retrieval.py script at all because it came before it and was never updated so I removed all the old code and replaced it. This overhaul required a large change to pubchem_retrieval.py which now accepts a new argument prev_drug_filepaths, and then a compatibility update to every other dataset's drug generation script. All datasets now use all previous drug files!
This also solves half of #421, not the drug descriptors portion though.

Large scale Renaming
All references to the build directory are being changed to coderbuild.
All recent datasets were renamed, dropping pdo/pdx, and other small updates across all build files.

SarcoPDO Fixes
There were a couple issues to this dataset as it had not properly passed validation in the last build. This fixes the mutations file (#431) and the experiments file (#436).

LiverPDO Fix
The last second addition of integer casting led to an issue with the experiments file (#432). This has been fixed.

Mapping Scripts Update
The mapping scripts are now updated to include all desired datasets (#435).

Broad_Sanger Update
Broad/Sanger has persistently been the most likely to fail and stop the build process. This is due to the file download method (#438) where a connection break or mis-download stops everything. I'm implementing a more robust method to download files. This is taking a bit of time debugging (but concurrent to everything else), but in the long run, this will save us days/weeks of build time.
Also fixing a polars import issue that is stopping a couple of scripts from working (#442).

Build Process
I've cleaned up a ton of the print statements across dataset build scripts so the debugging process can be faster. Previously I had to filter through 100k+ lines of logs to find the issues. The only issue this relates to is #437 (which produces 56k lines of warnings in the log). Some print statements can't be removed easily, such as those from the GDC_tool but this is still much better.

I'm also implementing a retry function in build_all.py. For example, if the hcmi build_omics.sh script fails due to a memory spike, it will retry it 3 times before the whole build fails. While not a direct fix, it will also function as back-up protection against broken connections during downloads that cause inconsistent failures (#434). This also required a change to the pubchem_retrieval logic with the ignore_chems (#446).

Docker Process
Optimized all Dockerfiles across all datasets in order to better leverage docker caching. This significantly speeds up build time and more notably, debugging time, especially for docker containers with R. Broad_sanger_exp takes 1hr to build, broad_sanger_omics takes 20 minutes to build without optimized cache order. Now files can be modified and using caching; R and everything else that needs compiling will be cached. Best order is essentially largest to smallest, so R and python compiling, add R requirements file, install R packages, add python requirements file, install python packages, add all build files.

Extras
Removed a couple of unused files including Dockerfile.crcPDO (#447).

…les for caching

…hope

…t. Removed tons of print statements so debugging the full build would be easier

…s in drug descriptor file.

… used in the drug generation as well, so we need to keep it this ver

…to stream hcmi data instead of hold in storage

…y 404s. Build_all retries set at 3 and 10 min

…re quite a few

…atasets. This was hundreds of references so its possible I missed something or capitalization is off somewhere

jjacobson95 added 23 commits July 30, 2025 16:46

Fix beataml Drug issue

c5acf71

liverpdo fixes

c4df787

added novartis to build_all.py. update for liverpdo drugs

a96c111

another liverpdo drug update

17377b2

testing pubchem update

1981a3b

working on pubchem

ebc79b5

working on pubchem2

bc5b859

updated pubchem call in build/bladderpdo/02_createBladderPDODrugsFile.py

799c636

Large drug generation overhaul

7470735

reduced drugs in broad_sanger for debugging

7cf7dd1

bug fix

a40eca2

changed to random 10 instead fo first test for debugging

7f57630

Speed up Docker build (and debug process) through optimizing dockerfi…

44ab62c

…les for caching

Make sure helper script is actually added to the dockerfile

2fafd15

bug fix in join

e3670b0

bug fix on join

54a9254

Sorted after joining

aee1a1d

ensure that first drug in first file starts at SMI_1 instead of SMI_2

7f39128

Turning off test steps. Made a change to HCMI that should speed up I …

88083fe

…hope

SarcPDO issues fixed for mutations and experiments

533f66b

fixes liverpdo experiments

797f37c

Updated mapping scripts with all datasets and removed cptac by defaul…

09fb9e5

…t. Removed tons of print statements so debugging the full build would be easier

Added robust methods to download files for broad_sanger omics

4d74714

jjacobson95 marked this pull request as draft August 5, 2025 17:08

jjacobson95 added 6 commits August 5, 2025 14:28

Dockerfile optimization. Attempting to fix broad_sanger. Hide warning…

509b170

…s in drug descriptor file.

tiny changes. 05b_separate_datasets.py working now

c349fc6

pinning polars-lts-cpu to the original version as polars pin. This is…

8018f8b

… used in the drug generation as well, so we need to keep it this ver

Added 3 x retry to build_all.py for each step that fails. Attempting …

c7171f2

…to stream hcmi data instead of hold in storage

Remove incorrectly-cased Dockerfile.crcPDO and add Dockerfile.crcpdo

d628dd3

Merge remote-tracking branch 'origin/main' into build_2.2_bug_fixes

e3f4df3

jjacobson95 added 10 commits August 6, 2025 22:42

HCMI data streaming finally seems like it mightttt be working

b347513

Handle 503 Gateway errors better. Pubchem ignore_chems updated to onl…

4630da8

…y 404s. Build_all retries set at 3 and 10 min

Renamed build to coderbuild. Hopefully I got all references, there we…

9881108

…re quite a few

Renamed All PDO/PDX Datasets. Modified all files that reference the d…

5cf3dac

…atasets. This was hundreds of references so its possible I missed something or capitalization is off somewhere

Adding missed references to build/coderbuild

e98d88d

Adding more missed name changes

3e0aea2

Adding just a couple more references

afdb5f7

Patch fix for a weird bug

f2535f8

apparently pl.scan_csv can't handle gzipped files. fixed hcmi stream

57b9fe0

Removed previous mapping files because vast dataset changes and renaming

70b50ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build 2.2 Bug Fixes - Large PR #441

Build 2.2 Bug Fixes - Large PR #441

Uh oh!

jjacobson95 commented Aug 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Build 2.2 Bug Fixes - Large PR #441

Are you sure you want to change the base?

Build 2.2 Bug Fixes - Large PR #441

Uh oh!

Conversation

jjacobson95 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

jjacobson95 commented Aug 5, 2025 •

edited

Loading