Skip to content

Build 2.2 Bug Fixes - Large PR #441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 39 commits into
base: main
Choose a base branch
from
Draft

Build 2.2 Bug Fixes - Large PR #441

wants to merge 39 commits into from

Conversation

jjacobson95
Copy link
Collaborator

@jjacobson95 jjacobson95 commented Aug 5, 2025

Build v2.2 is still in progress, and not ready to merge, but I wanted to show what bugs are in progress already. It will take some time to test everything, but its coming along well.

Summary of all changes made on this branch:


Drug file generation was overhauled.
This was a major update across all datasets due to the fact that every dataset required some of its own code to adjust the SMI values (which not all did, #430) and to create a merged file of all of the previous drug files (which not all did, #428, #429). Additionally one dataset (#427) did not use the pubchem_retrieval.py script at all because it came before it and was never updated so I removed all the old code and replaced it. This overhaul required a large change to pubchem_retrieval.py which now accepts a new argument prev_drug_filepaths, and then a compatibility update to every other dataset's drug generation script. All datasets now use all previous drug files!
This also solves half of #421, not the drug descriptors portion though.

Large scale Renaming
All references to the build directory are being changed to coderbuild.
All recent datasets were renamed, dropping pdo/pdx, and other small updates across all build files.

SarcoPDO Fixes
There were a couple issues to this dataset as it had not properly passed validation in the last build. This fixes the mutations file (#431) and the experiments file (#436).

LiverPDO Fix
The last second addition of integer casting led to an issue with the experiments file (#432). This has been fixed.

Mapping Scripts Update
The mapping scripts are now updated to include all desired datasets (#435).

Broad_Sanger Update
Broad/Sanger has persistently been the most likely to fail and stop the build process. This is due to the file download method (#438) where a connection break or mis-download stops everything. I'm implementing a more robust method to download files. This is taking a bit of time debugging (but concurrent to everything else), but in the long run, this will save us days/weeks of build time.
Also fixing a polars import issue that is stopping a couple of scripts from working (#442).

Build Process
I've cleaned up a ton of the print statements across dataset build scripts so the debugging process can be faster. Previously I had to filter through 100k+ lines of logs to find the issues. The only issue this relates to is #437 (which produces 56k lines of warnings in the log). Some print statements can't be removed easily, such as those from the GDC_tool but this is still much better.

I'm also implementing a retry function in build_all.py. For example, if the hcmi build_omics.sh script fails due to a memory spike, it will retry it 3 times before the whole build fails. While not a direct fix, it will also function as back-up protection against broken connections during downloads that cause inconsistent failures (#434). This also required a change to the pubchem_retrieval logic with the ignore_chems (#446).

Docker Process
Optimized all Dockerfiles across all datasets in order to better leverage docker caching. This significantly speeds up build time and more notably, debugging time, especially for docker containers with R. Broad_sanger_exp takes 1hr to build, broad_sanger_omics takes 20 minutes to build without optimized cache order. Now files can be modified and using caching; R and everything else that needs compiling will be cached. Best order is essentially largest to smallest, so R and python compiling, add R requirements file, install R packages, add python requirements file, install python packages, add all build files.

Extras
Removed a couple of unused files including Dockerfile.crcPDO (#447).

@jjacobson95 jjacobson95 marked this pull request as draft August 5, 2025 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy