-
Notifications
You must be signed in to change notification settings - Fork 3
Build 2.2 Bug Fixes - Large PR #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
jjacobson95
wants to merge
39
commits into
main
Choose a base branch
from
build_2.2_bug_fixes
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…t. Removed tons of print statements so debugging the full build would be easier
…s in drug descriptor file.
… used in the drug generation as well, so we need to keep it this ver
…to stream hcmi data instead of hold in storage
…y 404s. Build_all retries set at 3 and 10 min
…atasets. This was hundreds of references so its possible I missed something or capitalization is off somewhere
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Build v2.2 is still in progress, and not ready to merge, but I wanted to show what bugs are in progress already. It will take some time to test everything, but its coming along well.
Summary of all changes made on this branch:
Drug file generation was overhauled.
This was a major update across all datasets due to the fact that every dataset required some of its own code to adjust the SMI values (which not all did, #430) and to create a merged file of all of the previous drug files (which not all did, #428, #429). Additionally one dataset (#427) did not use the pubchem_retrieval.py script at all because it came before it and was never updated so I removed all the old code and replaced it. This overhaul required a large change to pubchem_retrieval.py which now accepts a new argument
prev_drug_filepaths
, and then a compatibility update to every other dataset's drug generation script. All datasets now use all previous drug files!This also solves half of #421, not the drug descriptors portion though.
Large scale Renaming
All references to the
build
directory are being changed tocoderbuild
.All recent datasets were renamed, dropping pdo/pdx, and other small updates across all build files.
SarcoPDO Fixes
There were a couple issues to this dataset as it had not properly passed validation in the last build. This fixes the mutations file (#431) and the experiments file (#436).
LiverPDO Fix
The last second addition of integer casting led to an issue with the experiments file (#432). This has been fixed.
Mapping Scripts Update
The mapping scripts are now updated to include all desired datasets (#435).
Broad_Sanger Update
Broad/Sanger has persistently been the most likely to fail and stop the build process. This is due to the file download method (#438) where a connection break or mis-download stops everything. I'm implementing a more robust method to download files. This is taking a bit of time debugging (but concurrent to everything else), but in the long run, this will save us days/weeks of build time.
Also fixing a polars import issue that is stopping a couple of scripts from working (#442).
Build Process
I've cleaned up a ton of the print statements across dataset build scripts so the debugging process can be faster. Previously I had to filter through 100k+ lines of logs to find the issues. The only issue this relates to is #437 (which produces 56k lines of warnings in the log). Some print statements can't be removed easily, such as those from the GDC_tool but this is still much better.
I'm also implementing a retry function in build_all.py. For example, if the hcmi build_omics.sh script fails due to a memory spike, it will retry it 3 times before the whole build fails. While not a direct fix, it will also function as back-up protection against broken connections during downloads that cause inconsistent failures (#434). This also required a change to the pubchem_retrieval logic with the ignore_chems (#446).
Docker Process
Optimized all Dockerfiles across all datasets in order to better leverage docker caching. This significantly speeds up build time and more notably, debugging time, especially for docker containers with R. Broad_sanger_exp takes 1hr to build, broad_sanger_omics takes 20 minutes to build without optimized cache order. Now files can be modified and using caching; R and everything else that needs compiling will be cached. Best order is essentially largest to smallest, so R and python compiling, add R requirements file, install R packages, add python requirements file, install python packages, add all build files.
Extras
Removed a couple of unused files including Dockerfile.crcPDO (#447).