Merge pull request #34 from lemora/fix/structure-chapter-docs

josemduarte · web-flow · commit 72b7808b6940 · 2021-11-15T09:06:16.000-08:00
Fixed documentation for parts of the structure chapter
diff --git a/structure/alignment.md b/structure/alignment.md
@@ -20,12 +20,12 @@ acid sequences converge on a common tertiary structure.
 
 A **structural alignment** of other biological polymers can also be made in BioJava.
 For example, nucleic acids can be structurally aligned to find common structural motifs, 
-independent of sequence simililarity. This is specially important for RNAs, because their
+independent of sequence similarity. This is specially important for RNAs, because their
 3D structure arrangement is important for their function.
 
 For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment).
 
-## Alignment Algorithms supported by BioJava
+## Alignment Algorithms Supported by BioJava
 
 BioJava comes with a number of algorithms for aligning structures. The following
 five options are displayed by default in the graphical user interface (GUI),
@@ -45,9 +45,9 @@ in 3D. See below for descriptions of the algorithms.
 Since BioJava version 4.1.0, multiple structures can be compared at the same time in 
 a **multiple structure alignment**, that can later be visualized in Jmol. 
 The algorithm is described in detail below. As an overview, it uses any pairwise alignment 
-algorithm and a **reference** structure to per perform an alignment of all the structures. 
+algorithm and a **reference** structure to perform an alignment of all the structures.
 Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among
-all the strucutures, identifying conserved **structural motifs**.
+all the structures, identifying conserved **structural motifs**.
 
 ## Alignment User Interface
 
@@ -91,7 +91,7 @@ This code shows the following user interface:
 ![Multiple Alignment GUI](img/multiple_gui.png)
 
 The input format is a free text field, where the structure identifiers are 
-indidcated, space separated. A **structure identifier** is a String that 
+indicated, space separated. A **structure identifier** is a String that
 uniquely identifies a structure. It is basically composed of the pdbID, the
 chain letters and the ranges of residues of each chain. For the formal description
 visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html).
@@ -125,12 +125,12 @@ The Combinatorial Extension (CE) algorithm was originally developed by
 1998](http://peds.oxfordjournals.org/content/11/9/739.short) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/9796821).
 It works by identifying segments of the two structures with similar local
 structure, and then combining those to try to align the most residues possible
-while keeping the overall RMSD of the superposition low.
+while keeping the overall root-mean-square deviation (RMSD) of the superposition low.
 
 CE is a rigid-body alignment algorithm, which means that the structures being
 compared are kept fixed during superposition. In some cases it may be desirable
 to break large proteins up into domains prior to aligning them (by manually
-inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by
+inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by
 decomposing the protein automatically using the [Protein Domain
 Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html)
 algorithm).
@@ -146,10 +146,8 @@ to the C-terminal part of the other, and vice versa. CE-CP allows circularly
 permuted proteins to be compared.  For more information on circular
 permutations, see the
 [Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or
-[Molecule of the Month]
-(http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar)
-articles [![pubmed]
-(http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
+[Molecule of the Month](https://pdb101.rcsb.org/motm/124)
+articles [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
 
 
 For proteins without a circular permutation, CE-CP results look very similar to
@@ -173,8 +171,7 @@ It performs similarly to CE for most structures. The 'rigid' flavor uses a
 rigid-body superposition and only considers alignments with matching sequence
 order.
 
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
 
 ### FATCAT - flexible
 
@@ -186,11 +183,9 @@ calmodulin with and without calcium bound can be much better aligned with
 FATCAT-flexible than with one of the rigid alignment algorithms. The downside of
 this is that it can lead to additional false positives in unrelated structures.
 
-![(Left) Rigid and (Right) flexible alignments of
-calmodulin](img/1cfd_1cll_fatcat.png)
+![(Left) Rigid and (Right) flexible alignments of calmodulin](img/1cfd_1cll_fatcat.png)
 
-BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
+BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
 
 ### Smith-Waterman
 
@@ -204,8 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a
 small number of badly aligned residues. However, this method is faster than
 the structure-based methods.
 
-BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain]
-(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
+BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
 
 ### Other methods
 
@@ -250,43 +244,7 @@ by the pairwise alignment algorithm limitations.
 The algorithm performs similarly to other multiple structure alignment algorithms for most protein families. 
 The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment.
 
-BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain]
-(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
-
-## PDB-wide Database Searches
-
-The Alignment GUI also provides functionality for PDB-wide structural searches.
-This systematically compares a structure against a non-redundant set of all
-other structures in the PDB at either a chain or a domain level. Representatives
-are selected using the RCSB's clustering of proteins with 40% sequence identity,
-as described
-[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp).
-Domains are selected using either SCOP (when available) or the
-ProteinDomainParser algorithm.
-
-![Database Search GUI](img/database_search.png)
-
-To perform a database search, select the 'Database Search' tab, then choose a
-query structure based on PDB ID, SCOP domain id, or from a custom file. The
-output directory will be used to store results. These consist of individual
-alignments in compressed XML format, as well as a tab-delimited file of
-similarity scores and statistics. The statistics are displayed in an interactive
-results table, which allows the alignments to be sorted. The 'Align' column
-allows individual alignments to be visualized with the alignment GUI.
-
-![Database Search Results](img/database_search_results.png)
-
-Be aware that this process can be very time consuming. Before
-starting a manual search, it is worth considering whether a pre-computed result
-may be available online, for instance for
-[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp)
-or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or
-specific domains, a few optimizations can reduce the time for a database search.
-Downloading PDB files is a considerable bottleneck. This can be solved by
-downloading all PDB files from the [FTP
-server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting
-the `PDB_DIR` environmental variable. This operation sped up the search from
-about 30 hours to less than 4 hours.
+BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
 
 
 ## Creating Alignments Programmatically
@@ -363,8 +321,7 @@ MultipleAlignmentJmolDisplay.display(result);
 
 Many of the alignment algorithms are available in the form of command line
 tools. These can be accessed through the main methods of the StructureAlignment
-classes. Tar bundles are also available with scripts for running
-[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp).
+classes.
 
 Example:
 ```bash
@@ -378,7 +335,7 @@ file in various formats.
 
 ## Alignment Data Model
 
-For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md)
+For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md)
 
 ## Acknowledgements
 
diff --git a/structure/caching.md b/structure/caching.md
@@ -53,10 +53,8 @@ This example turns on the use of chemical components when loading a `Structure`.
 	AtomCache cache = new AtomCache();
 
 	cache.setPath("/tmp/");
-			
+
 	FileParsingParameters params = cache.getFileParsingParams();
-	
-	params.setLoadChemCompInfo(true);
 
 	StructureIO.setAtomCache(cache);
 
diff --git a/structure/mmcif.md b/structure/mmcif.md
@@ -13,7 +13,7 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and
 ## The Basics
 
 BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files 
-into a biological and chemically  meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)). 
+into a biological and chemically  meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)). 
 If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation. 
 Let's start first with the most basic way of loading a protein structure.
 
diff --git a/structure/seqres.md b/structure/seqres.md
@@ -5,20 +5,19 @@ How molecular sequences are linked to experimentally observed atoms.
 
 ## Sequences and Atoms
 
-In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
+In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
 
-Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt.
+Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt.
 
-![Screenshot of Protein Feature View at RCSB]
-(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
+![Screenshot of Protein Feature View at RCSB](https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
 
 As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor.
 
 The blue-boxes are regions for which atoms records are available. For the grey regions there is sequence information available in the PDB, but no coordinates.
 
 ## Seqres and Atom Records
 
-The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
+The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
 
 The **Atom** records provide coordinates where it was possible to observe them.