Skip to content

Commit 72b7808

Browse files
authored
Merge pull request #34 from lemora/fix/structure-chapter-docs
Fixed documentation for parts of the structure chapter
2 parents fd1d94d + ac793c4 commit 72b7808

File tree

4 files changed

+22
-68
lines changed

4 files changed

+22
-68
lines changed

structure/alignment.md

Lines changed: 16 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ acid sequences converge on a common tertiary structure.
2020

2121
A **structural alignment** of other biological polymers can also be made in BioJava.
2222
For example, nucleic acids can be structurally aligned to find common structural motifs,
23-
independent of sequence simililarity. This is specially important for RNAs, because their
23+
independent of sequence similarity. This is specially important for RNAs, because their
2424
3D structure arrangement is important for their function.
2525

2626
For more info see the Wikipedia article on [structure alignment](http://en.wikipedia.org/wiki/Structural_alignment).
2727

28-
## Alignment Algorithms supported by BioJava
28+
## Alignment Algorithms Supported by BioJava
2929

3030
BioJava comes with a number of algorithms for aligning structures. The following
3131
five options are displayed by default in the graphical user interface (GUI),
@@ -45,9 +45,9 @@ in 3D. See below for descriptions of the algorithms.
4545
Since BioJava version 4.1.0, multiple structures can be compared at the same time in
4646
a **multiple structure alignment**, that can later be visualized in Jmol.
4747
The algorithm is described in detail below. As an overview, it uses any pairwise alignment
48-
algorithm and a **reference** structure to per perform an alignment of all the structures.
48+
algorithm and a **reference** structure to perform an alignment of all the structures.
4949
Then, it runs a **Monte Carlo** optimization to determine the residue equivalencies among
50-
all the strucutures, identifying conserved **structural motifs**.
50+
all the structures, identifying conserved **structural motifs**.
5151

5252
## Alignment User Interface
5353

@@ -91,7 +91,7 @@ This code shows the following user interface:
9191
![Multiple Alignment GUI](img/multiple_gui.png)
9292

9393
The input format is a free text field, where the structure identifiers are
94-
indidcated, space separated. A **structure identifier** is a String that
94+
indicated, space separated. A **structure identifier** is a String that
9595
uniquely identifies a structure. It is basically composed of the pdbID, the
9696
chain letters and the ranges of residues of each chain. For the formal description
9797
visit [StructureIdentifier](http://www.biojava.org/docs/api/org/biojava/nbio/structure/StructureIdentifier.html).
@@ -125,12 +125,12 @@ The Combinatorial Extension (CE) algorithm was originally developed by
125125
1998](http://peds.oxfordjournals.org/content/11/9/739.short) [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/9796821).
126126
It works by identifying segments of the two structures with similar local
127127
structure, and then combining those to try to align the most residues possible
128-
while keeping the overall RMSD of the superposition low.
128+
while keeping the overall root-mean-square deviation (RMSD) of the superposition low.
129129

130130
CE is a rigid-body alignment algorithm, which means that the structures being
131131
compared are kept fixed during superposition. In some cases it may be desirable
132132
to break large proteins up into domains prior to aligning them (by manually
133-
inputing a subrange, using the [SCOP or CATH databases](externaldb.md), or by
133+
inputting a subrange, using the [SCOP or CATH databases](externaldb.md), or by
134134
decomposing the protein automatically using the [Protein Domain
135135
Parser](http://www.biojava.org/docs/api/org/biojava/nbio/structure/domain/LocalProteinDomainParser.html)
136136
algorithm).
@@ -146,10 +146,8 @@ to the C-terminal part of the other, and vice versa. CE-CP allows circularly
146146
permuted proteins to be compared. For more information on circular
147147
permutations, see the
148148
[Wikipedia](http://en.wikipedia.org/wiki/Circular_permutation_in_proteins) or
149-
[Molecule of the Month]
150-
(http://www.pdb.org/pdb/101/motm.do?momID=124&evtc=Suggest&evta=Moleculeof%20the%20Month&evtl=TopBar)
151-
articles [![pubmed]
152-
(http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
149+
[Molecule of the Month](https://pdb101.rcsb.org/motm/124)
150+
articles [![pubmed](http://img.shields.io/badge/in-pubmed-blue.svg?style=flat)](http://www.ncbi.nlm.nih.gov/pubmed/22496628).
153151

154152

155153
For proteins without a circular permutation, CE-CP results look very similar to
@@ -173,8 +171,7 @@ It performs similarly to CE for most structures. The 'rigid' flavor uses a
173171
rigid-body superposition and only considers alignments with matching sequence
174172
order.
175173

176-
BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid]
177-
(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
174+
BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatRigid](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatRigid.html)
178175

179176
### FATCAT - flexible
180177

@@ -186,11 +183,9 @@ calmodulin with and without calcium bound can be much better aligned with
186183
FATCAT-flexible than with one of the rigid alignment algorithms. The downside of
187184
this is that it can lead to additional false positives in unrelated structures.
188185

189-
![(Left) Rigid and (Right) flexible alignments of
190-
calmodulin](img/1cfd_1cll_fatcat.png)
186+
![(Left) Rigid and (Right) flexible alignments of calmodulin](img/1cfd_1cll_fatcat.png)
191187

192-
BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible]
193-
(www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
188+
BioJava class: [org.biojava.nbio.structure.align.fatcat.FatCatFlexible](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/fatcat/FatCatFlexible.html)
194189

195190
### Smith-Waterman
196191

@@ -204,8 +199,7 @@ locating gaps can lead to high RMSD in the resulting superposition due to a
204199
small number of badly aligned residues. However, this method is faster than
205200
the structure-based methods.
206201

207-
BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain]
208-
(http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
202+
BioJava Class: [org.biojava.nbio.structure.align.ce.CeCPMain](http://www.biojava.org/docs/api/org/biojava/nbio/structure/align/ce/CeCPMain.html)
209203

210204
### Other methods
211205

@@ -250,43 +244,7 @@ by the pairwise alignment algorithm limitations.
250244
The algorithm performs similarly to other multiple structure alignment algorithms for most protein families.
251245
The parameters both for the pairwise aligner and the MC optimization can have an impact on the final result. There is not a unique set of parameters, because they usually depend on the specific use case. Thus, trying some parameter combinations, keeping in mind the effect they produce in the score function, is a good practice when doing any structure alignment.
252246

253-
BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain]
254-
(www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
255-
256-
## PDB-wide Database Searches
257-
258-
The Alignment GUI also provides functionality for PDB-wide structural searches.
259-
This systematically compares a structure against a non-redundant set of all
260-
other structures in the PDB at either a chain or a domain level. Representatives
261-
are selected using the RCSB's clustering of proteins with 40% sequence identity,
262-
as described
263-
[here](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp).
264-
Domains are selected using either SCOP (when available) or the
265-
ProteinDomainParser algorithm.
266-
267-
![Database Search GUI](img/database_search.png)
268-
269-
To perform a database search, select the 'Database Search' tab, then choose a
270-
query structure based on PDB ID, SCOP domain id, or from a custom file. The
271-
output directory will be used to store results. These consist of individual
272-
alignments in compressed XML format, as well as a tab-delimited file of
273-
similarity scores and statistics. The statistics are displayed in an interactive
274-
results table, which allows the alignments to be sorted. The 'Align' column
275-
allows individual alignments to be visualized with the alignment GUI.
276-
277-
![Database Search Results](img/database_search_results.png)
278-
279-
Be aware that this process can be very time consuming. Before
280-
starting a manual search, it is worth considering whether a pre-computed result
281-
may be available online, for instance for
282-
[FATCAT-rigid](http://www.rcsb.org/pdb/static.do?p=general_information/cluster/structureAll.jsp)
283-
or [DALI](http://ekhidna.biocenter.helsinki.fi/dali/start). For custom files or
284-
specific domains, a few optimizations can reduce the time for a database search.
285-
Downloading PDB files is a considerable bottleneck. This can be solved by
286-
downloading all PDB files from the [FTP
287-
server](ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/) and setting
288-
the `PDB_DIR` environmental variable. This operation sped up the search from
289-
about 30 hours to less than 4 hours.
247+
BioJava class: [org.biojava.nbio.structure.align.multiple.mc.MultipleMcMain](https://www.biojava.org/docs/api/org/biojava/nbio/structure/align/multiple/mc/MultipleMcMain.html)
290248

291249

292250
## Creating Alignments Programmatically
@@ -363,8 +321,7 @@ MultipleAlignmentJmolDisplay.display(result);
363321

364322
Many of the alignment algorithms are available in the form of command line
365323
tools. These can be accessed through the main methods of the StructureAlignment
366-
classes. Tar bundles are also available with scripts for running
367-
[CE and FATCAT](http://source.rcsb.org/jfatcatserver/download.jsp).
324+
classes.
368325

369326
Example:
370327
```bash
@@ -378,7 +335,7 @@ file in various formats.
378335

379336
## Alignment Data Model
380337

381-
For details about the structure alignment data models in biojava, see [Structure Alignment Data Model](alignment-data-model.md)
338+
For details about the structure alignment data models in BioJava, see [Structure Alignment Data Model](alignment-data-model.md)
382339

383340
## Acknowledgements
384341

structure/caching.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,8 @@ This example turns on the use of chemical components when loading a `Structure`.
5353
AtomCache cache = new AtomCache();
5454

5555
cache.setPath("/tmp/");
56-
56+
5757
FileParsingParameters params = cache.getFileParsingParams();
58-
59-
params.setLoadChemCompInfo(true);
6058

6159
StructureIO.setAtomCache(cache);
6260

structure/mmcif.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ The mmCIF file format has been around for some time (see [Westbrook 2000][] and
1313
## The Basics
1414

1515
BioJava uses the [CIFTools-java](https://github.com/rcsb/ciftools-java) library to parse mmCIF. BioJava then has its own data model that reads PDB and mmCIF files
16-
into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](mmcif.md)).
16+
into a biological and chemically meaningful data model (BioJava supports the [Chemical Components Dictionary](chemcomp.md)).
1717
If you don't want to use that data model, you can still use the CIFTools-java parser, please refer to its documentation.
1818
Let's start first with the most basic way of loading a protein structure.
1919

structure/seqres.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,20 +5,19 @@ How molecular sequences are linked to experimentally observed atoms.
55

66
## Sequences and Atoms
77

8-
In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB oftein contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
8+
In many experiments not all atoms that are part of the molecule under study can be observed. As such the ATOM records in PDB often contain missing atoms or only the part of a molecule that could be experimentally determined. In case of multi-domain proteins the PDB often contains only one of the domains (and in some cases even shorter fragments).
99

10-
Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of how the regions that have been observed in an experiment and are available in the PDB map to UniProt.
10+
Let's take a look at an example. The [Protein Feature View](https://github.com/andreasprlic/proteinfeatureview) provides a graphical summary of the regions that have been observed in an experiment and are available in the PDB map to UniProt.
1111

12-
![Screenshot of Protein Feature View at RCSB]
13-
(https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
12+
![Screenshot of Protein Feature View at RCSB](https://raw.github.com/andreasprlic/proteinfeatureview/master/images/P06213.png "Insulin receptor - P06213 (INSR_HUMAN)")
1413

1514
As you can see, there are three PDB entries (PDB IDs [3LOH](http://www.rcsb.org/pdb/explore.do?structureId=3LOH), [2HR7](http://www.rcsb.org/pdb/explore.do?structureId=2RH7), [3BU3](http://www.rcsb.org/pdb/explore.do?structureId=3BU3)) that cover different regions of the UniProt sequence for the insulin receptor.
1615

1716
The blue-boxes are regions for which atoms records are available. For the grey regions there is sequence information available in the PDB, but no coordinates.
1817

1918
## Seqres and Atom Records
2019

21-
The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequences as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
20+
The sequence that has been used in the experiment is stored in the **Seqres** records in the PDB. It is often not the same sequence as can be found in Uniprot, since it can contain cloning-artefacts and modifications that were necessary in order to crystallize a structure.
2221

2322
The **Atom** records provide coordinates where it was possible to observe them.
2423

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy