Ba 5070
Ba 5070
Paul Emsley* and Kevin Cowtan CCP4mg is a project that aims to provide a general-purpose Received 26 February 2004
Accepted 4 August 2004
tool for structural biologists, providing tools for X-ray
structure solution, structure comparison and analysis, and
York Structural Biology Laboratory, University of
York, Heslington, York YO10 5YW, England
publication-quality graphics. The map-®tting tools are avail-
able as a stand-alone package, distributed as `Coot'.
Correspondence e-mail:
emsley@ysbl.york.ac.uk
1. Introduction
Molecular graphics still plays an important role in the deter-
mination of protein structures using X-ray crystallographic
data, despite on-going efforts to automate model building.
Functions such as side-chain placement, loop, ligand and
fragment ®tting, structure comparison, analysis and validation
are routinely performed using molecular graphics. Lower
resolution (dmin worse than 2.5 A Ê ) data in particular need
interactive ®tting.
The introduction of FRODO (Jones, 1978) and then O
(Jones et al., 1991) to the ®eld of protein crystallography was in
each case revolutionary, each in their time breaking new
ground in demonstrating what was possible with the current
hardware. These tools allowed protein crystallographers to
enjoy what is widely held to be the most thrilling part of their
work: giving birth (as it were) to a new protein structure. The
CCP4 program suite (Collaborative Computational Project,
Number 4, 1994) is an integrated collection of software for
macromolecular crystallography, with a scope ranging from
data processing to structure re®nement and validation. Until
recently, molecular graphics had not been part of the suite.
With the recent computational and graphical performance of
relatively cheap hardware, the time had arrived for CCP4 to
provide graphical functionality for knowledge-based (semi-)-
automatic building using powerful modern languages in a
¯exible extendible package. CCP4mg (Potterton et al., 2004) is
an initiative by CCP4 to provide libraries and a molecular
graphics application that is a popular system for represent-
ation, modelling, structure determination, analysis and
validation. The aim is to provide a system that is easy to use
and a platform for developers who wish to integrate macro-
molecular computation with a molecular-graphics interface.
There are several modules to such graphical functionality; the
protein model-building/map-®tting tools described here are
only a part. These tools are available as a stand-alone software
package, Coot.
A map-®tting program has to provide certain functionality,
which is not required by a molecular-display program. These
functions include symmetry coordinates, electron-density map
contouring and the ability to move the coordinates in various
# 2004 International Union of Crystallography ways, such as model idealization or according to side-chain
Printed in Denmark ± all rights reserved rotamer probabilities.
Acta Cryst. (2004). D60, 2126±2132 Emsley & Cowtan Model-building tools for molecular graphics 2127
research papers
should be known. These ideal values can come in various position is determined from the mean position of the grid
forms. The interface in Coot reads the mmCIF dictionaries of coordinates of the cluster. This position is then optimized by
REFMAC, which de®ne idea values and estimated standard re®ning the position to the local maximum as determined by
deviations for bond lengths, angles, torsions planes and chiral cubic interpolation of the map. A map sphericity test is then
centres. Coot uses the Polak±Ribiere variant of the BFGS applied; the variance of the cubic interpolated electron density
(Broyden±Fletcher±Goldfarb±Shanno) conjugate-gradient Ê from the local maximum in positive
at points 0.3, 0.6 and 0.9 A
multi-variable function minimizer to optimize the coordinates. and negative offsets along the x, y and z axes are determined.
The analytical gradient derivations are described in The variances are summed and must be lower than a user-
Appendix A. changeable cutoff (default 0.07 e2 A Ê ÿ6). The successful posi-
2.6.1. Fitting to the map. As described above, the map tions are then compared with the coordinates of the protein's
gradients are provided by a Clipper function. These map O and N atoms. If the distance is between user-changeable
gradients (at the positions of the atom centres) are simply criteria (default 2.4±3.4 AÊ ) then the position is accepted as a
multiplied by a (user-changeable) scaling factor and added to solvent O atom and (optionally) added to the protein model.
the geometric terms to de®ne the target function (this is called
`Re®nement' in Coot). 2.8. Add terminal residue
Given the selection of a terminal residue (which also could
2.7. Finding ligands merely be the start of a gap of unplaced residues), two residue-
type independent randomly selected '/ pairs are made from
A map can be masked by a set of coordinates (typically
Clipper's Ramachandran distribution of '/ pairs. These
those of the currently determined atoms of the protein
angles are used to generate positions of C, C , O and N main-
model). This approach leaves a map that has positive density
chain atoms for the neighbouring two residues using the
at places where there are no atoms to represent that density
peptide geometry. This set of atoms then undergo rigid-body
(similar, in fact, to an Fo ÿ Fc map). This masked map is
re®nement to optimize the ®t to the map. The score of the ®t
searched for `clusters' of density above a particular level. The
and the positions of the atoms are recorded. This procedure is
clustering of the grid points of the asymmetric unit into
then repeated a number of times (by default 100). The main-
potential ligand sites is performed conveniently using a
chain atoms of the neighbouring residue's best-®t atoms are
recursive neighbour search of the map. The clusters are sorted
then offered as a position of the neighbouring residue (the
according to size and electron-density value. Eigenvalues and
atoms of the next neighbouring residue are discarded).
eigenvectors are calculated for each cluster of grid points.
Similarly, the eigenvalues and eigenvectors of the search
ligands (there can of course be just one search ligand) are 2.9. Skeletonization and Ca building
computed (the parameters being the positions of the atom Coot uses a Clipper map to generate and store the skeleton.
centres). The eigenvalues of the ligands are compared with the This approach is convenient because, like electron-density
eigenvalues of each of the electron-density clusters and if they maps, the skeleton can be displayed `on the ¯y' anywhere in
are suf®ciently similar the ligand is placed into the cluster by the crystal (i.e. it is not limited to a precalculated region). The
matching the centre of the test ligand and the centre of the Clipper skeletonization algorithm is similar to that employed
cluster. The ligand is oriented in each of the four different in DM from CCP4 (Cowtan, 1994). A skeleton `bond' (bone)
orientations that provide coinciding eigenvectors and then is drawn between neighbouring map grid points if both parts
rigid-body re®ned and scored. The score is simply the sum of are marked as skeleton points.
the electron density at the atom centres. The score at each site The skeleton can be further trimmed by recursive tip
for each different ligand is compared and the best ®t (highest removal (a tip being a grid point with one or zero neighbours).
score with suf®cient fraction of atoms in positive density after This process removes side chains and, potentially, parts of the
the rigid-body re®nement) is chosen. This last check ensures termini, but provides an easy means of identifying the fold and
that oversized ligands are not ®tted into small clusters. non-crystallographic symmetry.
2.7.1. Flexible ligands. Instead of having a series of Like some validation (Kleywegt, 1997) and other attempts
different ligand compounds, the search ligands can be gener- at automated model building (Morris et al., 2002; Old®eld &
ated from a single ligand that has rotatable bonds. The ligand Hubbard, 1994; Old®eld, 2001), a likelihood distribution for
dictionary provides a description of the geometry of the ligand the pseudo-torsion angle C (n)ÐC (n + 1)ÐC (n + 2)Ð
including torsions. These torsions are randomly sampled for a C (n + 3) versus the angle C (n + 1)ÐC (n + 2)ÐC (n + 3)
number of trials (by default 1000) to provide coordinates that has been generated from high-resolution structures in the
can be checked against the potential ligand sites as described PDB (Berman et al., 2000) (Fig. 1). Once at least three C
above. An enhancement would be to allow the determination atoms have been placed, this is used as prior knowledge in the
of the number of trials to depend on the number of torsions. placement of the next C position in the following manner.
2.7.2. Finding water molecules. The electron density is Skeleton points between 2.4 and 3.4 A Ê from the current C
clustered as described for ligands. For clusters that have a position (which has an associated nearby skeleton point) are
volume below a certain upper limit (4.2 A Ê 3, which stops water selected. These skeleton points are tested for direct connec-
molecules being placed in multi-atom ligand sites) a starting tivity to the current skeleton point. Skeleton points that are
2128 Emsley & Cowtan Model-building tools for molecular graphics Acta Cryst. (2004). D60, 2126±2132
research papers
C coordinates are converted to main-chain coordinates in a
manner similar to that previously described (Jones & Thirup,
1986; Esnouf, 1997).
APPENDIX A
Regularization and refinement derivatives
The function that we are trying to minimize is S, where
S Sbond Sangle Storsion Splane :
A1. Bonds
NP
bonds
Sbond bi ÿ b0i 2 ;
i1
where b0i is the ideal length (from the dictionary) of the ith
bond, bi is the bond vector and bi is its length.
@Si @S @b @b
i i 2 bi ÿ b0i i ;
Figure 1 @xm @bi @xm @xm
C pseudo-torsion angle versus opening angle for proteins in the PDB
used in the likelihood assignment of potential C positions.
bi xm ÿ xk 2 ym ÿ yk 2 zm ÿ zk 2 1=2 :
Therefore
directly connected are assigned a score of 100; those that are
unconnected are assigned a score of 0.1. For each selected @bi 11 x ÿ xk
skeleton point, a test point is then generated 3.8 A Ê from the 2 xm ÿ xk m
@xm 2 bi bi
current C position in the direction of skeleton point. A C
pseudo-torsion angle and angle pair are generated from the and
position of the test point, the current C position and the two @Si x ÿ xk
previously assigned C positions. This pseudo-torsion angle 2bi ÿ b01 m :
@xm bi
and angle pair are used to generate a score by looking up the
value in the internal representation of Fig. 1 using cubic
interpolation. This value is combined with the skeleton-based
score for this particular test point. This procedure is then A2. Angles
repeated in a `look-ahead' manner, assuming that the test We are trying to minimize Sangle, where (for simplicity, the
point is a member of the set of four C positions generating weights have been omitted)
the C pseudo-torsion/angle pair. The most likely solution for
Nangles
the look-ahead is combined with the score for the current test P
Sangle i ÿ 0 i 2 :
point. The test points are then sorted by combined score and i1
interactively offered as potential positions for the next C
atom, the positions with the best score being offered ®rst. Angle is contributed to by atoms k, l and m:
Occasionally (usually as a result of a positional error in the cos a b= ab;
current C position), 3.8 A Ê is the wrong distance to the next
correct C position; thus the user is allowed to change the where a is the bond of atoms k and l [(xk ÿ xl), (yk ÿ yl),
length to something other than 3.8 A Ê. (zk ÿ zl)] and b is the bond of atoms l and m [(xm ÿ xl),
The depth of the look-ahead in the current implementation (ym ÿ yl), (zm ÿ zl)]. Note that the vectors point away from the
is at level 1 but could trivially be extended (in tests, a level 2 middle atom l.
look-ahead was better but took too long to be considered Therefore,
pleasantly interactive). a cos P; 1
This algorithm has room for improvement: for example, by
considering the value of the density at the test point and along where
the C pseudo-bond, one-third and two-thirds of the way
P a b= ab:
along the bond (corresponding to positions that are close to
the peptide C and N atoms; Old®eld, 2003). Using the chain rule,
Acta Cryst. (2004). D60, 2126±2132 Emsley & Cowtan Model-building tools for molecular graphics 2129
research papers
@ @ @P A4. Angles: an end atom (atoms k or m)
: 2
@xk @P @xk This case is simpler because there are no cross-terms in
Given that we are only interested in in the range 0 ! , @R=@xk and @Q=@xk .
@ 1 @R x ÿ xl
ÿ : 3 k :
@P sin @xk ab
Again using the chain rule, and
@P @R @Q @Q
Q R ; 4 xm ÿ xl ;
@xk @xk @xk @xk
where and so
Q a b; 5 @ 1 xl ÿ xk xm ÿ xl
ÿ cos : 8
@xk sin a2 ab
R 1= ab: 6
A3. Angles: the middle atom The torsion angle is the angle between a b and b c
(Fig. 2) and this can be written as
A middle atom is somewhat more tricky than an end atom
because the derivatives of ab and a b are not so trivial. Let us arctan a b^ c= ÿa c a b
^ b^ c ; 9
change the indexing so that we are actually talking about the
middle atom, l. where b^ is a unit vector in the direction of b, b^ b=b.
Differentiating (6) gives This de®nition of the torsion angle is used rather than the
more common de®nition, which uses three cross-products,
@R 1 @a 1 @b because our version and its derivatives are faster to calculate.
ÿ 2b ÿ 2a : 7
@xl ab @xl ab @x l Let us split the expression up into tractable portions; the
evaluation of in the program will combine these expressions,
@a=@xl here is exactly the same as for bonds,
starting at the end (the most simple).
@a xl ÿ xk From the primatives,
:
@xl a
ax P2x ÿ P1x ; bx P3x ÿ P2x ; cx P4x ÿ P3x ;
Similarly,
@b xl ÿ xm ay P2y ÿ P1y ; by P3y ÿ P2y ; cy P4y ÿ P3y ;
:
@xl a
az P2z ÿ P1z ; bz P3z ÿ P2z ; cz P4z ÿ P3z ;
Therefore, substituting these equations into (7) gives
@R x ÿx x ÿx arctan D;
ÿ l 3 kÿ l 3 m:
@xl ab ab
where
Turning to Q, recall (5); therefore a b c=b
D :
Q xk ÿ xl xm ÿ xl yk ÿ yl ym ÿ yl ÿa c a b b c=b2
zk ÿ zl zm ÿ zl So
and hence
@Q
ÿ xk ÿ xl ÿ xm ÿ xl :
@xl
Substituting all the above into (4) gives
@P x ÿx x ÿ x ÿ xk ÿ xl ÿ xm ÿ xl
a b ÿ l 3 k ÿ l 3 m :
@xl ab ab ab
Combining this expression and (3) into (2) we obtain
@ 1 @P
:
@xl sin @xl
Figure 2
Nomenclature used for torsion angles.
2130 Emsley & Cowtan Model-building tools for molecular graphics Acta Cryst. (2004). D60, 2126±2132
research papers
@ @ @D @L 2 xP ÿ xP2 2 xP3 ÿ xP2
; 10 ÿ 3 3 ÿ :
@xP1 @D @xP1 @xP3 b b b4
@H @H @H @H @M
cx ; ÿcx ; ax ; ÿax ; by cz ÿ bz cy ay cz ÿ az cy ;
@xP1 @xP2 @xP3 @xP4 @xP2
@K @K @K
0; ÿcx ; cx bx ; @M
@xP1 @xP2 @xP3 az cy ÿ ay cz by az ÿ bz ay ;
@xP3
@K @J @J
bx ; ÿbx ; bx ÿ ax ;
@xP4 @xP1 @xP2 @M
ay bz ÿ az by ;
@J @J @xP4
ax ; 0:
@xP3 @xP4
@M
The @b=@x terms are just like the bond derivatives, ÿ bz cx ÿ bx cz ;
@yP1
@L @L @b
; @M
@xP1 @b @xP1 bz cx ÿ bx cz az cx ÿ ax cz ;
@yP2
i.e.
Acta Cryst. (2004). D60, 2126±2132 Emsley & Cowtan Model-building tools for molecular graphics 2131
research papers
@M Recall the equation of a plane: ax + by + cz + d = 0. Firstly,
ÿ az cx ÿ ax cz bz ax ÿ bx az ;
@yP3 the centres of the sets of atoms, xcen, ycen, zcen, are determined.
The plane is moved so that it crosses the origin and therefore
@M d = 0 (it is moved back later). The problem then involves three
ÿ bz ax ÿ bx az ;
@yP4 equations, three unknowns and an eigenvalue problem, with
the smallest eigenvalue corresponding to the best-®t plane.
@M The least-squares planes of the plane restraints are recal-
ÿ bx cy ÿ by cx ; culated at every iteration.
@zP1
The authors thank Garib Murshudov, Eleanor J. Dodson,
@M Jack Quine and the many Coot testers. KC is supported by The
bx cy ÿ by cx ax cy ÿ ay cx ;
@zP2 Royal Society (grant No. 003R05674). PE is funded by
BBSRC grant No. 87/B17320.
@M
ÿ ax cy ÿ ay cx ay bx ÿ ax by ;
@zP3 References
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N.,
@M Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids
ÿ ay bx ÿ ax by :
@zP4 Res. 28, 235±242.
Collaborative Computational Project, Number 4 (1994). Acta Cryst.
D50, 760±766.
Cowtan, K. (1994). Jnt CCP4/ESF±EACBM Newsl. Protein Crystal-
A7. Combining terms logr. 31, 34±38.
Cowtan, K. (2002). Jnt CCP4/ESF±EACBM Newsl. Protein Crystal-
Combining, we obtain the following expression for the logr. 40.
derivative of torsion angle in terms of the primitive deri- Cowtan, K. (2003). Crystallogr. Rev. 9, 73±80.
vates, Dunbrack, R. L. Jr & Cohen, F. E. (1997). Protein Sci. 6, 1661±1681.
Esnouf, R. M. (1997). Acta Cryst. D53, 665±672.
@ 1 @D Jones, T. A. (1978). J. Appl. Cryst. 11, 268±272.
;
@xP1 1 tan2 @xP1 Jones, T. A., Cowan, S., Zou, J.-Y. & Kjeldgaard, M. (1991). Acta
Cryst. A47, 110±119.
where Jones, T. A. & Thirup, S. (1986). EMBO J. 5, 891±822.
" !# Kleywegt, G. J. (1997). J. Mol. Biol. 273, 371±376.
@D @E E @H @K @J @L Krissinel, E. B, Winn, M. D., Ballard, C. C., Ashton, A. W., Patel, P.,
F ÿ JL KL JK : Potterton, E. A., McNicholas, S. J., Cowtan, K. D. & Emsley, P.
@xP1 @xP1 G2 @xP1 @xP1 @xP1 @xP1
(2004). Acta Cryst. D60, 2250±2255.
McRee, D. E. (1999). J. Struct. Biol. 125, 156±165.
Morris, R. J., Perrakis, A. & Lamzin, V. S. (2002). Acta Cryst. D58,
968±975.
A8. Planes Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst
D53, 240±255.
Old®eld, T. J. (2001). Acta Cryst. D57, 82±94.
Nplanes Natomsi
P P Old®eld, T. J. (2003). Acta Cryst. D59, 483±491.
Splane e2ij ; Old®eld, T. J. & Hubbard, R. E. (1994). Proteins Struct. Funct. Genet.
i1 j1 18, 324±337.
Potterton, L., McNicholas, S., Krissinel, E., Gruber, J., Cowtan, K.,
where eij is the distance of the ith plane restraint's jth atom Emsley, P., Murshudov, G. N., Cohen, S., Perrakis, A. & Noble, M.
from the ith plane restraint's least-squares plane. (2004). Acta Cryst. D60, 2288±2294.
2132 Emsley & Cowtan Model-building tools for molecular graphics Acta Cryst. (2004). D60, 2126±2132