If it's your day job to push proteins in silico then you will one day have to calculate the RMSD of a protein. For example, you've just simulated the protein GinormousA for a whole micro-second, but you don't even know if GinormousA is stable. Sure, you could load up a protein viewer and eyeball the stability of the protein throughout the simulation. But eye-balling is just not going to cut it with your boss -- you need to slap a number on it. This is where RMSD comes in. RMSD is a quantitative measurement of difference between two atomic structures.
A molecule's conformation is basically a set of 3-dimensional coordinates. This means that a conformation is a set of
vectors,
, where each
has three components. As the molecular conformation changes, therefore, the result is a new set of vectors
. We can capture the the differences between conformations using these two sets of vectors.
To make life easier, we first shift the center of mass of
and
to the origin of the coordinate system (we can always shift them back afterwards). Once "centered," the differences between the vector sets can be captured by the total squared distance between them:
The root mean-square deviation (RMSD) is then:
But hola, you say...this mean-square measure doesn't measure similarity very well! After all, any rotation of the set of
(which doesn't change the internal arrangement of
), would distort the RMSD! What we really need, then, is to find the best rotation of
with respect to
before taking the RMSD. To rotate the vectors
, we apply a rotation matrix
to get
. We can then express
as a function of the rotation
:
Thus, to find the RMSD, we must first find the rotation
that minimizes
-- the optimal superposition of the two structures.
to get this rotation. The algorithm described in those papers have since become part of the staple diet of protein analysis. As the original proof for the Kabsch equations (using 6 different Lagrange multipliers) is rather tedious, we'll describe a much simpler proof using standard linear algebra.
First, we expand
(which is now a function of
):
The first part,
, is invariant with respect to a rotation
. The variation resides in the last term,
. The goal, therefore, is to find the matrix
that gives the largest possible value of
, from which we obtain:
and
into two
matrices
and
. Then we can write
as a matrix trace:
where
is a
matrix and
is an
matrix. Juggling the matrices inside the trace, we get:
Now, we still don't know what
is, but we can study the matrix
, which is the
correlation matrix between
and
. Furthermore, by invoking the powerful Singular Value Decomposition theorem, we know that we can always write R as:
and obtain the
matrices
and
(which are orthogonal), and
(which is positive diagonal, with elements
). We can then substitute these matrices into the expression for
:
Because
is diagonal, we can get a simple expression for
in terms of the
:
. The secret sauce is in
. If we look at the definition of
, we can see that it is a product of orthogonal matrices, namely, the matrices
,
and
. Thus,
must also be orthogonal.
Given this, the largest possible value for any element of
is 1, or
. And since
is a product of terms linear in the diagonal elements of
, the maximum value of
occurs when the
:
Plugging this into the equation for RMSD, we get
From this analysis, we can also determine the rotation that yields the optimal superposition of the vectors. When
, then
, the identity matrix. This immediately gives us an equation for the optimal rotation
in terms of the left and right orthogonal vectors of the matrix R:
, but we still don't know how to find them! Never fear. Kabsch, who first derived the solution, showed that to solve for
, we should solve the matrix
, which is a nice symmetric real matrix, and is consequently much easier to solve. Expanding on the singular value decomposition from the previous section
we can see that the singular value term of
is
. In other words, the singular values of
(
) are related to the singular values of
(
) by the relationship
. And also, the right singular vectors
are the same for both
and
.
Furthermore, since
is a symmetric real matrix, its eigenvalues are identical to its singular values
. Thus, we can use any standard method (e.g. the Jacobi transformation) to find the eigenvalues of
, and by substituting
into the RMSD formula:
from the SVD theorem. You see,
is guaranteed to be orthogonal, but for our RMSD calculation, we don't just want orthogonal matrices, we want rotation matrices. And while orthogonal matrices can have determinants of +1 or -1, "proper" rotation matrices have determinants of +1 (unless they're also reflected, in which case they'll have determinants of -1, and are termed "improper" rotations).When does this happen? To answer this question, take a look at your hands. Assuming that you're a primate, you have two of them, both alike, but reflected across the plane of symmetry that runs down the center of your body. Now, try to superimpose your hands. Go ahead. Try. If you're observant, you'll notice that you simply can't superimpose your hands. Even if you cut them off of your arms, the best you can hope to do is to align them so that their outlines are coincident -- but the palm of one hand will still be touching the other. You can never perfectly superimpose your hands, because your hands are chiral.
The same thing happens in molecular systems -- molecules are frequently chiral, and when two molecules are related by chirality, it sometimes happens that their optimal superposition involves a rotation, followed by a reflection. Thus, the "improper" rotation. Unfortunately, a reflection in
will result in an incorrect RMSD value using the stated algorithm, so we need to detect this situation before it becomes a problem.
To restrict the solutions of
to proper rotations, we check the determinants of
and
:
If
then there is a reflection in
, and we must eliminate the reflection. The easiest way to do this is to reflect the smallest principal axis in
, by reversing the sign of the third (smallest) eignevalue, which yields the final Kabsch formula for RMSD in terms of
and
:
, rather than using an iterative approach (such as Jacobi). If only RMSD is desired (i.e. the calculation of a rotation matrix is unecessary), the get_rmsd() method is more efficient.In addition to these methods, a get_trivial_rmsd() function is provided to calculate the RMSD value between two structures without attempting to superimpose their atoms.
1.5.1