A summary of SCA calculations

Olivier Rivoire1, Stanislas Leibler1, and Rama Ranganathan2
1Laboratory of Living Matter, and Center for Studies in Physics and Biology, The Rockefeller University, 1230 York Avenue, New York, New York 10065, USA.
2The Green Center for Systems Biology, and Department of Pharmacology, University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA.

(Dated: August 18, 2008)

Download the PDF version of this document

 

I. INTRODUCTION

This document provides a short summary of the principles and implementations of the SCA method. A more thorough description of the assumptions, justifications and open questions related to the method will be given elsewhere in formal publication.

 

II. PRELIMINARIES - MULTIPLE SEQUENCE ALIGNMENT AND FREQUENCIES

A. Frequencies

A multiple sequence alignment of M sequences of length L is represented by a binary array xi,s(a), where xi,s(a) = 1 if sequence s has amino acid a at position i, and 0 otherwise (s = 1,,M is for sequences, i = 1,,L is for positions and a = 1,,20 is for amino acids). The frequency fi(a) of an amino acid a at position i is computed as the number of sequences in the alignment having amino acid a at position i, divided by the total number of sequences, including those with a gap at i; it can also be written

  (a)    (a)
fi  = ⟨xi,s⟩s,
(1)

where xi,s(a) is averaged over all sequences s.

 

B. Binary approximation

In Halabi et al., (manuscript submitted), we also make use of a so-called ”binary approximation” of the full alignment in which we consider only the most frequent amino acid ai at position i. The alignment is then represented by a binary array xi,s where xi,s = 1 if sequence s contains the most frequent amino acid at position i, and 0 otherwise (i.e., xi,s = xi,s(ai)). This reduction is useful since, as calculated below (Sec. III ), the positional conservation in the full alignment is well-approximated by the positional conservation in the binary alignment. Other approaches to binary approximation are possible and are currently under investigation for improved approximation of positional conservation and correlation.

 

C. Background frequencies

As described in section III , positional conservation is measured by the divergence of the observed frequency fi(a) of amino acid a at position i from the background probability q(a) of amino acid a. This background probability is computed from the mean frequency of amino acid a in all proteins in the non-redundant database. Specifically,

q =  (0.073,0.025,0.050,0.061,0.042,0.072,0.023,0.053,0.064,0.089,
     0.023,0.043,0.052,0.040,0.052,0.073,0.056,0.063,0.013,0.033),

where amino acids are ordered according to the alphabetic order of their standard one-letter abbreviation.

 

D. Gaps

Some calculations also require introducing a background probability for gaps. If γ represents the fraction of gaps in the alignment, a background probability distribution can be taken as ¯q (0) = γ for gaps, and ¯q (a) = (1 - γ)q(a) for the 20 amino acids. A practical strategy is to truncate alignments to sequence positions with a frequency of gaps fi(0) no greater than 0.2; alternatively, one can truncate the alignment to the positions present in an atomic structure of a representative member of the protein family. This prevents trivial over-representation of gaps in a sequence alignment and ensures calculations are made only at largely non-gapped sequence positions. Other approaches to defining a background expectation for gaps are possible that can also be consistent with the general approach outlined here.

 

III. POSITION-SPECIFIC CONSERVATION - FIRST ORDER STATISTICS

A. Relative entropy

The conservation of amino acid a at position i, considered independently of other positions, is measured by the statistical quantity Di(a), the so-called relative entropy (1) of fi(a) given q(a). Its definition is derived from the probability PM[fi(a)] of observing fi(a) in an alignment of M sequences given a background probability q(a):

                                    (a)            (a)
PM [f(ia)] =----(a)--M-!----(a)-(q(a))Mfi (1- q(a))M (1-fi ).
          (M fi  )!(M (1 - fi )!
(2)
When M is large (the relevant limit for SCA, see below), the Stirling formula leads to the approximation
    (a)   - MD (a)
PM [fi ] ≃ e   i ,  with
(3)
 (a)   (a)   fi(a)-       (a)   1--f(ia)
Di  = fi ln q(a) + (1- fi )ln 1- q(a).
(4)

The value of Di(a) indicates how unlikely the observed frequency of amino acid a at position i would be if a occurred randomly with probability q(a) - a definition of position-specific conservation.

 

B. Equivalence with previous definitions

Di(a) is equivalent to measures of positional conservation introduced in previous reports of the SCA method. In essence, Di(a) is the asymptotic limit for large M for ΔGistat,a (MATLAB SCA Toolbox v1.0, as reported in Refs. (25)), and ΔEistat,a (SCA Toolbox v1.5, as reported in Ref. (6)):

                     1
ΔGstiat,a= ΔEstiat,a= - --lnPM [f(ai)] ≃ D (ai)
                     M
(5)

The pre-factor -1-
M scales the positional conservation parameter for alignments of different size, and represents the statistical unit of conservation symbolically indicated by kT* or γ* in previous works.

 

C. Appropriate alignment sizes

A more precise relation between the probability PM[fi(a)] and the relative entropy Di(a) is

   1       a)    (a)  ln M     ( 1 )
- M--lnPM [fi ] = Di + 2M-- +O   M-- .
(6)

The values of Di(a) are typically of order 1-3 (the scale is given by ln20 3), os the corrective term lnM∕(2M) can be neglected when M is of order of 100 sequences or greater (M = 100 corresponds to lnM∕(2M) 0.02). This gives a lower bound on the size of alignments appropriate for SCA studies; provided one operates above this limit, the previous measurements of conservation are quantitatively close to Di(a).

 

D. Overall positional conservation

An overall positional conservation Di taking into account the frequencies of all 20 amino acids can also be defined, but requires introducing a background probability for gaps (see Sec. II.D ). Denoting fi(0) = 1 - a=120fi(a) the fraction of gaps at position i, we can then write the probability of jointly observing the frequencies (fi(1),,fi(20)) of each of the 20 possible amino acids at position i as

PM [f (1),⋅⋅⋅ ,f(20)] =-------M-!--------(q¯(0))Mfi(0)⋅⋅⋅(¯q(20))Mf(i20)≃ e-MDi
    i       i     (M f(i0))!⋅⋅⋅(M f(i20))!
(7)

where Di = a=020fi(a) lnfi(a(a))
¯q defines the overall conservation at position i.

 

E. Relation between overall and amino acid-specific positional conservations

If considering gaps, we also define D¯i(a) = fi(a) ln (a)
fqi¯(a) + (1 - fi(a))ln   (a)
11--f ¯qi(a), the equivalent of Di(a) (the positional conservation for amino acid a) corresponding to a background probability distribution that includes gaps. As a rule, D¯i(a) Di and in practice , ¯Di(a) is found to be maximal for a = ai, the most frequent amino acid at position i.

 

F. Equivalence of various definitions in the binary approximation limit

Note that Di(a), D¯i(a), and Di are non-linear functions of fi(a) that rise more and more steeply as fi(a) approaches one. A consequence is that for all but the least conserved positions, the overall conservation Di is well approximated by the conservation of the most prevalent amino acid (Di(ai) or ¯Di(ai)), a result that justifies the use of the binary approximation in Halabi et al. (submitted). The same argument indicates that the definition of overall positional conservation introduced in previous reports of the SCA method, namely

                   ┌│ -20-(---------)--
ΔGstat= ΔEstat=  1-│∘ ∑   lnP  [f(a)]2,
   i       i     M   a=1    M  i
(8)

behaves essentially as Di(ai).

 

IV. CORRELATED CONSERVATION - SECOND ORDER STATISTICS

The basic principle for defining a SCA correlation matrix is to weight correlations between pairs of positions by a function of the positional conservations. Different implementations of the SCA method correspond to different definitions of the weights, but are all based on this same principle (described below). The original implementation of SCA defined conserved correlations through a specific type of perturbation analysis on the sequence alignment (MATLAB SCA Toolbox 1.5, Sec. IV.E ). The current implementation is based on a bootstrap (or, more precisely, jackknife) procedure, which amounts to using weights that are gradients of positional conservations (Sec. IV.D ). With regard to distributions, SCA Toolbox 2.x computes the SCA correlation matrix using the bootstrap, and SCA Toolbox 3.0 using weights. In practice, versions 2.x and 3.0 report nearly identical values.

 

A. Unweighted covariance matrix

In general, a covariance matrix reporting pair-wise correlations between amino acids at positions in a multiple sequence alignment can be defined as

C (aijb)= ⟨x(i,a)sx(jb),s⟩s - ⟨x(ai,)s⟩s⟨x(jb,s)⟩s = f(iajb)- f(ai)f(jb).
(9)
where fij(ab) = xi,s(a)xj,s(b)s represents the joint frequency of having a at position i and b at position j. The corresponding expression in the binary approximation is
Cij = ⟨xi,sxj,s⟩s - ⟨xi,s⟩s⟨xj,s⟩s = f(aijiaj) - fi(ai)f(jaj).
(10)

 

B. Weighted covariance matrix

As a general principle, SCA matrices can be obtained by weighting these covariance matrices by a functional φ of the positional conservations Di(a) (or, more generally, a function of the frequencies fi(a) and q(a) with properties similar to that of Di(a))

 (ab)    (  (a))  (  (b))  (ab)
˜Cij  = φ D i  φ  D j  Cij ,
(11)
or, in the binary approximation
 ˜     (  (ai))  (  (aj))
Cij = φ D i   φ D j   |Cij|.
(12)
Although not essential, the absolute value taken in the last formula eliminates negative correlations that originate from alternative choices of amino acids at a position. In uses of SCA for characterizing positional correlations, the sign of amino acid-specific correlations is not considered. Eqs. (11)-(12) represent the most general description of the SCA matrix - a weighted correlation matrix that measures the significance of amino acid correlations by the conservation of the residues involved. A possible choice of weights, implemented in SCA Toolbox 3.0 and discussed in section IV.D below, is
  (   )      (a)    [  (a)     (a) ]
φ  D(a) = ∂D-i- = ln fi--(1---q--) .
    i      ∂f(ai)      (1- fi(a))q(a)
(13)

These weights rise even more steeply than Di(a) as the frequencies of amino acids fi(a) approach one, a property that reduces correlations arising from weakly conserved amino acids (since the gradient of Di(a) approaches zero as fi(a) q(a)), and emphasizes conserved correlations. This property addresses a central issue in assessing functional correlations in sequence alignments - the need to minimize the contribution of purely historical correlations between positions that derive from many small clades of sequences with close phylogenetic relationships.

 

C. Reduced SCA matrix and binary approximation

In previous implementations of the SCA method, a reduced matrix ¯
Cij was defined from ˜
Cij(ab) by

     ┌││ (----(----)-)-
¯Cij = │∘ ( ∑  ˜C(ab)2)
         a,b   ij
(14)

Within the range of validity of the binary approximation, this matrix corresponds to ˜Cij and therefore yields equivalent results. Other approaches to reduction of the four-dimensional tensor ˜
Cij(ab) to the positional correlation matrix  ˜
Cij are possible but, as long as the binary approximation is justified, these are expected to give similar results.

 

D. Weights derived from the bootstrap procedure

If we introduce Di,s(a), the positional conservation of amino acid a at position i for an alignment obtained by leaving out sequence s, the covariance matrix associated with this bootstrap procedure is:

ˆC(iajb) = ⟨D (i,a)sD (bj,)s⟩s - ⟨D (ai,)s⟩s⟨D (j,b)s⟩s.
(15)
This approach is implemented in the MATLAB distribution SCA v2.0. The Ĉij(ab) matrix can also be analytically derived by rewriting it as a weighted correlation matrix (7). Let Mi(a) be the number of sequences with amino acid a at position i, and M be the total number of sequences. When sequences s is left out, the frequency fi(a) = Mi(a)∕M becomes
      M  (a)- x(a)   (      )       x(a)    (    )
fi(a,s)= --i-----is, =  1 + 1-- f(ia) - -i,s-+O   -12  ,
         M - 1          M         M        M
(16)
where xi,s(a) = 1 if sequence s has amino acid a at position i, and 0 otherwise. In the limit of large number of sequences M, expanding Di(a), viewed as a function of fi,s(a), to first order in 1∕M leads to
  (a)   ˆ(a)  x(ia,s)∂D(ia)
D i,s ≈ Di  -  M  ∂f(a),
                   i
(17)
where ˆDi(a) is the relative entropy Di(a) with fi(a) replaced by (1+ 1∕M )fi(a). It thus follows that, to first order in 1∕M,
  (ab)  -1- ∂D-(ai)∂D-(jb)(  (a) (b)     (a)   (b) )
Cˆij  ≈ M  2∂f(a)∂f (b)  ⟨xi,s xj,s⟩s - ⟨xi,s⟩s⟨xj,s⟩s .
             i    j
(18)
Or, per Eq. (9)
        1 ∂D (a) ∂D(b)
ˆC(iajb) ≈ --2---i(a) --j(b)-C(ijab)
       M   ∂fi  ∂fj
(19)
The bootstrap procedure thus corresponds to weighting the raw covariance matrix Cij(ab) by gradients of positional conservation (compare Eq. (19) with Eqs. (11)-(13). A scaling factor of 1∕M2 relates the bootstrap and weighting approaches to the SCA correlation matrix:
 (ab)    1   (ab)
ˆCij  ≈ M-2 ˜Cij .
(20)

 

E. Weights derived from the original perturbation procedure

The implementation of the SCA method introduced originally in Lockless and Ranganathan was based on a perturbation to the amino acid distribution at one test site i to measure the difference in position-specific conservation of each amino acid at a second site j. In general, the perturbation consisted of restricting the test site to a highly prevalent amino acid ai, a manipulation that extracts a sub-alignment with size equal to fi(ai)M. For test sites in which sub-alignments retained sufficient size and diversity to be globally representative of the full alignment (i.e., fi(ai)M > 100 sequences), a difference conservation value was calculated:

    stat,b,ai       stat,b,ai     1 [ (    [ (b)])    (   [ (b)|ai])]
ΔΔG j,i    = Δ ΔE j,i    = - M-- ln  PM  fj   - ln PM  fj|i      ,
(21)
where fj|i(b)|ai is the frequency of amino acid b in the sub-alignment obtained by retaining only the sequences having a well represented amino acid ai at position i. ΔΔGj,istat,b,ai represents the change in the conservation of amino acid b at position j due to the perturbation introduced at position i, a measure of their correlation. The first term on the right hand side, -1-
M ln(    [ (b)])
 PM   fj, corresponds to Dj(b). Given the assumption that perturbations lead to sub-alignments that are representative of the full alignment (a condition satisfied typically by only the most frequent amino acid at a subset of positions), fj|i(b)|ai fj(b) for most amino acids b at positions j. We may therefore expand the second term, -1-
M ln(    [     ])
 PM   f(b)|ai
       j|i, by writing
        f(aib)        f(aib)- f(ai)f(b)        C (aib)
f(jb|i)|ai= -ij--= fj(b)+ -ij-----i---j- = f(bj)+ --ij--
        fi(ai)             f(iai)              f(aii)
with Cij(aib) defined as in Eq. (9), so that
       (   [     ])          (aib)   (b)
- -1ln  PM  f(j|b)i|ai  ≈ D (jb)+ C-ij-- ∂Dj--.
  M                         f(iai) ∂f(jb)
This leads to
                     (b)
ΔΔGstat,b,ai ≈ ---1- ∂Dj--C(aib),
    j,i        f(iai) ∂f(jb)  ij
(22)
which shows that the perturbation procedure also corresponds to a weighting procedure for correlations. Finally, within the range of validity of the binary approximation,
          ┌│ -----------------
    stat   │∘ 2∑0 (     stat,b,ai)2   ||    stat,aj,ai||    1 ∂D (jaj)
ΔΔG j,i  =       ΔΔG  j,i      ≈ |Δ ΔG j,i     | ≈-(ai)---(aj)|Cij|.
            b=1                                fi   ∂fj
(23)

 

V. DISTRIBUTIONS OF SCA

(1) SCA v1.5: The original SCA method as specified in Lockless and Ranganathan (2) with one modification that was used in all subsequent papers: the division of binomial probabilities by the mean probability of amino acids in the alignment is removed. This version is longer in active use.

(2) SCA v2.5: The bootstrap-based approach for SCA. Position-specific conservation calculated as in Eq. (4 ) and correlations calculated as in Eq. (11 ). Matrix reduction per Eq. (14 ).

(3) SCA v3.0: The analytical calculation of correlations weighted by gradients of relative entropy. Position-specific conservation calculated as in Eq. (4 ) and correlations calculated as in Eq. (11 )-(12 ). An update to this version is expected shortly that will also includes codes for new statistical methods for identifying groups of correlated amino acid positions (the ”sectors” in Halabi et al., submitted) and for assessing the statistical independence of sectors. For non-binarized alignments, matrix reduction is per Eq. (14 ). Current version. Distributions are MATLAB Toolboxes that include various accessory codes for data formatting, display, and analysis through hierarchical clustering. A tutorial with a sample alignment that illustrates the analytic process is also provided.

 

REFERENCES

 

[1] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience, New-York, 1991.

[2] S W Lockless and R Ranganathan. Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286(5438):295–9, Oct 1999.

[3] Gürol M Süel, Steve W Lockless, Mark A Wall, and Rama Ranganathan. Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol, 10(1):59–69, Jan 2003.

[4] Mark E Hatley, Steve W Lockless, Scott K Gibson, Alfred G Gilman, and Rama Ranganathan. Allosteric determinants in guanine nucleotide-binding proteins. Proc Natl Acad Sci USA, 100(24):14445–50, Nov 2003.

[5] Andrew I Shulman, Christopher Larson, David J Mangelsdorf, and Rama Ranganathan. Structural determinants of allosteric ligand activation in rxr heterodimers. Cell, 116(3):417–29, Feb 2004.

[6] Michael Socolich, Steve W Lockless, William P Russ, Heather Lee, Kevin H Gardner, and Rama Ranganathan. Evolutionary information for specifying a protein fold. Nature, 437(7058):512–8, Sep 2005.

[7] B. Efron and R. J. Tishirani. An introduction to the bootstrap. Chapman and Hall, 1994.