Main page   Analyse sequence   Help

Conformational Flexibility Profile (CFP) server HELP

Conformational Flexibility Profile (CFP) server - Help on parameter settings


Background

The CFP Tool uses the Generalized Local Propensity of amino acids, GLP (Kuznetsov and Rackovsky, 2003, 2004), to compute a raw numeric propensity profile for an amino acid sequence. Next, this raw profile is smoothed using a sliding window of length W. Smoothed profile is subsequently used to partition the sequence into seed segments with high conformational flexibility. Seed segments are extended on both sides and reported in the final table. The significance of each final segment is assessed using the scan statistic based on the observed density of flexible residues (click here for the program flowchart).

GLP is a normalized index that measures the degree of context-dependent local backbone flexibility of the 20 amino acid types. For each amino acid, X, in a tri-peptide, iXj, this index measures the width of the distribution of (phi,psi) dihedral angles associated with the central amino acid in a tri-peptide iXj:

GLP(iXj) = Et(iXj)/Er(iXj), where

Et(iXj) is the Shannon entropy of the tri-peptide specific distribution of backbone conformations computed using n(iXj) occurrences of the tri-peptide iXj observed in the non-redundant dataset of X-ray structures (the PDB SELECT-25 dataset).
Er(iXj) is the average entropy of a distribution of n(iXj) tri-peptides randomly sampled without replacement from the same non-redundant dataset.

The central residue, X, is represented in the full 20-letter alphabet, while the flanking residues i and j are collapsed into 3 groups based on side chain properties: 1-Gly; 2-Pro; 3-18 other amino acids. A value of GLP(iXj) higher than one indicates that the average entropy of random distribution is lower than that observed for a given tri-peptide iXj and implies that the tri-peptide has conformational flexibility higher than the average.

Input page parameters

Please note that all parameters have been optimized in such a way that the extension of a flexible seed segment terminates when the GLP of the extension window becomes similar to a typical GLP of a helix fragment (GLP in lower 95%). This is based on the assumption that helices are the most rigid parts of the protein.

  • Smoothing window size
    This is the length of the sliding window used to smooth the raw profile. To compute the score for a given sequence position i and a window of size W in the smoothed profile, i-(W-1)/2 neighboring residues on each side of residue i are used. The score for residue i is the weighted average GLP computed for the sequence segment of length W ceneterd at position i. If W is set to 1, the raw profile is used. High values of W tend to reveal long segments with high flexibility and mask the short ones. Lower values of W tend to reveal short segments withhigh flexibility. You may want to use different values of W and compare the results.

  • GLP threshold for seed segments
    A threshold used to identify segments with high flexibility. Contiguous sequence positions that have smoothed GLP above this threshold are merged into a seed flexible segment.

  • Extension threshold
    Each seed flexible segment is extended on both sides until its average GLP drops below this threshold.

  • Extension window threshold
    N- and C-terminal ends of a flexible seed segment are extended one position at a time if the extension window that begins at this position has the average GLP above this threshold.

  • Extension window size
    The length of the extension window used to extend flexible seed segments.

  • Hat-shaped local smoother
    The raw GLP profile is smoothed in such a way that the closer the position in the smoothing window is to the central position, i, the higher its contribution to the smoothed GLP score assigned to position i.

  • Equal weights local smoother
    Each position in the smoothing window has the same weight. The smoothed GLP score assigned to the central position i is just the unweighted average score computed over all positions in the window.

  • Minimum seed segment
    Seed flexible segments with length below this threshold are not extended.

  • Maximum separation between merged segments
    Flexible segments separated by this or smaller number of positions are merged into one.

  • SWISS-PROT background frequencies
    Amino acid frequencies of the non-redundant SwissProt database are used to estimate the statistical significance.

  • PDB background frequencies
    Amino acid frequencies of the non-redundant Protein Data Bank are used to estimate the statistical significance.

  • Flexible residues
    A set of residues with high backbone flexibility used to estimate statistical significance of segments with high flexibility using the scan statistic.

  • X axis size, Y axis size
    Size of X and Y axis in the propensity plot in pixels.

  • Create a plot
    Display the smoothed GLP profile in the web-browser.

  • Create a text file
    Save the raw and smoothed propensity profiles along with the amino acid sequence in a text file.

    Format of the output file:


    Column 1 - sequence position
    Column 2 - protein sequence
    Column 3 - the raw propensity profile
    Column 4 - marking of positions in the raw profile. Positions for which the profile was calculated are marked by 1. Excluded positions are marked by 0. A position can be excluded because of non-standard amino acid character in this position or if the number of tri-peptides is below 'min.number of trimers' (see above).
    Column 5 - the smoothed propensity profile.
    Column 6 - marking of positions in the smoothed profile. Positions for which the profile was calculated are marked by 1. Excluded positions are marked by 0. More positions can be marked by 0 than in column 4, since if one window position is excluded, the entire window that contains this position is excluded also.

    Format of the output pages

    The CFP output consists of two pages. All arguments used to run the program are given on the top of each page. The first part shows the smoothed GLP plot (click here for an example of the first output page). The second part shows the detailed information of every flexible segment found in the input sequence (click here for an example of the second output page).

    On the second page, the entire input sequence is shown, with high flexibility segments marked by dotted arrows (<..>) below the sequence on the 'Flexible' line. The second line shows low complexity segments (marked by x's) detected by the PSEG program (Wootton and Federhen, 1996) (PSEG is run with the default arguments).

    Below the input sequence a table of high flexibility segments is shown. This table has the following columns:

    Columns 'Start' and 'End' - start and end positions of the segment within the sequence.
    Column 'Length' - the segment length.
    Column 'Average propensity' - the average raw GLP of the segment.
    Column 'P-value' - the significance of the observed density of flexible positions in a given segment.
    Column 'Sequence' - the sequence of the segment.


    At the end of the output page the estimates of the overall abundance (or lack) of flexible positions in the input sequence are printed. If the sequence has a larger number of flexible positions than that expected by chance, the corresponding line is marked with (+). The lack of flexible positions is marked with (-). Exact significance is computed using exact binomial probabilities, approximate significance is computed using the normal approximation to the binomial. If the normal approximation is not valid, it is denoted by '*'. Very low P-values indicate that the input sequence has a very biased amino acid composition and an unusualy high (or low) number of flexible positions.

    References

    1. I.Kuznetsov and S.Rackovsky, 2003, On the properties and sequence context of structurally ambivalent fragments in proteins. Protein Science, 12:2420-2433.

    2. I.Kuznetsov and S.Rackovsky, 2004, Comparative computational analysis of prion proteins reveals two fragments with unusual structural properties and a pattern of increase in hydrophobicity associated with disease-promoting mutations. Protein Science, 13(12):323-3244

    3. J.C.Wootton and S.Federhen, 1996, Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266:554-571.

    The development of this web-server was supported by grant 1R03LM009034 from the National Library of Medicine of the National Institutes of Health.

    If you have any questions, you may address them to Igor Kuznetsov at ikuznetsov (at) uamail (dot) albany (dot) edu

  • The CFP Tool (C) 2009 Igor B. Kuznetsov
    Web design: Byron Gerlach