PCP-ML

Protein Characterization Package for Machine Learning

Functionality

Characterizers and Encoders

AtchelyFactors

Characterizes five major aspects of an amino acid with real number values. The values were obtained via a statistical analysis of amino acids when looking at polarity (Factor 1), secondary structure (Factor 2), molecular size (Factor 3), amino acid composition (Factor 4) and charge (Factor 5). These values were reported in Atchely et al, PNAS, doi: 10.1073/pnas.0408677102

InterfaceContactPotentials

Characterizes contact potential between two residues. These contact potentials come from a statistical analysis performed on contacts in protein interfaces. They were reported in Glaser et al., Protiens, Vol. 43:2 p89-1021.

BetaContactPotentials

Characterizes the contact potential for two residues in two beta sheets. These values come from a study of contact potentials of residues in cross strand pairings in beta sheets. They were reported in Zhu et al, Protein Science, PMCID: PMC2144259

SSComposition

Determine the percentage of each secondary structure (SS) type in a string representing the SS.

SAComposition

Determine the percentage of solvent accessibility for a string representing the SA.

AAComposition

Determine the percentage of each amino acid in a protein sequence.

Hydrophobicity

Characterizes the hydrophobicity of a residue. These values come from a study on hydrophobicity and helical propensity. Cite Monera et al, Journal of Peptide Science, Vol 1(5), pg 319-329. They are scaled by a factor of 100.

CalculateEntropy

Calculates the Shannon entropy for a vector of probabilities.

CalculateR

Calculates the Pearson correlation coefficient for two vectors.

CalculateCosine

Calculates the cosine between two vectors.

ScaledOrderMean

Calculates the nth ordered mean.

HotEncodeAA

Generate a hot encoding for an amino acid . This encoding is a 20 bit hot encoding where each amino acid is represented by a bit in the 20 bit encoding vector. Only the bit representing the amino acid will be set to one and all others will be zero.

HotEncodeSS

Generate a hot encoding for secondary structure type. The following encoding is used: 100 - H 010 - E 001 - C

HotEncodeSA

Generate a hot encoding for solvent accessibility. The following encoding is used: 01 - b 10 - e.

Parsers

ParseFastaSequences

Parse out sequence headers and contents from a FASTA sequence file.

ParseSSPro

Parse out the secondary structure and solvent accessibility from a SSPro prediction file.

ParsePSIPred

Parse out the secondary structure (H, E, or C) from a PSIPred HFORMAT V3.3 prediction file.

ParseAsciiPSSM

Extract position counts and information score from an ASCII PSSM.

ParseAnchoredMSA

Parse an anchored MSA file and calculate the relative frequency of each amino acid at each position. Parse an anchored multiple sequence alignment (MSA) file and determine the percentage of each amino acid or gap and a position in the MSA. The frequencies are initialized to zero. Each line is a sequence in a multiple sequence alignment.

ParseDSSPOutput

Parse out the secondary structure (H, E, or C) from a DSSP file. Unknown SS codes are set to coil(C).

Feature Writers and Utilities

PrintFeatures

Print contents of feature vector to standard output.

WriteFeatures

Writes contents of feature vector along with target values.

ReadFile

Read the contents of a text file and save each line.