Characterizes five major aspects of an amino acid with real number values. The values were obtained via a statistical analysis of amino acids when looking at polarity (Factor 1), secondary structure (Factor 2), molecular size (Factor 3), amino acid composition (Factor 4) and charge (Factor 5). These values were reported in Atchely et al, PNAS, doi: 10.1073/pnas.0408677102
Characterizes contact potential between two residues. These contact potentials come from a statistical analysis performed on contacts in protein interfaces. They were reported in Glaser et al., Protiens, Vol. 43:2 p89-1021.
Characterizes the contact potential for two residues in two beta sheets. These values come from a study of contact potentials of residues in cross strand pairings in beta sheets. They were reported in Zhu et al, Protein Science, PMCID: PMC2144259
Determine the percentage of each secondary structure (SS) type in a string representing the SS.
Determine the percentage of solvent accessibility for a string representing the SA.
Determine the percentage of each amino acid in a protein sequence.
Characterizes the hydrophobicity of a residue. These values come from a study on hydrophobicity and helical propensity. Cite Monera et al, Journal of Peptide Science, Vol 1(5), pg 319-329. They are scaled by a factor of 100.
Calculates the Shannon entropy for a vector of probabilities.
Calculates the Pearson correlation coefficient for two vectors.
Calculates the cosine between two vectors.
Calculates the nth ordered mean.
Generate a hot encoding for an amino acid . This encoding is a 20 bit hot encoding where each amino acid is represented by a bit in the 20 bit encoding vector. Only the bit representing the amino acid will be set to one and all others will be zero.
Generate a hot encoding for secondary structure type. The following encoding is used: 100 - H 010 - E 001 - C
Generate a hot encoding for solvent accessibility. The following encoding is used: 01 - b 10 - e.
Parse out sequence headers and contents from a FASTA sequence file.
Parse out the secondary structure and solvent accessibility from a SSPro prediction file.
Parse out the secondary structure (H, E, or C) from a PSIPred HFORMAT V3.3 prediction file.
Extract position counts and information score from an ASCII PSSM.
Parse an anchored MSA file and calculate the relative frequency of each amino acid at each position. Parse an anchored multiple sequence alignment (MSA) file and determine the percentage of each amino acid or gap and a position in the MSA. The frequencies are initialized to zero. Each line is a sequence in a multiple sequence alignment.
Parse out the secondary structure (H, E, or C) from a DSSP file. Unknown SS codes are set to coil(C).
Print contents of feature vector to standard output.
Writes contents of feature vector along with target values.
Read the contents of a text file and save each line.