Thursday, June 13, 2013

Frequency Plot for Protein Sequences using R

A frequency plot is a graphical data analysis technique for summarizing the distributional information of a variable. The response variable is divided into equal sized intervals (or bins). The number of occurrences of the response variable is calculated for each bin. In this tutorial, the number of occurrences of each amino acids in the protein sequence (response variable) is calculated and sorted in ascending order.

The frequency plot then consists of:
Vertical Axis = Amino acids
Horizontal Axis = Frequencies of the amino acids
There are 4 types of frequency plots:
  1. Frequency plot (absolute counts);
  2. Relative frequency plot (convert counts to proportions);
  3. Cumulative frequency plot;
  4. Cumulative relative frequency plot.
The frequency plot and the histogram have the same information except the frequency plot has lines connecting the frequency values whereas the histogram has bars at the frequency values.

Frequency plot using R

In this tutorial, the programming language R and BioConductor packages SeqinR & Biostrings is used to generate a frequency plot from the protein sequence. SeqinR is used to read or manipulate sequences, and Biostrings is used to convert sequence to array. For generating a frequency plot, we need a protein sequence in .fasta|.fas file format as input. The simple protocol for generating frequency plot is given below:

Step 1: Download and install R software according to your system platform.

Step 2: Download SeqinR and Biostrings module from CRAN and install. The brief explanations for Step (1) & (2) can be downloaded from

Step 3: Create an R script as given bellow using an ASCII editor (Eg. Notepad) and save it with .R file extension.

Source Code:
seqfile <- read.fasta(file = "E:/Q9CD83.fasta")
fastaseq <- seqfile[[1]]
seqstring <- c2s(fastaseq)
seqstring <- toupper(seqstring)
seqchar <- s2c(seqstring)
tab <- table(seqchar)
taborder <- tab[order(tab)]
names(taborder) <- aaa(names(taborder))
dotchart(taborder, pch=19, main="Frequency of Amino Acids", xlab="Frequency", ylab="Amino Acid")
Note: In this method, the protein sequence leprae (Q9CD83.fasta) must be present in the E: directory.

Step 4: Run the R script. Now a graphical frequency plot image will be generated.

Frequency Plot in R

1 comment :

  1. Can you kindly guide if
    1) there are more than 1 sequence for eg. 10.
    2) Adding significant/error bars to compare amino acid differences in the frequency plot among proteins
    3)Adding significant/error bars to compare amino acid differences in the frequency plot among proteins from two different organisms.