Saturday, June 1, 2013

DotPlot for Protein Sequences using R

Dotplot is the visual representation of the similarity between two protein or nucleotide sequences. Dotplot was introduced by Gibbs and McIntyre in 1970 and are two-dimensional matrices that have the sequences of the proteins being compared along the vertical (y) and horizontal (x) axes. Individual cells in the matrix can be shaded black if residues are identical, so that matching sequence segments appear as runs of diagonal lines across the matrix. The closeness of the sequences in similarity will determine how close the diagonal line is to what a graph showing a curve demonstrating a direct relationship is. This relationship is affected by certain sequence features such as frame shifts, direct repeats, and inverted repeats. Frame shifts include insertions, deletions, and mutations. The presence of one of these features, or the presence of multiple features, will cause for multiple lines to be plotted in a various possibility of configurations, depending on the features present in the sequences. A feature that will cause a very different result on the dot plot is the presence of low-complexity region/regions. Low-complexity regions are regions in the sequence with only a few amino acids, which in turn, causes redundancy within that small or limited region. These regions are typically found around the diagonal, and may or may not have a square in the middle of the dot plot.

Dotplot using R

In this tutorial, the programming language R and BioConductor package SeqinR is used to generate a dotplot from the pair of protein sequences. I have given two methods to generate a dotplot from the pair of protein sequences. One is online method, and the other is offline. For both methods, we need two protein sequences in .fasta|.fas format as input. The simple protocol for generating dotplot is given below:

Step 1: Download and install R software according to your system platform.

Step 2: Download SeqinR module from CRAN and install. The brief explanations  for Step (1) & (2) can be downloaded from

Step 3: Create an R script as given bellow using an ASCII editor (Eg. Notepad) and save it with .R file extension.

Method 1:
query("leprae", "AC=Q9CD83")
lepraeseq <- getSequence(leprae$req[[1]])
query("ulcerans", "AC=A0PQ23")
ulceransseq <- getSequence(ulcerans$req[[1]])
dotPlot(lepraeseq, ulceransseq)
Note: This method downloads two protein sequences leprae (Q9CD83) and ulcerans (A0PQ23) from SwissProt database.

Method 2:
leprae <- read.fasta(file = "E:/Q9CD83.fasta")
ulcerans <- read.fasta(file = "E:/A0PQ23.fasta")
lepraeseq <- leprae[[1]]
ulceransseq <- ulcerans[[1]]
dotPlot(lepraeseq, ulceransseq)
Note: In this method, two protein sequences leprae (Q9CD83.fasta) and ulcerans (A0PQ23.fasta) must be present in the E: directory.

Step 4: Run the R script. Now a graphical dotplot image will be generated.



Post a Comment