Overview Last updated
What Will
MSAReveal
Do For You?
- A number of demonstration and test sets of sequences are provided ready to run. Click on the
button Show Demos & Tests.
Given one or more protein amino acid sequences, aligned or not:
- Sequences are displayed, optionally colored by type of amino acid.
Currently there is one color scheme. Others can be added on request to
-
To color a different subset
of amino acids, and count their numbers in each sequence,
use Find. For example, to color only
Ser, enter "S" in the Find slot.
Or, to color all Asn or Gln, enter [NQ] in the Find slot.
(Entering NQ without the brackets will find only the sequence NQ.)
To make the colored hits easier to see, uncheck all the other
color options (or click the "None" button).
Counts of your query in each sequence are tabulated in Part I of Statistics,
and you can sort that table by numbers of hits.
For more, see
Finding Sequence Fragments.
- Touching any amino acid pops up its 3-letter abbreviation, and its sequence number.
-
A
consensus
is displayed at the bottom of the sequence listing.
Touching any column in the consensus shows the frequencies of amino acids in that column
(see image at right). The amino acid with the highest frequency is boldface, and its
percentage is given.
- Highlighting mutations: When all sequences are >50% identical, a special option
appears with a checkbox to "Highlight Differences". When checked, only the columns
with non-identical residues will be colored.
For an example, click Consensus wrap test + highlight differences
under Show Demos & Tests.
- Short sequences entered in the Find slot are highlighted in red, and their
sequence positions linked. They are found regardless of intervening gaps.
Ambiguous amino acids are supported.
-
Statistics are reported, including lengths excluding gaps, total amino acids in the alignment,
percentage identity vs. the reference (first) sequence, percentage aromatics, numbers of Cys,
charged, and other amino acids, net charge at neutral pH,
and counts and percentages of gaps in each sequence.
The statistics table can be sorted on any column.
- Slides: For showing sequence alignments in presentations, before
taking a snapshot
to put in a slide, you may use the Options checkboxes to hide row numbers, the consensus, gene
names, and UniProt IDs in order to make a less cluttered and more compact display.
|
Default sequence listing.
How to take a snapshot.
|
|
Compact, uncluttered table
after unchecking (under Options) row numbers, gene names, UniProt ID's, and consensus.
|
Advanced Features:
- The starting number for each sequence can be
specified in the header.
- An alignment description, when included in the header of the first sequence, will be
displayed above the sequences.
- A comment for each sequence, when optionally included in its header, will be
displayed when the taxon is touched.
- The presence of ambiguous amino acids (BJOUXZ) or illegal characters is reported,
with buttons to find every instance.
A number of other
error conditions
are reported.
- When a 3D crystallographic (or other empirical) model is available for a sequence, the
PDB code can be
included in its header.
In the Statistics table, the code will be linked to display the 3D model in
FirstGlance in Jmol.
-
Some web browsers become very sluggish when displaying large numbers of
tool-tipped amino acids. When the number of amino acids is
too large for the browser in use,
a recommendation appears automatically to use Firefox. Firefox performs efficiently
even with one million
amino acids.
How To Use MSAReveal:
-
Collect amino acid sequences, e.g. from UniProt.Org.
instructions are provided.
- Align sequences.
Instructions are provided
using free, straightforward, powerful Jalview. MSAReveal does not align sequences.
- Save the alignment in a file in FASTA format.
- Display the alignment, copy, and paste into MSAReveal.
- Press the button Process Sequences.
Voila!
No problem! MSAReveal shows you the 3-letter abbreviation in a tooltip
whenever you touch a one-letter code in the color scheme options, or in the
sequence alignment listing. When you touch a one-letter code column header in the
statistics table, the full name of the amino acid is shown.
And here is a
handy reference chart.
We recommend downloading FASTA sequences from
UniProt.Org:
- At UniProt.Org, use the search slot at the top to describe a sequence.
Examples: "yeast gal4", "sulfurreducens pila", "human pla2g6".
- In the list of hits, click on the Entry code (in the left column of the table)
for the sequence you want. (We recommend viewing the entire entry to confirm this is
what you want.)
- Click on the blue Sequence button at the left side of the page.
For a single sequence:
- Click on the blue FASTA button.
- Open your browser's File menu, and click Save Page As.
- You may wish to rename the file to add the name of the protein or taxon. Keeping the
file type ".fasta" is a good idea.
For a group of sequences:
- Click on the blue button Add to basket.
- When you have added all the desired sequences to your basket,
scroll to the top of the page and click on the blue Basket button.
- In the box that opens, click on Download.
- Select Uncompressed and click Go.
- Select Save File and click OK.
You can now open your saved FASTA file (a plain text editor would be ideal, see below),
select all, copy, and paste into MSAReveal.
NOTE that your sequences are not yet aligned. See
How To Align Sequences.
FASTA files are plain text. You can edit them with a plain text editor, for example
to separate or gather sequences. A plain text editor is one which does not
"mark up" the text with formatting codes. In Windows, use Notepad. In Mac, use the free
program
TextWrangler. If you use WordPad, Word, TextEdit, or
other "word processor" programs, it is often tricky to force the program to save as plain text.
We recommend the free program
Jalview
because it is straightforward, and preserves the full UniProt headers (including genus and
species).
Jalview requires that free
Java be installed on your computer. Alignments done in UniProt
suffer from FASTA headers that have only the UniProt Accession Number,
without the taxon (genus and species).
Instructions for Jalview:
- You will need files containing FASTA sequences that have been saved on your computer.
See
How To Download FASTA Sequences.
- Run Jalview.
- Drag a file containing one or more FASTA sequences and drop into Jalview.
A window should appear that displays the sequence(s) at the top.
- Drag additional files into the SAME window if you wish to add more sequences.
- At the top of the window containing your sequences, click on Web Service
and then click on Alignment.
- Choose an alignment algorithm (such as MAFFT, MUSCLE, or TCOFFEE) and click on
with defaults.
- A second window opens and the alignment is performed. If you have many or long sequences,
this might take a while.
- A third window titled "So and so alignment" opens when the alignment is completed.
- Open the File menu at the top left of the third window, and "Save As".
You may want to double-click on Desktop to save it there temporarily.
Use FASTA format, and name the file appropriately.
-
Your saved alignment is now ready to open (a plain text editor would be good), select all,
copy and paste into MSAReveal.
Options:
Options (preferences) are remembered automatically between sessions, unless you have disabled
"cookies" in your browser.
Sequences:
- Sequences are numbered starting with 1 by default. However, you can specify a starting
sequence number by adding "start=N" to the header of a sequence.
N can be positive, negative, or zero. For example if you want the
first residue of the mature protein to be number 1, and it is preceded by a 7-amino acid signal
sequence, you can add "start=-6" to the header. Now the signal sequence will be numbered -6 to 0, and 1 will be the first residue of the mature protein. This is illustrated in the demo
"3: Pilins Pa unaligned".
- MSAReveal can handle sequences of length >30,000, alignments with >400 sequences,
and alignments with a total of more than one million amino acids.
Tests have included
six sequences of titin with a total of 178,130 amino acids in the alignment. Human titin has
34,350 amino acids.
On a late 2014 MacBook Pro, processing the titin alignment took about 18 sec.
Pasting it into the box six times gives an alignment with 1,068,780 amino acids total. Processing
that took less than 90 seconds, and the results appeared to be correct.
Tests have also included
an alignment with 401 sequences of length 310 (total 110,000 amino acids).
Demos: Click the button Show Demos & Tests
and then click on the link Larger Examples.
- Various
error conditions
are detected and reported.
- A number of sample sequence alignments (and some unaligned sets) are provided. Press
the button "Show Demos & Tests" above the sequence input box.
Finding Sequence Fragments:
- After a set of sequences has been processed, at the top of the output is a slot labeled
Find.
If you enter a sequence fragment, any matches in the sequence listing
will be highlighted in red. The remainder of the sequences will be shown in gray lower case,
making it easier to spot the red matches.
The search algorithm has the following features:
-
The query sequence will be found regardless of gaps. For example, if you specify
CDEFG, you will find not only CDEFG but also CD-EF------G. This works only if the query
contains no dashes (gaps). (Technical note: the query CDEFG is run as the regex /C-*D-*E-*F-*G/g.)
Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "SG".
-
After you enter the query, a list of matches in each row (sequence) will be displayed,
with the sequence number of the first amino acid in each match. Each sequence number is
hyperlinked so that clicking it will jump to the corresponding match. Only the first 20 matches
are listed in each row, but all matches are highlighted in the sequence listing.
-
To CLEAR a search, delete all characters in the Find slot and press Enter.
-
If you specify a sequence containing one or more dashes (gaps), only an exact match will be
found. For example, if you specify CD--EFG, only CD--EFG will be found, not CDEFG, not
CDE-FG, not even CD-EFG nor CD---EFG. If you include dashes in your query, you may not
include square brackets [...], but you may include question marks.
Try it: use Demo "3: Pilins Pa TCOFFEE" and compare hits for "GK" vs. "G-K".
-
Queries can be any length, including length 1. Thus the query "M" will highlight all
methionines in all sequences.
Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "W".
-
The query sequence can be specified in upper or lower case, or a mixture. It will be converted
to upper case for matching purposes. Thus, the query CdEfG will match CDEFG, C---DEFG, etc.
-
Regular expression character classes are supported in a limited way. For example, [IL] will match
a single amino acid, either Ile or Leu. [IL] is equivalent to [LI].
-
Thus, the query MKA[AQ][KQ] will match MKAAK, MKAAQ, MKAQK, and MKAQQ. [FYW]
matches any single aromatic (except His).
-
[ILMV][ILMV][ILMV][ILMV] will match any 4 consecutive residues of
Ile, Leu, Met or Val in any sequence order, regardless of gaps.
Thus it would match IIII, IVMV, LMML, IL--LI, V--ML-----I, etc. etc.
-
In a query, dashes may not be combined with square brackets.
(The regex curly bracket syntax, e.g. [ILMV]{4}, is not supported, nor are any regex special
characters except for [ and ].)
Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "[QT]DGS".
-
A question mark matches any amino acid. Thus A?E matches ACE, AME, A--Y-E, etc. etc.
(Technical note: question marks are converted to regex [A-Z].)
Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "?GT".
Note that OVERLAPPING hits are NOT found. The above trial finds the S--GT in S--GTGT, but not
the trailing TGT. Compare a search for "GT".
Headers:
- In FASTA format, the "header" is the first line of the sequence record. It must begin
with the character ">", and must be a single line. See examples by clicking the
"Demos & Tests" buttons.
-
UniProt headers work best but other header formats can be used.
Supported FASTA header formats:
-
UniProt.
- Multiple sequence alignments generated by the
ConSurf Server for jobs using the Swiss-Prot, Clean UniProt, Uniref90, and Uniprot databases.
FASTA header formats NOT YET supported*:
-
GenBank
headers are not yet supported because
changes to the FASTA header format are scheduled for
October, 2016.
- Multiple sequence alignments generated by the
ConSurf Server for jobs using the NR database (which have headers similar
to GenBank).
-
* Unsupported headers will be shown as is. Taxa and accession codes will not be extracted
and tabulated.
- If you would like support for a header format that is not mentioned here,
please contact
- Header formats can be mixed in the same group of sequences.
- Genus and species will be tabulated when given in the header following "OS="
(UniProt format) or "Tax=" (ConSurf format), or following a dash (ConSurf "Clean Uniprot" format).
-
The gene name is tabulated when given in the header following "GN=" (UniProt format).
- UniProt 6 or 10 character
Accession Codes
are detected (regardless of the surrounding characters)
and tabulated with links to UniProt.
UniParc
Identifiers (beginning "UPI") are also used.
If none of these are found, UniProt
Entry Names
are looked for.
Optional additions to headers by users:
-
If a 4-character
PDB Entry Code
is added to a header following "PDB=", it will be tabulated in the Statistics
table and linked to display the 3D model in
FirstGlance in Jmol.
Demo: 9: Pilins.
- You may add "start=N" to the header to specify where the numbering of the sequence
should start.
Examples.
-
When a description of an alignment is
added to a header, it will be displayed
above the sequences table.
-
When a description of an individual sequence
added to its header, it will be displayed
when the Taxon of that sequence is touched with the mouse.
Output:
- Sequences can be displayed in a single horizontally-scrolling table, or broken into
multiple tables ("wrapped") of specified length (default 80 amino acids each).
- Touching any amino acid reports its sequence number in a tooltip,
counting the first amino acid as
number one.
- The statistics table can be sorted by any column. Row numbers remain intact and can be
used to cross-reference between the sequences table and list of full headers. The table can
be "unsorted" by sorting on the row number column.
- A single color scheme for amino acids is provided in this version. Others can be added
by contacting
- The state of checkboxes (colors applied or not, output wrapped or not) and other
preferences are remembered between sessions and runs (using browser "cookies").
Consensus:
A consensus is shown below the sequence alignment. Touching any position (column) in the
consensus reports the frequencies of amino acids and dashes in that column in a tooltip.
Here is the key to the characters in the consensus line (when the sequences are aligned*):
-
A
Black upper case letters: 100% identical.
-
A
Gray upper case letters: all but one identical (when 4-9 sequences),
or >=90% identical (when 10 or more sequences).
-
a
Gray lower case letters: >50% (when 3 or more sequences).
Cannot occur with 4 sequences (where 3 identical generates a gray uppercase letter).
A gray lowercase letter supercedes
a gray colon.
-
:
Gray colon: "similar", >=90% in a single similarity group when there are 10 or more sequences.
100% in a single
similarity group when there are 2-9 sequences. Superceded by a gray lower case letter.
This is more simplistic than the
PAM 250 matrix method used by Clustal.
Similarity Groups:
- ILMV AC (hydrophobic, not aromatic)
- FYW (aromatic)
- NQ ST Y (polar, not charged)
- DEKR H (charged)
- GP (P is helix-breaking; turns frequently include one or both)
- Note that Y is included in both aromatic and polar, not charged.
* It is assumed that when the sequences are aligned, all sequences (including letters + dashes)
are the same length (see
Errors Detected, item 6).
If the sequences are not aligned, dashes should not be present.
When the lengths of sequences (including dashes if present) differ, the absence of an amino
acid (or dash) in a column (because that sequence does not extend into that column) counts
as non-identity. Thus:
- Black upper case letters cannot occur in columns into which not all sequences extend.
- Gray upper case and lower case letters, and colonss, are possible in columns into which
not all sequences extend. These are illustrated in some of the "tiny" Demos & Tests.
Statistics:
- The length of each sequence (exclusive of gaps/dashes) is given.
- The length of the sequences in the alignment, including gaps/dashes,
is given in the Consensus line below the aligned sequences.
- The total number of amino acids in all sequences combined is given in the last row
of the Statistics table.
-
The number of identical residues, and percentage of identical residues, relative to the
first ("Reference") sequence. For the percentage, the denominator is the length of the
sequence, regardless of whether the reference sequence is shorter.
- Counts and percentages of various residues and groups of residues. More amino acids
or groups can be added on request (
).
- Net charge near neutral pH.
- Number of gaps (groups of one or more consecutive dashes), dashes ("gapped" positions),
and dashes as percentage of the length (denominator includes dashes).
Browser-Specific Behavior:
-
MSAReveal's output displays large numbers of characters (amino
acids), each with a sequence-number tooltip and optionally a distinct color. Firefox is
the only popular browser that can do this with large jobs,
while remaining responsive to scrolling and tooltip display.
As the size of the job increases, other browsers become sluggish, and eventually unusably
sluggish, sometimes nearly frozen. In contrast, Firefox can handle 1,000,000 amino acids
while having excellent responsiveness. Based on performances during testing,
the following thresholds have been
set. When Firefox is not the browser initially employed, and
the number of amino acids in the job exceeds the "Recommend" threshold, changing to Firefox
is recommended but the user may choose to proceed despite sluggish behavior.
When the number of amino acids exceeds the "Require" threshold, the user must* use
Firefox to process the job.
|
Job Size (Total Amino Acids)
|
Browser
|
Recommend Firefox
|
Require Firefox
|
Chrome
|
10,000
|
20,000
|
Edge
|
35,000
|
70,000
|
Internet Explorer 11
|
60,000
|
100,000
|
Safari
|
5,000
|
10,000
|
* If you add "?go=1" to the end of the URL, the job will be processed in the current
browser 5 seconds after the
message requiring Firefox appears. This enables testing regardless of the threshold.
If you wish to recommend changes in the above thresholds, please contact
The following conditions are detected and reported.
Each of these can be demonstrated with one of the
Demo tests provided.
- No header. Demo: Header Missing.
- Illegal characters not representing amino acids.
When present, a button appears offering to list all instances with links to jump to each one.
Demo: Illegal Characters.
- Legal but ambiguous amino acid characters BJOUXZ.
When present, a button appears offering to list all instances with links to jump to each one.
Demo: 1: With Gaps, Ambiguous AA.
- Nucleic acid sequence instead of protein sequence. Demo: DNA/RNA.
- A single sequence containing gaps (dashes), hence not an alignment. Demo: 1: With Gaps, Ambiguous AA.
- Sequences contain gaps, hence apparently an alignment, but have different lengths.
Demo: Mismatched Lengths.
- Header containing more than one distinct 6- or 10-character UniProt Accesion Numbers. Demo: Multiple accession numbers.
When a sequence has an empirical 3D structure in the
Protein Data Bank,
you may add "PDB=xxxx" to the header, where xxxx is the
PDB accession code.
Such PDB codes will appear in a "3D" column in the Statistics table, linked to display
the corresponding structures in
FirstGlance in Jmol.
The addition must be before
>> or >>>.
Example: Demo "9: Pilins".
Group Descriptions:
If you add, for example, ">>> Aligned by MAFFT" to the end of a header,
this will be displayed
above the table of sequences, with a
light green background. Such a group description would normally be added to only
one header in a group of sequences. If several headers contain
">>>",
the descriptions will be concatenated. Example: Gal4 Demo.
Sequence Descriptions:
If you add, for example, ">> Mutant Y57W" to the end of a header, when you touch the
Taxon in this row with the mouse, this sequence descripton will be shown above
the table of sequences, with a
pink background. Example: Gal4 Demo.
">>>" and ">>" can be in either order, but both must be at the end
of the header.
Hyperlinks in descriptions: If you wish to include a hyperlink in a description,
replace the space following "<a" with a vertical bar "|",
so your anchor tag becomes
"<a|href=...>linked text</a>". This avoids having the line broken
on the space
within the hyperlink, which causes the link to display incorrectly in the Full Headers
section. Demo: "1: Gs pilA". The only place you will see the vertical bar
is in the box where the Demo is pasted. It it replaced with a space (after wrapping)
in the Results.