MSAReveal.Org Help

Getting Started:

Overview

Features:

Errors Detected and Reported

Optional Advanced Features:

Credits
Versions

Overview — Last updated

What Will MSAReveal Do For You?

A number of demonstration and test sets of sequences are provided ready to run. Click on the button Show Demos & Tests.

Given one or more protein amino acid sequences, aligned or not:

Sequences are displayed, optionally colored by type of amino acid.
Currently there is one color scheme. Others can be added on request to

To color a different subset of amino acids, and count their numbers in each sequence, use Find. For example, to color only Ser, enter "S" in the Find slot.

Or, to color all Asn or Gln, enter [NQ] in the Find slot. (Entering NQ without the brackets will find only the sequence NQ.) To make the colored hits easier to see, uncheck all the other color options (or click the "None" button). Counts of your query in each sequence are tabulated in Part I of Statistics, and you can sort that table by numbers of hits. For more, see Finding Sequence Fragments.

Touching any amino acid pops up its 3-letter abbreviation, and its sequence number.

A consensus is displayed at the bottom of the sequence listing. Touching any column in the consensus shows the frequencies of amino acids in that column (see image at right). The amino acid with the highest frequency is boldface, and its percentage is given.

Highlighting mutations: When all sequences are >50% identical, a special option appears with a checkbox to "Highlight Differences". When checked, only the columns with non-identical residues will be colored. For an example, click Consensus wrap test + highlight differences under Show Demos & Tests.

Short sequences entered in the Find slot are highlighted in red, and their sequence positions linked. They are found regardless of intervening gaps. Ambiguous amino acids are supported.

Statistics are reported, including lengths excluding gaps, total amino acids in the alignment, percentage identity vs. the reference (first) sequence, percentage aromatics, numbers of Cys, charged, and other amino acids, net charge at neutral pH, and counts and percentages of gaps in each sequence. The statistics table can be sorted on any column.

Slides: For showing sequence alignments in presentations, before taking a snapshot to put in a slide, you may use the Options checkboxes to hide row numbers, the consensus, gene names, and UniProt IDs in order to make a less cluttered and more compact display.

	Default sequence listing. How to take a snapshot.
	Compact, uncluttered table after unchecking (under Options) row numbers, gene names, UniProt ID's, and consensus.

Advanced Features:

The starting number for each sequence can be specified in the header.
An alignment description, when included in the header of the first sequence, will be displayed above the sequences.
A comment for each sequence, when optionally included in its header, will be displayed when the taxon is touched.
The presence of ambiguous amino acids (BJOUXZ) or illegal characters is reported, with buttons to find every instance. A number of other error conditions are reported.
When a 3D crystallographic (or other empirical) model is available for a sequence, the PDB code can be included in its header. In the Statistics table, the code will be linked to display the 3D model in FirstGlance in Jmol.
Some web browsers become very sluggish when displaying large numbers of tool-tipped amino acids. When the number of amino acids is too large for the browser in use, a recommendation appears automatically to use Firefox. Firefox performs efficiently even with one million amino acids.

How To Use MSAReveal:

Collect amino acid sequences, e.g. from UniProt.Org. instructions are provided.
Align sequences. Instructions are provided using free, straightforward, powerful Jalview. MSAReveal does not align sequences.
Save the alignment in a file in FASTA format.
Display the alignment, copy, and paste into MSAReveal.
Press the button Process Sequences. Voila!

Still Learning 1-Letter Amino Acid Codes?

No problem! MSAReveal shows you the 3-letter abbreviation in a tooltip whenever you touch a one-letter code in the color scheme options, or in the sequence alignment listing. When you touch a one-letter code column header in the statistics table, the full name of the amino acid is shown.

And here is a handy reference chart.

How To Download FASTA Sequences

We recommend downloading FASTA sequences from UniProt.Org:

At UniProt.Org, use the search slot at the top to describe a sequence. Examples: "yeast gal4", "sulfurreducens pila", "human pla2g6".
In the list of hits, click on the Entry code (in the left column of the table) for the sequence you want. (We recommend viewing the entire entry to confirm this is what you want.)
Click on the blue Sequence button at the left side of the page.

Click on the blue FASTA button.
Open your browser's File menu, and click Save Page As.
You may wish to rename the file to add the name of the protein or taxon. Keeping the file type ".fasta" is a good idea.

Click on the blue button Add to basket.
When you have added all the desired sequences to your basket, scroll to the top of the page and click on the blue Basket button.
In the box that opens, click on Download.

Select Uncompressed and click Go.
Select Save File and click OK.

You can now open your saved FASTA file (a plain text editor would be ideal, see below), select all, copy, and paste into MSAReveal.

NOTE that your sequences are not yet aligned. See How To Align Sequences.

FASTA files are plain text. You can edit them with a plain text editor, for example to separate or gather sequences. A plain text editor is one which does not "mark up" the text with formatting codes. In Windows, use Notepad. In Mac, use the free program TextWrangler. If you use WordPad, Word, TextEdit, or other "word processor" programs, it is often tricky to force the program to save as plain text.

How To Align Sequences

We recommend the free program Jalview because it is straightforward, and preserves the full UniProt headers (including genus and species). Jalview requires that free Java be installed on your computer. Alignments done in UniProt suffer from FASTA headers that have only the UniProt Accession Number, without the taxon (genus and species). Instructions for Jalview:

You will need files containing FASTA sequences that have been saved on your computer. See How To Download FASTA Sequences.
Run Jalview.
Drag a file containing one or more FASTA sequences and drop into Jalview. A window should appear that displays the sequence(s) at the top.
Drag additional files into the SAME window if you wish to add more sequences.
At the top of the window containing your sequences, click on Web Service and then click on Alignment.
Choose an alignment algorithm (such as MAFFT, MUSCLE, or TCOFFEE) and click on with defaults.
A second window opens and the alignment is performed. If you have many or long sequences, this might take a while.

A third window titled "So and so alignment" opens when the alignment is completed.
Open the File menu at the top left of the third window, and "Save As". You may want to double-click on Desktop to save it there temporarily. Use FASTA format, and name the file appropriately.
Your saved alignment is now ready to open (a plain text editor would be good), select all, copy and paste into MSAReveal.

Specifications:

Options:

Options (preferences) are remembered automatically between sessions, unless you have disabled "cookies" in your browser.

Sequences:

Sequences are numbered starting with 1 by default. However, you can specify a starting sequence number by adding "start=N" to the header of a sequence. N can be positive, negative, or zero. For example if you want the first residue of the mature protein to be number 1, and it is preceded by a 7-amino acid signal sequence, you can add "start=-6" to the header. Now the signal sequence will be numbered -6 to 0, and 1 will be the first residue of the mature protein. This is illustrated in the demo "3: Pilins Pa unaligned".
MSAReveal can handle sequences of length >30,000, alignments with >400 sequences, and alignments with a total of more than one million amino acids. Tests have included six sequences of titin with a total of 178,130 amino acids in the alignment. Human titin has 34,350 amino acids. On a late 2014 MacBook Pro, processing the titin alignment took about 18 sec. Pasting it into the box six times gives an alignment with 1,068,780 amino acids total. Processing that took less than 90 seconds, and the results appeared to be correct. Tests have also included an alignment with 401 sequences of length 310 (total 110,000 amino acids). Demos: Click the button Show Demos & Tests and then click on the link Larger Examples.
Various error conditions are detected and reported.
A number of sample sequence alignments (and some unaligned sets) are provided. Press the button "Show Demos & Tests" above the sequence input box.

Finding Sequence Fragments:

After a set of sequences has been processed, at the top of the output is a slot labeled Find.

If you enter a sequence fragment, any matches in the sequence listing will be highlighted in red. The remainder of the sequences will be shown in gray lower case, making it easier to spot the red matches.

The search algorithm has the following features:
- The query sequence will be found regardless of gaps. For example, if you specify CDEFG, you will find not only CDEFG but also CD-EF------G. This works only if the query contains no dashes (gaps). (Technical note: the query CDEFG is run as the regex /C-*D-*E-*F-*G/g.)
  Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "SG".
- After you enter the query, a list of matches in each row (sequence) will be displayed, with the sequence number of the first amino acid in each match. Each sequence number is hyperlinked so that clicking it will jump to the corresponding match. Only the first 20 matches are listed in each row, but all matches are highlighted in the sequence listing.
- To CLEAR a search, delete all characters in the Find slot and press Enter.
- If you specify a sequence containing one or more dashes (gaps), only an exact match will be found. For example, if you specify CD--EFG, only CD--EFG will be found, not CDEFG, not CDE-FG, not even CD-EFG nor CD---EFG. If you include dashes in your query, you may not include square brackets [...], but you may include question marks.
  Try it: use Demo "3: Pilins Pa TCOFFEE" and compare hits for "GK" vs. "G-K".
- Queries can be any length, including length 1. Thus the query "M" will highlight all methionines in all sequences.
  Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "W".
- The query sequence can be specified in upper or lower case, or a mixture. It will be converted to upper case for matching purposes. Thus, the query CdEfG will match CDEFG, C---DEFG, etc.
- Regular expression character classes are supported in a limited way. For example, [IL] will match a single amino acid, either Ile or Leu. [IL] is equivalent to [LI].
  - Thus, the query MKA[AQ][KQ] will match MKAAK, MKAAQ, MKAQK, and MKAQQ. [FYW] matches any single aromatic (except His).
  - [ILMV][ILMV][ILMV][ILMV] will match any 4 consecutive residues of Ile, Leu, Met or Val in any sequence order, regardless of gaps. Thus it would match IIII, IVMV, LMML, IL--LI, V--ML-----I, etc. etc.
  - In a query, dashes may not be combined with square brackets. (The regex curly bracket syntax, e.g. [ILMV]{4}, is not supported, nor are any regex special characters except for [ and ].)
  Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "[QT]DGS".
- A question mark matches any amino acid. Thus A?E matches ACE, AME, A--Y-E, etc. etc. (Technical note: question marks are converted to regex [A-Z].)
  Try it: use Demo "3: Pilins Pa TCOFFEE" and search for "?GT".
  Note that OVERLAPPING hits are NOT found. The above trial finds the S--GT in S--GTGT, but not the trailing TGT. Compare a search for "GT".

Headers:

In FASTA format, the "header" is the first line of the sequence record. It must begin with the character ">", and must be a single line. See examples by clicking the "Demos & Tests" buttons.
UniProt headers work best but other header formats can be used.
- UniProt.
- Multiple sequence alignments generated by the ConSurf Server for jobs using the Swiss-Prot, Clean UniProt, Uniref90, and Uniprot databases.
- GenBank headers are not yet supported because changes to the FASTA header format are scheduled for October, 2016.
- Multiple sequence alignments generated by the ConSurf Server for jobs using the NR database (which have headers similar to GenBank).
- * Unsupported headers will be shown as is. Taxa and accession codes will not be extracted and tabulated.
- If you would like support for a header format that is not mentioned here, please contact
Header formats can be mixed in the same group of sequences.
Genus and species will be tabulated when given in the header following "OS=" (UniProt format) or "Tax=" (ConSurf format), or following a dash (ConSurf "Clean Uniprot" format).
The gene name is tabulated when given in the header following "GN=" (UniProt format).
UniProt 6 or 10 character Accession Codes are detected (regardless of the surrounding characters) and tabulated with links to UniProt. UniParc Identifiers (beginning "UPI") are also used. If none of these are found, UniProt Entry Names are looked for.

Optional additions to headers by users:

If a 4-character PDB Entry Code is added to a header following "PDB=", it will be tabulated in the Statistics table and linked to display the 3D model in FirstGlance in Jmol. Demo: 9: Pilins.
You may add "start=N" to the header to specify where the numbering of the sequence should start. Examples.
When a description of an alignment is added to a header, it will be displayed above the sequences table.
When a description of an individual sequence added to its header, it will be displayed when the Taxon of that sequence is touched with the mouse.

Output:

Sequences can be displayed in a single horizontally-scrolling table, or broken into multiple tables ("wrapped") of specified length (default 80 amino acids each).
Touching any amino acid reports its sequence number in a tooltip, counting the first amino acid as number one.
The statistics table can be sorted by any column. Row numbers remain intact and can be used to cross-reference between the sequences table and list of full headers. The table can be "unsorted" by sorting on the row number column.
A single color scheme for amino acids is provided in this version. Others can be added by contacting
The state of checkboxes (colors applied or not, output wrapped or not) and other preferences are remembered between sessions and runs (using browser "cookies").

Consensus:

A consensus is shown below the sequence alignment. Touching any position (column) in the consensus reports the frequencies of amino acids and dashes in that column in a tooltip.

A Black upper case letters: 100% identical.
A Gray upper case letters: all but one identical (when 4-9 sequences), or >=90% identical (when 10 or more sequences).
a Gray lower case letters: >50% (when 3 or more sequences). Cannot occur with 4 sequences (where 3 identical generates a gray uppercase letter). A gray lowercase letter supercedes a gray colon.
: Gray colon: "similar", >=90% in a single similarity group when there are 10 or more sequences. 100% in a single similarity group when there are 2-9 sequences. Superceded by a gray lower case letter. This is more simplistic than the PAM 250 matrix method used by Clustal.

ILMV AC (hydrophobic, not aromatic)
FYW (aromatic)
NQ ST Y (polar, not charged)
DEKR H (charged)
GP (P is helix-breaking; turns frequently include one or both)
Note that Y is included in both aromatic and polar, not charged.

Errors Detected, item 6

the absence of an amino acid (or dash) in a column

counts as non-identity

Black upper case letters cannot occur in columns into which not all sequences extend.
Gray upper case and lower case letters, and colonss, are possible in columns into which not all sequences extend. These are illustrated in some of the "tiny" Demos & Tests.

Statistics:

The length of each sequence (exclusive of gaps/dashes) is given.
The length of the sequences in the alignment, including gaps/dashes, is given in the Consensus line below the aligned sequences.
The total number of amino acids in all sequences combined is given in the last row of the Statistics table.
The number of identical residues, and percentage of identical residues, relative to the first ("Reference") sequence. For the percentage, the denominator is the length of the sequence, regardless of whether the reference sequence is shorter.
Counts and percentages of various residues and groups of residues. More amino acids or groups can be added on request ( ).
Net charge near neutral pH.
Number of gaps (groups of one or more consecutive dashes), dashes ("gapped" positions), and dashes as percentage of the length (denominator includes dashes).

Browser-Specific Behavior:

MSAReveal's output displays large numbers of characters (amino acids), each with a sequence-number tooltip and optionally a distinct color. Firefox is the only popular browser that can do this with large jobs, while remaining responsive to scrolling and tooltip display. As the size of the job increases, other browsers become sluggish, and eventually unusably sluggish, sometimes nearly frozen. In contrast, Firefox can handle 1,000,000 amino acids while having excellent responsiveness. Based on performances during testing, the following thresholds have been set. When Firefox is not the browser initially employed, and the number of amino acids in the job exceeds the "Recommend" threshold, changing to Firefox is recommended but the user may choose to proceed despite sluggish behavior. When the number of amino acids exceeds the "Require" threshold, the user must* use Firefox to process the job.

	Job Size (Total Amino Acids)
Browser	Recommend Firefox	Require Firefox
Chrome	10,000	20,000
Edge	35,000	70,000
Internet Explorer 11	60,000	100,000
Safari	5,000	10,000

* If you add "?go=1" to the end of the URL, the job will be processed in the current browser 5 seconds after the message requiring Firefox appears. This enables testing regardless of the threshold.

If you wish to recommend changes in the above thresholds, please contact

Errors Detected and Reported

The following conditions are detected and reported. Each of these can be demonstrated with one of the Demo tests provided.

No header. Demo: Header Missing.
Illegal characters not representing amino acids. When present, a button appears offering to list all instances with links to jump to each one. Demo: Illegal Characters.
Legal but ambiguous amino acid characters BJOUXZ. When present, a button appears offering to list all instances with links to jump to each one. Demo: 1: With Gaps, Ambiguous AA.
Nucleic acid sequence instead of protein sequence. Demo: DNA/RNA.
A single sequence containing gaps (dashes), hence not an alignment. Demo: 1: With Gaps, Ambiguous AA.
Sequences contain gaps, hence apparently an alignment, but have different lengths. Demo: Mismatched Lengths.
Header containing more than one distinct 6- or 10-character UniProt Accesion Numbers. Demo: Multiple accession numbers.

3D Structures (PDB Codes)

When a sequence has an empirical 3D structure in the Protein Data Bank, you may add "PDB=xxxx" to the header, where xxxx is the PDB accession code. Such PDB codes will appear in a "3D" column in the Statistics table, linked to display the corresponding structures in FirstGlance in Jmol. The addition must be before >> or >>>. Example: Demo "9: Pilins".

>>> & >>: Descriptions

Group Descriptions: If you add, for example, ">>> Aligned by MAFFT" to the end of a header, this will be displayed above the table of sequences, with a light green background. Such a group description would normally be added to only one header in a group of sequences. If several headers contain ">>>", the descriptions will be concatenated. Example: Gal4 Demo.

Sequence Descriptions: If you add, for example, ">> Mutant Y57W" to the end of a header, when you touch the Taxon in this row with the mouse, this sequence descripton will be shown above the table of sequences, with a pink background. Example: Gal4 Demo.

">>>" and ">>" can be in either order, but both must be at the end of the header.

Hyperlinks in descriptions: If you wish to include a hyperlink in a description, replace the space following "<a" with a vertical bar "|", so your anchor tag becomes "<a|href=...>linked text</a>". This avoids having the line broken on the space within the hyperlink, which causes the link to display incorrectly in the Full Headers section. Demo: "1: Gs pilA". The only place you will see the vertical bar is in the box where the Demo is pasted. It it replaced with a space (after wrapping) in the Results.