DNA-surveillance - Species Identification with DNA

Witness for Whales	Home	About	How to Use	The Science	Links and Publications	Data Ownership
Search	Cluster (Simple)	Cluster (Advanced)	Maximum Likelihood	Example Data

How do I use Witness for the Whales?
Search Strategy
Submitting a Sequence
IUPAC Nucleotide Codes
Advanced search and bootstrapping
- Bootstrapping
- Emailed response
Maximum Likelihood Analysis
The Results
Issues of Interpretation
Poorly Resolved Species Groups
Phylogenetic Robustness
Missing Species

How do I use Witness for the Whales?

If you have a tissue sample from a cetacean, you can obtain expert and reliable identification of the species in two steps:

Use standard molecular laboratory techniques to obtain nucleotide sequence from the mtDNA control region (5'end) OR mtDNA cytochrome b (5'end).
Submit the sequence to this site and select the appropriate reference sequence dataset for comparison. Anadvanced cluster search option gives you the opportunity to perform a bootstrap analysis, while the maximum likelihood will perform more rigorous statistical analyses in placing your query sequence on the tree. Both the advanced cluster and maximum likelihood options will send you the results by email.

Try some of our example sequences to see how it works, see examples.

Be awarethat there are issues of interpretation which you must bear in mind when using this site.

Search Strategy

You will have the greatest success if you use a hierarchical or iterative approach to identifying the source of your sequence. There are several reference sets to choose from, and each is available for both the mtDNA control region and cytochrome b.

Using the simple search strategy, start with the first reference set (All Cetaceans) to determine the suborder or family which is most closely related to your sequence. You may wish to view a summary of the phylogeny of whales, dolphins and porpoises which forms the basis of our reference datasets.
Then choose one of the more specific and more detailed reference sets to fine-tune your analysis.
Then, repeat the search using the advanced mode, and use bootstrap resampling to evaluate the robustness of your identification.
If bootstrap support for the species grouping of your test sequence is low, it may be worth doing a full alignment (versus a profile alignment) with the appropriate reference dataset.

If your sequence is from a humpback whale, select the humpback whale populationreference set to determine the source population.

Submitting a Sequence

To submit a sequence for analysis:

click on the Simple search link
paste your sequence into the Data Entry window
select the reference dataset and the genomic locus
click on the Submit button

Your sequence must be either in FASTA format or as a text nucleotide sequence. Use either UPPER or lowercase. For example:

>mysample
ACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAGCTGAAGGAATC
GTAGAAATTAAACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAG
CTGAAGGAATCTGTAGAAATTAA

ACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAGCTGAAGGAATC
GTAGAAATTAAACCATAATAGTACAGCTGAAGGAATCTGTAGAAATTAAACCATAATAGTACAG
CTGAAGGAATCTGTAGAAATTAA

Only one sequence may be submitted at a time.

If your sequence contains illegal characters, that is those not included in the IUPAC ambiguity codes, then it will be rejected with an error message. If your sequence does contain any of the ambiguity codes, then they will be used both in aligning the sequence and in calculating evolutionary distances.

Your sequence will be analysed automatically. Please wait about 15 seconds and then click the Retrieve Results button to view your results. It will take longer for results to become available if full alignment and/or bootstrap resampling are requested.

IUPAC Nucleotide Codes

Ambiguous	Symbol	Meaning	Origin of designation
	G	G	Guanine
	A	A	Adenine
	T	T	Thymine
	C	C	Cytosine
	U	U	Uracil
X	R	G or A	puRine
X	Y	T or C	pYrimidine
X	M	A or C	aMino
X	K	G or T	Keto
X	S	G or C	Strong interaction (3 H bonds)
X	W	A or T	Weak interaction (2 H bonds)
X	H	A or C or T	not-G, H follows G in the alphabet
X	B	G or T or C	not-A, B follows A
X	V	G or C or A	not-T (not-U), V follows U
X	D	G or A or T	not-C, D follows C
X	N	G or A or T or C	aNy

Advanced search and bootstrapping

The Advanced search window adds additional functions to the search process:

Bootstrapping

To perform a bootstrap analysis:

click on the Advanced search link
paste your sequence into the Data Entry window
select the reference dataset and genomic locus
select the number of bootstrap replicates you require
optionally enter an email address to which the results will be sent
click on the Submit button

A bootstrap analysis will take longer than a simple search. The length of time will depend on the number of pseudoreplicates you have chosen and on the load on our server. Your screen will be refreshed about every 10 seconds, or you can choose to have the results sent to you byemail.

Emailed response

You can choose to have the results sent to you by email. If you enter an optional email address, you can close your browser once the search has been submitted.

Maximum Likelihood Analysis

The reference alignment, and the associated phylogenetic tree, are considered to be prior knowledge about the relationships among the reference organisms. Potentially the query sequence can be joined to that tree on any branch. We seek the connection point that has the highest statistical likelihood, thereby giving the maximum likelihood estimate of the relationship between the query and reference sequences. The maximum likelihood connection point is represented in the output by a dashed branch. For a particular connection point the determined likelihood score is the maximum likelihood estimate under the associated topology (that is, all the branch lengths are re-optimised for each connection point).

The Shimodaira-Hasegawa (SH) test is used for assessing a confidence limit on the connection point with the highest expected likelihood. The expected likelihood of a connection point is the expectation of likelihood under the true process of evolution (as a random variable). The SH test calculates such a confindence limit by simulating replicate datasets under an approximation of the least configurable configuration (LFC) in which is that all connection points have equivalent expected likelihoods, and comparing the observed differences in likelihood with the expected distribution of likelihoods under the LFC.

The utilised implementation of the SH test simulates 1000 non-parameteric bootstraps, and uses the RELL (Shimodaira and Hasegawa 1999) approximation. Branches that represent connection points within the confidence limit are colour red. A critical value of ?= 0.05 is used (95%confidence limit).

The Results

The results will be displayed first as a phylogenetic tree in which the differences between sequences are proportional to the lengths of thehorizontal branches separating the tips. The names of the reference species are colour-coded to help you identify close relatives. To save a copy of the tree as a PNG-format file, right-click (PC) or control-click (Mac) on the image and choose Download Image to Disk, or similar, from the pop-up menu.

If you have performed a bootstrap analysis, the resulting phylogenetic tree will display numbers at some of the nodes. These numbers are the percentage of bootstrap pseudoreplicates that contain the clade formed by the subtree starting at that node. This measure of bootstrap supportis displayed only when at least 50%of the pseudoreplicates contain the clade. The phylogenetic tree displayed is the estimated tree, and not the consensus of the bootstrap pseudoreplicate trees.

If you scroll further down past the tree, you will also find a table showing the evolutionary distances between the user-submitted sequence and each of the sequences in the reference set. Sites having IUPAC ambiguity codes are included in the calculation of evolutionary distances. To save the contents of the table to disk, select all of the table, copy it, open a text file document on your computer (eg Notepad or SimpleText) and then paste it in.

If you scroll further down further again, there is a text version of the phylogenetic tree in Newick format. To save this to disk, select the contents of the text box in which it is displayed, open a text file document on your computer (eg Notepad or SimpleText) and then paste it in.

You can fine-tune your analysis by clicking on the Submit a sequence link to return to the Data Entry page, where you can choose a different reference set.

Issues of Interpretation

Is it a cetacean?

Witness for the Whales is an online service for the identification of cetaceans by phylogenetic analysis. Its scope is limited to the cetaceans, and any submitted sequence will be treated as if it were derived froma cetacean. A simple system has been implemented to flag sequences which might give unreliable results. Nevertheless, it remains the responsibility of the user to decide whether a phylogenetic analysis is appropriate in their individual case. The user should also seek other evidence to corroborate that any DNA sequence which they submit is actually cetacean in origin, perhaps by searching Genbank.

Poorly Resolved Species Groups

The derivation of a phylogenetic tree from DNA sequence data which reflects the taxonomy of cetaceans is dependent on the alignment of the sequences, and the ability of the locus in question to differentiate among the species. This is on the assumption that species recognised on traditional morphological grounds will also possess diagnostic genetic characters distinguishing them from all other species. In some groups (e.g., subfamily Delphininae), this does not seem to be the case due likely to the recent and rapid rate of diversification of these species. Due to this problem, a warning note will appear on screen for all user-submitted sequences identified as members of the family Delphininae. When in doubt, consult our reference phylogenetic tree.

Due to the rapid rate of mutation of the mtDNA control region and frequency of insertion/deletion mutations (indels), it can be difficult to align sets of sequences which represent a large proportion of the genetic diversity observed among cetaceans at this locus (i.e., the "All Cetaceans"and "Odontocetes"datasets). Establishing positional homology among all nucleotide sites in alignments of sequences in these datasets is problematic, and multiple alignments are often equally plausible. Consequently, test sequences that are compared to the mtDNA control region "All Cetaceans"and "Odontocetes"datasets may be somewhat "mis-aligned", and as a result, may be slightly misplaced on the phylogenetic tree. Nevertheless, all test sequences will be placed close to the appropriate group. This problem is resolved as the user searches further down through the hierarchical series of datasets.

When in doubt about the species identification suggested by the phylogenetic analysis, resubmit your sequence using a database giving finer resolution (e.g., at the family or sub-family level), or use the advanced search mode, and use the full alignment method and/or bootstrap resampling of the data, or all of the above. Other sources of information, including other loci, may be needed to provide corroborating results.

Phylogenetic Robustness

The loci (mtDNA control region and cytochrome b) and method of phylogenetic analysis (evolutionary distances + neighbour-joining tree) used here are geared specifically to addressing questions of species or population identity. They may not be as well-suited to the robust reconstruction of higher-level relationships among more distantly related cetacean species. As such, many of the higher-level relationships suggested by Witness for the Whales are unstable and should not be considered an accurate reflection of the evolutionary relationships among these taxa (e.g. for the family Ziphiidae, in which reconstructions suggest that the genus Mesoplodon is not monophyletic;see Dalebout (2002) for further discussion regarding higher-level relationships in this speciose family).

Missing Species

The tables summarising the species represented in each reference database (mtDNA control region and cytochrome b) should be consulted prior to making any conclusions regarding the identity of any sample.

DNA Surveillance

Species identification with DNA

Contents