title logo

DNA Surveillance

Species identification with DNA

Carnivora Mini-Barcodes Home About How to Use The Science Links and Publications Data Ownership
Search Cluster (Simple) Cluster (Advanced) Maximum Likelihood Example Data

How to Use

Manul
To run the species identification engine, you will need a query sequence or a set of sequences (see below) that overlap and can be aligned to one of the three mitochondrial segments in the database. After the sequence is provided, the DNA Surveillance is used to align the sequence and output a distance-based tree and a list of genetic distances between the query and the least divergent reference sequences.

Databases

The 18 reference alignments in the DNA Surveillance - Carnivora are organized in a hierarchical way. There are six categories and each category has one dataset for each of the three segments (ATP6, COI, and cytb). The categories are as follows:

  • FULL DATASET: contains all the sequences available for each mtDNA segment.

  • PINNIPEDIA: contains only the sequences of species belonging to the marine families Odobenidae, Otariidae and Phocidae.

  • TERRESTRIAL: contains all the species except those classified as Odobenidae, Otariidae, and Phocidae (i.e. Full dataset minus Pinnipedia). This dataset is further divided into three geographic sub-groups:
    - Africa
    - Americas
    - Eurasia
These different databases help reduce computational run-time as they contain subsets of the sequences comprising the full dataset. They are particularly useful if you know your sample is certainly not from a particular taxon and/or geographic location. Geographic distributions for the "Terrestrial"species followed the maps of the 2011 IUCN Red List of Threatened Species. Species found in more than one of the three regions were kept in all alignments where they are found.

Run-time

Distance-based analyses are relatively fast and usually take less than 5 minutes to finish for one query sequence. The run-time increases with alignment length, the total number of sequences in the reference alignment, the length of the query sequence, and the number of query sequences analyzed simultaneously.

The search engine will work on both shorter and longer sequences than the standard length specified for each reference alignment (126, 187, and 110 bp, respectively for ATP6, COI, and cytb). However, query sequences that are much longer than the alignment will slow down the analysis because they take longer to align. For instance, it is possible to use a query of the complete cytochrome bgene (1140 bp) instead of the standard 110 bp segment flanked by the bases 292 and 403 of the gene. But this analysis takes more time to finish than if the 110 bp segment is isolated prior running the algorithm. In addition, because some alignments have more sequences than others, the run-time varies among different datasets.

Cluster (Simple) Analysis

To try a sequence:

1) Select the sequence, including the FASTA header line [e.g., >ATP6(Unknown 1)]and copy it to the clipboard.
2) Go to the Cluster (Simple)link.
3) Paste the sequence in the Data Entrywindow.
4) Click on the appropriate button to select the database of interest.
5) Click the Submitbutton.
6) Wait a few seconds before you click the Retrieve Resultsbutton to see the results of the phylogenetic analysis.
______________________________________________________________________________________________________

Example Data

>ATP6(Unknown 1)
TTCATTACCCCAACAATAATAGGACTGCCTATTGTTATATTAATCATTATATTCCCAAGTATTCTATTTCCATCGCCCAACCGAC
TGATTAATAACCGCCTAATCTCACTGCAACAATGACTAGTA


>COI(Unknown 2)
ACCTGCTATATCTCAGTATCAAACACCTTTATTTGTTTGATCTGTCTTAATTACTGCTGTTTTACTACTCTTATCACTACCAGTT
TTAGCAGCTGGTATTACTATGTTATTAACTGATCGAAATTTAAATACCACTTTCTTTGATCCTGCTGGAGGAGGGGACCCTATTT
TATATCAACACTTATTC


>cytb(Unknown 3)
TAGGACGAGGCCTATACTACGGATCCTATATATTTCCTGAGACATGAAATATCGGCATCATCCTATTATTTACAGTGATAGCAAC
TGCATTCATAGGTTACGTTTTACCA

______________________________________________________________________________________________________

Cluster (Advanced) Analysis

On the Cluster (Advanced)link the analysis can be run with bootstrapping. The bootstrap support is displayed only when at least 50% of the pseudoreplicates contain the clade. The phylogenetic tree displayed is the estimated tree (the same topology of the Cluster Simpleanalysis), and not the consensus of the bootstrap trees. Bootstrap analyses take longer than a simple search and the length of time increases with the number of pseudoreplicates. The screen will be refreshed approximately every 10 seconds. Alternatively, you can have the results sent to your email when the analysis is finished.

Results

DNA Surveillance outputs a distance tree and a table with pairwise genetic distances calculated between the query and several reference sequences. The tree output is color-coded by the Family in which the species belongs (see The Science). The query sequence name remains in black typeface, which facilitates its visualization on the tree. All reference sequences on the tree are named with the Linnean binomial nomenclature followed by the GenBank accession number. When a particular sequence is not from GenBank, an identification number from other databases (e.g. BOLD) is used. Sequences identified as "CAR###"in the ATP6and COIdatasets can be downloaded only from BOLD. The distance table is ordered from the least to the most divergent sequence.

Potential problems


Additional Analyses

The best way to resolve inconsistencies in identifications is by increasing the number of segments used. Using the three segments will deliver more accurate identifications than only one. However, this can be prohibitive in some cases. Another good way to proceed is by reworking the reference alignment to include only the species that are known to occur in the area where the sample was collected. For instance, if a fecal sample collected in the Cockscomb Basin (Belize) groups in the Pantheragenus, it is very likely that this sample is from a jaguar because leopards, tigers, and lions, and snow leopards do not occur naturally in the Neotropics. The geographic scope of the analysis could be reduced even further to country, biome, or local levels as long as the species inventory for the area is good enough so that exclusions can be made with confidence.

There are alternative methods that attempt to use all the information in the sequences to assign unknown samples to species. These methods can be particularly useful in cases where closely-related species remain paraphyletic with these short segments. The alternative methods include character-based analysis (Rach et al.2008, Chaves et al.2012), Bayesian analysis (Munch et al.2008) and the use of logic formulas (Bertolazzi et al.2009). We refer the user to the resources cited above for detailed information on these methods.