DNA-surveillance - Species Identification with DNA

Carnivora Mini-Barcodes	Home	About	How to Use	The Science	Links and Publications	Data Ownership
Search	Cluster (Simple)	Cluster (Advanced)	Maximum Likelihood	Example Data

How to Use

To run the species identification engine, you will need a query sequence or a set of sequences (see below) that overlap and can be aligned to one of the three mitochondrial segments in the database. After the sequence is provided, the DNA Surveillance is used to align the sequence and output a distance-based tree and a list of genetic distances between the query and the least divergent reference sequences.

Databases

The 18 reference alignments in the DNA Surveillance - Carnivora are organized in a hierarchical way. There are six categories and each category has one dataset for each of the three segments (ATP6, COI, and cytb). The categories are as follows:

FULL DATASET: contains all the sequences available for each mtDNA segment.
PINNIPEDIA: contains only the sequences of species belonging to the marine families Odobenidae, Otariidae and Phocidae.
TERRESTRIAL: contains all the species except those classified as Odobenidae, Otariidae, and Phocidae (i.e. Full dataset minus Pinnipedia). This dataset is further divided into three geographic sub-groups:
- Africa
- Americas
- Eurasia

These different databases help reduce computational run-time as they contain subsets of the sequences comprising the full dataset. They are particularly useful if you know your sample is certainly not from a particular taxon and/or geographic location. Geographic distributions for the "Terrestrial"species followed the maps of the 2011 IUCN Red List of Threatened Species. Species found in more than one of the three regions were kept in all alignments where they are found.

Run-time

Distance-based analyses are relatively fast and usually take less than 5 minutes to finish for one query sequence. The run-time increases with alignment length, the total number of sequences in the reference alignment, the length of the query sequence, and the number of query sequences analyzed simultaneously.

The search engine will work on both shorter and longer sequences than the standard length specified for each reference alignment (126, 187, and 110 bp, respectively for ATP6, COI, and cytb). However, query sequences that are much longer than the alignment will slow down the analysis because they take longer to align. For instance, it is possible to use a query of the complete cytochrome bgene (1140 bp) instead of the standard 110 bp segment flanked by the bases 292 and 403 of the gene. But this analysis takes more time to finish than if the 110 bp segment is isolated prior running the algorithm. In addition, because some alignments have more sequences than others, the run-time varies among different datasets.

Cluster (Simple) Analysis

To try a sequence:

1) Select the sequence, including the FASTA header line [e.g., >ATP6(Unknown 1)]and copy it to the clipboard.
2) Go to the Cluster (Simple)link.
3) Paste the sequence in the Data Entrywindow.
4) Click on the appropriate button to select the database of interest.
5) Click the Submitbutton.
6) Wait a few seconds before you click the Retrieve Resultsbutton to see the results of the phylogenetic analysis.

The FASTA header is not needed if only one query sequence is used. DNA Surveillance will automatically name
the query as USER SEQUENCE in this case. The identification engine also accepts multiple sequences at the
same time, in which case one FASTA header line is required to identify each sequence.

______________________________________________________________________________________________________

Example Data

>ATP6(Unknown 1)
TTCATTACCCCAACAATAATAGGACTGCCTATTGTTATATTAATCATTATATTCCCAAGTATTCTATTTCCATCGCCCAACCGAC
TGATTAATAACCGCCTAATCTCACTGCAACAATGACTAGTA

>COI(Unknown 2)
ACCTGCTATATCTCAGTATCAAACACCTTTATTTGTTTGATCTGTCTTAATTACTGCTGTTTTACTACTCTTATCACTACCAGTT
TTAGCAGCTGGTATTACTATGTTATTAACTGATCGAAATTTAAATACCACTTTCTTTGATCCTGCTGGAGGAGGGGACCCTATTT
TATATCAACACTTATTC

>cytb(Unknown 3)
TAGGACGAGGCCTATACTACGGATCCTATATATTTCCTGAGACATGAAATATCGGCATCATCCTATTATTTACAGTGATAGCAAC
TGCATTCATAGGTTACGTTTTACCA
______________________________________________________________________________________________________

Cluster (Advanced) Analysis

On the Cluster (Advanced)link the analysis can be run with bootstrapping. The bootstrap support is displayed only when at least 50% of the pseudoreplicates contain the clade. The phylogenetic tree displayed is the estimated tree (the same topology of the Cluster Simpleanalysis), and not the consensus of the bootstrap trees. Bootstrap analyses take longer than a simple search and the length of time increases with the number of pseudoreplicates. The screen will be refreshed approximately every 10 seconds. Alternatively, you can have the results sent to your email when the analysis is finished.

Results

DNA Surveillance outputs a distance tree and a table with pairwise genetic distances calculated between the query and several reference sequences. The tree output is color-coded by the Family in which the species belongs (see The Science). The query sequence name remains in black typeface, which facilitates its visualization on the tree. All reference sequences on the tree are named with the Linnean binomial nomenclature followed by the GenBank accession number. When a particular sequence is not from GenBank, an identification number from other databases (e.g. BOLD) is used. Sequences identified as "CAR###"in the ATP6and COIdatasets can be downloaded only from BOLD. The distance table is ordered from the least to the most divergent sequence.

Potential problems

Missing species: Some species are not yet represented in any of the databases (for a list see here). Sequences of these missing species will tend show high genetic distances form other sequences. If more than one sequence of the same species is available, these "unknowns"will tend to group close to one another in the tree. If you have reference sequences that are not in the DNA Surveillance - Carnivora database, you can run the analysis with these sequences along with "unknowns". These sequences will most likely form their own group in the tree and will serve to identify other sequences of the same species. If you want to add your sequences to the DNA Surveillance - Carnivora, please contact us.
Paraphyletic groups: Due to the short length of the sequences and recent evolutionary history, some closely-related species can form paraphyletic groups which may affect the identification of unknown samples based solely on the assumption of monophyletic group affinity. If the unknown sequence in question is identical to the haplotype of one of the species, then identification is straightforward. However, if the new sequence does not mach any of the haplotypes, then alternative methods (e.g. character-based analysis, see Additional Analyses) may help solve the issue.
Sequence misidentification: Some reference sequences may have been attributed to the wrong species. One of the reasons this might occur is the change in the taxonomy over time. Another reason could be that some researchers tend to lump different species into one taxon and as a result, sequencesof multiple species are attributed to only one. This can be particularly problematic for the cytbdatabase, which is assembled based exclusively on GenBank sequences. The cytbsequences have been accumulated over a long period of time by several research groups and oftentimes these sequences are not associated with voucher specimens. In these cases, we strongly recommend that the user refer back to the original publication to check the taxonomic assignment and geographic origin of the sample.
Not a carnivoran: Your query sequence may belong to a species other than a carnivoran. This might happen due to PCR contamination with either endogenous or exogenous DNA, or because the sample was not correctly identified as being from a carnivoran. These sequences show a similar pattern of genetic distance and placement in the tree as the sequences of missing species. A GenBank search using BLAST can help ascertain the affinity of an unknown sample to its high taxonomic level (e.g. Order) beforehand.

Additional Analyses

The best way to resolve inconsistencies in identifications is by increasing the number of segments used. Using the three segments will deliver more accurate identifications than only one. However, this can be prohibitive in some cases. Another good way to proceed is by reworking the reference alignment to include only the species that are known to occur in the area where the sample was collected. For instance, if a fecal sample collected in the Cockscomb Basin (Belize) groups in the Pantheragenus, it is very likely that this sample is from a jaguar because leopards, tigers, and lions, and snow leopards do not occur naturally in the Neotropics. The geographic scope of the analysis could be reduced even further to country, biome, or local levels as long as the species inventory for the area is good enough so that exclusions can be made with confidence.

There are alternative methods that attempt to use all the information in the sequences to assign unknown samples to species. These methods can be particularly useful in cases where closely-related species remain paraphyletic with these short segments. The alternative methods include character-based analysis (Rach et al.2008, Chaves et al.2012), Bayesian analysis (Munch et al.2008) and the use of logic formulas (Bertolazzi et al.2009). We refer the user to the resources cited above for detailed information on these methods.

DNA Surveillance

Species identification with DNA