Microbiome Utilities Portal of the Broad Institute

Microbiome Analysis Utilities

ChimeraSlayer

ChimeraSlayer (download) is a chimeric sequence detection utility, compatible with near-full length Sanger sequences and shorter 454-FLX sequences (~500 bp).

Chimera Slayer involves the following series of steps that operate to flag chimeric 16S rRNA sequences: (A) the ends of a query sequence are searched against an included database of reference chimera-free 16S sequences to identify potential parents of a chimera; (B) candidate parents of a chimera are selected as those that form a branched best scoring alignment to the NAST-formatted query sequence; © the NAST alignment of the query sequence is improved in a ‘chimera-aware’ profile-based NAST realignment to the selected reference parent sequences; and (D) an evolutionary framework is used to flag query sequences found to exhibit greater sequence homology to an in silico chimera formed between any two of the selected reference parent sequences.

To run Chimera Slayer, you need NAST-formatted sequences generated by the included NAST-iEr utility. Given NAST-formatted sequences, run ChimeraSlayer like so:

%microbiomeutil/ChimeraSlayer/ChimeraSlayer.pl  --query_NAST  ${sequences}.NAST

The output files include the following:

${sequences}.NAST.CPS                      :results from the chimera parent selection step
${sequences}.NAST.CPS_RENAST               :NAST alignments from a 'chimera-aware' realignment of the query
${sequences}.NAST.CPS.CPC                  :results from the chimera 'phylo-checker' step  ** the Chimera Slayer final verdict **
${sequences}.NAST.CPS.CPC.wTaxons          :the taxonomy of the reference (step)parents of the chimera

The .CPC output file is tab-delimited with the following fields:

0      ChimeraSlayer
1      chimera_AJ007403            # the accession of the chimera query
2      S000387216                  # reference parent A
3      S000001688                  # reference parent B
4      0.9422                      # divergence ratio of query to chimera (left_A, right_B)
5      90.00                       # percent identity between query and chimera(left_A, right_B)
6      0                           # confidence in query as a chimera related to (left_A, right_B)
7      1.0419                      # divergence ratio of query to chimera (right_A, left_B)
8      99.52                       # percent identity between query and chimera(right_A, left_B)
9      100                         # confidence in query as a chimera related to (right_A, left_B)
10     YES                         # ** verdict as a chimera or not **
11     NAST:4032-4033              # estimated approximate chimera breakpoint in NAST coordinates
12     ECO:767-768                 # estimated approximate chimera breakpoint according to the E. coli unaligned reference seq coordinates

For those query sequences flagged as chimeras, the .wTaxons file includes the following extra columns:

13      Rhodococcus                                                                # genus name of Parent A
14      Rhodococcus koreensis (T); DNP505; AF124342 Rhodococcus koreensis          # descriptive info for Parent A
15      Streptomyces                                                               # genus name of Parent B
16      Streptomyces somaliensis (T); DSM 40738; AJ007403 Streptomyces somaliensis # descriptive info for Parent B
17      INTRA-ORDER                                                                # type of chimera based on selected parents

Note	It is not recommended to blindly discard all sequences flagged as chimeras. Some may represent naturally formed chimeras that do not represent PCR artifacts. Sequences flagged may warrant further investigation.

If you use the —printCSalignments option, a diagram of the query matching the parents on both sides of the breakpoint is included in the output. For example:

Per_id parents: 89.52

          Per_id(Q,A): 94.00
--------------------------------------------------- A: S000387216
88.65                                99.06
~~~~~~~~~~~~~~~~~~~~~~~~\ /~~~~~~~~~~~~~~~~~~~~~~~~ Q: chimera_AJ007403
DivR: 0.942 BS: 0.00     |
Per_id(QLA,QRB): 90.00   |
                         |
   (L-AB: 88.65)         |      (R-AB: 90.34)
   WinL:0-704            |      WinR:705-1449
                         |
Per_id(QLB,QRA): 99.52   |
DivR: 1.042 BS: 100.00   |
~~~~~~~~~~~~~~~~~~~~~~~~/ \~~~~~~~~~~~~~~~~~~~~~~~~~ Q: chimera_AJ007403
100.00                                91.28
---------------------------------------------------- B: S000001688
           Per_id(Q,B): 95.52

DeltaL: -11.35                   DeltaR: 7.79

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GGAGGCTCGTACCGCTGTCTTGTTAAGGACTGGTTTTTTACTGTCTATACAGACTCTTCA  A: S000387216
AAGACGCTTGGGTTTCACTCCTGCGCTTCGGCCGGGCCCGGCACTCGCCACAGTCTCGAG  Q: chimera_AJ007403
AAGACGCTTGGGTTTCACTCCTGCGCTTCGGCCGGGCCCGGCACTCGCCACAGTCTCGAG  B: S000001688

!!!!!!!!!!!!!!!!!!!!
TACTACTGGATATCCTGATA  A: S000387216
CGTCGTCTTGATGTTCACAT  Q: chimera_AJ007403
CGTCGTCTTGATGTTCACAT  B: S000001688

** Breakpoint **

                           !!!!!!!
TGCGTTCGGATCGATTGTTGCCGTACGCTGTGTCGATTAAAGGTAATCATAAGGGCTTTC  A: S000387216
TGCGTTCGGATCGATTGTTGCCGTACGCCTGTGTCATTAAAGGTAATCATAAGGGCTTTC  Q: chimera_AJ007403
GTAACGATCGCTTCCAACCCATCCGGTGCTGTGTCGCCGGGCACGGCTTGGGAATTAACT  B: S000001688
!!!!!!!!!!!!!!!!!!!!!!!!!!!!       !!!!!!!!!!!!!!!!!!!!!!!!!

GACTTACGACTC  A: S000387216
GACTTACGACTC  Q: chimera_AJ007403
ATTCCCAAGTCT  B: S000001688
!!!!!!!!!!!!

The above indicates the percent identities between the alignment segments corresponding to query and either parent. Since chimeras can occur two ways: (left parent A & right parent B) or (left parent B & right parent A), a fork diagram is shown with the statistics for each potential chimera as it relates to the query sequence. The bootstrap (BS) values indicate the confidence level for the corresponding chimera type. The informative SNP positions from the complete alignments are shown for both sides of the breakpoint.

WigeoN

WigeoN (download) examines the sequence conservation between a query and a trusted reference sequence, both in NAST alignment format. Based on the sequence identity between the query and the reference sequence, there is an expected amount of variation among the alignment. If the observed variation is greater than the 95% quantile of the distribution of variation observed between non-anomalous sequences, then it is flagged as an anomaly.

WigeoN is a flexible command-line based reimplementation of the Pintail algorithm Appl Environ Microbiol. 2005 Dec;7112:7724-36.

WigeoN is useful for flagging chimeras and anomalies only in near full-length 16S rRNA sequences. WigeoN lacks sensitivity with sequences less than 1000 bp.

To run WigeoN, you need NAST-formatted sequences generated by the included <<A_NASTiEr, NAST-iEr> utility. Given NAST-formatted sequences, run WigeoN like so:

%microbiomeutil/WigeoN/run_WigeoN.pl --query_NAST ${sequences}.NAST  >  ${sequences}.WigeoN

The output is tab-delimited like so:

0       chimera_AJ007403       # query sequence
1       S000387216             # best matching reference sequence
2       div:
3       5.45                   # percent sequence divergence between the query and the reference sequence
4       stDev:
5       4.01                   # standard deviation from expected reference sequence divergence across alignment windows
6       Quant95:Yes            # stDev is in the top 5% of stDev values observed among reference sequences at that same mean divergence
7       Quant99:YES            # top 1%  *** This value is recommended for flagging aberrant sequences ***
8       Quant99.9:No           # top 0.1%
9       Quant99.99:No          # top 0.01%

NAST-iEr

The NAST-iEr alignment utility (download) aligns a single raw nucleotide sequence against one or more NAST formatted sequences.

The alignment algorithm involves global dynamic programming profile alignment to fixed (NAST-formatted) multiply aligned template sequences without any end-gap penalty.

Run it like so, using a set of fasta-formatted sequences.

% microbiomeutil/NAST-iEr/run_NAST-iEr.pl --query_FASTA ${sequences}.fasta  > ${sequences}.NAST

AmosCmp16Spipeline

AmosCmp16Spipeline (download) uses the AMOScmp software to assemble multiple, potentially overlapping 16S rRNA sequencing reads based on read mappings to a reference 16S rRNA gene.

Given the following inputs: -fasta file containing sequencing reads -file containing the corresponding qual values -file enumerating the accessions corresponding to reads of the same clone individual assembly tasks -a reference database of 16S rRNA sequences

The single reference sequence that best matches all the reads is chosen. Lucy is used to trim the sequence reads of low quality termini. An additional homology-trimming operation is performed to exclude regions of the sequence that lack homology to the reference. The resulting trimmed reads and quality values are used to generate a sequence assembly using the AMOScmp software. A scaffold sequence is generated, where Ns are used to fill in gaps according to estimated gap sizes based on reference sequence anchoring, and quality values are reported according to the scaffold sequence. A README file containing instructions and sample data are provided.

TreeChopper

TreeChopper (download) clusters tree leaf nodes according to phylogenetic distance.

A graph is constructed from the tree like so: all leaves are visited, and from each leaf, all neighboring leaves within a specified distance threshold are added to a graph with an edge placed between them. After building this graph, each edge connecting pairs of nodes is examined and a Jaccard similarity coefficient is computed (see http://www.biomedcentral.com/1741-7007/3/7 for details). Those edges that loosely connect nodes as defined by this similarity coefficient are removed. The nodes connected by the remaining edges are clustered by transitive closure (single linkage clustering) and reported as OTUs.

The minimum phylogenetic distance between clustered nodes, and the minimum similarity coefficient between nodes in the graph are tuneable parameters. A README file containing instructions and sample data are provided.

Miscellaneous Remarks

The bacterial 16S rRNA is the primary target of the ChimeraSlayer, WigeoN, and NAST-iEr utilities.
Patrick Schloss' Mothur software includes a chimera.slayer method that is based on our ChimeraSlayer implementation, and as of Mothur version 1.19.0, has been heavily tested and shows very similar accuracy, although the Mothur implementation is many times faster.
Robert Edgar has a new chimera detection tool available called UCHIME (manuscript in prep.). We are actively collaborating with Robert on this development and we expect UCHIME to be a faster and more accurate chimera detection utility than ChimeraSlayer, and it has the advantage of being able to leverage abundance data and not require a reference database.

Reference

ChimeraSlayer and companion tools are described in the following:

Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Haas BJ, Gevers D, Earl A, Feldgarden M, Ward DV, Giannokous G, Ciulla D, Tabbaa D, Highlander SK, Sodergren E, Methe B, Desantis TZ, Petrosino JF, Knight R, Birren BW. Genome Res. 2011 Mar;21(3):494-504. Epub 2011 Jan 6.