Broad HMP logo

The Human Microbiome Project (HMP) is an exciting Roadmap initiative funded by the National Institutes of Health (NIH). The goal of the project is to understand how the microbial communities inhabiting our bodies contribute to normal human health, development, and disease (http://nihroadmap.nih.gov/hmp).

The Broad Institute (http://www.broadinstitute.org) was launched in 2004 with the visionary philanthropic investment of Eli and Edythe Broad, who joined with leaders at Harvard and its affiliated hospitals, MIT, and the Whitehead Institute to pioneer a "new model” of collaborative science. The Broad Institute is organized as a transparent infrastructure that allows biology- and technology-focused scientists to work together to identify and overcome the most critical obstacles to realizing the full promise of genomic medicine.

The Broad Institute aggressively advances sequence-based technologies and the bioinformatics necessary to characterize the vast complexity of the human microbiome. In keeping with our mission, we make the microbiome analysis utilities developed by the Broad Institute available to the community in order to promote further innovation and collaborative research efforts. We appreciate your feedback.

The utilities developed by the Broad Institute and provided here apply to a range of challenges posed by the microbiome initiative, including:

ChimeraSlayer, WigeoN, NAST-iEr, and the database of reference 16S sequences are provided as a single co-dependent download. Sample data and usage instructions are included.

Note
Recent Developments: Robert Edgar has developed a faster and more accurate chimera detection tool called UCHIME. Also, for ChimeraSlayer aficionados, Patrick Schloss now includes a faster implementation of ChimeraSlayer in his Mothur package.

Microbiome Analysis Utilities

ChimeraSlayer

ChimeraSlayer (download) is a chimeric sequence detection utility, compatible with near-full length Sanger sequences and shorter 454-FLX sequences (~500 bp).

Chimera Slayer involves the following series of steps that operate to flag chimeric 16S rRNA sequences: (A) the ends of a query sequence are searched against an included database of reference chimera-free 16S sequences to identify potential parents of a chimera; (B) candidate parents of a chimera are selected as those that form a branched best scoring alignment to the NAST-formatted query sequence; © the NAST alignment of the query sequence is improved in a ‘chimera-aware’ profile-based NAST realignment to the selected reference parent sequences; and (D) an evolutionary framework is used to flag query sequences found to exhibit greater sequence homology to an in silico chimera formed between any two of the selected reference parent sequences.

To run Chimera Slayer, you need NAST-formatted sequences generated by the included NAST-iEr utility. Given NAST-formatted sequences, run ChimeraSlayer like so:

%microbiomeutil/ChimeraSlayer/ChimeraSlayer.pl  --query_NAST  ${sequences}.NAST

The output files include the following:

${sequences}.NAST.CPS                      :results from the chimera parent selection step
${sequences}.NAST.CPS_RENAST               :NAST alignments from a 'chimera-aware' realignment of the query
${sequences}.NAST.CPS.CPC                  :results from the chimera 'phylo-checker' step  ** the Chimera Slayer final verdict **
${sequences}.NAST.CPS.CPC.wTaxons          :the taxonomy of the reference (step)parents of the chimera

The .CPC output file is tab-delimited with the following fields:

0      ChimeraSlayer
1      chimera_AJ007403            # the accession of the chimera query
2      S000387216                  # reference parent A
3      S000001688                  # reference parent B
4      0.9422                      # divergence ratio of query to chimera (left_A, right_B)
5      90.00                       # percent identity between query and chimera(left_A, right_B)
6      0                           # confidence in query as a chimera related to (left_A, right_B)
7      1.0419                      # divergence ratio of query to chimera (right_A, left_B)
8      99.52                       # percent identity between query and chimera(right_A, left_B)
9      100                         # confidence in query as a chimera related to (right_A, left_B)
10     YES                         # ** verdict as a chimera or not **
11     NAST:4032-4033              # estimated approximate chimera breakpoint in NAST coordinates
12     ECO:767-768                 # estimated approximate chimera breakpoint according to the E. coli unaligned reference seq coordinates

For those query sequences flagged as chimeras, the .wTaxons file includes the following extra columns:

13      Rhodococcus                                                                # genus name of Parent A
14      Rhodococcus koreensis (T); DNP505; AF124342 Rhodococcus koreensis          # descriptive info for Parent A
15      Streptomyces                                                               # genus name of Parent B
16      Streptomyces somaliensis (T); DSM 40738; AJ007403 Streptomyces somaliensis # descriptive info for Parent B
17      INTRA-ORDER                                                                # type of chimera based on selected parents
Note
It is not recommended to blindly discard all sequences flagged as chimeras. Some may represent naturally formed chimeras that do not represent PCR artifacts. Sequences flagged may warrant further investigation.

If you use the —printCSalignments option, a diagram of the query matching the parents on both sides of the breakpoint is included in the output. For example:

Per_id parents: 89.52
          Per_id(Q,A): 94.00
--------------------------------------------------- A: S000387216
88.65                                99.06
~~~~~~~~~~~~~~~~~~~~~~~~\ /~~~~~~~~~~~~~~~~~~~~~~~~ Q: chimera_AJ007403
DivR: 0.942 BS: 0.00     |
Per_id(QLA,QRB): 90.00   |
                         |
   (L-AB: 88.65)         |      (R-AB: 90.34)
   WinL:0-704            |      WinR:705-1449
                         |
Per_id(QLB,QRA): 99.52   |
DivR: 1.042 BS: 100.00   |
~~~~~~~~~~~~~~~~~~~~~~~~/ \~~~~~~~~~~~~~~~~~~~~~~~~~ Q: chimera_AJ007403
100.00                                91.28
---------------------------------------------------- B: S000001688
           Per_id(Q,B): 95.52
DeltaL: -11.35                   DeltaR: 7.79
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
GGAGGCTCGTACCGCTGTCTTGTTAAGGACTGGTTTTTTACTGTCTATACAGACTCTTCA  A: S000387216
AAGACGCTTGGGTTTCACTCCTGCGCTTCGGCCGGGCCCGGCACTCGCCACAGTCTCGAG  Q: chimera_AJ007403
AAGACGCTTGGGTTTCACTCCTGCGCTTCGGCCGGGCCCGGCACTCGCCACAGTCTCGAG  B: S000001688
!!!!!!!!!!!!!!!!!!!!
TACTACTGGATATCCTGATA  A: S000387216
CGTCGTCTTGATGTTCACAT  Q: chimera_AJ007403
CGTCGTCTTGATGTTCACAT  B: S000001688
** Breakpoint **
                           !!!!!!!
TGCGTTCGGATCGATTGTTGCCGTACGCTGTGTCGATTAAAGGTAATCATAAGGGCTTTC  A: S000387216
TGCGTTCGGATCGATTGTTGCCGTACGCCTGTGTCATTAAAGGTAATCATAAGGGCTTTC  Q: chimera_AJ007403
GTAACGATCGCTTCCAACCCATCCGGTGCTGTGTCGCCGGGCACGGCTTGGGAATTAACT  B: S000001688
!!!!!!!!!!!!!!!!!!!!!!!!!!!!       !!!!!!!!!!!!!!!!!!!!!!!!!
GACTTACGACTC  A: S000387216
GACTTACGACTC  Q: chimera_AJ007403
ATTCCCAAGTCT  B: S000001688
!!!!!!!!!!!!

The above indicates the percent identities between the alignment segments corresponding to query and either parent. Since chimeras can occur two ways: (left parent A & right parent B) or (left parent B & right parent A), a fork diagram is shown with the statistics for each potential chimera as it relates to the query sequence. The bootstrap (BS) values indicate the confidence level for the corresponding chimera type. The informative SNP positions from the complete alignments are shown for both sides of the breakpoint.

WigeoN

WigeoN (download) examines the sequence conservation between a query and a trusted reference sequence, both in NAST alignment format. Based on the sequence identity between the query and the reference sequence, there is an expected amount of variation among the alignment. If the observed variation is greater than the 95% quantile of the distribution of variation observed between non-anomalous sequences, then it is flagged as an anomaly.

WigeoN is a flexible command-line based reimplementation of the Pintail algorithm Appl Environ Microbiol. 2005 Dec;7112:7724-36.

WigeoN is useful for flagging chimeras and anomalies only in near full-length 16S rRNA sequences. WigeoN lacks sensitivity with sequences less than 1000 bp.

To run WigeoN, you need NAST-formatted sequences generated by the included <<A_NASTiEr, NAST-iEr> utility. Given NAST-formatted sequences, run WigeoN like so:

%microbiomeutil/WigeoN/run_WigeoN.pl --query_NAST ${sequences}.NAST  >  ${sequences}.WigeoN

The output is tab-delimited like so:

0       chimera_AJ007403       # query sequence
1       S000387216             # best matching reference sequence
2       div:
3       5.45                   # percent sequence divergence between the query and the reference sequence
4       stDev:
5       4.01                   # standard deviation from expected reference sequence divergence across alignment windows
6       Quant95:Yes            # stDev is in the top 5% of stDev values observed among reference sequences at that same mean divergence
7       Quant99:YES            # top 1%  *** This value is recommended for flagging aberrant sequences ***
8       Quant99.9:No           # top 0.1%
9       Quant99.99:No          # top 0.01%

NAST-iEr

The NAST-iEr alignment utility (download) aligns a single raw nucleotide sequence against one or more NAST formatted sequences.

The alignment algorithm involves global dynamic programming profile alignment to fixed (NAST-formatted) multiply aligned template sequences without any end-gap penalty.

Run it like so, using a set of fasta-formatted sequences.

% microbiomeutil/NAST-iEr/run_NAST-iEr.pl --query_FASTA ${sequences}.fasta  > ${sequences}.NAST

AmosCmp16Spipeline

AmosCmp16Spipeline (download) uses the AMOScmp software to assemble multiple, potentially overlapping 16S rRNA sequencing reads based on read mappings to a reference 16S rRNA gene.

Given the following inputs: -fasta file containing sequencing reads -file containing the corresponding qual values -file enumerating the accessions corresponding to reads of the same clone individual assembly tasks -a reference database of 16S rRNA sequences

The single reference sequence that best matches all the reads is chosen. Lucy is used to trim the sequence reads of low quality termini. An additional homology-trimming operation is performed to exclude regions of the sequence that lack homology to the reference. The resulting trimmed reads and quality values are used to generate a sequence assembly using the AMOScmp software. A scaffold sequence is generated, where Ns are used to fill in gaps according to estimated gap sizes based on reference sequence anchoring, and quality values are reported according to the scaffold sequence. A README file containing instructions and sample data are provided.

TreeChopper

TreeChopper (download) clusters tree leaf nodes according to phylogenetic distance.

A graph is constructed from the tree like so: all leaves are visited, and from each leaf, all neighboring leaves within a specified distance threshold are added to a graph with an edge placed between them. After building this graph, each edge connecting pairs of nodes is examined and a Jaccard similarity coefficient is computed (see http://www.biomedcentral.com/1741-7007/3/7 for details). Those edges that loosely connect nodes as defined by this similarity coefficient are removed. The nodes connected by the remaining edges are clustered by transitive closure (single linkage clustering) and reported as OTUs.

The minimum phylogenetic distance between clustered nodes, and the minimum similarity coefficient between nodes in the graph are tuneable parameters. A README file containing instructions and sample data are provided.

Miscellaneous Remarks

Reference

ChimeraSlayer and companion tools are described in the following:

Questions, comments, etc?

Contact Brian Haas (bhaas at broadinstitute dot org)

This project has been funded with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health.