Frequently asked questions (FAQs) about GreenGenes.

 

Q: How does greengenes 'know' that the sequences it distributes are truly 16S rRNA?

 

Q: What is the "divergence ratio" and how do I know if the sequence is a chimera?

 

Q: How quickly are new database sequences available for download?

 

Q: Will any given sequence be analyzed exactly the same way each time it is run through greengenes?

 

Q: What is the "Revered Set" of sequences?

 

Q: What is the best way for me to obtain full length (or nearly full length) sequences for the archaea and bacteria from greengenes?

 

Q: I've noticed that some of my sequences are truncated while running through greengenes.  In what cases does this happen, and is there any way to prevent it from happening?

 

Q: How do I update my current 16S database using greengenes?

 

 

 


Q: How does greengenes 'know' that the sequences it distributes are truly 16S rRNA?

 

A: Before a sequence is deposited in greengenes as 16S rRNA, a megablast is performed against a template fasta file. Any sequence that more closely matches a mitochondrial sequence or 18S is reported as "MITO..." or "18S...." and is removed. This is a rapid pipeline test but a more thorough test to make sure a sequence is 16S is to model the secondary structure. Back to FAQs.

 

Q: What is the "divergence ratio" and how do I know if the sequence is a chimera?

A: Identification of chimeric sequences is expected to be more reliable when parent sequences are distinctly different from each other, while the divergence between the chimeric fragments and their parent sequences is low. We quantify this by a divergence ratio


              0.5 ( sid(i, k | w1) + sid(j, k | w2) )
d-ratio = ----------------------------------------------------
                     sid (i, j | w1 u w2)


where the numerator is the sequence identity(sid) between the fragments of chimeria k and their putative parent sequences i and j, averaged over windows w1 and w2 left and right of the break point; the denominator is the sequence identity of both parent sequences over both windows.  The window size is set to 300 bases.
The divergence ratio will be close to 1 when there is no significant difference between parent sequences and the putative chimera, and such a prediction will be generally unreliable. Divergence ratios larger than 1.1 (one point one) are in our experience a good indication for real chimeric sequences. Back to FAQs.

 

Q: How quickly are new database sequences available for download?

 

A: Sequences are not available for export until they have gone through the prokMSA namer and chimera check.  Once this has completed, they will be available in the BLAST and Simrank database within a few hours. Back to FAQs.

 

 

Q: Will any given sequence be analyzed exactly the same way each time it is run through greengenes?

If the parameters you choose are the same and the database of 16S templates are the same, then, yes the result would be the same.  We do database updates weekly (approximately) so a sequence you submit may yield a slightly different result when run on different weeks. Back to FAQs.

 

Q: What is the "Revered Set" of sequences?

 

A: The “Revered Set” is a smaller sequence set for users who just want long (>1350nt), non-chimeric, non-redundant sequences. Back to FAQs.

 

Q: What is the best way for me to obtain full length (or nearly full length) sequences for the archaea and bacteria from greengenes?

 

A: The best way to get the sequences is using the Export function.

Step 1: Browse tree (http://greengenes.lbl.gov/cgi-bin/nph-browse.cgi).  On left side of page make sure "Hugenholtz" is chosen under "My Taxonomy".  If it is not, select it and press "Activate".   In the table presented, click on the check-boxes for each Domains of interest (or, alternately, click on the Domain names themselves to reveal the phyla, etc).   Be sure to click the button "Make changes to My Interest List" to record the nodes you have checked/unchecked.

Step 2: Export the sequences (http://greengenes.lbl.gov/cgi-bin/nph-export_records.cgi)  Make your selections under "Filters" and "Options".  Enter the email address where you want your sequences sent. Then click "Export Now".

Use the "prokMSA format" to receive maximum annotation.  You may want to try it on a small scale at first by just choosing one family.  That way you can evaluate what type of information is in each record. Back to FAQs.

 

Q: I've noticed that some of my sequences are truncated while running through greengenes.  In what cases does this happen, and is there any way to prevent it from happening?

 

A: This truncation occurs when your sequence is unable to align well to any ONE template sequence. This is one way to prevent possible chimeras from being included in a multiple sequence alignment, but the down side is that NAST (the aligner) is unable to align the full sequence in some cases.   Future versions of NAST will allow this "safety" to be turned-off. If you find that a sequence was truncated, you may want to test that sequence with Bellerophon or Pintail to see if it may be a chimera. Another possibility is that you have discovered a totally new 16S sequence (at phylum level?), in which case you should align it by hand or with a highly accurate program like ClustalW using 50 or more nearest neighbors. Back to FAQs.

 

Q: How do I update my current 16S database using greengenes?

 

A: We recommend starting with the latest greengenes.arb database. Then align your personal sequences into the greengenes alignment format and import into ARB using the greengenes import filter (greengenes.ift). To update you local database periodically, go to the export function, upload a list of accession numbers or prokMSA ids that you have in your database and choose "consider sequences NOT found in uploaded list", this will return any new sequences to you. Then import these into ARB using the greengenes.ift import filter. Remember, as prokMSA ids are used as the unique identifiers in the database, when updating greengenes.arb from the website with public records, make sure to select the “use old names” in ARB when you import the sequences. When importing your own (non-public) sequences aligned with the NAST aligner use the “create new names” option in ARB. Never use the “generate new names” option in the species menu in ARB as this will erase the prokMSA ids from the name field and create difficulties if you want to overwrite existing records. Back to FAQs.

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.