Frequently asked questions (FAQs) about GreenGenes.

 

Q: How does greengenes 'know' that the sequences it distributes are truly 16S rRNA?

 

Q: What is the "divergence ratio" and how do I know if the sequence is a chimera?

 

Q: How quickly are new database sequences available for download?

 

Q: Will any given sequence be analyzed exactly the same way each time it is run through greengenes?

 

Q: What is the "Revered Set" of sequences?

 

Q: What is the best way for me to obtain full length (or nearly full length) sequences for the archaea and bacteria from greengenes?

 

Q: I've noticed that some of my sequences are truncated while running through greengenes.  In what cases does this happen, and is there any way to prevent it from happening?

 

Q: How do I update my current 16S rRNA database using greengenes?

 

Q: How can I download the current prokMSA database directly to my UNIX machine without using a web browser?

 

Q: I would like to map my 16S rRNA sequences to taxonomic codes (phylocodes). What's the best way to do this?

 

Q: How do I avoid duplicate sequences in my ARB database caused by ARB generating names during import?

Q: What is the numeric ARB name used in greengenes.arb?

Q: Arb cannot import pre-aligned sequences output from the NAST aligner. The sequences import successfully, in fasta format, but ARB is not importing the alignments correctly - they appear as one large block in the alignment viewer. How can I fix this?

 

 

 

Answers

 


Q: How does greengenes 'know' that the sequences it distributes are truly 16S rRNA?

 

A: Before a sequence is deposited in greengenes as 16S rRNA, a megablast is performed against a template FASTA file. Any sequence that more closely matches a mitochondrial sequence or 18S rRNA is reported as "MITO..." or "18S...." and is removed. This is a rapid pipeline test. A more thorough test to make sure a sequence is 16S rRNA is to model the secondary structure. Back to FAQs.

 

Q: What is the "divergence ratio" and how do I know if the sequence is a chimera?

A: Identification of chimeric sequences is expected to be more reliable when parent sequences are distinctly different from each other, while the divergence between the chimeric fragments and their parent sequences is low. We quantify this by a divergence ratio


              0.5 ( sid(i, k | w1) + sid(j, k | w2) )
d-ratio = ----------------------------------------------------
                     sid (i, j | w1 u w2)


where the numerator is the sequence identity(sid) between the fragments of chimeria k and their putative parent sequences i and j, averaged over windows w1 and w2 to the left and right of the break point; the denominator is the sequence identity of both parent sequences over both windows.  The window size is set to 300 bases.
The divergence ratio will be close to 1 when there is no significant difference between parent sequences and the putative chimera, and such a prediction will be generally unreliable. Divergence ratios larger than 1.1 (one point one) in our experience are a good indication for real chimeric sequences. Back to FAQs.

 

Q: How quickly are new database sequences available for download?

 

A: Sequences are not available for export until they have gone through the prokMSA namer and chimera check.  Once this has completed, they will be available in the BLAST and SimRank database within a few hours. Back to FAQs.

 

 

Q: Will any given sequence be analyzed exactly the same way each time it is run through greengenes?

If the parameters you choose are the same and the database of 16S rRNA templates is the same, then, yes, the result would be the same.  We do database updates weekly (approximately) so a sequence you submit may yield a slightly different result when run in different weeks. Back to FAQs.

 

Q: What is the "Revered Set" of sequences?

 

A: The “Revered Set” is a smaller sequence set for users who just want long (>1350nt), non-chimeric, non-redundant sequences. Back to FAQs.

 

Q: What is the best way for me to obtain full length (or nearly full length) sequences for the archaea and bacteria from greengenes?

 

A: The best way to get the sequences is using the Export function.

Step 1: Browse tree (http://greengenes.lbl.gov/cgi-bin/nph-browse.cgi).  On left side of page make sure "Hugenholtz" is chosen under "My Taxonomy".  If it is not, select it and press "Activate".   In the table presented, click on the check-boxes for each Domains of interest (or, alternately, click on the Domain names themselves to reveal the phyla, etc).   Be sure to click the button "Make changes to My Interest List" to record the nodes you have checked/unchecked.

Step 2: Export the sequences (http://greengenes.lbl.gov/cgi-bin/nph-export_records.cgi)  Make your selections under "Filters" and "Options".  Enter the email address where you want your sequences sent. Then click "Export Now".

Use the "prokMSA format" to receive maximum annotation.  You may want to try it on a small scale at first by just choosing one family.  That way you can evaluate what type of information is in each record. Back to FAQs.

 

Q: I've noticed that some of my sequences are truncated while running through greengenes.  In what cases does this happen, and is there any way to prevent it from happening?

 

A: This truncation occurs when your sequence is unable to align well to any ONE template sequence. This is one way to prevent possible chimeras from being included in a multiple sequence alignment, but the down side is that NAST (the aligner) is unable to align the full sequence in some cases.   Future versions of NAST will allow this "safety" to be turned-off. If you find that a sequence was truncated, you may want to test that sequence with Bellerophon or Pintail to see if it may be a chimera. Another possibility is that you have discovered a totally new 16S rRNA sequence (at phylum level?), in which case you should align it by hand or with a highly accurate program like ClustalW using 50 or more nearest neighbors. Back to FAQs.

 

Q: How do I update my current 16S rRNA database using greengenes?

 

A: We recommend starting with the latest greengenes.arb database. Then align your personal sequences into the greengenes alignment format and import into ARB using the greengenes import filter (greengenes.ift). To update you local database periodically, go to the export function, upload a list of accession numbers or prokMSA ids that you have in your database and choose "consider sequences NOT found in uploaded list", this will return any new sequences to you. Then import these into ARB using the greengenes.ift import filter. Remember, as prokMSA ids are used as the unique identifiers in the database, when updating greengenes.arb from the website with public records, make sure to select the “use old names” in ARB when you import the sequences. When importing your own (non-public) sequences aligned with the NAST aligner use the “create new names” option in ARB. Never use the “generate new names” option in the species menu in ARB as this will erase the prokMSA ids from the name field and create difficulties if you want to overwrite existing records. Back to FAQs.

 

Q: How can I download the current prokMSA database directly to my Unix machine without using a web browser?

 

A: To obtain a file directly to your Unix machine you can use 'wget'which is available on most Unix platforms.

For example, the command, wget http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/current_prokMSA_aligned.FASTA.gz

will get the desired document (current version of the prokMSA database aligned in FASTA format) from the web and copy it onto your machine without needing to open a browser. Back to FAQs.

Q: I would like to map my 16S rRNA sequences to taxonomic codes (phylocodes). What's the best way to do this?

A: First align your sequences using NAST with the option "do not remove alignment gap characters (returned sequences will be 7,682 characters)" . You'll receive two separate emails. One will contain the aligned sequences, the other will be a report that lists a neighbor (using the prokMSA_id) for each of your sequences which may assist in your query. Maybe a better way to attach taxonomic codes to your sequences is with a two step process. First, classify each of your aligned sequences in the 7,682-character format with:

http://greengenes.lbl.gov/cgi-bin/nph-classify.cgi

This will return the taxonomic placement of each of your sequences according to multiple published taxonomies in a string format (e.g. "Bacteria; Proteobacteria; Alphaproteobacteria; Consistiales; Caedibacteraceae; otu_532). From that report, you can decide which taxonomy system you wish to adopt (NCBI, Hugenholtz, Ludwig, RDP, etc.). Then, if you wish to assign a code to each of these strings, you can find the string-to-code relationships at:

http://greengenes.lbl.gov/Download/Taxonomic_Outlines/


For instance if you like Phil Hugenholtz's system, be sure to download "Hugenholtz_SeqDescByOTU_tax_outline.txt"
There is one note of caution about using numeric phylocodes from greengenes. To be safe, you should classify and grab you codes on the same day. The codes are updated as more sequences are annotated by the curators. Back to FAQs.

Q: How do I avoid duplicate sequences in my ARB database caused by ARB generating names during import?

A: When you import sequences obtained from greengenes (ie. sequences already in genbank) into your local arb database you should always use the greengenes import format and select "use old names". This way the prokMSA_id is the ARB name and you won't get the overwrite problem you describe. For your own sequences that aren't in greengenes you can get ARB to create new names, and when the study gets published and the sequences finally get into greengenes via NCBI, then you can delete the unpublished set to avoid duplication. Back to FAQs.

 

Q: What is the numeric ARB name used in greengenes.arb?

 

A: This is the prokMSA_id, the unique identifier used for records in greengenes. We chose to use this identifier to bypass potential problems when updating ARB with greengene records, (e.g. inserting the same record twice into arb because the arb name differed slightly, which can happen if names are generated by ARB). Back to FAQs.

 

Q: Arb cannot import pre-aligned sequences output from the NAST aligner. The sequences import successfully, in fasta format, but ARB is not importing the alignments correctly - they appear as one large block in the alignment viewer. How can I fix this?

A: You may be using the wrong import filter in Arb. You'll need to choose "fasta with gaps" to retain the alignment. Back to FAQs.

 

..

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.