Frequently asked questions (FAQs) about GreenGenes.
Q:
How does greengenes 'know' that
the sequences it distributes are truly 16S rRNA?
Q:
What is the "divergence ratio" and how do I know
if the sequence is a chimera?
Q:
How quickly are new database sequences available for download?
Q:
Will any given sequence be analyzed exactly the same way
each time it is run through greengenes?
Q: What is the "Revered Set" of sequences?
Q:
How do I update my current 16S database using greengenes?
Q:
How does greengenes 'know' that the sequences it
distributes are truly 16S rRNA?
A:
Before a sequence is deposited in greengenes as
16S rRNA, a megablast is performed against a template
fasta file. Any sequence that more closely matches
a mitochondrial sequence or 18S is reported as "
Q:
What is the "divergence ratio" and how do I know if the sequence
is a chimera?
A: Identification of chimeric sequences is expected to be more reliable when
parent sequences are distinctly different from each other, while the divergence
between the chimeric fragments and their parent sequences is low. We quantify
this by a divergence ratio
0.5 ( sid(i, k | w1) + sid(j, k | w2) )
d-ratio = ----------------------------------------------------
sid (i, j | w1 u w2)
where the numerator is the sequence identity(sid)
between the fragments of chimeria k and their putative
parent sequences i and j, averaged over windows
w1 and w2 left and right of the break point; the denominator is the sequence
identity of both parent sequences over both windows. The window size
is set to 300 bases.
The divergence ratio will be close to 1 when there is no significant difference
between parent sequences and the putative chimera, and such a prediction will
be generally unreliable. Divergence ratios larger than 1.1 (one point one)
are in our experience a good indication for real chimeric sequences. Back
to FAQs.
Q:
How quickly are new database sequences available for download?
A: Sequences are not available for export until they have gone through the prokMSA namer and chimera check. Once this has completed, they will be available in the BLAST and Simrank database within a few hours. Back to FAQs.
Q:
Will any given sequence be analyzed exactly the same way each time it is run
through greengenes?
If the parameters you choose are the same and the database of 16S templates are the same, then, yes the result would be the same. We do database updates weekly (approximately) so a sequence you submit may yield a slightly different result when run on different weeks. Back to FAQs.
Q:
What is the "Revered Set" of sequences?
A: The “Revered Set” is a smaller sequence set for users who just want long (>1350nt), non-chimeric, non-redundant sequences. Back to FAQs.
Q:
What is the best way for me to obtain full length (or nearly full length)
sequences for the archaea and bacteria from greengenes?
A:
The best way to get the sequences is using the Export function.
Step 1: Browse tree (http://greengenes.lbl.gov/cgi-bin/nph-browse.cgi).
On left side of page make sure "Hugenholtz" is chosen under "My
Taxonomy". If it is not, select it and press "Activate".
In the table presented, click on the check-boxes for each Domains of interest
(or, alternately, click on the Domain names themselves to reveal the phyla,
etc). Be sure to click the button "Make changes to My Interest
List" to record the nodes you have checked/unchecked.
Step 2: Export the sequences (http://greengenes.lbl.gov/cgi-bin/nph-export_records.cgi) Make your selections under "Filters" and "Options".
Enter the email address where you want your sequences sent. Then click "Export
Now".
Use the "prokMSA format" to receive maximum
annotation. You may want to try it on a small scale at first by just
choosing one family. That way you can evaluate what type of information
is in each record. Back to FAQs.
Q:
I've noticed that some of my sequences are truncated while running through
greengenes. In what cases does this happen,
and is there any way to prevent it from happening?
A: This truncation occurs when your sequence is unable to align well to any ONE template sequence. This is one way to prevent possible chimeras from being included in a multiple sequence alignment, but the down side is that NAST (the aligner) is unable to align the full sequence in some cases. Future versions of NAST will allow this "safety" to be turned-off. If you find that a sequence was truncated, you may want to test that sequence with Bellerophon or Pintail to see if it may be a chimera. Another possibility is that you have discovered a totally new 16S sequence (at phylum level?), in which case you should align it by hand or with a highly accurate program like ClustalW using 50 or more nearest neighbors. Back to FAQs.
Q:
How do I update my current 16S database using greengenes?
A:
We recommend starting with the latest greengenes.arb database. Then align your personal sequences
into the greengenes alignment format and import
into ARB using the greengenes import filter (greengenes.ift). To update you local database periodically,
go to the export function, upload a list of accession numbers or prokMSA ids that you have in your database and choose "consider
sequences NOT found in uploaded list", this will return any new sequences
to you. Then import these into ARB using the greengenes.ift import filter. Remember, as prokMSA ids are
used as the unique identifiers in the database, when updating greengenes.arb from the website with public records, make
sure to select the “use old names” in ARB when you import the sequences. When
importing your own (non-public) sequences aligned with the NAST aligner use
the “create new names” option in ARB. Never use the
“generate new names” option in the species menu in ARB as this will erase
the prokMSA ids from the name field and create difficulties if
you want to overwrite existing records.