The data in this directory was derived from the Greengenes database, and assembled into QIIME-compatible file in the Rob Knight Lab at the University of Colorado at Boulder.

Questions? Contact the Qiime Forum at http://groups.google.com/group/qiime-forum.

Find information on Greengenes at http://greengenes.lbl.gov/ and information on QIIME at http://qiime.sourceforge.net.

====
Data in this directory:
 
 notes.txt : this file
 otu_map: OTU maps (generated by pick_otus.py)

 rep_set: Representative sequences for our 97% OTUs, in NAST-aligned and unaligned formats. If you're running assign_taxonomy.py with BLAST, you can use gg_97_otus_6oct2010.fasta as your reference database (-r). You can also use this file to pick otus against with BLAST or uclust_ref. The sequence identifiers in this file are the prokMSA ids in greengenes.
 
 taxonomies: id-to-taxonomy maps for seven different taxonomies that are used to annotate the greengenes records. Any of these files can be used as your id-to-taxonomy map (-t) in assign_taxonomy.py -- we prefer the new Greengenes taxonomy (otu_id_to_greengenes.txt)
 
 trees: the conservative greengenes tree (sel4cni.inf.masked.ft.rooted) and a pruned version which contains only the tips in our 97% OTUs (gg_97_otus_6oct2010.tre). If you pick otus using BLAST or uclust_ref with no new clusters, you can use gg_97_otus_6oct2010.tre as the tree for phylogenetic diversity calculations.


====

Specific commands used to pick OTUs.

masked sequences were used last time to pick otus. we are redoing with unmasked sequences.

1) take all seq ids from sel4cni

$ grep "^>" sel4cni.inf.masked | sed -e 's/^>//' > sel4cni.ids

2) unalign and dereplicate raw greengenes sequences

[wwwuser@growler gg_otus]$ python
Python 2.6.4 (r264:75706, Jan 28 2010, 15:52:45) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unmasked = []
>>> for l in open('inflated_sub_fields.txt'):
...   fields = l.strip().split('\t')
...   unmasked.append((fields[0],fields[-1].replace("-",'').replace(".",'')))
... 
>>> derepped = dict([(seq,id_) for id_, seq in unmasked])
>>> len(derepped)
452013
>>> len(unmasked)
508195
>>> output = []
>>> for seq,id_ in derepped.items():
...   output.append(">%s\n%s" % (id_, seq))
... 
>>> f = open('all_gg.unmasked.unaligned.dereplicated','w')
>>> f.write('\n'.join(output))
>>> f.close()

3) pull out sel4cni unmasked sequences

[wwwuser@growler gg_otus]$ python
Python 2.6.4 (r264:75706, Jan 28 2010, 15:52:45) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> sel4cni_ids = [l.strip() for l in open('sel4cni.ids')]
>>> seqs = {}
>>> from cogent.parse.fasta import MinimalFastaParser
>>> seqs = dict(MinimalFastaParser(open('all_gg.unmasked.unaligned.dereplicated')))
>>> len(seqs)
452013
>>> output = []
>>> sel4cni_ids = set(sel4cni_ids)
>>> for id_, seq in seqs.items():
...   if id_ in sel4cni_ids:
...     output.append('>%s\n%s' % (id_, seq))
... 
>>> len(output)
356666
>>> len(sel4cni_ids)
408135
>>> foo = {}
>>> for id_, seq in seqs.items():
...   if id_ in sel4cni_ids:
...     foo[id_] = seq
... 
>>> len(foo)
356666
>>> derep = dict([(seq, id_) for id_,seq in foo.items()])
>>> len(derep)
356666
>>> f = open('sel4cni.unmasked.unaligned.dereplicated','w')
>>> f.write('\n'.join(output))
>>> f.close()

4) run through presort

echo "python $HOME/gg_otus/QiimeUtils/gg_otus/presort_gg_fasta.py -i $HOME/gg_otus/inflated_sub_fields.txt -f $HOME/gg_otus/sel4cni.unmasked.unaligned.dereplicated -o $HOME/gg_otus/sel4cni.unmasked.unaligned.dereplicated.sorted.fasta" | qsub -k oe -N ggpresort -q amdq

5) run through uclust

echo "python /home/wwwuser/software/Qiime/scripts/pick_otus.py -i /home/wwwuser/gg_otus/sel4cni.unmasked.unaligned.dereplicated.sorted.fasta -s 0.97 -o /home/wwwuser/gg_otus/uclust_otus_97_sel4cni_unmasked/ --uclust_otu_id_prefix gg_otu_6oct2010 -DB" | qsub -k oe -N ggOTU97 -q amdq

6) pick rep set

python /home/wwwuser/software/Qiime/scripts/pick_rep_set.py -m first -i otu_map/otus.txt -o rep_set/rep_set.fasta -f ../sel4cni.unmasked.unaligned.dereplicated.sorted.fasta

7) rename seqs

<missing notes>

8) extract aligned sequences

filter_fasta.py -f /home/caporaso/data/gg_may2010_raw/inflated_sub_fields_aligned.fasta -o gg_97_otus_6oct2010_aligned.fasta -a gg_97_otus_6oct2010.fasta 

9) replace all Us with Ts

caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ ls
total 361M
-rw-rw-r-- 1 caporaso caporaso 305M Oct 12 16:48
gg_97_otus_6oct2010_aligned.fasta
-rw-rw-r-- 1 caporaso caporaso  57M Oct 11 16:41 gg_97_otus_6oct2010.fasta
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ fg
ipython


[gg_otus_6oct2010/rep_set]|8> g = open('gg_97_otus_6oct2010_T.fasta','w')

[gg_otus_6oct2010/rep_set]|9> for seq_id, seq in
MinimalFastaParser(open('./gg_97_otus_6oct2010.fasta')):
                          ..:     g.write('>%s\n%s\n' % (seq_id,
seq.replace('U','T')))
                          ..:     
                          ..:     

[gg_otus_6oct2010/rep_set]|10> g.close()
[gg_otus_6oct2010/rep_set]|11> g =
open('gg_97_otus_6oct2010_aligned_T.fasta','w')
[gg_otus_6oct2010/rep_set]|12> for seq_id, seq in
MinimalFastaParser(open('./gg_97_otus_6oct2010_')):
    g.write('>%s\n%s\n' % (seq_id, seq.replace('U','T')))
./gg_97_otus_6oct2010_T.fasta          ./gg_97_otus_6oct2010_aligned_T.fasta
./gg_97_otus_6oct2010_aligned.fasta    

[gg_otus_6oct2010/rep_set]|12> for seq_id, seq in
MinimalFastaParser(open('./gg_97_otus_6oct2010_a')):
    g.write('>%s\n%s\n' % (seq_id, seq.replace('U','T')))
./gg_97_otus_6oct2010_aligned.fasta    ./gg_97_otus_6oct2010_aligned_T.fasta

[gg_otus_6oct2010/rep_set]|12> for seq_id, seq in
MinimalFastaParser(open('./gg_97_otus_6oct2010_aligned.fasta')):
    g.write('>%s\n%s\n' % (seq_id, seq.replace('U','T')))
                          ...:     
                          ...:     

[gg_otus_6oct2010/rep_set]|14> g.close()

[gg_otus_6oct2010/rep_set]|15> 
[1]+  Stopped                 ipython
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ 
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ 
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ ls
total 722M
-rw-rw-r-- 1 caporaso caporaso 305M Oct 12 16:48
gg_97_otus_6oct2010_aligned.fasta
-rw-rw-r-- 1 caporaso caporaso 305M Oct 13 14:37
gg_97_otus_6oct2010_aligned_T.fasta
-rw-rw-r-- 1 caporaso caporaso  57M Oct 11 16:41 gg_97_otus_6oct2010.fasta
-rw-rw-r-- 1 caporaso caporaso  57M Oct 13 14:36 gg_97_otus_6oct2010_T.fasta
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ count_seqs *

41513  : gg_97_otus_6oct2010_aligned.fasta
41513  : gg_97_otus_6oct2010_aligned_T.fasta
41513  : gg_97_otus_6oct2010.fasta
41513  : gg_97_otus_6oct2010_T.fasta
166052  : Total
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ mv
gg_97_otus_6oct2010_T.fasta gg_97_otus_6oct2010.fasta
mv: overwrite `gg_97_otus_6oct2010.fasta'? y
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ mv
gg_97_otus_6oct2010_aligned_T.fasta gg_97_otus_6oct2010_aligned.fasta
mv: overwrite `gg_97_otus_6oct2010_aligned.fasta'? y
caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ ls
total 361M
-rw-rw-r-- 1 caporaso caporaso 305M Oct 13 14:37
gg_97_otus_6oct2010_aligned.fasta
-rw-rw-r-- 1 caporaso caporaso  57M Oct 13 14:36 gg_97_otus_6oct2010.fasta

# justin k 24 oct 2010
10) make subset of tree, retaining only tips in the 97% otu rep set:

from cogent.core.tree import PhyloNode, TreeError
from cogent.parse.fasta import MinimalFastaParser
from cogent.parse.tree import DndParser
tree = DndParser(open('sel4cni.inf.masked.ft.rooted', 'U'),PhyloNode)


seqsf = open('rep_set/gg_97_otus_6oct2010.fasta','U')
seqs = MinimalFastaParser(seqsf)
labs = [rec[0] for rec in seqs]
len(labs)
len(seqs)
labs[0]
t2 = tree.getSubTree(labs)

len(t2.tips())
t2.writeToFile('gg_97_otus_6oct2010.tre')