The data in this directory was derived from the Greengenes database, and assembled into QIIME-compatible file in the Rob Knight Lab at the University of Colorado at Boulder. Questions? Contact the Qiime Forum at http://groups.google.com/group/qiime-forum. Find information on Greengenes at http://greengenes.lbl.gov/ and information on QIIME at http://qiime.sourceforge.net. ==== Data in this directory: notes.txt : this file otu_map: OTU maps (generated by pick_otus.py) rep_set: Representative sequences for our 97% OTUs, in NAST-aligned and unaligned formats. If you're running assign_taxonomy.py with BLAST, you can use gg_97_otus_6oct2010.fasta as your reference database (-r). You can also use this file to pick otus against with BLAST or uclust_ref. The sequence identifiers in this file are the prokMSA ids in greengenes. taxonomies: id-to-taxonomy maps for seven different taxonomies that are used to annotate the greengenes records. Any of these files can be used as your id-to-taxonomy map (-t) in assign_taxonomy.py -- we prefer the new Greengenes taxonomy (otu_id_to_greengenes.txt) trees: the conservative greengenes tree (sel4cni.inf.masked.ft.rooted) and a pruned version which contains only the tips in our 97% OTUs (gg_97_otus_6oct2010.tre). If you pick otus using BLAST or uclust_ref with no new clusters, you can use gg_97_otus_6oct2010.tre as the tree for phylogenetic diversity calculations. ==== Specific commands used to pick OTUs. masked sequences were used last time to pick otus. we are redoing with unmasked sequences. 1) take all seq ids from sel4cni $ grep "^>" sel4cni.inf.masked | sed -e 's/^>//' > sel4cni.ids 2) unalign and dereplicate raw greengenes sequences [wwwuser@growler gg_otus]$ python Python 2.6.4 (r264:75706, Jan 28 2010, 15:52:45) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> unmasked = [] >>> for l in open('inflated_sub_fields.txt'): ... fields = l.strip().split('\t') ... unmasked.append((fields[0],fields[-1].replace("-",'').replace(".",''))) ... >>> derepped = dict([(seq,id_) for id_, seq in unmasked]) >>> len(derepped) 452013 >>> len(unmasked) 508195 >>> output = [] >>> for seq,id_ in derepped.items(): ... output.append(">%s\n%s" % (id_, seq)) ... >>> f = open('all_gg.unmasked.unaligned.dereplicated','w') >>> f.write('\n'.join(output)) >>> f.close() 3) pull out sel4cni unmasked sequences [wwwuser@growler gg_otus]$ python Python 2.6.4 (r264:75706, Jan 28 2010, 15:52:45) [GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> sel4cni_ids = [l.strip() for l in open('sel4cni.ids')] >>> seqs = {} >>> from cogent.parse.fasta import MinimalFastaParser >>> seqs = dict(MinimalFastaParser(open('all_gg.unmasked.unaligned.dereplicated'))) >>> len(seqs) 452013 >>> output = [] >>> sel4cni_ids = set(sel4cni_ids) >>> for id_, seq in seqs.items(): ... if id_ in sel4cni_ids: ... output.append('>%s\n%s' % (id_, seq)) ... >>> len(output) 356666 >>> len(sel4cni_ids) 408135 >>> foo = {} >>> for id_, seq in seqs.items(): ... if id_ in sel4cni_ids: ... foo[id_] = seq ... >>> len(foo) 356666 >>> derep = dict([(seq, id_) for id_,seq in foo.items()]) >>> len(derep) 356666 >>> f = open('sel4cni.unmasked.unaligned.dereplicated','w') >>> f.write('\n'.join(output)) >>> f.close() 4) run through presort echo "python $HOME/gg_otus/QiimeUtils/gg_otus/presort_gg_fasta.py -i $HOME/gg_otus/inflated_sub_fields.txt -f $HOME/gg_otus/sel4cni.unmasked.unaligned.dereplicated -o $HOME/gg_otus/sel4cni.unmasked.unaligned.dereplicated.sorted.fasta" | qsub -k oe -N ggpresort -q amdq 5) run through uclust echo "python /home/wwwuser/software/Qiime/scripts/pick_otus.py -i /home/wwwuser/gg_otus/sel4cni.unmasked.unaligned.dereplicated.sorted.fasta -s 0.97 -o /home/wwwuser/gg_otus/uclust_otus_97_sel4cni_unmasked/ --uclust_otu_id_prefix gg_otu_6oct2010 -DB" | qsub -k oe -N ggOTU97 -q amdq 6) pick rep set python /home/wwwuser/software/Qiime/scripts/pick_rep_set.py -m first -i otu_map/otus.txt -o rep_set/rep_set.fasta -f ../sel4cni.unmasked.unaligned.dereplicated.sorted.fasta 7) rename seqs 8) extract aligned sequences filter_fasta.py -f /home/caporaso/data/gg_may2010_raw/inflated_sub_fields_aligned.fasta -o gg_97_otus_6oct2010_aligned.fasta -a gg_97_otus_6oct2010.fasta 9) replace all Us with Ts caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ ls total 361M -rw-rw-r-- 1 caporaso caporaso 305M Oct 12 16:48 gg_97_otus_6oct2010_aligned.fasta -rw-rw-r-- 1 caporaso caporaso 57M Oct 11 16:41 gg_97_otus_6oct2010.fasta caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ fg ipython [gg_otus_6oct2010/rep_set]|8> g = open('gg_97_otus_6oct2010_T.fasta','w') [gg_otus_6oct2010/rep_set]|9> for seq_id, seq in MinimalFastaParser(open('./gg_97_otus_6oct2010.fasta')): ..: g.write('>%s\n%s\n' % (seq_id, seq.replace('U','T'))) ..: ..: [gg_otus_6oct2010/rep_set]|10> g.close() [gg_otus_6oct2010/rep_set]|11> g = open('gg_97_otus_6oct2010_aligned_T.fasta','w') [gg_otus_6oct2010/rep_set]|12> for seq_id, seq in MinimalFastaParser(open('./gg_97_otus_6oct2010_')): g.write('>%s\n%s\n' % (seq_id, seq.replace('U','T'))) ./gg_97_otus_6oct2010_T.fasta ./gg_97_otus_6oct2010_aligned_T.fasta ./gg_97_otus_6oct2010_aligned.fasta [gg_otus_6oct2010/rep_set]|12> for seq_id, seq in MinimalFastaParser(open('./gg_97_otus_6oct2010_a')): g.write('>%s\n%s\n' % (seq_id, seq.replace('U','T'))) ./gg_97_otus_6oct2010_aligned.fasta ./gg_97_otus_6oct2010_aligned_T.fasta [gg_otus_6oct2010/rep_set]|12> for seq_id, seq in MinimalFastaParser(open('./gg_97_otus_6oct2010_aligned.fasta')): g.write('>%s\n%s\n' % (seq_id, seq.replace('U','T'))) ...: ...: [gg_otus_6oct2010/rep_set]|14> g.close() [gg_otus_6oct2010/rep_set]|15> [1]+ Stopped ipython caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ ls total 722M -rw-rw-r-- 1 caporaso caporaso 305M Oct 12 16:48 gg_97_otus_6oct2010_aligned.fasta -rw-rw-r-- 1 caporaso caporaso 305M Oct 13 14:37 gg_97_otus_6oct2010_aligned_T.fasta -rw-rw-r-- 1 caporaso caporaso 57M Oct 11 16:41 gg_97_otus_6oct2010.fasta -rw-rw-r-- 1 caporaso caporaso 57M Oct 13 14:36 gg_97_otus_6oct2010_T.fasta caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ count_seqs * 41513 : gg_97_otus_6oct2010_aligned.fasta 41513 : gg_97_otus_6oct2010_aligned_T.fasta 41513 : gg_97_otus_6oct2010.fasta 41513 : gg_97_otus_6oct2010_T.fasta 166052 : Total caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ mv gg_97_otus_6oct2010_T.fasta gg_97_otus_6oct2010.fasta mv: overwrite `gg_97_otus_6oct2010.fasta'? y caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ mv gg_97_otus_6oct2010_aligned_T.fasta gg_97_otus_6oct2010_aligned.fasta mv: overwrite `gg_97_otus_6oct2010_aligned.fasta'? y caporaso@growler:~/data/gg_otus_6oct2010/rep_set$ ls total 361M -rw-rw-r-- 1 caporaso caporaso 305M Oct 13 14:37 gg_97_otus_6oct2010_aligned.fasta -rw-rw-r-- 1 caporaso caporaso 57M Oct 13 14:36 gg_97_otus_6oct2010.fasta # justin k 24 oct 2010 10) make subset of tree, retaining only tips in the 97% otu rep set: from cogent.core.tree import PhyloNode, TreeError from cogent.parse.fasta import MinimalFastaParser from cogent.parse.tree import DndParser tree = DndParser(open('sel4cni.inf.masked.ft.rooted', 'U'),PhyloNode) seqsf = open('rep_set/gg_97_otus_6oct2010.fasta','U') seqs = MinimalFastaParser(seqsf) labs = [rec[0] for rec in seqs] len(labs) len(seqs) labs[0] t2 = tree.getSubTree(labs) len(t2.tips()) t2.writeToFile('gg_97_otus_6oct2010.tre')