README.txt for: HMP_strains_16S_aligned.fasta HMP_strains_missing_16S.txt ####### The goal of the HMP_strains_16S_aligned.fasta file is to collect at least one near full-length (>1250 bases) 16S rRNA gene from each Bacterial (or eventually Archaeal strain) in the HMP Strains project where the "Project Status" is beyond just "Targeted". In cases where a 16S gene is not yet available due to a lack of 16S genes in the the released contigs, a near neighbor is used as a placeholder. These placeholders can be quickly identified by the '!!!' included in their headers. The intended use of this file is to "decorate" 16S trees with leaves already sequenced or in the process of being sequenced by the HMP. This may allow future strains to be selected based based on a systematic scoring method, for instance. For this reason, placeholders are used rather than omission of the strain so that users of this file can be made aware of upcoming sequences. Another use could be a small-scale database for comparing 16S NGS reads. Methods: The HMP strains list is slurped as a txt file maintained by Konstantinos Liolios at JGI. The HMP public contigs are obtained from NCBI: ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria/ The 16S genes are found and extracted with NAST. The fasta header format of the output file is: >[prokMSA_id A.K.A greengenes_id] [prokMSAname/strain_name] [Genbank Accession of contig] [span of contig aligned] [placeholder info if necessary] ######## The goal of the HMP_strains_missing_16S.txt file is to alert users of HMP genome projects where all contigs are missing a full length 16S gene. Only strains where a GENBANK ACCESSION number is provided in ~/kliolios/HMP/hmp.xls AND all contigs failed a full length (>1250 bases) 16S search are documented in this file. The tab-delimited txt file contains columns: organism_name: Strain's name ncbi_acc: one row per contig timestamp: when search was performed at greengenes message: why search was unsuccessful