Supplementary Datasets for: Vidulin, Smuc and Supek, Bioinformatics (under review) 2016.
These data sets contain the predicted gene functions for 5,133,543 genes from 2,071 microbial genomes. They were obtained in a single massive analysis that draws on five established genomic methodologies: phylogenetic profiles, conserved gene neighborhoods, remote homology patterns, protein biophysical properties and codon adaptation profiles.
In particular, this encompasses:
In the tables with predictions, columns are GO terms, rows are COGs/NOGs (from EggNog v4), and cells contain the precision score, which is equivalent to 1-FDR. Precision<0.10, or equivalently FDR>0.90 are all written as 0.
Predictions - individual methods. (5-8 Mb each file)
predictions_Phyletic-profiles.txt.gz
predictions_Empirical-kernel-map.txt.gz
predictions_Conserved-gene-neighborhoods.txt.gz
predictions_Biophysical-protein-sequence-properties.txt.gz
predictions_Translation-efficiency-profiles.txt.gz
Predictions - integration schemes. (approx 8 Mb each file)
predictions_integrated_Best-precision.txt.gz
predictions_integrated_Weighted-voting.txt.gz
predictions_integrated_One-vote.txt.gz
ID and mapping files.
Gene-identifiers-to-OGs-mapping.txt.gz (50 Mb)
GO-terms-assigned-to-OGs_using-50-percent-rule.txt.gz (0.3 Mb)
More details of the methodology will become available upon acceptance of the manuscript.