In prokaryotic genomes functionally coupled genes can be organized in conserved clusters of neighboring genes, called operons, enabling their coordinated regulation. Thus, it is possible to predict function of uncharacterised genes by analysing functional annotations within their neighborhoods. Here, we present an algorithm that gives an insight into genomic neighborhoods of a query protein family by calculating statistical significance for overrepresentation of functional domains in the neighborhoods. The analysis is performed on set of 88 542 microbial genomes from the IMG database , version 2.0 dated 2021. You can select what size of database you want to use for analysis. The default is the whole database, but you can choose to use only reference genomes.
Example of usage of this tool with described steps is HERE
Number od PFAM domains in ou database can be seen here Statistics
The analysis can be carried out in three ways. Input - option 1: Query is a gene list (IMG-JGI identifiers, one per line, e.g. 2264882819). It is assumed that input genes are homologs, but this assumption is not checked. Neighborhoods of all the genes from the list are analysed.
Gene list as query
Input - option 2:
Query is a Pfam domain (Pfam ID, e.g. PF02696).
Here, the results can be averaged on the taxonomic levels of genus or family, or not averaged ("all database" option).
Pfam domain as query (genera level)
Pfam domain as query (family level)
Pfam domain as query (all database)
Input - option 3
Query is a protein sequence. The search for homologous proteins is performed by employing the DIAMOND software, with a
selected cutoff value. The hits obtained from the search are collected and their neighborhoods are analyzed.
For each Pfam domain that is found encoded in at least one analysed neighborhood, the algorithm calculates local density of this domain (expressed as counts per number of genes within neighborhoods) and global density (expressed as number of all occurrences of this Pfam domain in genome per number of genes within whole genome). A statistical test determines whether the local density is significantly higher than the global density. A higher local densityis desired and it means a domain is encoded in query genomic neighborhoods more often than would be expected in a random situation.