Functional Annotation

The funannotate2 annotate command adds functional annotation to gene models. It uses various databases for annotation, including Pfam, dbCAN, MEROPS, UniProtKB/Swiss-Prot, and BUSCOlite.

Basic Usage

funannotate2 annotate -i /path/to/funannotate2_predict_output -o /path/to/output_dir

Required Arguments

-i, --input-dir: Path to funannotate2 predict output directory
-f, --fasta: Genome in FASTA format (required if not using –input-dir)
-t, --tbl: Genome annotation in TBL format (required if not using –input-dir and not using –gff3)
-g, --gff3: Genome annotation in GFF3 format (required if not using –input-dir and not using –tbl)
-o, --out: Output folder name (required if not using –input-dir)

Optional Arguments

-a, --annotations: Annotations files, 3 column TSV [transcript-id, feature, data]
-s, --species: Species name, use quotes for binomial, e.g. “Aspergillus fumigatus”
-st, --strain: Strain/isolate name
--cpus: Number of CPUs to use (default: 2)
--tmpdir: Volume to write tmp files (default: /tmp)
--curated-names: Path to custom file with gene-specific annotations (tab-delimited: gene_id annotation_type annotation_value)

Annotation Process

The annotate command performs the following steps:

Loads gene models from the input directory or specified files
Extracts protein sequences from the gene models
Searches the proteins against various databases:
- `Pfam-A <http://pfam.xfam.org/>`_: Identifies protein domains using pyhmmer
- `dbCAN <http://bcb.unl.edu/dbCAN2/>`_: Identifies carbohydrate-active enzymes (CAZymes) using pyhmmer
- `UniProtKB/Swiss-Prot <https://www.uniprot.org/>`_: Identifies protein function using diamond
- `MEROPS <https://www.ebi.ac.uk/merops/>`_: Identifies proteases using diamond
- `BUSCOlite <https://github.com/nextgenusfs/buscolite>`_: Identifies conserved orthologs
Cleans gene names and product descriptions using a curated database
Merges all annotations into the gene models
Outputs the annotated gene models in GFF3, TBL, and GenBank formats
Extracts protein and transcript sequences from the annotated gene models
Generates summary statistics for the annotation

Using Custom Curated Gene Names and Products

Funannotate2 includes a database of curated gene names and products that are used to ensure consistent and accurate annotation. However, you may want to use your own curated gene names and products for specific organisms or projects.

Creating a Custom Annotations File

Create a tab-delimited text file with gene IDs in the first column, annotation types in the second column, and annotation values in the third column:

# Custom annotations for specific genes/transcripts
gene123     name    ACT1
gene123     product Actin
gene456     name    CDC42
gene456     product Cell division control protein 42
gene789     go_term GO:0005524

A template file is available at funannotate2/data/custom_annotations.template.txt.

Using the Custom Annotations File

Use the --curated-names option to specify your custom file:

funannotate2 annotate -i /path/to/funannotate2_predict_output -o /path/to/output_dir --curated-names /path/to/custom_annotations.txt

How Custom Annotations are Used

Funannotate2 will load your custom annotations file
For each gene ID in your custom file, the specified annotations will be applied
For single-value annotations like name and product, custom values replace existing ones
For multi-value annotations like go_term and ec_number, custom values are added to existing ones
Custom annotations for name and product bypass the cleaning rules
This allows you to have precise control over important annotations while preserving other data

Supported Annotation Types

name: Gene name (e.g., ACT1, CDC42)
product: Product description (e.g., Actin, Cell division control protein 42)
note: Additional information about the gene
go_term: Gene Ontology term (e.g., GO:0005524)
ec_number: Enzyme Commission number (e.g., 3.6.4.13)
db_xref: Database cross-reference (e.g., UniProtKB:P60010)

Benefits of Using Custom Annotations

Precise control over annotations for specific genes
Override automated annotation for important genes
Add specialized annotations not available from automated sources
Ensure consistent annotation across multiple projects
Maintain control over the annotation of important genes

Output Files

The annotate command generates the following output files in the specified output directory:

<species>.gff3: Gene models in GFF3 format
<species>.tbl: Gene models in NCBI TBL format
<species>.gbk: Gene models in GenBank format
<species>.proteins.fa: Protein sequences in FASTA format
<species>.transcripts.fa: Transcript sequences in FASTA format
<species>.fasta: Genome sequence in FASTA format
<species>.summary.json: Summary statistics in JSON format
Gene2Products.need-curating.txt: Problematic gene names/products that need manual curation
Gene2Products.new-valid.txt: New valid gene names/products that could be added to the curated database

The annotate_misc directory contains intermediate files and detailed results from each annotation source:

pfam.results.json: Raw results from Pfam-A search
dbcan.results.json: Raw results from dbCAN search
uniprot-swissprot.results.json: Raw results from UniProtKB/Swiss-Prot search
merops.results.json: Raw results from MEROPS search
busco.results.json: Raw results from BUSCOlite search
annotations.pfam.tsv: Parsed annotations from Pfam-A search
annotations.dbcan.tsv: Parsed annotations from dbCAN search
annotations.uniprot-swissprot.tsv: Parsed annotations from UniProtKB/Swiss-Prot search
annotations.merops.tsv: Parsed annotations from MEROPS search
annotations.busco.tsv: Parsed annotations from BUSCOlite search
proteome.fasta: Protein sequences extracted from gene models