Functional Annotation
The funannotate2 annotate command adds functional annotation to gene models. It uses various databases for annotation, including Pfam, dbCAN, MEROPS, UniProtKB/Swiss-Prot, and BUSCOlite.
Basic Usage
funannotate2 annotate -i /path/to/funannotate2_predict_output -o /path/to/output_dir
Required Arguments
-i, --input-dir: Path to funannotate2 predict output directory-f, --fasta: Genome in FASTA format (required if not using –input-dir)-t, --tbl: Genome annotation in TBL format (required if not using –input-dir and not using –gff3)-g, --gff3: Genome annotation in GFF3 format (required if not using –input-dir and not using –tbl)-o, --out: Output folder name (required if not using –input-dir)
Optional Arguments
-a, --annotations: Annotations files, 3 column TSV [transcript-id, feature, data]-s, --species: Species name, use quotes for binomial, e.g. “Aspergillus fumigatus”-st, --strain: Strain/isolate name--cpus: Number of CPUs to use (default: 2)--tmpdir: Volume to write tmp files (default: /tmp)--curated-names: Path to custom file with gene-specific annotations (tab-delimited: gene_id annotation_type annotation_value)
Annotation Process
The annotate command performs the following steps:
Loads gene models from the input directory or specified files
Extracts protein sequences from the gene models
Searches the proteins against various databases:
`Pfam-A <http://pfam.xfam.org/>`_: Identifies protein domains using pyhmmer
`dbCAN <http://bcb.unl.edu/dbCAN2/>`_: Identifies carbohydrate-active enzymes (CAZymes) using pyhmmer
`UniProtKB/Swiss-Prot <https://www.uniprot.org/>`_: Identifies protein function using diamond
`MEROPS <https://www.ebi.ac.uk/merops/>`_: Identifies proteases using diamond
`BUSCOlite <https://github.com/nextgenusfs/buscolite>`_: Identifies conserved orthologs
Cleans gene names and product descriptions using a curated database
Merges all annotations into the gene models
Outputs the annotated gene models in GFF3, TBL, and GenBank formats
Extracts protein and transcript sequences from the annotated gene models
Generates summary statistics for the annotation
Using Custom Curated Gene Names and Products
Funannotate2 includes a database of curated gene names and products that are used to ensure consistent and accurate annotation. However, you may want to use your own curated gene names and products for specific organisms or projects.
Creating a Custom Annotations File
Create a tab-delimited text file with gene IDs in the first column, annotation types in the second column, and annotation values in the third column:
# Custom annotations for specific genes/transcripts
gene123 name ACT1
gene123 product Actin
gene456 name CDC42
gene456 product Cell division control protein 42
gene789 go_term GO:0005524
A template file is available at funannotate2/data/custom_annotations.template.txt.
Using the Custom Annotations File
Use the --curated-names option to specify your custom file:
funannotate2 annotate -i /path/to/funannotate2_predict_output -o /path/to/output_dir --curated-names /path/to/custom_annotations.txt
How Custom Annotations are Used
Funannotate2 will load your custom annotations file
For each gene ID in your custom file, the specified annotations will be applied
For single-value annotations like
nameandproduct, custom values replace existing onesFor multi-value annotations like
go_termandec_number, custom values are added to existing onesCustom annotations for
nameandproductbypass the cleaning rulesThis allows you to have precise control over important annotations while preserving other data
Supported Annotation Types
name: Gene name (e.g., ACT1, CDC42)product: Product description (e.g., Actin, Cell division control protein 42)note: Additional information about the genego_term: Gene Ontology term (e.g., GO:0005524)ec_number: Enzyme Commission number (e.g., 3.6.4.13)db_xref: Database cross-reference (e.g., UniProtKB:P60010)
Benefits of Using Custom Annotations
Precise control over annotations for specific genes
Override automated annotation for important genes
Add specialized annotations not available from automated sources
Ensure consistent annotation across multiple projects
Maintain control over the annotation of important genes
Output Files
The annotate command generates the following output files in the specified output directory:
<species>.gff3: Gene models in GFF3 format
<species>.tbl: Gene models in NCBI TBL format
<species>.gbk: Gene models in GenBank format
<species>.proteins.fa: Protein sequences in FASTA format
<species>.transcripts.fa: Transcript sequences in FASTA format
<species>.fasta: Genome sequence in FASTA format
<species>.summary.json: Summary statistics in JSON format
Gene2Products.need-curating.txt: Problematic gene names/products that need manual curation
Gene2Products.new-valid.txt: New valid gene names/products that could be added to the curated database
The annotate_misc directory contains intermediate files and detailed results from each annotation source:
pfam.results.json: Raw results from Pfam-A search
dbcan.results.json: Raw results from dbCAN search
uniprot-swissprot.results.json: Raw results from UniProtKB/Swiss-Prot search
merops.results.json: Raw results from MEROPS search
busco.results.json: Raw results from BUSCOlite search
annotations.pfam.tsv: Parsed annotations from Pfam-A search
annotations.dbcan.tsv: Parsed annotations from dbCAN search
annotations.uniprot-swissprot.tsv: Parsed annotations from UniProtKB/Swiss-Prot search
annotations.merops.tsv: Parsed annotations from MEROPS search
annotations.busco.tsv: Parsed annotations from BUSCOlite search
proteome.fasta: Protein sequences extracted from gene models