Usage
Funannotate2 provides a command-line interface for annotating eukaryotic genomes. The workflow typically consists of the following steps:
Clean the genome assembly
Train ab initio gene prediction algorithms (optional)
Predict genes
Functionally annotate the predicted genes
Command-Line Interface
Funannotate2 provides several commands for different stages of the annotation process:
funannotate2 <command> <options>
Available commands:
clean: Find and remove duplicated contigs, sort by size, rename headerstrain: Train ab initio gene prediction algorithmspredict: Predict primary gene models in a eukaryotic genomeannotate: Add functional annotation to gene modelsinstall: Install and manage databasesspecies: View and manage trained species models
Each command has its own set of options. Use funannotate2 <command> --help to see the available options for each command.
Genome Cleaning
The first step in the annotation process is to clean and prepare the genome assembly. The clean command sorts contigs by size, identifies and removes duplicated contigs, and optionally renames contig headers.
funannotate2 clean -f genome.fasta -o cleaned_genome.fasta
Required arguments:
-f, --fasta: Input genome FASTA file-o, --out: Output cleaned genome FASTA file
Optional arguments:
-p, --pident: Percent identity threshold for identifying duplicated contigs (default: 95)-c, --cov: Coverage threshold for identifying duplicated contigs (default: 95)-m, --minlen: Minimum contig length to keep (default: 500)-r, --rename: Rename contigs with this basename (e.g., “scaffold_”)--cpus: Number of CPUs to use (default: 2)--tmpdir: Directory for temporary files--exhaustive: Compute every contig, else stop at N50 (default: False)--logfile: Write logs to file--debug: Debug the output (default: False)
How it works:
The genome is loaded and contigs are sorted by length
Contigs smaller than the minimum length are filtered out
Starting with the smallest contigs, each contig is aligned against all larger contigs using minimap2
If a contig is found to be duplicated elsewhere in the genome (based on percent identity and coverage thresholds), it is marked for removal
By default, the process stops at the N50 contig size to save time, as larger contigs are less likely to be duplicated
If the
--exhaustiveoption is used, all contigs are checked for duplicationIf the
--renameoption is used, contigs are renamed with the provided basename followed by a number (e.g., “scaffold_1”, “scaffold_2”, etc.)The cleaned genome is written to the output file
Training Ab Initio Gene Predictors
Before predicting genes, you can train ab initio gene prediction algorithms to improve accuracy. This step is optional but recommended for best results, especially for non-model organisms. You can re-use training data by passing a pretrained species slug or a params.json file.
funannotate2 train -f cleaned_genome.fasta -s "Aspergillus nidulans" -o anid_f2
Required arguments:
-f, --fasta: Input genome FASTA file-s, --species: Species name (e.g., “Aspergillus fumigatus”)-o, --out: Output folder name
Optional arguments:
-t, --training-set: Training set to use in GFF3 format--strain: Strain/isolate name--cpus: Number of CPUs to use (default: 2)--optimize-augustus: Run Augustus mediated optimized training (not recommended) (default: False)--header-len: Max length for fasta headers (default: 100)
How it works:
If no training set is provided, BUSCOlite is used to identify conserved single-copy orthologs in the genome
The identified orthologs are used to create a training set of gene models
The training set is filtered and split into test and train sets
Augustus, SNAP, and GlimmerHMM are trained using the training set
The trained parameters are saved in a JSON file that can be used with the
predictcommandThe trained parameters are also saved in the funannotate2 database for future use
Gene Prediction
After cleaning the genome and optionally training ab initio gene predictors, you can predict genes:
funannotate2 predict -i anid_f2
Required arguments:
-i, --input-dir: funannotate2 output directory-f, --fasta: Input genome FASTA file (softmasked repeats)-o, --out: Output folder name-p, --params, --pretrained: Params.json or pretrained species slug. Usefunannotate2 speciesto see pretrained species-s, --species: Species name (e.g., “Aspergillus fumigatus”)
Optional arguments:
-st, --strain: Strain/isolate name (e.g., “Af293”)-e, --external: External gene models/annotation in GFF3 format-w, --weights: Gene predictors and weights-ps, --proteins: Proteins to use for evidence-ts, --transcripts: Transcripts to use for evidence-c, --cpus: Number of CPUs to use (default: 2)-mi, --max-intron: Maximum intron length (default: 3000)-hl, --header-len: Max length for fasta headers (default: 100)-l, --locus-tag: Locus tag for genes, perhaps assigned by NCBI, e.g. VC83 (default: FUN2_)-n, --numbering: Specify start of gene numbering (default: 1)--tmpdir: Volume to write tmp files (default: /tmp)
How it works:
The genome is analyzed for assembly statistics and softmasked regions
If the genome is not softmasked, pytantan is used to quickly softmask repeats
If protein evidence is provided, it is aligned to the genome using
miniprotIf transcript evidence is provided, it is aligned to the genome using
gapmm2Evidence alignments are converted to hints for
augustusAb initio gene predictors (Augustus, GeneMark [optional], SNAP, GlimmerHMM) are run on the genome
tRNAscan-SE is run to identify tRNA genes
The GFFtk consensus module is used to generate consensus gene models from all evidence and ab initio predictions
The consensus gene models are filtered and annotated
The final gene models are output in GFF3, TBL, and GenBank formats
Protein and transcript sequences are extracted from the gene models
Summary statistics are generated for the annotation
Functional Annotation
After predicting genes, you can functionally annotate them:
funannotate2 annotate -i anid_f2
Required arguments:
-i, --input-dir: funannotate2 output directory-f, --fasta: Genome in FASTA format (required if not using –input-dir)-t, --tbl: Genome annotation in TBL format (required if not using –input-dir and not using –gff3)-g, --gff3: Genome annotation in GFF3 format (required if not using –input-dir and not using –tbl)-o, --out: Output folder name (required if not using –input-dir)
Optional arguments:
-a, --annotations: Annotations files, 3 column TSV [transcript-id, feature, data]-s, --species: Species name (e.g., “Aspergillus fumigatus”)-st, --strain: Strain/isolate name--cpus: Number of CPUs to use (default: 2)--tmpdir: Volume to write tmp files (default: /tmp)--curated-names: Path to custom file with gene-specific annotations (tab-delimited: gene_id annotation_type annotation_value)
How it works:
The gene models are loaded from the input directory or specified files
Protein sequences are extracted from the gene models
The proteins are searched against various databases for functional annotation: - Pfam-A database using pyhmmer for protein domains - dbCAN database using pyhmmer for carbohydrate-active enzymes (CAZymes) - UniProtKB/Swiss-Prot database using diamond for protein function - MEROPS database using diamond for proteases - BUSCOlite for conserved orthologs
Gene names and product descriptions are cleaned using a curated database
Custom annotations can be provided to override automatic cleaning
The annotations are merged into the gene models
The annotated gene models are output in GFF3, TBL, and GenBank formats
Protein and transcript sequences are extracted from the annotated gene models
Summary statistics are generated for the annotation
For more details on using custom curated gene names and products, see the Functional Annotation page.
Example Workflow
Here’s an example workflow for annotating a fungal genome:
# Clean the genome
funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta -m 1000 -r scaffold_
# Train ab initio gene predictors (optional)
funannotate2 train -f cleaned_genome.fasta -s "Aspergillus fumigatus" -o f2_output --strain "Af293" --cpus 16
# Predict genes using trained parameters
funannotate2 predict -i f2_output -ps uniprot_fungi.fasta -ts rnaseq_transcripts.fasta --cpus 16
# Or predict genes using pretrained species
funannotate2 predict -i f2_output -p aspergillus_fumigatus -ps uniprot_fungi.fasta -ts rnaseq_transcripts.fasta --cpus 16
# Functionally annotate genes
funannotate2 annotate -i f2_output --cpus 16
# Add custom gene/product annotations (optional)
funannotate2 annotate -i f2_output --cpus 16 --curated-names custom_annotations.txt
For more detailed examples and explanations, see the Tutorial page.
Database Installation
Funannotate2 requires several databases for gene prediction and functional annotation. You can install these databases using the install command:
funannotate2 install -d all
Required arguments:
-d, --db: Databases to install [all,merops,uniprot,dbCAN,pfam,go,mibig,interpro,gene2product,mito]
Optional arguments:
-s, --show: Show currently installed databases (default: False)-w, --wget: Use wget for downloading (default: False)-f, --force: Force re-download/re-install of all databases (default: False)-u, --update: Update databases if change detected (default: False)
How it works:
The command checks for the
$FUNANNOTATE2_DBenvironment variable, which should point to the directory where databases will be installedIf the specified databases are already installed, the command will skip them unless
--forceor--updateis usedThe databases are downloaded from their respective sources and processed for use with funannotate2
A record of installed databases is kept in the
funannotate-db-info.jsonfile in the database directory
Managing Trained Species
Funannotate2 maintains a database of trained species parameters for gene prediction. You can view and manage these species using the species command:
funannotate2 species
Optional arguments:
-l, --load: Load a new species with a *.params.json file-d, --delete: Delete a species from database-f, --format: Format to show existing species in (default: table)
How it works:
Without arguments, the command lists all trained species in the database
With
--load, the command adds a new species to the database from a params.json file (typically generated by thetraincommand)With
--delete, the command removes a species from the databaseThe
--formatoption controls how the species list is displayed (table, json, or yaml)