Usage

Funannotate2 provides a command-line interface for annotating eukaryotic genomes. The workflow typically consists of the following steps:

Clean the genome assembly
Train ab initio gene prediction algorithms (optional)
Predict genes
Functionally annotate the predicted genes

Command-Line Interface

Funannotate2 provides several commands for different stages of the annotation process:

funannotate2 <command> <options>

Available commands:

clean: Find and remove duplicated contigs, sort by size, rename headers
train: Train ab initio gene prediction algorithms
predict: Predict primary gene models in a eukaryotic genome
annotate: Add functional annotation to gene models
install: Install and manage databases
species: View and manage trained species models

Each command has its own set of options. Use funannotate2 <command> --help to see the available options for each command.

Genome Cleaning

The first step in the annotation process is to clean and prepare the genome assembly. The clean command sorts contigs by size, identifies and removes duplicated contigs, and optionally renames contig headers.

funannotate2 clean -f genome.fasta -o cleaned_genome.fasta

Required arguments:

-f, --fasta: Input genome FASTA file
-o, --out: Output cleaned genome FASTA file

Optional arguments:

-p, --pident: Percent identity threshold for identifying duplicated contigs (default: 95)
-c, --cov: Coverage threshold for identifying duplicated contigs (default: 95)
-m, --minlen: Minimum contig length to keep (default: 500)
-r, --rename: Rename contigs with this basename (e.g., “scaffold_”)
--cpus: Number of CPUs to use (default: 2)
--tmpdir: Directory for temporary files
--exhaustive: Compute every contig, else stop at N50 (default: False)
--logfile: Write logs to file
--debug: Debug the output (default: False)

How it works:

The genome is loaded and contigs are sorted by length
Contigs smaller than the minimum length are filtered out
Starting with the smallest contigs, each contig is aligned against all larger contigs using minimap2
If a contig is found to be duplicated elsewhere in the genome (based on percent identity and coverage thresholds), it is marked for removal
By default, the process stops at the N50 contig size to save time, as larger contigs are less likely to be duplicated
If the --exhaustive option is used, all contigs are checked for duplication
If the --rename option is used, contigs are renamed with the provided basename followed by a number (e.g., “scaffold_1”, “scaffold_2”, etc.)
The cleaned genome is written to the output file

Training Ab Initio Gene Predictors

Before predicting genes, you can train ab initio gene prediction algorithms to improve accuracy. This step is optional but recommended for best results, especially for non-model organisms. You can re-use training data by passing a pretrained species slug or a params.json file.

funannotate2 train -f cleaned_genome.fasta -s "Aspergillus nidulans" -o anid_f2

Required arguments:

-f, --fasta: Input genome FASTA file
-s, --species: Species name (e.g., “Aspergillus fumigatus”)
-o, --out: Output folder name

Optional arguments:

-t, --training-set: Training set to use in GFF3 format
--strain: Strain/isolate name
--cpus: Number of CPUs to use (default: 2)
--optimize-augustus: Run Augustus mediated optimized training (not recommended) (default: False)
--header-len: Max length for fasta headers (default: 100)

How it works:

If no training set is provided, BUSCOlite is used to identify conserved single-copy orthologs in the genome
The identified orthologs are used to create a training set of gene models
The training set is filtered and split into test and train sets
Augustus, SNAP, and GlimmerHMM are trained using the training set
The trained parameters are saved in a JSON file that can be used with the predict command
The trained parameters are also saved in the funannotate2 database for future use

Gene Prediction

After cleaning the genome and optionally training ab initio gene predictors, you can predict genes:

funannotate2 predict -i anid_f2

Required arguments:

-i, --input-dir: funannotate2 output directory
-f, --fasta: Input genome FASTA file (softmasked repeats)
-o, --out: Output folder name
-p, --params, --pretrained: Params.json or pretrained species slug. Use funannotate2 species to see pretrained species
-s, --species: Species name (e.g., “Aspergillus fumigatus”)

Optional arguments:

-st, --strain: Strain/isolate name (e.g., “Af293”)
-e, --external: External gene models/annotation in GFF3 format
-w, --weights: Gene predictors and weights
-ps, --proteins: Proteins to use for evidence
-ts, --transcripts: Transcripts to use for evidence
-c, --cpus: Number of CPUs to use (default: 2)
-mi, --max-intron: Maximum intron length (default: 3000)
-hl, --header-len: Max length for fasta headers (default: 100)
-l, --locus-tag: Locus tag for genes, perhaps assigned by NCBI, e.g. VC83 (default: FUN2_)
-n, --numbering: Specify start of gene numbering (default: 1)
--tmpdir: Volume to write tmp files (default: /tmp)

How it works:

The genome is analyzed for assembly statistics and softmasked regions
If the genome is not softmasked, pytantan is used to quickly softmask repeats
If protein evidence is provided, it is aligned to the genome using miniprot
If transcript evidence is provided, it is aligned to the genome using gapmm2
Evidence alignments are converted to hints for augustus
Ab initio gene predictors (Augustus, GeneMark [optional], SNAP, GlimmerHMM) are run on the genome
tRNAscan-SE is run to identify tRNA genes
The GFFtk consensus module is used to generate consensus gene models from all evidence and ab initio predictions
The consensus gene models are filtered and annotated
The final gene models are output in GFF3, TBL, and GenBank formats
Protein and transcript sequences are extracted from the gene models
Summary statistics are generated for the annotation

Functional Annotation

After predicting genes, you can functionally annotate them:

funannotate2 annotate -i anid_f2

Required arguments:

-i, --input-dir: funannotate2 output directory
-f, --fasta: Genome in FASTA format (required if not using –input-dir)
-t, --tbl: Genome annotation in TBL format (required if not using –input-dir and not using –gff3)
-g, --gff3: Genome annotation in GFF3 format (required if not using –input-dir and not using –tbl)
-o, --out: Output folder name (required if not using –input-dir)

Optional arguments:

-a, --annotations: Annotations files, 3 column TSV [transcript-id, feature, data]
-s, --species: Species name (e.g., “Aspergillus fumigatus”)
-st, --strain: Strain/isolate name
--cpus: Number of CPUs to use (default: 2)
--tmpdir: Volume to write tmp files (default: /tmp)
--curated-names: Path to custom file with gene-specific annotations (tab-delimited: gene_id annotation_type annotation_value)

How it works:

The gene models are loaded from the input directory or specified files
Protein sequences are extracted from the gene models
The proteins are searched against various databases for functional annotation: - Pfam-A database using pyhmmer for protein domains - dbCAN database using pyhmmer for carbohydrate-active enzymes (CAZymes) - UniProtKB/Swiss-Prot database using diamond for protein function - MEROPS database using diamond for proteases - BUSCOlite for conserved orthologs
Gene names and product descriptions are cleaned using a curated database
Custom annotations can be provided to override automatic cleaning
The annotations are merged into the gene models
The annotated gene models are output in GFF3, TBL, and GenBank formats
Protein and transcript sequences are extracted from the annotated gene models
Summary statistics are generated for the annotation

For more details on using custom curated gene names and products, see the Functional Annotation page.

Example Workflow

Here’s an example workflow for annotating a fungal genome:

# Clean the genome
funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta -m 1000 -r scaffold_

# Train ab initio gene predictors (optional)
funannotate2 train -f cleaned_genome.fasta -s "Aspergillus fumigatus" -o f2_output --strain "Af293" --cpus 16

# Predict genes using trained parameters
funannotate2 predict -i f2_output -ps uniprot_fungi.fasta -ts rnaseq_transcripts.fasta --cpus 16

# Or predict genes using pretrained species
funannotate2 predict -i f2_output -p aspergillus_fumigatus -ps uniprot_fungi.fasta -ts rnaseq_transcripts.fasta --cpus 16

# Functionally annotate genes
funannotate2 annotate -i f2_output --cpus 16

# Add custom gene/product annotations (optional)
funannotate2 annotate -i f2_output --cpus 16 --curated-names custom_annotations.txt

For more detailed examples and explanations, see the Tutorial page.

Database Installation

Funannotate2 requires several databases for gene prediction and functional annotation. You can install these databases using the install command:

funannotate2 install -d all

Required arguments:

-d, --db: Databases to install [all,merops,uniprot,dbCAN,pfam,go,mibig,interpro,gene2product,mito]

Optional arguments:

-s, --show: Show currently installed databases (default: False)
-w, --wget: Use wget for downloading (default: False)
-f, --force: Force re-download/re-install of all databases (default: False)
-u, --update: Update databases if change detected (default: False)

How it works:

The command checks for the $FUNANNOTATE2_DB environment variable, which should point to the directory where databases will be installed
If the specified databases are already installed, the command will skip them unless --force or --update is used
The databases are downloaded from their respective sources and processed for use with funannotate2
A record of installed databases is kept in the funannotate-db-info.json file in the database directory

Managing Trained Species

Funannotate2 maintains a database of trained species parameters for gene prediction. You can view and manage these species using the species command:

funannotate2 species

Optional arguments:

-l, --load: Load a new species with a *.params.json file
-d, --delete: Delete a species from database
-f, --format: Format to show existing species in (default: table)

How it works:

Without arguments, the command lists all trained species in the database
With --load, the command adds a new species to the database from a params.json file (typically generated by the train command)
With --delete, the command removes a species from the database
The --format option controls how the species list is displayed (table, json, or yaml)