Training Ab Initio Gene Predictors

The funannotate2 train command trains ab initio gene prediction algorithms to improve gene prediction accuracy. This step is optional but recommended for best results, especially for non-model organisms.

Basic Usage

funannotate2 train -f genome.fasta -s "Aspergillus fumigatus" -o train_results

Required Arguments

-f, --fasta: Input genome FASTA file
-s, --species: Species name (e.g., “Aspergillus fumigatus”)
-o, --out: Output folder name

Optional Arguments

-t, --training-set: Training set to use in GFF3 format
--strain: Strain/isolate name
--cpus: Number of CPUs to use (default: 2)
--optimize-augustus: Run Augustus mediated optimized training (not recommended) (default: False)
--header-len: Max length for fasta headers (default: 100)

Training Process

The train command performs the following steps:

Prepare the genome:
- The genome is loaded and quality checks are performed
- Headers are checked to ensure they are not too long
- Contigs are checked for non-IUPAC characters
Generate a training set (if not provided):
- BUSCOlite is used to identify conserved single-copy orthologs in the genome
- The identified orthologs are used to create a training set of gene models
- The training set is filtered to remove problematic gene models
Split the training set:
- The training set is split into test and train sets
- The test set is used to evaluate the performance of the trained models
Train Augustus:
- Augustus is trained using the training set
- The training process generates species-specific parameters for Augustus
- The trained parameters are evaluated using the test set
Train SNAP:
- SNAP is trained using the training set
- The training process generates species-specific parameters for SNAP
- The trained parameters are evaluated using the test set
Train GlimmerHMM:
- GlimmerHMM is trained using the training set
- The training process generates species-specific parameters for GlimmerHMM
- The trained parameters are evaluated using the test set
Save the trained parameters:
- The trained parameters for all tools are saved in a JSON file
- The JSON file can be used with the predict command
- The trained parameters are also saved in the funannotate2 database for future use

Output Files

The train command generates the following output files in the specified output directory:

params.json: JSON file containing the trained parameters for all tools
training-models.train.gff3: The training set used for training
training-models.test.gff3: The test set used for evaluation

The train_misc directory contains intermediate files and detailed results from the training process:

augustus/: Directory containing Augustus training files
snap/: Directory containing SNAP training files
glimmerhmm/: Directory containing GlimmerHMM training files
busco/: Directory containing BUSCOlite results (if used to generate the training set)

Using Trained Parameters

The trained parameters can be used with the predict command in two ways:

Using the params.json file:

funannotate2 predict -f genome.fasta -o predict_results -p train_results/params.json -s "Aspergillus fumigatus"

Using the species name (if the parameters have been saved in the funannotate2 database):

funannotate2 predict -f genome.fasta -o predict_results -p aspergillus_fumigatus -s "Aspergillus fumigatus"

To save the trained parameters in the funannotate2 database, use the species command:

funannotate2 species -l train_results/params.json

This will make the trained parameters available for future use with the species name as the identifier.