Training Ab Initio Gene Predictors
The funannotate2 train command trains ab initio gene prediction algorithms to improve gene prediction accuracy. This step is optional but recommended for best results, especially for non-model organisms.
Basic Usage
funannotate2 train -f genome.fasta -s "Aspergillus fumigatus" -o train_results
Required Arguments
-f, --fasta: Input genome FASTA file-s, --species: Species name (e.g., “Aspergillus fumigatus”)-o, --out: Output folder name
Optional Arguments
-t, --training-set: Training set to use in GFF3 format--strain: Strain/isolate name--cpus: Number of CPUs to use (default: 2)--optimize-augustus: Run Augustus mediated optimized training (not recommended) (default: False)--header-len: Max length for fasta headers (default: 100)
Training Process
The train command performs the following steps:
Prepare the genome:
The genome is loaded and quality checks are performed
Headers are checked to ensure they are not too long
Contigs are checked for non-IUPAC characters
Generate a training set (if not provided):
BUSCOlite is used to identify conserved single-copy orthologs in the genome
The identified orthologs are used to create a training set of gene models
The training set is filtered to remove problematic gene models
Split the training set:
The training set is split into test and train sets
The test set is used to evaluate the performance of the trained models
Train Augustus:
Augustus is trained using the training set
The training process generates species-specific parameters for Augustus
The trained parameters are evaluated using the test set
Train SNAP:
SNAP is trained using the training set
The training process generates species-specific parameters for SNAP
The trained parameters are evaluated using the test set
Train GlimmerHMM:
GlimmerHMM is trained using the training set
The training process generates species-specific parameters for GlimmerHMM
The trained parameters are evaluated using the test set
Save the trained parameters:
The trained parameters for all tools are saved in a JSON file
The JSON file can be used with the
predictcommandThe trained parameters are also saved in the funannotate2 database for future use
Output Files
The train command generates the following output files in the specified output directory:
params.json: JSON file containing the trained parameters for all tools
training-models.train.gff3: The training set used for training
training-models.test.gff3: The test set used for evaluation
The train_misc directory contains intermediate files and detailed results from the training process:
augustus/: Directory containing Augustus training files
snap/: Directory containing SNAP training files
glimmerhmm/: Directory containing GlimmerHMM training files
busco/: Directory containing BUSCOlite results (if used to generate the training set)
Using Trained Parameters
The trained parameters can be used with the predict command in two ways:
Using the params.json file:
funannotate2 predict -f genome.fasta -o predict_results -p train_results/params.json -s "Aspergillus fumigatus"
Using the species name (if the parameters have been saved in the funannotate2 database):
funannotate2 predict -f genome.fasta -o predict_results -p aspergillus_fumigatus -s "Aspergillus fumigatus"
To save the trained parameters in the funannotate2 database, use the species command:
funannotate2 species -l train_results/params.json
This will make the trained parameters available for future use with the species name as the identifier.