Training Ab Initio Gene Predictors =============================== The ``funannotate2 train`` command trains ab initio gene prediction algorithms to improve gene prediction accuracy. This step is optional but recommended for best results, especially for non-model organisms. Basic Usage ---------- .. code-block:: bash funannotate2 train -f genome.fasta -s "Aspergillus fumigatus" -o train_results Required Arguments ---------------- * ``-f, --fasta``: Input genome FASTA file * ``-s, --species``: Species name (e.g., "Aspergillus fumigatus") * ``-o, --out``: Output folder name Optional Arguments ---------------- * ``-t, --training-set``: Training set to use in GFF3 format * ``--strain``: Strain/isolate name * ``--cpus``: Number of CPUs to use (default: 2) * ``--optimize-augustus``: Run Augustus mediated optimized training (not recommended) (default: False) * ``--header-len``: Max length for fasta headers (default: 100) Training Process -------------- The ``train`` command performs the following steps: 1. **Prepare the genome**: * The genome is loaded and quality checks are performed * Headers are checked to ensure they are not too long * Contigs are checked for non-IUPAC characters 2. **Generate a training set** (if not provided): * BUSCOlite is used to identify conserved single-copy orthologs in the genome * The identified orthologs are used to create a training set of gene models * The training set is filtered to remove problematic gene models 3. **Split the training set**: * The training set is split into test and train sets * The test set is used to evaluate the performance of the trained models 4. **Train Augustus**: * `Augustus `_ is trained using the training set * The training process generates species-specific parameters for Augustus * The trained parameters are evaluated using the test set 5. **Train SNAP**: * `SNAP `_ is trained using the training set * The training process generates species-specific parameters for SNAP * The trained parameters are evaluated using the test set 6. **Train GlimmerHMM**: * `GlimmerHMM `_ is trained using the training set * The training process generates species-specific parameters for GlimmerHMM * The trained parameters are evaluated using the test set 7. **Save the trained parameters**: * The trained parameters for all tools are saved in a JSON file * The JSON file can be used with the ``predict`` command * The trained parameters are also saved in the funannotate2 database for future use Output Files ---------- The ``train`` command generates the following output files in the specified output directory: * **params.json**: JSON file containing the trained parameters for all tools * **training-models.train.gff3**: The training set used for training * **training-models.test.gff3**: The test set used for evaluation The ``train_misc`` directory contains intermediate files and detailed results from the training process: * **augustus/**: Directory containing Augustus training files * **snap/**: Directory containing SNAP training files * **glimmerhmm/**: Directory containing GlimmerHMM training files * **busco/**: Directory containing BUSCOlite results (if used to generate the training set) Using Trained Parameters --------------------- The trained parameters can be used with the ``predict`` command in two ways: 1. **Using the params.json file**: .. code-block:: bash funannotate2 predict -f genome.fasta -o predict_results -p train_results/params.json -s "Aspergillus fumigatus" 2. **Using the species name** (if the parameters have been saved in the funannotate2 database): .. code-block:: bash funannotate2 predict -f genome.fasta -o predict_results -p aspergillus_fumigatus -s "Aspergillus fumigatus" To save the trained parameters in the funannotate2 database, use the ``species`` command: .. code-block:: bash funannotate2 species -l train_results/params.json This will make the trained parameters available for future use with the species name as the identifier.