Gene Prediction ============== The ``funannotate2 predict`` command predicts gene models in a eukaryotic genome. It uses a combination of ab initio gene predictors and evidence-based approaches to generate accurate gene models. Basic Usage ---------- .. code-block:: bash funannotate2 predict -f genome.fasta -o predict_results -p pretrained_species -s "Aspergillus fumigatus" Required Arguments ---------------- * ``-i, --input-dir``: funannotate2 output directory * ``-f, --fasta``: Input genome FASTA file (softmasked repeats) * ``-o, --out``: Output folder name * ``-p, --params, --pretrained``: Params.json or pretrained species slug. Use ``funannotate2 species`` to see pretrained species * ``-s, --species``: Species name (e.g., "Aspergillus fumigatus") Optional Arguments ---------------- * ``-st, --strain``: Strain/isolate name (e.g., "Af293") * ``-e, --external``: External gene models/annotation in GFF3 format * ``-w, --weights``: Gene predictors and weights * ``-ps, --proteins``: Proteins to use for evidence * ``-ts, --transcripts``: Transcripts to use for evidence * ``-c, --cpus``: Number of CPUs to use (default: 2) * ``-mi, --max-intron``: Maximum intron length (default: 3000) * ``-hl, --header-len``: Max length for fasta headers (default: 100) * ``-l, --locus-tag``: Locus tag for genes, perhaps assigned by NCBI, e.g. VC83 (default: FUN2_) * ``-n, --numbering``: Specify start of gene numbering (default: 1) * ``--tmpdir``: Volume to write tmp files (default: /tmp) Prediction Process ---------------- The ``predict`` command performs the following steps: 1. **Prepare the genome**: * The genome is analyzed for assembly statistics and softmasked regions * If the genome is not softmasked, `pytantan `_ is used to quickly softmask repeats * The genome is split into contigs for parallel processing 2. **Align evidence** (if provided): * **Protein evidence**: Proteins are aligned to the genome using `miniprot `_ * **Transcript evidence**: Transcripts are aligned to the genome using `gapmm2 `_ * Evidence alignments are converted to hints for `augustus `_ 3. **Run ab initio gene predictors**: * **`Augustus `_**: Uses species-specific parameters and evidence hints * **`GeneMark `_**: Uses self-training or species-specific parameters * **`SNAP `_**: Uses species-specific parameters * **`GlimmerHMM `_**: Uses species-specific parameters * **`tRNAscan-SE `_**: Identifies tRNA genes Note that you can run other ab initio predictors and they can be passed the :code:`--other_gff` option. For example, `helixerlite `_ is a recommended add-on that uses machine learning for gene prediction, but it's not included in the default installation due to dependency issues. 4. **Generate consensus gene models**: * The `GFFtk `_ consensus module is used to integrate all evidence and ab initio predictions * Gene models are weighted based on the reliability of each source * Overlapping gene models are resolved based on evidence and prediction quality * Gene models are filtered based on various criteria (e.g., minimum protein length) 5. **Annotate gene models**: * Gene models are assigned unique IDs based on the locus tag and numbering * Gene models are sorted by genomic location * Gene models are output in GFF3, TBL, and GenBank formats * Protein and transcript sequences are extracted from the gene models * Summary statistics are generated for the annotation Evidence-Based Prediction ---------------------- For best results, provide protein and/or transcript evidence. By defualt, the UniProt/SwissProt database is used for protein evidence. .. code-block:: bash funannotate2 predict -f genome.fasta -o predict_results -p pretrained_species -s "Aspergillus fumigatus" \ -ps uniprot_fungi.fasta -ts rnaseq_transcripts.fasta Protein evidence should be in FASTA format and can include: * Proteins from closely related species * Curated protein databases (e.g., UniProt) * Proteins from previous annotations Transcript evidence should be in FASTA format and can include: * Assembled transcripts from RNA-Seq data * EST sequences * cDNA sequences Using External Gene Models ----------------------- You can provide external gene models in GFF3 format: .. code-block:: bash funannotate2 predict -f genome.fasta -o f2_output -p pretrained_species -s "Aspergillus fumigatus" -e external_models.gff3 External gene models can be from: * Previous annotations * Other gene prediction tools * Manual annotations Output Files ---------- The ``predict`` command generates the following output files in the specified output directory: * **.gff3**: Gene models in GFF3 format * **.tbl**: Gene models in NCBI TBL format * **.gbk**: Gene models in GenBank format * **.proteins.fa**: Protein sequences in FASTA format * **.transcripts.fa**: Transcript sequences in FASTA format * **.fasta**: Genome sequence in FASTA format * **.summary.json**: Summary statistics in JSON format The ``predict_misc`` directory contains intermediate files and detailed results from each prediction source: * **augustus/**: Augustus prediction results * **genemark/**: GeneMark prediction results * **snap/**: SNAP prediction results * **glimmerhmm/**: GlimmerHMM prediction results * **trnascan/**: tRNAscan-SE results * **proteins/**: Protein evidence alignments * **transcripts/**: Transcript evidence alignments * **hints/**: Evidence hints for ab initio predictors * **softmasked-regions.bed**: Softmasked regions in BED format * **assembly-gaps.bed**: Assembly gaps in BED format