Gene Prediction

The funannotate2 predict command predicts gene models in a eukaryotic genome. It uses a combination of ab initio gene predictors and evidence-based approaches to generate accurate gene models.

Basic Usage

funannotate2 predict -f genome.fasta -o predict_results -p pretrained_species -s "Aspergillus fumigatus"

Required Arguments

-i, --input-dir: funannotate2 output directory
-f, --fasta: Input genome FASTA file (softmasked repeats)
-o, --out: Output folder name
-p, --params, --pretrained: Params.json or pretrained species slug. Use funannotate2 species to see pretrained species
-s, --species: Species name (e.g., “Aspergillus fumigatus”)

Optional Arguments

-st, --strain: Strain/isolate name (e.g., “Af293”)
-e, --external: External gene models/annotation in GFF3 format
-w, --weights: Gene predictors and weights
-ps, --proteins: Proteins to use for evidence
-ts, --transcripts: Transcripts to use for evidence
-c, --cpus: Number of CPUs to use (default: 2)
-mi, --max-intron: Maximum intron length (default: 3000)
-hl, --header-len: Max length for fasta headers (default: 100)
-l, --locus-tag: Locus tag for genes, perhaps assigned by NCBI, e.g. VC83 (default: FUN2_)
-n, --numbering: Specify start of gene numbering (default: 1)
--tmpdir: Volume to write tmp files (default: /tmp)

Prediction Process

The predict command performs the following steps:

Prepare the genome:
- The genome is analyzed for assembly statistics and softmasked regions
- If the genome is not softmasked, pytantan is used to quickly softmask repeats
- The genome is split into contigs for parallel processing
Align evidence (if provided):
- Protein evidence: Proteins are aligned to the genome using miniprot
- Transcript evidence: Transcripts are aligned to the genome using gapmm2
- Evidence alignments are converted to hints for augustus
Run ab initio gene predictors:
- `Augustus <https://bioinf.uni-greifswald.de/augustus/>`_: Uses species-specific parameters and evidence hints
- `GeneMark <http://exon.gatech.edu/GeneMark/>`_: Uses self-training or species-specific parameters
- `SNAP <https://github.com/KorfLab/SNAP>`_: Uses species-specific parameters
- `GlimmerHMM <https://ccb.jhu.edu/software/glimmerhmm/>`_: Uses species-specific parameters
- `tRNAscan-SE <http://lowelab.ucsc.edu/tRNAscan-SE/>`_: Identifies tRNA genes
Note that you can run other ab initio predictors and they can be passed the --other_gff option. For example, helixerlite is a recommended add-on that uses machine learning for gene prediction, but it’s not included in the default installation due to dependency issues.
Generate consensus gene models:
- The GFFtk consensus module is used to integrate all evidence and ab initio predictions
- Gene models are weighted based on the reliability of each source
- Overlapping gene models are resolved based on evidence and prediction quality
- Gene models are filtered based on various criteria (e.g., minimum protein length)
Annotate gene models:
- Gene models are assigned unique IDs based on the locus tag and numbering
- Gene models are sorted by genomic location
- Gene models are output in GFF3, TBL, and GenBank formats
- Protein and transcript sequences are extracted from the gene models
- Summary statistics are generated for the annotation

Evidence-Based Prediction

For best results, provide protein and/or transcript evidence. By defualt, the UniProt/SwissProt database is used for protein evidence.

funannotate2 predict -f genome.fasta -o predict_results -p pretrained_species -s "Aspergillus fumigatus" \
    -ps uniprot_fungi.fasta -ts rnaseq_transcripts.fasta

Protein evidence should be in FASTA format and can include:

Proteins from closely related species
Curated protein databases (e.g., UniProt)
Proteins from previous annotations

Transcript evidence should be in FASTA format and can include:

Assembled transcripts from RNA-Seq data
EST sequences
cDNA sequences

Using External Gene Models

You can provide external gene models in GFF3 format:

funannotate2 predict -f genome.fasta -o f2_output -p pretrained_species -s "Aspergillus fumigatus" -e external_models.gff3

External gene models can be from:

Previous annotations
Other gene prediction tools
Manual annotations

Output Files

The predict command generates the following output files in the specified output directory:

<species>.gff3: Gene models in GFF3 format
<species>.tbl: Gene models in NCBI TBL format
<species>.gbk: Gene models in GenBank format
<species>.proteins.fa: Protein sequences in FASTA format
<species>.transcripts.fa: Transcript sequences in FASTA format
<species>.fasta: Genome sequence in FASTA format
<species>.summary.json: Summary statistics in JSON format

The predict_misc directory contains intermediate files and detailed results from each prediction source:

augustus/: Augustus prediction results
genemark/: GeneMark prediction results
snap/: SNAP prediction results
glimmerhmm/: GlimmerHMM prediction results
trnascan/: tRNAscan-SE results
proteins/: Protein evidence alignments
transcripts/: Transcript evidence alignments
hints/: Evidence hints for ab initio predictors
softmasked-regions.bed: Softmasked regions in BED format
assembly-gaps.bed: Assembly gaps in BED format