Tutorial

This tutorial will guide you through the process of annotating a fungal genome using Funannotate2. We will use a publically available genome assembly for Aspergillus nidulans.

Prerequisites

Before starting this tutorial, make sure you have:

  1. Installed Funannotate2 and its dependencies and databases (see Installation)

Step 1: Fetch genome assembly

Lets fetch the model organism Aspergillus nidulans FGSCA4 from NCBI:

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/425/GCF_000011425.1_ASM1142v1/GCF_000011425.1_ASM1142v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/425/GCF_000011425.1_ASM1142v1/GCF_000011425.1_ASM1142v1_genomic.gff.gz

Next we can clean up the NCBI GFF3 format:

$ gfftk sanitize -f GCF_000011425.1_ASM1142v1_genomic.fna.gz -g GCF_000011425.1_ASM1142v1_genomic.gff.gz -o FGSCA4.gff3

And then we can have a look at the annotation stats in this heavily curated model organism, and you can see there are 10518 genes, 10455 of them are protein coding or (mRNA) features.

$ gfftk stats -f GCF_000011425.1_ASM1142v1_genomic.fna.gz -i FGSCA4.gff3

Step 2: Train Ab Initio Prediction Tools

The next step is to train the ab initio prediction tools, to do that we’ll use funannotate2 train. I’m going to use the anid_f2 as the output directory.

$ funannotate2 train -f GCF_000011425.1_ASM1142v1_genomic.fna.gz \
   -s "Aspergillus nidulans" --strain FGSCA4 --cpus 8 -o anid_f2

Step 3: Predict Genes

The next step is to predict genes using the training sets we just generated. Here we will just use the defaults for evidence mapping, which is to align the SwissProt/UniProt curated database. If you had high quality curated transcript data you could pass that to --transcripts option.

$ funannotate2 predict -i anid_f2 --cpus 8 --strain FGSCA4

In this case we see that the tool predicted 10196 genes (10016 of which are protein coding genes), note this very close the the public gold standard annotation in terms of numbers of gene models. You can see at the end of predict that the tool also ran BUSCO to measure the completeness of the assembly, which is 98.94% complete.

Step 4: Functionally Annotate Genes

The next step is to functionally annotate the predicted gene models. The core annotate module in funannotate2 will do a few functional annotation steps, however, it does not natively try to support everything as this becomes daunting. Rather users can provide their own functional annotation to the script via the -a,--annotations command line argument.

$ funannotate2 annotate -i anid_f2/ --cpus 8

The annotate step was able to add functional annotation to 8594 of the 10196 gene models from PFAM, CAZymes, MEROPS, SwissProt/UniProt, and BUSCO.

Output Files

The annotation process produces various output in the output directory (anid_f2). Here are some of the key files:

  1. Ab initio training parameters:

    • train_results/Aspergillus_nidulans_FGSCA4.params.json: Training parameters for ab initio gene prediction tools, can be used for future predictions or permanently install with funannotate2 species

  2. Gene Prediction:

    • predict_results/Aspergillus_nidulans_FGSCA4.fasta: Genome assembly in FASTA format

    • predict_results/Aspergillus_nidulans_FGSCA4.gff3: Predicted genes in GFF3 format

    • predict_results/Aspergillus_nidulans_FGSCA4.gbk: Predicted genes in GenBank flat-file format

    • predict_results/Aspergillus_nidulans_FGSCA4.tbl: Predicted genes in GenBank TBL format

  3. Functional Annotation:

    • annotate_results/Aspergillus_nidulans_FGSCA4.fasta: Genome assembly in FASTA format

    • annotate_results/Aspergillus_nidulans_FGSCA4.gff3: Predicted genes in GFF3 format

    • annotate_results/Aspergillus_nidulans_FGSCA4.gbk: Predicted genes in GenBank flat-file format

    • annotate_results/Aspergillus_nidulans_FGSCA4.tbl: Predicted genes in GenBank TBL format

    • annotate_results/Aspergillus_nidulans_FGSCA4.proteins.fa: Predicted proteins in FASTA format

    • annotate_results/Aspergillus_nidulans_FGSCA4.transcripts.fa: Predicted transcripts in FASTA format

    • annotate_results/Aspergillus_nidulans_FGSCA4.summary.json: Summary statistics in JSON format

For more help, see the Frequently Asked Questions or open an issue on the GitHub repository.