Genome Cleaning

The funannotate2 clean command prepares a genome assembly for annotation by removing duplicated contigs, sorting contigs by size, and optionally renaming contig headers.

Basic Usage

funannotate2 clean -f genome.fasta -o cleaned_genome.fasta

Required Arguments

  • -f, --fasta: Input genome FASTA file

  • -o, --out: Output cleaned genome FASTA file

Optional Arguments

  • -p, --pident: Percent identity threshold for identifying duplicated contigs (default: 95)

  • -c, --cov: Coverage threshold for identifying duplicated contigs (default: 95)

  • -m, --minlen: Minimum contig length to keep (default: 500)

  • -r, --rename: Rename contigs with this basename (e.g., “scaffold_”)

  • --cpus: Number of CPUs to use (default: 2)

  • --tmpdir: Directory for temporary files

  • --exhaustive: Compute every contig, else stop at N50 (default: False)

  • --logfile: Write logs to file

  • --debug: Debug the output (default: False)

Cleaning Process

The clean command performs the following steps:

  1. Load and sort the genome:

    • The genome is loaded from the input FASTA file

    • Contigs are sorted by length, from shortest to longest

    • Basic statistics are calculated, including N50

  2. Filter contigs by length:

    • Contigs shorter than the minimum length are removed

    • This helps eliminate small, potentially problematic contigs

  3. Check for duplicated contigs:

    • Starting with the smallest contigs, each contig is aligned against all larger contigs using minimap2

    • If a contig is found to be duplicated elsewhere in the genome (based on percent identity and coverage thresholds), it is marked for removal

    • By default, the process stops at the N50 contig size to save time, as larger contigs are less likely to be duplicated

    • If the --exhaustive option is used, all contigs are checked for duplication

  4. Rename contigs (if requested):

    • If the --rename option is used, contigs are renamed with the provided basename followed by a number (e.g., “scaffold_1”, “scaffold_2”, etc.)

    • Contigs are numbered from largest to smallest

  5. Write the cleaned genome:

    • The cleaned genome is written to the output file

    • Duplicated contigs are excluded

    • Contigs are written in order from largest to smallest

Why Clean Your Genome?

Cleaning your genome assembly before annotation offers several benefits:

  1. Removes redundant sequences:

    • Duplicated contigs can lead to redundant gene annotations

    • Removing duplicates ensures each gene is annotated only once

  2. Improves annotation quality:

    • Small, fragmented contigs often contain partial genes or repetitive elements

    • Removing these contigs can improve the overall quality of gene predictions

  3. Standardizes contig names:

    • Consistent, simple contig names make downstream analysis easier

    • Some tools have limitations on header length or format

  4. Reduces computational requirements:

    • Fewer contigs means faster processing in subsequent steps

    • Removing small contigs can significantly reduce the number of contigs without losing much sequence

Example Usage

Basic cleaning with default parameters:

funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta

Cleaning with custom parameters:

funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta -m 1000 -p 98 -c 98 -r scaffold_

Exhaustive cleaning (check all contigs for duplication):

funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta --exhaustive

Output

The clean command produces a single output file: the cleaned genome in FASTA format. The command also outputs statistics about the cleaning process, including:

  • Number of contigs in the input genome

  • Number of contigs larger than the minimum length

  • N50 of the input genome

  • Number of duplicated contigs found

  • Number of contigs written to the output file