Genome Cleaning

The funannotate2 clean command prepares a genome assembly for annotation by removing duplicated contigs, sorting contigs by size, and optionally renaming contig headers.

Basic Usage

funannotate2 clean -f genome.fasta -o cleaned_genome.fasta

Required Arguments

-f, --fasta: Input genome FASTA file
-o, --out: Output cleaned genome FASTA file

Optional Arguments

-p, --pident: Percent identity threshold for identifying duplicated contigs (default: 95)
-c, --cov: Coverage threshold for identifying duplicated contigs (default: 95)
-m, --minlen: Minimum contig length to keep (default: 500)
-r, --rename: Rename contigs with this basename (e.g., “scaffold_”)
--cpus: Number of CPUs to use (default: 2)
--tmpdir: Directory for temporary files
--exhaustive: Compute every contig, else stop at N50 (default: False)
--logfile: Write logs to file
--debug: Debug the output (default: False)

Cleaning Process

The clean command performs the following steps:

Load and sort the genome:
- The genome is loaded from the input FASTA file
- Contigs are sorted by length, from shortest to longest
- Basic statistics are calculated, including N50
Filter contigs by length:
- Contigs shorter than the minimum length are removed
- This helps eliminate small, potentially problematic contigs
Check for duplicated contigs:
- Starting with the smallest contigs, each contig is aligned against all larger contigs using minimap2
- If a contig is found to be duplicated elsewhere in the genome (based on percent identity and coverage thresholds), it is marked for removal
- By default, the process stops at the N50 contig size to save time, as larger contigs are less likely to be duplicated
- If the --exhaustive option is used, all contigs are checked for duplication
Rename contigs (if requested):
- If the --rename option is used, contigs are renamed with the provided basename followed by a number (e.g., “scaffold_1”, “scaffold_2”, etc.)
- Contigs are numbered from largest to smallest
Write the cleaned genome:
- The cleaned genome is written to the output file
- Duplicated contigs are excluded
- Contigs are written in order from largest to smallest

Why Clean Your Genome?

Cleaning your genome assembly before annotation offers several benefits:

Removes redundant sequences:
- Duplicated contigs can lead to redundant gene annotations
- Removing duplicates ensures each gene is annotated only once
Improves annotation quality:
- Small, fragmented contigs often contain partial genes or repetitive elements
- Removing these contigs can improve the overall quality of gene predictions
Standardizes contig names:
- Consistent, simple contig names make downstream analysis easier
- Some tools have limitations on header length or format
Reduces computational requirements:
- Fewer contigs means faster processing in subsequent steps
- Removing small contigs can significantly reduce the number of contigs without losing much sequence

Example Usage

Basic cleaning with default parameters:

funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta

Cleaning with custom parameters:

funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta -m 1000 -p 98 -c 98 -r scaffold_

Exhaustive cleaning (check all contigs for duplication):

funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta --exhaustive

Output

The clean command produces a single output file: the cleaned genome in FASTA format. The command also outputs statistics about the cleaning process, including:

Number of contigs in the input genome
Number of contigs larger than the minimum length
N50 of the input genome
Number of duplicated contigs found
Number of contigs written to the output file