Genome Cleaning
The funannotate2 clean command prepares a genome assembly for annotation by removing duplicated contigs, sorting contigs by size, and optionally renaming contig headers.
Basic Usage
funannotate2 clean -f genome.fasta -o cleaned_genome.fasta
Required Arguments
-f, --fasta: Input genome FASTA file-o, --out: Output cleaned genome FASTA file
Optional Arguments
-p, --pident: Percent identity threshold for identifying duplicated contigs (default: 95)-c, --cov: Coverage threshold for identifying duplicated contigs (default: 95)-m, --minlen: Minimum contig length to keep (default: 500)-r, --rename: Rename contigs with this basename (e.g., “scaffold_”)--cpus: Number of CPUs to use (default: 2)--tmpdir: Directory for temporary files--exhaustive: Compute every contig, else stop at N50 (default: False)--logfile: Write logs to file--debug: Debug the output (default: False)
Cleaning Process
The clean command performs the following steps:
Load and sort the genome:
The genome is loaded from the input FASTA file
Contigs are sorted by length, from shortest to longest
Basic statistics are calculated, including N50
Filter contigs by length:
Contigs shorter than the minimum length are removed
This helps eliminate small, potentially problematic contigs
Check for duplicated contigs:
Starting with the smallest contigs, each contig is aligned against all larger contigs using minimap2
If a contig is found to be duplicated elsewhere in the genome (based on percent identity and coverage thresholds), it is marked for removal
By default, the process stops at the N50 contig size to save time, as larger contigs are less likely to be duplicated
If the
--exhaustiveoption is used, all contigs are checked for duplication
Rename contigs (if requested):
If the
--renameoption is used, contigs are renamed with the provided basename followed by a number (e.g., “scaffold_1”, “scaffold_2”, etc.)Contigs are numbered from largest to smallest
Write the cleaned genome:
The cleaned genome is written to the output file
Duplicated contigs are excluded
Contigs are written in order from largest to smallest
Why Clean Your Genome?
Cleaning your genome assembly before annotation offers several benefits:
Removes redundant sequences:
Duplicated contigs can lead to redundant gene annotations
Removing duplicates ensures each gene is annotated only once
Improves annotation quality:
Small, fragmented contigs often contain partial genes or repetitive elements
Removing these contigs can improve the overall quality of gene predictions
Standardizes contig names:
Consistent, simple contig names make downstream analysis easier
Some tools have limitations on header length or format
Reduces computational requirements:
Fewer contigs means faster processing in subsequent steps
Removing small contigs can significantly reduce the number of contigs without losing much sequence
Example Usage
Basic cleaning with default parameters:
funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta
Cleaning with custom parameters:
funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta -m 1000 -p 98 -c 98 -r scaffold_
Exhaustive cleaning (check all contigs for duplication):
funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta --exhaustive
Output
The clean command produces a single output file: the cleaned genome in FASTA format. The command also outputs statistics about the cleaning process, including:
Number of contigs in the input genome
Number of contigs larger than the minimum length
N50 of the input genome
Number of duplicated contigs found
Number of contigs written to the output file