Genome Cleaning ============== The ``funannotate2 clean`` command prepares a genome assembly for annotation by removing duplicated contigs, sorting contigs by size, and optionally renaming contig headers. Basic Usage ---------- .. code-block:: bash funannotate2 clean -f genome.fasta -o cleaned_genome.fasta Required Arguments ---------------- * ``-f, --fasta``: Input genome FASTA file * ``-o, --out``: Output cleaned genome FASTA file Optional Arguments ---------------- * ``-p, --pident``: Percent identity threshold for identifying duplicated contigs (default: 95) * ``-c, --cov``: Coverage threshold for identifying duplicated contigs (default: 95) * ``-m, --minlen``: Minimum contig length to keep (default: 500) * ``-r, --rename``: Rename contigs with this basename (e.g., "scaffold_") * ``--cpus``: Number of CPUs to use (default: 2) * ``--tmpdir``: Directory for temporary files * ``--exhaustive``: Compute every contig, else stop at N50 (default: False) * ``--logfile``: Write logs to file * ``--debug``: Debug the output (default: False) Cleaning Process -------------- The ``clean`` command performs the following steps: 1. **Load and sort the genome**: * The genome is loaded from the input FASTA file * Contigs are sorted by length, from shortest to longest * Basic statistics are calculated, including N50 2. **Filter contigs by length**: * Contigs shorter than the minimum length are removed * This helps eliminate small, potentially problematic contigs 3. **Check for duplicated contigs**: * Starting with the smallest contigs, each contig is aligned against all larger contigs using `minimap2 `_ * If a contig is found to be duplicated elsewhere in the genome (based on percent identity and coverage thresholds), it is marked for removal * By default, the process stops at the N50 contig size to save time, as larger contigs are less likely to be duplicated * If the ``--exhaustive`` option is used, all contigs are checked for duplication 4. **Rename contigs** (if requested): * If the ``--rename`` option is used, contigs are renamed with the provided basename followed by a number (e.g., "scaffold_1", "scaffold_2", etc.) * Contigs are numbered from largest to smallest 5. **Write the cleaned genome**: * The cleaned genome is written to the output file * Duplicated contigs are excluded * Contigs are written in order from largest to smallest Why Clean Your Genome? -------------------- Cleaning your genome assembly before annotation offers several benefits: 1. **Removes redundant sequences**: * Duplicated contigs can lead to redundant gene annotations * Removing duplicates ensures each gene is annotated only once 2. **Improves annotation quality**: * Small, fragmented contigs often contain partial genes or repetitive elements * Removing these contigs can improve the overall quality of gene predictions 3. **Standardizes contig names**: * Consistent, simple contig names make downstream analysis easier * Some tools have limitations on header length or format 4. **Reduces computational requirements**: * Fewer contigs means faster processing in subsequent steps * Removing small contigs can significantly reduce the number of contigs without losing much sequence Example Usage ----------- Basic cleaning with default parameters: .. code-block:: bash funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta Cleaning with custom parameters: .. code-block:: bash funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta -m 1000 -p 98 -c 98 -r scaffold_ Exhaustive cleaning (check all contigs for duplication): .. code-block:: bash funannotate2 clean -f raw_genome.fasta -o cleaned_genome.fasta --exhaustive Output ----- The ``clean`` command produces a single output file: the cleaned genome in FASTA format. The command also outputs statistics about the cleaning process, including: * Number of contigs in the input genome * Number of contigs larger than the minimum length * N50 of the input genome * Number of duplicated contigs found * Number of contigs written to the output file