Skip to content

nucmer gives no alignment with certain records in subject file #238

@peterjc

Description

@peterjc

This is a reproducible bug where certain sequences in the reference/subject file result in no alignments. It is one of three similar examples found in a set of fungal genome assembliess as per pyani-plus/pyani-plus#417 and affects both NUCmer (NUCleotide MUMmer) version 3.1 and version 4.0.1 as well. Tested on both macOS and Linux, with mummer installed from BioConda.

Setup

Download smaller query genome:

❯ curl -o "GCA_009806305.1.fna.gz" "https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_009806305.1?download=true&gzip=true"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.6M    0 11.6M    0     0  1481k      0 --:--:--  0:00:08 --:--:-- 1554k

Download larger genome where last two records are problematic:

❯ curl -o "GCA_011745365.1.fna.gz" "https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_011745365.1?download=true&gzip=true"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25.9M    0 25.9M    0     0  1203k      0 --:--:--  0:00:22 --:--:-- 1165k

Checksums:

❯ md5sum GCA_*.fna.gz
6448363d6650a4a6a1db37000b1dde80  GCA_009806305.1.fna.gz
aa6bc2e22196bbdf26daaae2185fde88  GCA_011745365.1.fna.gz

Decompress and make subset of subject file without last two records:

❯ cat GCA_009806305.1.fna.gz | gunzip > query.fasta
❯ cat GCA_011745365.1.fna.gz | gunzip > subject59.fasta
❯ head -n 1461172 subject59.fasta > subject57.fasta
❯ grep -c "^>" query.fasta subject59.fasta subject57.fasta
query.fasta:35
subject59.fasta:59
subject57.fasta:57

Notice record 57 is complete:

❯ tail -n 2 subject57.fasta
TAAGATATTTATACCAAACATAATTAATTATTTTCTAAAATAAAGCTAGAATATTTATTA
AATTAAAGAATATAATTAAATATTTATAAT

Checksums:

❯ md5sum query.fasta subject59.fasta subject57.fasta
b028729332c5b2071eb3bba1be58e8ba  query.fasta
74227e57109fede619e4f4ae473bf926  subject59.fasta
5b53388c322b5ebadc028ebc989af913  subject57.fasta

Running mummer4

With just 57 references, under 30s:

❯ nucmer --version
4.0.1
❯ time nucmer -p query_vs_subject57 --mum subject57.fasta query.fasta
(no output)
❯ cat query_vs_subject57.delta
/private/tmp/mum_bug/subject57.fasta /private/tmp/mum_bug/query.fasta
NUCMER
>ENA|WHUS01000032|WHUS01000032.1 ENA|RAGY01000030|RAGY01000030.1 2728 43253
2 2728 24861 27587 416 416 0
-111
9
1132
19
-4
-3
3
-267
536
14
-48
4
104
-150
6
-124
3
-143
-31
-20
0

With the full 59 references, similar run time:

❯ nucmer -p query_vs_subject59 --mum subject59.fasta query.fasta
(no output)
❯ cat query_vs_subject59.delta
/private/tmp/mum_bug/subject59.fasta /private/tmp/mum_bug/query.fasta
NUCMER

Expected output - similar or the same as the above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions