Skip to content

High Memory Footprint Due to (mostly?) Unused Attribute #84

Description

@Yashrajsinh-Jadeja

Description

I've noticed that the _per_letter_annotations attribute in the SeqLike class is not actively used in the package by many. However, it is contributing significantly to the memory footprint of all objects of this class. This issue can affect performance, especially in environments where resource efficiency is critical. (particularly memory)

For example, if we take a 201 character long nucleotide string,

import random
seed = 33 #Setting seed for reproduceability
nt_length = 201 #Sequence length
random.seed(seed) 
letters = ["A","T","G","C"] #Picking ATGC DNA nucleotide characters
nt_seq = ''.join(random.choice(letters) for _ in range(nt_length)) #Creating a random string

And create a nucleotide SeqLike object of the nucleotide string nt_seq,

from seqlike import SeqLike
seq_obj = SeqLike(sequence=nt_seq,seq_type="NT") #Creating a seqlike object

Looking at the memory footprint of the seq_obj using pympler

from pympler import asizeof
print("Size of SeqLike Object:", asizeof.asizeof(seq_obj), "bytes") #Looking at size of the object
Size of SeqLike Object: 19328 bytes

Further digging into the memory footprint of the object by unpacking attributes,

from pympler import asizeof

def get_attribute_sizes(obj, path='', visited=None, sizes=None):
    if visited is None:
        visited = set()
    if sizes is None:
        sizes = {}

    obj_id = id(obj)
    if obj_id in visited:
        return sizes
    visited.add(obj_id)

    # Calculate the size and store it if not zero
    obj_size = asizeof.asizeof(obj)
    if obj_size > 0:
        sizes[path if path else 'self'] = obj_size

    # Handle different types of collections and objects
    if hasattr(obj, '__dict__'):
        for attr, value in obj.__dict__.items():
            full_path = f"{path}.{attr}" if path else attr
            get_attribute_sizes(value, full_path, visited, sizes)
    elif isinstance(obj, dict):
        for key, value in obj.items():
            full_path = f"{path}.{key}" if path else str(key)
            get_attribute_sizes(value, full_path, visited, sizes)
    elif isinstance(obj, (list, set, tuple)):
        for index, item in enumerate(obj):
            full_path = f"{path}[{index}]" if path else f"[{index}]"
            get_attribute_sizes(item, full_path, visited, sizes)
    return sizes

attribute_sizes = get_attribute_sizes(seq)

Plotting top 20 attributes

image

We see that _per_letter_annotations makes up a sizeable chunk of the _nt_record attribute, 13472 bytes to be precise.

Further dissecting the _per_letter_annotations attribute, we can see that it is a dictionary with 1 key (seqnums) value pair and the values are a single list with string elements that are presumably indices that go up to the length of the sequence.

print(seq_obj._nt_record._per_letter_annotations.keys()) #See keys of the dictionary
print(seq_obj._nt_record._per_letter_annotations["seqnums"]) #Focus on values of the dictionary
print(type(seq_obj._nt_record._per_letter_annotations["seqnums"][0])) #See data type of the element in the list
dict_keys(['seqnums'])
['1', '2', '3', '4', '5', '6', '7', '8', '9' .... '201']
<class 'str'>

By setting the seq_obj._nt_record._per_letter_annotations to None we can see a considerable reduction in memory occupied by the object

from pympler import asizeof
seq_obj._nt_record._per_letter_annotations = None
print("Size of SeqLike Object:", asizeof.asizeof(seq_obj), "bytes")
Size of SeqLike Object: 5856 bytes

image

Comments

  1. This is a reduction in memory footprint of this one object by ~70%.

  2. I have observed similar behavior for the _aa_record._per_letter_annotations as well. So the same still applies for objects created as an AA record instead of an NT record.

  3. The memory bloat can add up significantly over time and can be a critical limiting factor (memory-wise) especially for large machine-learning/computational biology data processing and analysis applications.

Expected Behavior

Objects of the SeqLike class should not allocate memory for attributes that are not used, thereby reducing the overall memory footprint of the application.

Current Behavior

Currently, every instance of the SeqLike class includes the _per_letter_annotations attribute, which increases the memory usage unnecessarily.

Possible Solution

One temporary potential solution to address this issue is to set the _per_letter_annotations attribute to None after its last necessary use, or entirely remove this attribute if it is confirmed to be redundant.

Alternative solutions may include looking at (line 1082 in particular)

seqlike/seqlike/SeqLike.py

Lines 1070 to 1083 in dde761c

@dispatch(SeqRecord)
def record_from(sequence, **kwargs) -> SeqRecord:
"""Construct SeqRecord from SeqRecord sequences.
:param sequence: A SeqRecord object.
:param **kwargs: Passed through to SeqRecord constructor.
:returns: A SeqRecord object.
"""
s: SeqRecord = deepcopy(sequence)
for k, v in kwargs.items():
setattr(s, k, v)
s = add_seqnums_to_letter_annotations(s)
return s

and modifying the function behavior where per letter annotations are only added based on a condition as opposed to being added by default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions