Description
I've noticed that the _per_letter_annotations attribute in the SeqLike class is not actively used in the package by many. However, it is contributing significantly to the memory footprint of all objects of this class. This issue can affect performance, especially in environments where resource efficiency is critical. (particularly memory)
For example, if we take a 201 character long nucleotide string,
import random
seed = 33 #Setting seed for reproduceability
nt_length = 201 #Sequence length
random.seed(seed)
letters = ["A","T","G","C"] #Picking ATGC DNA nucleotide characters
nt_seq = ''.join(random.choice(letters) for _ in range(nt_length)) #Creating a random string
And create a nucleotide SeqLike object of the nucleotide string nt_seq,
from seqlike import SeqLike
seq_obj = SeqLike(sequence=nt_seq,seq_type="NT") #Creating a seqlike object
Looking at the memory footprint of the seq_obj using pympler
from pympler import asizeof
print("Size of SeqLike Object:", asizeof.asizeof(seq_obj), "bytes") #Looking at size of the object
Size of SeqLike Object: 19328 bytes
Further digging into the memory footprint of the object by unpacking attributes,
from pympler import asizeof
def get_attribute_sizes(obj, path='', visited=None, sizes=None):
if visited is None:
visited = set()
if sizes is None:
sizes = {}
obj_id = id(obj)
if obj_id in visited:
return sizes
visited.add(obj_id)
# Calculate the size and store it if not zero
obj_size = asizeof.asizeof(obj)
if obj_size > 0:
sizes[path if path else 'self'] = obj_size
# Handle different types of collections and objects
if hasattr(obj, '__dict__'):
for attr, value in obj.__dict__.items():
full_path = f"{path}.{attr}" if path else attr
get_attribute_sizes(value, full_path, visited, sizes)
elif isinstance(obj, dict):
for key, value in obj.items():
full_path = f"{path}.{key}" if path else str(key)
get_attribute_sizes(value, full_path, visited, sizes)
elif isinstance(obj, (list, set, tuple)):
for index, item in enumerate(obj):
full_path = f"{path}[{index}]" if path else f"[{index}]"
get_attribute_sizes(item, full_path, visited, sizes)
return sizes
attribute_sizes = get_attribute_sizes(seq)
Plotting top 20 attributes

We see that _per_letter_annotations makes up a sizeable chunk of the _nt_record attribute, 13472 bytes to be precise.
Further dissecting the _per_letter_annotations attribute, we can see that it is a dictionary with 1 key (seqnums) value pair and the values are a single list with string elements that are presumably indices that go up to the length of the sequence.
print(seq_obj._nt_record._per_letter_annotations.keys()) #See keys of the dictionary
print(seq_obj._nt_record._per_letter_annotations["seqnums"]) #Focus on values of the dictionary
print(type(seq_obj._nt_record._per_letter_annotations["seqnums"][0])) #See data type of the element in the list
dict_keys(['seqnums'])
['1', '2', '3', '4', '5', '6', '7', '8', '9' .... '201']
<class 'str'>
By setting the seq_obj._nt_record._per_letter_annotations to None we can see a considerable reduction in memory occupied by the object
from pympler import asizeof
seq_obj._nt_record._per_letter_annotations = None
print("Size of SeqLike Object:", asizeof.asizeof(seq_obj), "bytes")
Size of SeqLike Object: 5856 bytes

Comments
-
This is a reduction in memory footprint of this one object by ~70%.
-
I have observed similar behavior for the _aa_record._per_letter_annotations as well. So the same still applies for objects created as an AA record instead of an NT record.
-
The memory bloat can add up significantly over time and can be a critical limiting factor (memory-wise) especially for large machine-learning/computational biology data processing and analysis applications.
Expected Behavior
Objects of the SeqLike class should not allocate memory for attributes that are not used, thereby reducing the overall memory footprint of the application.
Current Behavior
Currently, every instance of the SeqLike class includes the _per_letter_annotations attribute, which increases the memory usage unnecessarily.
Possible Solution
One temporary potential solution to address this issue is to set the _per_letter_annotations attribute to None after its last necessary use, or entirely remove this attribute if it is confirmed to be redundant.
Alternative solutions may include looking at (line 1082 in particular)
|
@dispatch(SeqRecord) |
|
def record_from(sequence, **kwargs) -> SeqRecord: |
|
"""Construct SeqRecord from SeqRecord sequences. |
|
|
|
:param sequence: A SeqRecord object. |
|
:param **kwargs: Passed through to SeqRecord constructor. |
|
:returns: A SeqRecord object. |
|
|
|
""" |
|
s: SeqRecord = deepcopy(sequence) |
|
for k, v in kwargs.items(): |
|
setattr(s, k, v) |
|
s = add_seqnums_to_letter_annotations(s) |
|
return s |
and modifying the function behavior where per letter annotations are only added based on a condition as opposed to being added by default.
Description
I've noticed that the
_per_letter_annotationsattribute in theSeqLikeclass is not actively used in the package by many. However, it is contributing significantly to the memory footprint of all objects of this class. This issue can affect performance, especially in environments where resource efficiency is critical. (particularly memory)For example, if we take a 201 character long nucleotide string,
And create a nucleotide SeqLike object of the nucleotide string
nt_seq,Looking at the memory footprint of the
seq_objusing pymplerFurther digging into the memory footprint of the object by unpacking attributes,
attribute_sizes = get_attribute_sizes(seq)Plotting top 20 attributes
We see that
_per_letter_annotationsmakes up a sizeable chunk of the_nt_recordattribute, 13472 bytes to be precise.Further dissecting the
_per_letter_annotationsattribute, we can see that it is a dictionary with 1 key (seqnums) value pair and the values are a single list with string elements that are presumably indices that go up to the length of the sequence.By setting the
seq_obj._nt_record._per_letter_annotationstoNonewe can see a considerable reduction in memory occupied by the objectComments
This is a reduction in memory footprint of this one object by ~70%.
I have observed similar behavior for the
_aa_record._per_letter_annotationsas well. So the same still applies for objects created as an AA record instead of an NT record.The memory bloat can add up significantly over time and can be a critical limiting factor (memory-wise) especially for large machine-learning/computational biology data processing and analysis applications.
Expected Behavior
Objects of the
SeqLikeclass should not allocate memory for attributes that are not used, thereby reducing the overall memory footprint of the application.Current Behavior
Currently, every instance of the
SeqLikeclass includes the_per_letter_annotationsattribute, which increases the memory usage unnecessarily.Possible Solution
One temporary potential solution to address this issue is to set the
_per_letter_annotationsattribute toNoneafter its last necessary use, or entirely remove this attribute if it is confirmed to be redundant.Alternative solutions may include looking at (line 1082 in particular)
seqlike/seqlike/SeqLike.py
Lines 1070 to 1083 in dde761c
and modifying the function behavior where per letter annotations are only added based on a condition as opposed to being added by default.