Expected behaviour
Within a single request, a connection to MySQL shouldn't be closed while still in use.
Current behaviour
We recently hit an issue where MCPServer unexpectedly threw an exception during processing, reporting MySQLdb.OperationalError: (2006, '') (which is "mysql server has gone away"). Usually this happens when MySQL gets restarted or the underlying connection is closed for some other reason, but in this case there was no system-level reason for this to have happened.
After some digging, I eventually worked out that the culprit was auto_close_old_connections() and a pretty large dose of bad luck. The root cause is auto_close_old_connections() being called more than once during the same request, when the request happens to start just before the underlying database connection is due to be expired by Django's CONN_MAX_AGE setting (which defaults to 1 hour).
In my case, I think the sequence of events was:
- A request starts, which calls
auto_close_old_connections() up front. No problem there, because the connection is a mere 59 minutes and 59 seconds old.
- In
lib/server/jobs/client.py, we find ourselves in submit_tasks in a loop that is iterating over the self.package.files result set.
- Within the body of that loop, we call
self.task_backend.submit_task, which ultimately calls the submit method of server/tasks/backends/gearman_backend.py. In turn, that method calls Task.bulk_log, which is decorated with @auto_close_old_connections()
- Now our underlying database connection is over an hour old, and
auto_close_old_connections() closes it.
- But eventually we return back to our loop that is still iterating over
self.package.files. It finishes successfully and attempts to close the underlying result set, but finds that its database connection has gone away.
The stack trace we see when that happens:
Traceback (most recent call last):
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 342, in _invoke_callbacks
callback(self)
File "/usr/lib/archivematica/MCPServer/server/queues.py", line 220, in _job_completed_callback
next_job = future.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/lib/archivematica/MCPServer/server/jobs/client.py", line 122, in run
self.submit_tasks()
File "/usr/lib/archivematica/MCPServer/server/jobs/client.py", line 225, in submit_tasks
for file_replacements in self.package.files(
File "/usr/lib/archivematica/MCPServer/server/packages.py", line 663, in files
for file_obj in queryset.iterator():
File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.10/site-packages/django/db/models/query.py", line 516, in _iterator
yield from iterable
File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.10/site-packages/django/db/models/query.py", line 121, in __iter__
for row in compiler.results_iter(results):
File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1496, in apply_converters
for row in map(list, rows):
File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 2096, in cursor_iter
cursor.close()
File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.10/site-packages/MySQLdb/cursors.py", line 103, in close
self._discard()
File "/usr/share/archivematica/virtualenvs/archivematica/lib/python3.10/site-packages/MySQLdb/cursors.py", line 95, in _discard
while con.next_result() == 0: # -1 means no more data.
MySQLdb.OperationalError: (2006, '')
There are a couple of other issues that might not be exactly the same bug, but have the same kind of shape:
#1758 #1750
Steps to reproduce
It's difficult to reproduce in the wild because you do have to get very unlucky :) In my development setup, I was able to replicate the issue by:
- Changing my CONN_MAX_AGE setting to
1 to get it turning over connections after 1 second
- Adding a
sleep statement to the body of lib/server/jobs/client.py's submit_tasks to slow down the loop enough to make hitting the bug likely.
- Submitting a transfer as normal
Your environment (version of Archivematica, operating system, other relevant details)
I'm running 1:1.17.0-1~22.04 under Ubuntu Jammy (Ubuntu 22.04.5 LTS)
For Artefactual use:
Before you close this issue, you must check off the following:
Expected behaviour
Within a single request, a connection to MySQL shouldn't be closed while still in use.
Current behaviour
We recently hit an issue where MCPServer unexpectedly threw an exception during processing, reporting
MySQLdb.OperationalError: (2006, '')(which is "mysql server has gone away"). Usually this happens when MySQL gets restarted or the underlying connection is closed for some other reason, but in this case there was no system-level reason for this to have happened.After some digging, I eventually worked out that the culprit was
auto_close_old_connections()and a pretty large dose of bad luck. The root cause isauto_close_old_connections()being called more than once during the same request, when the request happens to start just before the underlying database connection is due to be expired by Django'sCONN_MAX_AGEsetting (which defaults to 1 hour).In my case, I think the sequence of events was:
auto_close_old_connections()up front. No problem there, because the connection is a mere 59 minutes and 59 seconds old.lib/server/jobs/client.py, we find ourselves insubmit_tasksin a loop that is iterating over theself.package.filesresult set.self.task_backend.submit_task, which ultimately calls thesubmitmethod ofserver/tasks/backends/gearman_backend.py. In turn, that method callsTask.bulk_log, which is decorated with@auto_close_old_connections()auto_close_old_connections()closes it.self.package.files. It finishes successfully and attempts to close the underlying result set, but finds that its database connection has gone away.The stack trace we see when that happens:
There are a couple of other issues that might not be exactly the same bug, but have the same kind of shape:
#1758 #1750
Steps to reproduce
It's difficult to reproduce in the wild because you do have to get very unlucky :) In my development setup, I was able to replicate the issue by:
1to get it turning over connections after 1 secondsleepstatement to the body oflib/server/jobs/client.py'ssubmit_tasksto slow down the loop enough to make hitting the bug likely.Your environment (version of Archivematica, operating system, other relevant details)
I'm running
1:1.17.0-1~22.04under Ubuntu Jammy (Ubuntu 22.04.5 LTS)For Artefactual use:
Before you close this issue, you must check off the following: