Skip to content

HuttleyLab/scitrack

Repository files navigation

CI coverall Using Ruff Python 3.10+

About scitrack

One of the critical challenges in scientific analysis is to track all the elements involved. This includes the arguments provided to a specific application (including default values), input data files referenced by those arguments and output data generated by the application. In addition to this, tracking a minimal set of system specific information.

scitrack is a simple package aimed at researchers writing scripts, or more substantial scientific software, to support the tracking of scientific computation. The package provides elementary functionality to support logging. The primary capabilities concern generating checksums on input and output files and facilitating logging of the computational environment.

Installing

$ pip install scitrack

CachingLogger

There is a single object provided by scitrack, CachingLogger. This object is basically a wrapper around the Python standard library logging module. On invocation, CachingLogger captures basic information regarding the system and the command line call that was made to invoke the application.

In addition, the class provides convenience methods for logging both the path and the md5 hexdigest checksum 1 of input/output files. A method is also provided for producing checksums of text data. The latter is useful for the case when data are from a stream or a database, for instance.

All logging calls are cached until a path for a logfile is provided. The logger can also, optionally, create directories.

Simple instantiation of the logger

Creating the logger. Setting create_dir=True means on creation of the logfile, the directory path will be created also.

from scitrack import CachingLogger
LOGGER = CachingLogger(create_dir=True)
LOGGER.log_file_path = "somedir/some_path.log"

The last assignment triggers creation of somedir/some_path.log.

Warning

Once set, a loggers .log_file_path cannot be changed.

Capturing a programs arguments and options

scitrack will write the contents of sys.argv to the log file, prefixed by command_string. However, this only captures arguments specified on the command line. Tracking the value of optional arguments not specified, which may have default values, is critical to tracking the full command set. Doing this is now easy with the simple statement LOGGER.log_args(). The logger can also record the versions of named dependencies.

Here's one approach to incorporating scitrack into a command line application built using the click command line interface library. Below we create a simple click app and capture the required and optional argument values.

Note

LOGGER.log_args() should be called immediately after the function definition, or after "true" default values have been configured.

import click

from scitrack import CachingLogger

LOGGER = CachingLogger()


@click.command()
@click.option("-i", "--infile", type=click.Path(exists=True))
@click.option("-t", "--test", is_flag=True, help="Run test.")
def main(infile, test):
    # capture the local variables, at this point just provided arguments
    LOGGER.log_args()
    LOGGER.log_versions("numpy")
    LOGGER.input_file(infile)
    LOGGER.log_file_path = "some_path.log"


if __name__ == "__main__":
    main()

The CachingLogger.write() method takes a message and a label. All other logging methods wrap log_message(), providing a specific label. For instance, the method input_file() writes out two lines in the log.

  • input_file_path, the absolute path to the input file
  • input_file_path md5sum, the hex digest of the file

output_file() behaves analogously. An additional method text_data() is useful for other data input/output sources (e.g. records from a database). For this to have value for arbitrary data types requires a systematic approach to ensuring the text conversion is robust across platforms.

The log_args() method captures all local variables within a scope.

The log_versions() method captures the version of the caller's own package, the versions of its currently installed declared dependencies (across core and every extras group), and the versions of any additional named packages, e.g. LOGGER.log_versions(['numpy', 'sklearn']). A name supplied via packages that is neither installed nor importable raises PackageNotFoundError.

The log_licenses() method mirrors log_versions() but emits the declared license of each package under the license label. The license is resolved from package metadata, preferring the PEP 639 License-Expression field over the legacy License field; if neither is declared the value is recorded as UNKNOWN. A name supplied via packages that is not installed raises PackageNotFoundError.

Some sample output

2020-05-25 13:32:07	Eratosthenes:98447	INFO	system_details : system=Darwin Kernel Version 19.4.0: Wed Mar  4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64
2020-05-25 13:32:07	Eratosthenes:98447	INFO	python : 3.8.2
2020-05-25 13:32:07	Eratosthenes:98447	INFO	user : gavin
2020-05-25 13:32:07	Eratosthenes:98447	INFO	command_string : ./demo.py -i /Users/gavin/repos/SciTrack/tests/sample-lf.fasta
2020-05-25 13:32:07	Eratosthenes:98447	INFO	params : {'infile': '/Users/gavin/repos/SciTrack/tests/sample-lf.fasta', 'test': False}
2020-05-25 13:32:07	Eratosthenes:98447	INFO	version : __main__==None
2020-05-25 13:32:07	Eratosthenes:98447	INFO	version : numpy==1.18.4
2020-05-25 13:32:07	Eratosthenes:98447	INFO	input_file_path : /Users/gavin/repos/SciTrack/tests/sample-lf.fasta
2020-05-25 13:32:07	Eratosthenes:98447	INFO	input_file_path md5sum : 96eb2c2632bae19eb65ea9224aaafdad

Summarising a log file

log_summary() parses a written log file and returns its entries grouped by label, in the order they appear:

from scitrack import log_summary

summary = log_summary("some_path.log")
print(summary["input_file_path"])
# ['/path/to/input1.fasta', '/path/to/input2.fasta']
print(summary["input_file_path md5sum"])
# ['96eb2c2632bae19eb65ea9224aaafdad', ...]

By default only labels emitted by scitrack itself are captured: system_details, python, user, command_string, params, version, license, input_file_path, output_file_path, the corresponding md5sum lines, and misc. Lines under any other label are skipped silently. Two keyword arguments relax this:

  • labels=[...] — opt in to additional, application-specific labels that your code emits via LOGGER.log_message(msg, label="...").
  • all_labels=True — capture every label encountered in the file.

Project-specific summaries

Because log_summary() returns a plain dict[str, list[str]], clients can layer their own reporting on top without re-parsing the file. For example, a pipeline that emits custom dataset_id and accuracy entries can produce a project-tailored report:

from scitrack import log_summary

summary = log_summary(
    "path/to/run.log",
    labels=["dataset_id", "accuracy"],
)

print(f"Run by {summary['user'][0]} on Python {summary['python'][0]}")
print(f"Command: {summary['command_string'][0]}")
print(f"Inputs:  {len(summary.get('input_file_path', []))}")
print(f"Outputs: {len(summary.get('output_file_path', []))}")
for dataset, accuracy in zip(summary["dataset_id"], summary["accuracy"]):
    print(f"  {dataset}: accuracy={accuracy}")

This makes it straightforward to summarise application logs, making it useful for many things including provenance reports for users.

Other useful functions

Two other useful functions are get_file_hexdigest() and get_text_hexdigest() compute md5sum for files or text. Those can be used to validate the state recorded in the log-file matches results at a later date, e.g. output_file() records the path and md5sum of an output file.

get_package_licenses(packages) returns a {name: license} mapping for a list of installed packages, raising PackageNotFoundError eagerly if any name is not installed.

Reporting issues

Use the project issue tracker.

For Developers

The project is managed with uv and uses the hatchling build backend. Having cloned the repository onto your machine, install uv (see the uv installation guide), then create a development environment with the dev dependency group:

$ cd path/to/cloned/repo
$ uv sync --group dev

Run the test suite on the host Python:

$ uv run pytest tests/

The test, coverage, type-check, and format sessions are driven by nox with the uv backend. For example:

$ uv run nox -db uv -s test-3.11
$ uv run nox -db uv -s testcov-3.14 -- --cov-report=term-missing
$ uv run nox -db uv -s type_check-3.11
$ uv run nox -db uv -s fmt

Build the sdist and wheel with:

$ uv build

Footnotes

  1. The hexdigest serves as a unique signature of a files contents.

About

library aimed at scientists who want a simple tool to help log research computations

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages