Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
[![CI](https://github.com/HuttleyLab/scitrack/actions/workflows/testing_develop.yml/badge.svg)](https://github.com/HuttleyLab/scitrack/actions/workflows/testing_develop.yml)
[![coverall](https://coveralls.io/repos/github/GavinHuttley/scitrack/badge.svg?branch=develop)](https://coveralls.io/github/GavinHuttley/scitrack?branch=develop)
[![Using Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-310/)

# About `scitrack`

One of the critical challenges in scientific analysis is to track all the elements involved. This includes the arguments provided to a specific application (including default values), input data files referenced by those arguments and output data generated by the application. In addition to this, tracking a minimal set of system specific information.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Consider fixing the sentence fragment and adding a hyphen in "system-specific".

For example, you could rephrase it to: "In addition to this, it is important to track a minimal set of system-specific information."

Suggested change
One of the critical challenges in scientific analysis is to track all the elements involved. This includes the arguments provided to a specific application (including default values), input data files referenced by those arguments and output data generated by the application. In addition to this, tracking a minimal set of system specific information.
One of the critical challenges in scientific analysis is to track all the elements involved. This includes the arguments provided to a specific application (including default values), input data files referenced by those arguments and output data generated by the application. In addition to this, it is important to track a minimal set of system-specific information.


`scitrack` is a simple package aimed at researchers writing scripts, or more substantial scientific software, to support the tracking of scientific computation. The package provides elementary functionality to support logging. The primary capabilities concern generating checksums on input and output files and facilitating logging of the computational environment.

## Installing

```
$ pip install scitrack
```

## `CachingLogger`

There is a single object provided by `scitrack`, `CachingLogger`. This object is basically a wrapper around the Python standard library `logging` module. On invocation, `CachingLogger` captures basic information regarding the system and the command line call that was made to invoke the application.

In addition, the class provides convenience methods for logging both the path and the md5 hexdigest checksum [^1] of input/output files. A method is also provided for producing checksums of text data. The latter is useful for the case when data are from a stream or a database, for instance.

All logging calls are cached until a path for a logfile is provided. The logger can also, optionally, create directories.

## Simple instantiation of the logger

Creating the logger. Setting `create_dir=True` means on creation of the logfile, the directory path will be created also.

```python
from scitrack import CachingLogger
LOGGER = CachingLogger(create_dir=True)
LOGGER.log_file_path = "somedir/some_path.log"
```

The last assignment triggers creation of `somedir/some_path.log`.

> **Warning**
>
> Once set, a loggers `.log_file_path` cannot be changed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Use the possessive form "logger's" instead of "loggers".

Change this sentence to: Once set, a logger's ".log_file_path" cannot be changed.

Suggested change
> Once set, a loggers `.log_file_path` cannot be changed.
> Once set, a logger's ".log_file_path" cannot be changed.


## Capturing a programs arguments and options

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Add a possessive apostrophe in "program's".

Consider “Capturing a program's arguments and options.” for the heading.

Suggested change
## Capturing a programs arguments and options
## Capturing a program's arguments and options


`scitrack` will write the contents of `sys.argv` to the log file, prefixed by `command_string`. However, this only captures arguments specified on the command line. Tracking the value of optional arguments not specified, which may have default values, is critical to tracking the full command set. Doing this is now easy with the simple statement `LOGGER.log_args()`. The logger can also record the versions of named dependencies.

Here's one approach to incorporating `scitrack` into a command line application built using the `click` [command line interface library](http://click.pocoo.org/). Below we create a simple `click` app and capture the required and optional argument values.

> **Note**
>
> `LOGGER.log_args()` should be called immediately after the function definition, or after "true" default values have been configured.

```python
import click

from scitrack import CachingLogger

LOGGER = CachingLogger()


@click.command()
@click.option("-i", "--infile", type=click.Path(exists=True))
@click.option("-t", "--test", is_flag=True, help="Run test.")
def main(infile, test):
# capture the local variables, at this point just provided arguments
LOGGER.log_args()
LOGGER.log_versions("numpy")
LOGGER.input_file(infile)
LOGGER.log_file_path = "some_path.log"


if __name__ == "__main__":
main()
```

The `CachingLogger.write()` method takes a message and a label. All other logging methods wrap `log_message()`, providing a specific label. For instance, the method `input_file()` writes out two lines in the log.

- `input_file_path`, the absolute path to the input file
- `input_file_path md5sum`, the hex digest of the file

`output_file()` behaves analogously. An additional method `text_data()` is useful for other data input/output sources (e.g. records from a database). For this to have value for arbitrary data types requires a systematic approach to ensuring the text conversion is robust across platforms.

The `log_args()` method captures all local variables within a scope.

The `log_versions()` method captures the version of the caller's own package, the versions of its currently installed declared dependencies (across `core` and every extras group), and the versions of any additional named packages, e.g. `LOGGER.log_versions(['numpy', 'sklearn'])`. A name supplied via `packages` that is neither installed nor importable raises `PackageNotFoundError`.

The `log_licenses()` method mirrors `log_versions()` but emits the declared license of each package under the `license` label. The license is resolved from package metadata, preferring the PEP 639 `License-Expression` field over the legacy `License` field; if neither is declared the value is recorded as `UNKNOWN`. A name supplied via `packages` that is not installed raises `PackageNotFoundError`.

### Some sample output

```
2020-05-25 13:32:07 Eratosthenes:98447 INFO system_details : system=Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64
2020-05-25 13:32:07 Eratosthenes:98447 INFO python : 3.8.2
2020-05-25 13:32:07 Eratosthenes:98447 INFO user : gavin
2020-05-25 13:32:07 Eratosthenes:98447 INFO command_string : ./demo.py -i /Users/gavin/repos/SciTrack/tests/sample-lf.fasta
2020-05-25 13:32:07 Eratosthenes:98447 INFO params : {'infile': '/Users/gavin/repos/SciTrack/tests/sample-lf.fasta', 'test': False}
2020-05-25 13:32:07 Eratosthenes:98447 INFO version : __main__==None
2020-05-25 13:32:07 Eratosthenes:98447 INFO version : numpy==1.18.4
2020-05-25 13:32:07 Eratosthenes:98447 INFO input_file_path : /Users/gavin/repos/SciTrack/tests/sample-lf.fasta
2020-05-25 13:32:07 Eratosthenes:98447 INFO input_file_path md5sum : 96eb2c2632bae19eb65ea9224aaafdad
```

## Summarising a log file

`log_summary()` parses a written log file and returns its entries grouped by label, in the order they appear:

```python
from scitrack import log_summary

summary = log_summary("some_path.log")
print(summary["input_file_path"])
# ['/path/to/input1.fasta', '/path/to/input2.fasta']
print(summary["input_file_path md5sum"])
# ['96eb2c2632bae19eb65ea9224aaafdad', ...]
```

By default only labels emitted by `scitrack` itself are captured: `system_details`, `python`, `user`, `command_string`, `params`, `version`, `license`, `input_file_path`, `output_file_path`, the corresponding ` md5sum` lines, and `misc`. Lines under any other label are skipped silently. Two keyword arguments relax this:

- `labels=[...]` — opt in to additional, application-specific labels that your code emits via `LOGGER.log_message(msg, label="...")`.
- `all_labels=True` — capture every label encountered in the file.

### Project-specific summaries

Because `log_summary()` returns a plain `dict[str, list[str]]`, clients can layer their own reporting on top without re-parsing the file. For example, a pipeline that emits custom `dataset_id` and `accuracy` entries can produce a project-tailored report:

```python
from scitrack import log_summary

summary = log_summary(
"path/to/run.log",
labels=["dataset_id", "accuracy"],
)

print(f"Run by {summary['user'][0]} on Python {summary['python'][0]}")
print(f"Command: {summary['command_string'][0]}")
print(f"Inputs: {len(summary.get('input_file_path', []))}")
print(f"Outputs: {len(summary.get('output_file_path', []))}")
for dataset, accuracy in zip(summary["dataset_id"], summary["accuracy"]):
print(f" {dataset}: accuracy={accuracy}")
```

This makes it straightforward to summarise application logs, making it useful for many things including provenance reports for users.

## Other useful functions

Two other useful functions are `get_file_hexdigest()` and `get_text_hexdigest()` compute md5sum for files or text. Those can be used to validate the state recorded in the log-file matches results at a later date, e.g. `output_file()` records the path and md5sum of an output file.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Clarify the sentence by adding a missing "that"/"which" before "compute md5sum".

The wording "...and get_text_hexdigest() compute md5sum" is ungrammatical; consider restructuring the sentence, e.g. "Two other useful functions are get_file_hexdigest() and get_text_hexdigest(), which compute md5sums for files or text."

Suggested change
Two other useful functions are `get_file_hexdigest()` and `get_text_hexdigest()` compute md5sum for files or text. Those can be used to validate the state recorded in the log-file matches results at a later date, e.g. `output_file()` records the path and md5sum of an output file.
Two other useful functions are `get_file_hexdigest()` and `get_text_hexdigest()`, which compute md5sums for files or text. Those can be used to validate the state recorded in the log-file matches results at a later date, e.g. `output_file()` records the path and md5sum of an output file.


`get_package_licenses(packages)` returns a `{name: license}` mapping for a list of installed packages, raising `PackageNotFoundError` eagerly if any name is not installed.

## Reporting issues

Use the project [issue tracker](https://github.com/HuttleyLab/scitrack/issues).

## For Developers

The project is managed with [uv](https://docs.astral.sh/uv/) and uses the `hatchling` build backend. Having cloned the repository onto your machine, install `uv` (see the [uv installation guide](https://docs.astral.sh/uv/getting-started/installation/)), then create a development environment with the `dev` dependency group:

```
$ cd path/to/cloned/repo
$ uv sync --group dev
```

Run the test suite on the host Python:

```
$ uv run pytest tests/
```

The test, coverage, type-check, and format sessions are driven by [nox](https://nox.thea.codes/) with the uv backend. For example:

```
$ uv run nox -db uv -s test-3.11
$ uv run nox -db uv -s testcov-3.14 -- --cov-report=term-missing
$ uv run nox -db uv -s type_check-3.11
$ uv run nox -db uv -s fmt
```

Build the sdist and wheel with:

```
$ uv build
```

[^1]: The hexdigest serves as a unique signature of a files contents.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Use the possessive form "file's contents" in the footnote.

Change this to "a file's contents" for correct grammar.

165 changes: 0 additions & 165 deletions README.rst

This file was deleted.

12 changes: 10 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ build-backend = "hatchling.build"
name = "scitrack"
dynamic = ["version"]
description = "Basic logging capabilities to track scientific computations."
readme = "README.rst"
readme = "README.md"
license = "BSD-3-Clause"
license-files = ["LICENSE"]
authors = [{ name = "Gavin Huttley", email = "Gavin.Huttley@anu.edu.au" }]
Expand Down Expand Up @@ -42,7 +42,7 @@ path = "src/scitrack/__init__.py"
packages = ["src/scitrack"]

[tool.hatch.build.targets.sdist]
include = ["src/scitrack", "pyproject.toml", "README.rst", "LICENSE"]
include = ["src/scitrack", "pyproject.toml", "README.md", "LICENSE"]
exclude = ["**/*.xml", "**/__pycache__"]

[tool.mypy]
Expand All @@ -53,4 +53,12 @@ warn_unused_ignores = true
warn_redundant_casts = true
warn_unreachable = true

[tool.coverage.run]
source_pkgs = ["scitrack"]

[tool.coverage.report]
# restrict the report to scitrack; lazily imported stdlib modules
# (zipfile, importlib.metadata) otherwise leak into the output
include = ["*/scitrack/*"]

# Ruff configuration lives in ./ruff.toml
2 changes: 1 addition & 1 deletion ruff.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ target-version = "py310"
# McCabe complexity (`C901`) by default.
# G004 is about lazy formatting of logging args, not important here
select = ["ALL"]
ignore = ["EXE002", "FA100", "E501", "D", "G004"]
ignore = ["EXE002", "FA100", "E501", "D", "G004", "COM812"]

# Allow fix for all enabled rules (when `--fix`) is provided.
fixable = ["ALL"]
Expand Down
Loading
Loading