How to edit PDFs while relying on binary message bodies

## The problem the component solves

Handling email message bodies, both for HTML and TXT, seems to be much easier by keeping them as binary objects which avoids having to manage encoding issues. The libraries we're using for EML and MBOX can provide either binary or string message bodies, while the PST library provides binary objects (we're currently [detecting encoding](https://github.com/UAlbanyArchives/mailbag/blob/develop/mailbag/formats/pst.py#L63-L74) to generate strings) and the MSG library seems to provide only string objects. So, currently we have `HTML_Bytes`, `Text_Bytes`, `HTML_Body`, and `Text_Body` fields in the model (the body fields being strings) and we add what we can in the parsers and for derivatives we try to use the bytes fields and fall back to the string bodies. This seems to work reasonably well in testing.

However, our first [PDF derivative implementation](https://github.com/UAlbanyArchives/mailbag/blob/develop/mailbag/derivatives/pdf.py) involves editing the message bodies as strings. This is for two reasons. First, it would be nice to have a table showing common headers at the top of each PDF showing To, From, Subject, etc., minimally similar to what you get when you print an email from Outlook. This is [Must-Have requirement #8](https://archives.albany.edu/mailbag/requirements/#must-have). We have a [minimum implementation of this](https://github.com/UAlbanyArchives/mailbag/blob/develop/mailbag/derivatives/pdf.py#L34-L56). It currently literally looks for `<body` in the string and adds it after the next `>` which seemed better than trying to parse the HTML string with Beautiful Soup or whatever, but I'm unsure if this is the best method. 

The other use case is for adding a custom CSS to the HTML email bodies. This for when the PDF doesn't look nice and you want to adjust the margins or whatever. This is [Could-Have #36](https://archives.albany.edu/mailbag/requirements/#could-have). We're currently using `wkhtmltopdf` which seems to make generally nice PDFs, but I could see this being useful.

Both of these cases seem to require editing message bodies which I believe would require us to manage encoding. My best plan right now would be to unfortunately drop the custom CSS requirement and generate a separate PDF cover page with the headers and merge that with the `wkhtmltopdf` generated PDF. 

## Relevant part of mailbag spec?

N/A

## Type of component

- [ ] Core
- [x] Input
- [ ] Attachments
- [x] Derivatives conversion
- [ ] Reporting/Exporting
- [ ] GUI
- [ ] Distribution

## Expected contribution

- [ ] Pull Request
- [x] Comment with proposed solution

## Major challenges or things to keep in mind

Email parts _should_ list encoding correctly, but my real world experience says **beware**. The MBOX/EML libraries provide encoding with `part.get_charsets()` but I had issues passing that to `open(filename, "w", encoding=)` when I tried it quick. I'm not sure how the MSG or PST libraries provide encoding.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to edit PDFs while relying on binary message bodies #116

The problem the component solves

Relevant part of mailbag spec?

Type of component

Expected contribution

Major challenges or things to keep in mind

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to edit PDFs while relying on binary message bodies #116

Description

The problem the component solves

Relevant part of mailbag spec?

Type of component

Expected contribution

Major challenges or things to keep in mind

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions