Skip to content

How to edit PDFs while relying on binary message bodies #116

@gwiedeman

Description

@gwiedeman

The problem the component solves

Handling email message bodies, both for HTML and TXT, seems to be much easier by keeping them as binary objects which avoids having to manage encoding issues. The libraries we're using for EML and MBOX can provide either binary or string message bodies, while the PST library provides binary objects (we're currently detecting encoding to generate strings) and the MSG library seems to provide only string objects. So, currently we have HTML_Bytes, Text_Bytes, HTML_Body, and Text_Body fields in the model (the body fields being strings) and we add what we can in the parsers and for derivatives we try to use the bytes fields and fall back to the string bodies. This seems to work reasonably well in testing.

However, our first PDF derivative implementation involves editing the message bodies as strings. This is for two reasons. First, it would be nice to have a table showing common headers at the top of each PDF showing To, From, Subject, etc., minimally similar to what you get when you print an email from Outlook. This is Must-Have requirement #8. We have a minimum implementation of this. It currently literally looks for <body in the string and adds it after the next > which seemed better than trying to parse the HTML string with Beautiful Soup or whatever, but I'm unsure if this is the best method.

The other use case is for adding a custom CSS to the HTML email bodies. This for when the PDF doesn't look nice and you want to adjust the margins or whatever. This is Could-Have #36. We're currently using wkhtmltopdf which seems to make generally nice PDFs, but I could see this being useful.

Both of these cases seem to require editing message bodies which I believe would require us to manage encoding. My best plan right now would be to unfortunately drop the custom CSS requirement and generate a separate PDF cover page with the headers and merge that with the wkhtmltopdf generated PDF.

Relevant part of mailbag spec?

N/A

Type of component

  • Core
  • Input
  • Attachments
  • Derivatives conversion
  • Reporting/Exporting
  • GUI
  • Distribution

Expected contribution

  • Pull Request
  • Comment with proposed solution

Major challenges or things to keep in mind

Email parts should list encoding correctly, but my real world experience says beware. The MBOX/EML libraries provide encoding with part.get_charsets() but I had issues passing that to open(filename, "w", encoding=) when I tried it quick. I'm not sure how the MSG or PST libraries provide encoding.

Metadata

Metadata

Assignees

Labels

CoreThis is part of the main process for creating mailbagsDerivativesWriting data to derivative formats, like PDFs or WARCsInputParsing input data, such as MBOX, IMAP, PST, EML, etc.

Type

No type
No fields configured for issues without a type.

Projects

Status

Done, merged to develop

Relationships

None yet

Development

No branches or pull requests

Issue actions