The problem the component solves
Handling email message bodies, both for HTML and TXT, seems to be much easier by keeping them as binary objects which avoids having to manage encoding issues. The libraries we're using for EML and MBOX can provide either binary or string message bodies, while the PST library provides binary objects (we're currently detecting encoding to generate strings) and the MSG library seems to provide only string objects. So, currently we have HTML_Bytes, Text_Bytes, HTML_Body, and Text_Body fields in the model (the body fields being strings) and we add what we can in the parsers and for derivatives we try to use the bytes fields and fall back to the string bodies. This seems to work reasonably well in testing.
However, our first PDF derivative implementation involves editing the message bodies as strings. This is for two reasons. First, it would be nice to have a table showing common headers at the top of each PDF showing To, From, Subject, etc., minimally similar to what you get when you print an email from Outlook. This is Must-Have requirement #8. We have a minimum implementation of this. It currently literally looks for <body in the string and adds it after the next > which seemed better than trying to parse the HTML string with Beautiful Soup or whatever, but I'm unsure if this is the best method.
The other use case is for adding a custom CSS to the HTML email bodies. This for when the PDF doesn't look nice and you want to adjust the margins or whatever. This is Could-Have #36. We're currently using wkhtmltopdf which seems to make generally nice PDFs, but I could see this being useful.
Both of these cases seem to require editing message bodies which I believe would require us to manage encoding. My best plan right now would be to unfortunately drop the custom CSS requirement and generate a separate PDF cover page with the headers and merge that with the wkhtmltopdf generated PDF.
Relevant part of mailbag spec?
N/A
Type of component
Expected contribution
Major challenges or things to keep in mind
Email parts should list encoding correctly, but my real world experience says beware. The MBOX/EML libraries provide encoding with part.get_charsets() but I had issues passing that to open(filename, "w", encoding=) when I tried it quick. I'm not sure how the MSG or PST libraries provide encoding.
The problem the component solves
Handling email message bodies, both for HTML and TXT, seems to be much easier by keeping them as binary objects which avoids having to manage encoding issues. The libraries we're using for EML and MBOX can provide either binary or string message bodies, while the PST library provides binary objects (we're currently detecting encoding to generate strings) and the MSG library seems to provide only string objects. So, currently we have
HTML_Bytes,Text_Bytes,HTML_Body, andText_Bodyfields in the model (the body fields being strings) and we add what we can in the parsers and for derivatives we try to use the bytes fields and fall back to the string bodies. This seems to work reasonably well in testing.However, our first PDF derivative implementation involves editing the message bodies as strings. This is for two reasons. First, it would be nice to have a table showing common headers at the top of each PDF showing To, From, Subject, etc., minimally similar to what you get when you print an email from Outlook. This is Must-Have requirement #8. We have a minimum implementation of this. It currently literally looks for
<bodyin the string and adds it after the next>which seemed better than trying to parse the HTML string with Beautiful Soup or whatever, but I'm unsure if this is the best method.The other use case is for adding a custom CSS to the HTML email bodies. This for when the PDF doesn't look nice and you want to adjust the margins or whatever. This is Could-Have #36. We're currently using
wkhtmltopdfwhich seems to make generally nice PDFs, but I could see this being useful.Both of these cases seem to require editing message bodies which I believe would require us to manage encoding. My best plan right now would be to unfortunately drop the custom CSS requirement and generate a separate PDF cover page with the headers and merge that with the
wkhtmltopdfgenerated PDF.Relevant part of mailbag spec?
N/A
Type of component
Expected contribution
Major challenges or things to keep in mind
Email parts should list encoding correctly, but my real world experience says beware. The MBOX/EML libraries provide encoding with
part.get_charsets()but I had issues passing that toopen(filename, "w", encoding=)when I tried it quick. I'm not sure how the MSG or PST libraries provide encoding.