Add a method to access the raw version of attachments
Our system sometimes receives emails where the attachments have the wrong mime type. For instance, a PNG attachment might come in with the content-type set to text/plain.
mail-parser helpfully decodes text/plain into valid utf8, which means that non-utf8 byte sequences get replaced by “�” (U+FFFD), destroying the content. Then, there is no way to recover the original contents of the file if later down the line we decide we would like to reinterpret the attachment as non-text.
I have patched the library to add a raw_body next to the body for attachments, which always returns the bytes regardless of PartType, and we are using this as an internal fork for now, but it would be a lot nicer if we can get the feature (in this form or some other way) upstreamed and move back to upstream stalwartlabs/mail-parser.
Can an example be provided what such an email looks like? Might be able to correct this, independent of exposing raw access.
Not the maintainer, but it could be a more aligned fix to wrap the parts and expose methods to decode to &str or raw access:
struct Text<'b> {
charset: (),
raw: &'b [u8]
}
struct Html<'b> {
charset: (),
raw: &'b [u8]
}
impl Text<'_> {
fn raw(&self) -> &[u8] {
self.raw
}
fn decode_to_utf8(&self) -> Cow<'_, str> {
todo!()
}
}
impl Html<'_> {
fn raw_html(&self) -> &[u8] {
self.raw
}
// note that charset in <meta> tags can become out-of-sync with real utf8 encoding
// if consumer is a browser, likely want fn raw_html
fn decode_to_utf8(&self) -> Cow<'_, str> {
todo!()
}
}
This would be a less intrusive change for the tests, instead of duplicating both raw and decoded versions, there's a clone less in the good case and would be a clean fix for #109 where we might want raw html access.
Attached a crafted example file where a PNG's header says text/plain: png_as_text_plain.eml