Converting HTML and MIME to plain text mail

Ben Scott dragonhawk at gmail.com
Tue Oct 7 16:57:42 EDT 2008


On Tue, Oct 7, 2008 at 3:10 PM, Roger H. Goun <roger at bcah.com> wrote:
>> * Decode BASE64 or quoted-printable to 7-bit clean plain text
>
> This should be decode to 8-bit clean plain text.

  Nope.  Not if you're talking strict RFC-821/822 compliance.  The
specs say ASCII.  ASCII is properly a 7-bit character code.  The RFC
reinforces this, going so far as to give acceptable character code
values as 1 through 127.  RFC-822, Section 3.3.

	http://tools.ietf.org/html/rfc822#section-3.3

  After all, if someone really wants that genuine 1982 email
experience, I would hate for them to be disappointed.  ;-)

> These could be merged into:
> * Replace non-ASCII characters with an ASCII text representation

  As you touched upon, I  was thinking certain Unicode characters are
in common use *and* have a convenient ASCII equivalent, and those
could be handled with a lookup table.  However, Unicode characters not
in that table, and just plain unrecognizable non-ASCII data, would
have to be handled in a generic fashion.

  For example, here are some easy translations:

* Unicode Non-Break Space ==> ASCII space
* Unicode Em Dash ==> ASCII sequence --
* Unicode Left Double Quote ==> ASCII Double Quote (non-directional)

  But something like Unicode U+2654, "White Chess King", doesn't have
an easy ASCII representation, and it doesn't see common use, so it
likely would not be worth worrying about.  But a proper solution
should indicate that something was there, and was removed.

  One simple approach is to replace any such characters with an ASCII
representation of the numeric character value.  For example, take
U+25A1, Unicode "White Square".  That could be translated as follows:

     X□X     ==>     X[0x25][0xA1]X

  Not pretty, but easy, and allows manual translation back to the
proper code point, if needed.

> You probably want the mail to remain a valid MIME message, just in
> case the user ever upgrades her MUA.

  Leaving the MIME headers when the explict goal is to remove all MIME
functionality seems like a waste to me.  They can never be used for
anything useful.  I would think it better to make it clear that this
isn't MIME.  But hey, I can't volunteering to write the code.  ;-)

-- Ben


More information about the gnhlug-discuss mailing list