mail archives (was: Another ACPI anecdote, plus footnotes)
Kevin D. Clark
clark_k at pannaway.com
Tue Jan 25 09:09:01 EST 2005
Kevin D. Clark writes:
> Bill McGonigle <bill at bfccomputing.com> writes:
>
>> This appears to be the main stumbling block. Mailman has an e-mail
>> address obfuscation feature but it's very sad ('user at domain'). If
>> I were writing a harvester I think I'd throw in the 1-line regex to
>> fix these so it's not really a worthwhile countermeasure. I'm looking
>> at what it would take to have Mailman just remove the e-mail address
>> (display only the 'real name') or only show the user-part if there's
>> no 'real name' provided so the offer is DOA unless I get that fixed.
>
> I wrote a program that did most of this and posted this to the list a
> while ago. Unfortunately, I don't have my personal archive with me
> right now. Anybody? Otherwise I'll try to retreive this tonight.
Apologies if you get this more than once. I sent this out last night
but I haven't seen it get delivered yet.
This program is a filter that strips out email addresses. I hope that
this program can be useful in addressing our current issue.
To: gnhlug-discuss
Subject: Re: I HATE SPAM (was Re: Mouse swapping on a laptop)
From: kclark at .... (Kevin D. Clark)
Date: 04 Aug 2003 08:52:30 -0400
I vote that we:
1: Not require people to strip email addresses from the headers and
body of posts. This is too much work. Humans shouldn't have to
do this.
2: Keep the archive going, keep it world-accessible
3: Obfuscate email addresses in the web-archives. Many, many lists
that I am on do this already -- it makes sense.
WRT (3) above, I propose that we either configure our web-archiver to
do this (if it already does this), or else we just do this ourselves.
To this end, I have written (***) a Perl filter that does this
automagically. This filter could be used to filter our existing
archives, or it could be used from a procmail script inside the
web-archiving box in order to obfuscate all future postings.
*** This is basically Jeff Friedl's well-known email regexp, combined
with around 3 lines of my own code at the bottom.
Regards,
--kevin
--
"Some people, when confronted with a problem, think ``I know,
I'll use regular expressions.'' Now they have two problems."
-- Jamie Zawinski
#!/usr/bin/perl
# Program to build a regex to match an internet email address,
# from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
# (http://www.ora.com/catalog/regexp/)
#
# Optimized version.
#
# Copyright 1997 O'Reilly & Associates, Inc.
#
# Some things for avoiding backslashitis later on.
$esc = '\\\\'; $Period = '\.';
$space = '\040'; $tab = '\t';
$OpenBR = '\['; $CloseBR = '\]';
$OpenParen = '\('; $CloseParen = '\)';
$NonASCII = '\x80-\xff'; $ctrl = '\000-\037';
$CRlist = '\n\015'; # note: this should really be only \015.
# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character
##############################################################################
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext = qq< [^$esc$NonASCII$CRlist()] >;
# $Cnested matches one non-nested comment.
# It is unrolled, with normal of $ctext, special of $quoted_pair.
$Cnested = qq<
$OpenParen # (
$ctext* # normal*
(?: $quoted_pair $ctext* )* # (special normal*)*
$CloseParen # )
>;
# $comment allows one level of nested parentheses
# It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
$comment = qq<
$OpenParen # (
$ctext* # normal*
(?: # (
(?: $quoted_pair | $Cnested ) # special
$ctext* # normal*
)* # )*
$CloseParen # )
>;
##############################################################################
# $X is optional whitespace/comments.
$X = qq<
[$space$tab]* # Nab whitespace.
(?: $comment [$space$tab]* )* # If comment found, allow more spaces.
>;
# Item 10: atom
$atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters...
(?!$atom_char) # ..not followed by something that could be part of an atom
>;
# Item 11: doublequoted string, unrolled.
$quoted_str = qq<
\" # "
$qtext * # normal
(?: $quoted_pair $qtext * )* # ( special normal* )*
\" # "
>;
# Item 7: word is an atom or quoted string
$word = qq<
(?:
$atom # Atom
| # or
$quoted_str # Quoted string
)
>;
# Item 12: domain-ref is just an atom
$domain_ref = $atom;
# Item 13: domain-literal is like a quoted string, but [...] instead of "..."
$domain_lit = qq<
$OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]
>;
# Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain = qq<
(?:
$domain_ref
|
$domain_lit
)
$X # optional trailing comments
>;
# Item 6: domain is a list of subdomains separated by dots.
$domain = qq<
$sub_domain
(?:
$Period $X $sub_domain
)*
>;
# Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
$route = qq<
\@ $X $domain
(?: , $X \@ $X $domain )* # additional domains
:
$X # optional trailing comments
>;
# Item 6: local-part is a bunch of $word separated by periods
$local_part = qq<
$word $X
(?:
$Period $X $word $X # additional words
)*
>;
# Item 2: addr-spec is local at domain
$addr_spec = qq<
$local_part \@ $X $domain
>;
# Item 4: route-addr is <route? addr-spec>
$route_addr = qq[
< $X # <
(?: $route )? # optional route
$addr_spec # address spec
> # >
];
# Item 3: phrase........
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab
# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char =
qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
# We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
# because we take care of it manually.
$phrase = qq<
$word # leading word
$phrase_char * # "normal" atoms and/or spaces
(?:
(?: $comment | $quoted_str ) # "special" comment or quoted string
$phrase_char * # more "normal"
)*
>;
## Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = qq<
$X # optional leading comment
(?:
$addr_spec # address
| # or
$phrase $route_addr # name and address
)
>;
###########################################################################
# Here's a little snippet to test it.
# Addresses given on the commandline are described.
#
while (<>) {
s{($X) # optional leading comment
(?:
$addr_spec # address
| # or
($phrase) $route_addr # name and address
)}{$1 $2 <EMAIL-HIDDEN>}gxo;
print;
}
More information about the gnhlug-discuss
mailing list