I HATE SPAM (was Re: Mouse swapping on a laptop)
Kevin D. Clark
kclark at CetaceanNetworks.com
Mon Aug 4 08:52:30 EDT 2003
I vote that we:
1: Not require people to strip email addresses from the headers and
body of posts. This is too much work. Humans shouldn't have to
do this.
2: Keep the archive going, keep it world-accessible
3: Obfuscate email addresses in the web-archives. Many, many lists
that I am on do this already -- it makes sense.
WRT (3) above, I propose that we either configure our web-archiver to
do this (if it already does this), or else we just do this ourselves.
To this end, I have written (***) a Perl filter that does this
automagically. This filter could be used to filter our existing
archives, or it could be used from a procmail script inside the
web-archiving box in order to obfuscate all future postings.
*** This is basically Jeff Friedl's well-known email regexp, combined
with around 3 lines of my own code at the bottom.
Regards,
--kevin
--
"Some people, when confronted with a problem, think ``I know,
I'll use regular expressions.'' Now they have two problems."
-- Jamie Zawinski
-------------- next part --------------
#!/usr/bin/perl
# Program to build a regex to match an internet email address,
# from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
# (http://www.ora.com/catalog/regexp/)
#
# Optimized version.
#
# Copyright 1997 O'Reilly & Associates, Inc.
#
# Some things for avoiding backslashitis later on.
$esc = '\\\\'; $Period = '\.';
$space = '\040'; $tab = '\t';
$OpenBR = '\['; $CloseBR = '\]';
$OpenParen = '\('; $CloseParen = '\)';
$NonASCII = '\x80-\xff'; $ctrl = '\000-\037';
$CRlist = '\n\015'; # note: this should really be only \015.
# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character
##############################################################################
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext = qq< [^$esc$NonASCII$CRlist()] >;
# $Cnested matches one non-nested comment.
# It is unrolled, with normal of $ctext, special of $quoted_pair.
$Cnested = qq<
$OpenParen # (
$ctext* # normal*
(?: $quoted_pair $ctext* )* # (special normal*)*
$CloseParen # )
>;
# $comment allows one level of nested parentheses
# It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
$comment = qq<
$OpenParen # (
$ctext* # normal*
(?: # (
(?: $quoted_pair | $Cnested ) # special
$ctext* # normal*
)* # )*
$CloseParen # )
>;
##############################################################################
# $X is optional whitespace/comments.
$X = qq<
[$space$tab]* # Nab whitespace.
(?: $comment [$space$tab]* )* # If comment found, allow more spaces.
>;
# Item 10: atom
$atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters...
(?!$atom_char) # ..not followed by something that could be part of an atom
>;
# Item 11: doublequoted string, unrolled.
$quoted_str = qq<
\" # "
$qtext * # normal
(?: $quoted_pair $qtext * )* # ( special normal* )*
\" # "
>;
# Item 7: word is an atom or quoted string
$word = qq<
(?:
$atom # Atom
| # or
$quoted_str # Quoted string
)
>;
# Item 12: domain-ref is just an atom
$domain_ref = $atom;
# Item 13: domain-literal is like a quoted string, but [...] instead of "..."
$domain_lit = qq<
$OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]
>;
# Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain = qq<
(?:
$domain_ref
|
$domain_lit
)
$X # optional trailing comments
>;
# Item 6: domain is a list of subdomains separated by dots.
$domain = qq<
$sub_domain
(?:
$Period $X $sub_domain
)*
>;
# Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
$route = qq<
\@ $X $domain
(?: , $X \@ $X $domain )* # additional domains
:
$X # optional trailing comments
>;
# Item 6: local-part is a bunch of $word separated by periods
$local_part = qq<
$word $X
(?:
$Period $X $word $X # additional words
)*
>;
# Item 2: addr-spec is local at domain
$addr_spec = qq<
$local_part \@ $X $domain
>;
# Item 4: route-addr is <route? addr-spec>
$route_addr = qq[
< $X # <
(?: $route )? # optional route
$addr_spec # address spec
> # >
];
# Item 3: phrase........
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab
# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char =
qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
# We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
# because we take care of it manually.
$phrase = qq<
$word # leading word
$phrase_char * # "normal" atoms and/or spaces
(?:
(?: $comment | $quoted_str ) # "special" comment or quoted string
$phrase_char * # more "normal"
)*
>;
## Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = qq<
$X # optional leading comment
(?:
$addr_spec # address
| # or
$phrase $route_addr # name and address
)
>;
###########################################################################
# Here's a little snippet to test it.
# Addresses given on the commandline are described.
#
while (<>) {
s{($X) # optional leading comment
(?:
$addr_spec # address
| # or
($phrase) $route_addr # name and address
)}{$1 $2 <EMAIL-HIDDEN>}gxo;
print;
}
More information about the gnhlug-discuss
mailing list