extract string -- TIMTOWTDI

Wed Jan 11 00:41:01 EST 2006

> On 1/10/06, Paul Lussier <p.lussier at comcast.net> wrote:
> > > perl -ne 'split ","; $_ = $_[2]; s/(^")|("$)//g; print if m/univ/;' <
> > > abc.txt > def.txt
> > Egads!

That's a literal start at a Perl "bring the grep sed and cut-or-awk
into one process", but it's not maximally Perl-ish.  It is also
inefficient, it does the side-effecting "" removal before discarding
non-univs. It is a literal translation of a `cut | sed | grep` pipe,
not of the `cut|grep|sed` pipe shown earlier.  Of coures, if the
requirements allowed it, a `grep|cut|sed` pipe would be the best
shell impelementation -- but that relies  upon knowning all 'univ'
are in the desired column, which we haven't been granted.

If it weren't for the desire to drop the quotes, Perl couldn't beat
Cut's golf-score (key stroke count) on this one anyway, but we can
try to optimize expressivity while saving two of three process forks.

The Perl Motto is TIMTOWTDI: There Is More Than One Way TO Do It.
(We've already seen that this is often true for BASH too.)  This is
usually a good thing, as often some are better for some requirements
than for others.

For generalness in real code, I'd like to explicitly ignore the header
line on this sort of CSV file:

$    perl -F, -lane 'next if $.==1 or $F[-1] !~/univ/; print $F[-1]=~m/"(.*)"/;' 
univ of Vermont
univ of Penn
univ of south Florida
$

As with one prior posting, the '-naF,' args cause Perl to auto-split on
',' into @F on each line.  I normally used '-F, -lane' on
one-liners, since it's memorable.

The '$.==1 or' is not strictly required since the top line of the
sample file had "school" not "university" for the column head, but
it it had "university or school" or "school/univ" on line 1, would
be required.

$F[-1] is Perl's equivalent to AWK's NF, referring to the last
column, instead of by number. (By number is notoriously error prone
with 0-based field counting).  $F[-2] means last-but-one, etc, too,
and you can slice with them as @F[-6..-2] .

Rather than remove the "" with s/"//g, I've captured what's between
them and printed that.

We can make more use of -F ... we'll split on all the punctuation.

$     perl -F'/^"|","|"$/' -lane 'next if $.==1 or $F[-1] !~/univ/i; print $F[-1]' schools.txt
univ of Vermont
univ of Penn
univ of south Florida
$

Of course, some CSV files the ""'s are optional. In qhich case we
can do 

$    perl -F'/^"|"?\s*,"?|"$/' -lane 'next if $.==1 or $F[-1] !~/univ/i; print $F[-1]' schools.txt
univ of Vermont
univ of Penn
univ of south Florida
$

Alternatively, to print **any** quoted phrase containing univ,
whether in last column or not, using the commas ..

$    perl -F, -lane 'for (@F){s/"//g; print if /univ/i}' schools.txt
univ of Vermont
univ of Penn
univ of south Florida
$

or ignoring the commas, just uses the quotes to capture between quotes,
but only if there's a univ between.  I started sneaking in a /i flag
to be case insensitive above, and I'll continue here ...

$    perl -lane 'print for m{ " ( [^"]*? univ [^"]* ) " }xig' schools.txt
univ of Vermont
univ of Penn
univ of south Florida
$

[the ? isn't required but it should help efficiency.]

or

$ perl -lane 'print for grep {/univ/i} m{"([^"]*)"}g' schools.txt
univ of Vermont
univ of Penn
univ of south Florida

There's also a CPAN module or two for processing CSV files that
handles the commas and quotes in CSV files ...
  http://search.cpan.org/search?query=Text%3A%3ACSV&mode=all
Your Linux distro should have Text::CSV_XS as a apt/yum/rpm/...
module option, or grab it from CPAN and build. (It has an XS =>
.c module, so is ripping fast, but has to be make'd.)

None of this is seriously obfuscatory golfing, but if someone wanted to
say darn the cost of forking new processes off bash, 'awk/cut|grep|sed'
is easier to read, well, I won't argue that it's easier for him/her
to read, and they should do it that way -- unless they need to tune
for performance.

-- 
/"\     Bill Ricker  N1VUX  wdr at world.std.com
\ /     http://world.std.com/~wdr/           
 X      Member of the ASCII Ribbon Campaign Against HTML Mail
/ \