searching/grepping for words "near" each other

VirginSnow at vfemail.net VirginSnow at vfemail.net
Thu Apr 30 23:24:55 EDT 2009


> From: kevin_d_clark at comcast.net (Kevin D. Clark)
> Date: 30 Apr 2009 12:02:19 -0400

> > I want to search a text file for a few (alphabetic) words which must
> > be "near" each other, but not necessarily on the same line.  "Near"
> > could be defined however you like... within a certain number of words
> > from each other, a certain number of charecters from each other, or
> > some similar constraint.

> perl -0777 -lne 'print "$ARGV:$&" if (/weapons of.{1,200}mass destruction/s)' file1 file2 file3 ... fileN

That will work, but only if the search terms appear in document in the
same order as they appear in the query.  (This appears to be the case
with the hipdig solution as well.  Correct me if I'm wrong, of
course.)  The search terms I'm looking for could appear in the target
document in any order.  Perhaps I could have made that clearer.  Okay,
I *could* have made that clearer.

SEARCH:  weapons mass distraction

TEXT:    UFOs are a distraction for people
         who enjoy buying energy for
         low-power laser weapons.  Taco
         shells are ETs' preferred food.
         Because of their low mass, they
         can be carried into orbit with
         with minimal distraction.

MATCHES: distraction (for...laser) weapons (Taco...low) mass
         weapons (Taco...low) mass (they...minimal) distration

You could do this kind of matching with Perl regexps, but they'd have
to be nested, with one level of nesting for each term... which would
quickly become both ugly and inefficient.

I was thinking along the lines of sorting the search terms, along with
blocks of the text, so that the terms would be rearranged into a
canonical order... but it's not clear how to choose blocks of text for
an efficient search.

I could use a Perl regexp like

  (.{0,200}(weapons|mass|distraction).{0,200}){3}

but that doesn't require that each alternative appear at least once.
Things like:

  distraction FOO mass BAR distraction

would match that regexp, but would return false positive results for
my search.


More information about the gnhlug-discuss mailing list