searching/grepping for words "near" each other
VirginSnow at vfemail.net
VirginSnow at vfemail.net
Thu Apr 30 23:24:55 EDT 2009
> From: kevin_d_clark at comcast.net (Kevin D. Clark)
> Date: 30 Apr 2009 12:02:19 -0400
> > I want to search a text file for a few (alphabetic) words which must
> > be "near" each other, but not necessarily on the same line. "Near"
> > could be defined however you like... within a certain number of words
> > from each other, a certain number of charecters from each other, or
> > some similar constraint.
> perl -0777 -lne 'print "$ARGV:$&" if (/weapons of.{1,200}mass destruction/s)' file1 file2 file3 ... fileN
That will work, but only if the search terms appear in document in the
same order as they appear in the query. (This appears to be the case
with the hipdig solution as well. Correct me if I'm wrong, of
course.) The search terms I'm looking for could appear in the target
document in any order. Perhaps I could have made that clearer. Okay,
I *could* have made that clearer.
SEARCH: weapons mass distraction
TEXT: UFOs are a distraction for people
who enjoy buying energy for
low-power laser weapons. Taco
shells are ETs' preferred food.
Because of their low mass, they
can be carried into orbit with
with minimal distraction.
MATCHES: distraction (for...laser) weapons (Taco...low) mass
weapons (Taco...low) mass (they...minimal) distration
You could do this kind of matching with Perl regexps, but they'd have
to be nested, with one level of nesting for each term... which would
quickly become both ugly and inefficient.
I was thinking along the lines of sorting the search terms, along with
blocks of the text, so that the terms would be rearranged into a
canonical order... but it's not clear how to choose blocks of text for
an efficient search.
I could use a Perl regexp like
(.{0,200}(weapons|mass|distraction).{0,200}){3}
but that doesn't require that each alternative appear at least once.
Things like:
distraction FOO mass BAR distraction
would match that regexp, but would return false positive results for
my search.
More information about the gnhlug-discuss
mailing list