extract string

Thomas Charron twaffle at gmail.com
Wed Jan 11 10:21:01 EST 2006


On 1/11/06, Zhao Peng <greenmt at gmail.com> wrote:

> Hi All,
> First I really cannot be more grateful for the answers to my question
> from all of you, I appreciate your help and time. I'm especially touched
> by the outpouring of response on this list., which I have never
> experienced  before anywhere else.


  I hope my little comment didn't seem mean, I was more poking fun at the
fact that if someone posted a simular post, and called themselves a Systems
Administrator on a Windows network, comments simular to mine would have come
forth..  ;-)


> Secondly I'm sorry for the big stir-up as to "homework problems" which
> flooded the list, since I'm origin of it.


  Nah, it wasn't a flood.  Trust me, once you see a flood, you'll know it.
Usually, it's becouse someone says something political in nature.



> Kenny, regarding missing column issue, let me try to explain it again.
> Below is quoted from my original post:
> ============================================
> Also, if one column is missing, and "," is used to indicate that missing
> column, like the following (2nd column of 3rd line is missing):
> "name","age","school"
> "jerry" ,"21","univ of Vermont"
> "jesse",,,"Dartmouth college"
> "jack","18","univ of Penn"
> "john","20","univ of south Florida"
> ===========================================
> You said that "there is an extra column in the 3rd line". I disagree
> with you from my perspective. As you can see, there are 3 commas in
> between "jesse" and "Dartmouth college". For these 3 commas, again, if
> we think the 2nd one as an merely indication that the value for age
> column is missing, then the 3rd line will be be read as ["jesse",
> MISSING, "Dartmouth college"], not ["jesse",empty,empty, "Dartmouth
> college"] as you suggested.


  This is unusual, as typically, a comma delimited set of values would
simply have nothing between the commas, or a set of quotes with no data.

  Typically the line would look like this:

"jesse",,"Dartmouth college"

  Or

 "jesse","","Dartmouth college"



> Paul, as to your "simplest by what measurement" question. I was thinking
> of both "easiest to remember" and "easiest to understand" when I was
> posting my question. Now I desire for "most efficient" approach. I know
> that will be my homework.


  If this is something that you will be doing repeatedly for different files
types, I'd highly suggest getting familiar with regular expressions.  You've
seen a small snippet in Kenny's example 'sed s/\"//g'.  The 's/\"//g' says
to globally replace all quotes with nothing (s = substitute, /1/2/ says
'replace everything matching 1 with 2', in this case, a quote, with
nothing.  g means globally, aka, do it more then just once.  Regular
expressions are a powerful way to parse text files based on a given pattern,
to get at the data you want.



> Part of my primary job responsibilities is to convert raw data into SAS
> data sets. My "extract string" question comes from processing a raw data
> file in .txt format, which doesn't have any documentation, except the
> variable list. By looking at the raw data, I know that each variable is
> separated by a comma. For one particular variable(column) called
> "school", the length of some of its value is quite long(like: Univ of
> Wisconsin at Madison, Health Sci Ctr), but I don't know the definite
> length. I need to know it, because if the length I specify it not
> enough, only partial values will be read. Many of its values contain
> "univ", so I just thought if I could extract all strings containing
> "univ" from that variable(column), I will have a better chance to figure
> out the length of "school". That's why I had this question.


  Haven't even run it, but something perl like:

my $maxlen = 0;
while(<>) {
  /^(.*),(.*),(.*)$/;
  if(length($3) > $maxlen) {
    $maxlen = $3;
  }
}
print "Longest String in third column is $maxlen\n";

  This would read on STDIN till it couldn't read anymore.  Each line, it
would split based on the commas (If the third column contains commas, this
won't work, becouse $2 or $1 would be greedy and gobble some of the data,
FYI), and check the length of the third field against max length.  If it's
longer, assign it.  At the end, print it out.

  This Regular expression isn't great, but it's the 20 second typing
version.

  Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20060111/035f8e97/attachment.html


More information about the gnhlug-discuss mailing list