extract string

klussier at comcast.net klussier at comcast.net
Wed Jan 11 08:43:01 EST 2006


 -------------- Original message ----------------------
From: Zhao Peng <greenmt at gmail.com>
> Hi All,
> 

> Kenny, your "grep univ abc.txt | cut -f3 -d, | sed s/\"//g >> dev.txt" 
> works. I mis-read /\ as a simliar sign on the top of "6" key on the 
> keyboard(so when I typed that sign, I felt strange that it is much 
> smaller than /\, but didn't realize that they just are not the same 
> thing), instead of forward slash and back slash. I felt really 
> embarrassed with my stupid mistake. //blush

It happens. Believe me, I have done much dumber things in my time :-)

> Kenny, regarding missing column issue, let me try to explain it again. 
> Below is quoted from my original post:

[SNIP]

> You said that "there is an extra column in the 3rd line". I disagree 
> with you from my perspective. As you can see, there are 3 commas in 
> between "jesse" and "Dartmouth college". For these 3 commas, again, if 
> we think the 2nd one as an merely indication that the value for age 
> column is missing, then the 3rd line will be be read as ["jesse", 
> MISSING, "Dartmouth college"], not ["jesse",empty,empty, "Dartmouth 
> college"] as you suggested.

This poses an interesting problem. The "," is being used for two purposes: a delimiter *AND* as a place holder. Unfortunately, cut and the like will see it as a delimiter and only a delimiter. It's what they do. I think that you may need to use the awk line that I sent, or some of the perl one-liners to get just the last column. Otherwise, you will end up with emty fields. 


> For one particular variable(column) called 
> "school", the length of some of its value is quite long(like: Univ of 
> Wisconsin at Madison, Health Sci Ctr), but I don't know the definite 
> length. I need to know it, because if the length I specify it not 
> enough, only partial values will be read. Many of its values contain 
> "univ", so I just thought if I could extract all strings containing 
> "univ" from that variable(column), I will have a better chance to figure 
> out the length of "school". That's why I had this question.

This is going to be another problem. Every "," that is used is going to be seen as a dilimiter. If the school name has a "," in it as there is between Madison and Health above. That means that taking just the last field will not work either. I think that the easiest thing to do in this case is to change the delimiter to something that is unlikely to be found in any of the columns, like a ":". 

C-Ya,
Kenny



More information about the gnhlug-discuss mailing list