extract string

Zhao Peng greenmt at gmail.com
Wed Jan 11 01:35:01 EST 2006


Hi All,

First I really cannot be more grateful for the answers to my question 
from all of you, I appreciate your help and time. I'm especially touched 
by the outpouring of response on this list., which I have never 
experienced  before anywhere else.

Secondly I'm sorry for the big stir-up as to "homework problems" which 
flooded the list, since I'm origin of it.

Kenny, your "grep univ abc.txt | cut -f3 -d, | sed s/\"//g >> dev.txt" 
works. I mis-read /\ as a simliar sign on the top of "6" key on the 
keyboard(so when I typed that sign, I felt strange that it is much 
smaller than /\, but didn't realize that they just are not the same 
thing), instead of forward slash and back slash. I felt really 
embarrassed with my stupid mistake. //blush

Kenny, regarding missing column issue, let me try to explain it again. 
Below is quoted from my original post:

============================================
Also, if one column is missing, and "," is used to indicate that missing 
column, like the following (2nd column of 3rd line is missing):
"name","age","school"
"jerry" ,"21","univ of Vermont"
 "jesse",,,"Dartmouth college"
"jack","18","univ of Penn"
"john","20","univ of south Florida"
===========================================

You said that "there is an extra column in the 3rd line". I disagree 
with you from my perspective. As you can see, there are 3 commas in 
between "jesse" and "Dartmouth college". For these 3 commas, again, if 
we think the 2nd one as an merely indication that the value for age 
column is missing, then the 3rd line will be be read as ["jesse", 
MISSING, "Dartmouth college"], not ["jesse",empty,empty, "Dartmouth 
college"] as you suggested.

Paul, as to your "simplest by what measurement" question. I was thinking 
of both "easiest to remember" and "easiest to understand" when I was 
posting my question. Now I desire for "most efficient" approach. I know 
that will be my homework.

BTW,
A bit about me: I'm a junior SAS programmer at Dartmouth Medical school. 
(FYI: core strength of SAS lies in statistical analysis, I think, so you 
could say it's a statistical software, check www.sas.com). We run SAS on 
a RedHat server, but I basically know nothing about linux before I 
started working on this position(July, 2005). Fortunately, SAS 
programming doesn't require much linux knowledge. However, as you can 
imagine, at least I need to know some basic linux commands since I work 
on linux platform.

Part of my primary job responsibilities is to convert raw data into SAS 
data sets. My "extract string" question comes from processing a raw data 
file in .txt format, which doesn't have any documentation, except the 
variable list. By looking at the raw data, I know that each variable is 
separated by a comma. For one particular variable(column) called 
"school", the length of some of its value is quite long(like: Univ of 
Wisconsin at Madison, Health Sci Ctr), but I don't know the definite 
length. I need to know it, because if the length I specify it not 
enough, only partial values will be read. Many of its values contain 
"univ", so I just thought if I could extract all strings containing 
"univ" from that variable(column), I will have a better chance to figure 
out the length of "school". That's why I had this question.

Thank you all again!

Zhao



More information about the gnhlug-discuss mailing list