help with grabbing data out of file

blastman · 6 Oct 2011 at 14:33

hey guys,

I have a xml file which has loads of data in it. I need to read though the file and grab all the strings that fall between two known nodes.

example structer;

Code:

<Message Date="22/02/2011" Time="14:31:15" DateTime="2011-02-22T14:31:15.817Z" SessionID="1">
	<From>
		<User FriendlyName="[email protected]"/>
	</From>
	<To>
		<User FriendlyName="[email protected] (E-mail Address Not Verified)"/>
	</To>
	<Text Style="font-family:Segoe UI; color:#000000; ">
		message text
	</Text>
</Message>

<Message Date="22/02/2011" Time="14:47:02" DateTime="2011-02-22T14:47:02.019Z" SessionID="2">
	<From>
		<User FriendlyName="[email protected]"/>
	</From>
	<To>
		<User FriendlyName="[email protected] (E-mail Address Not Verified)"/>
	</To>
	<Text Style="font-family:Segoe UI; color:#000000; ">
		message text.....
	</Text>
</Message>

if i can get everything between '<Message Date' and '</Message>' into a variable so i end up with a load of variables, I'd be really happy

any ideas?

I have a feeling awk is what i need to use but i have no idea how!

thanks

SMN · 6 Oct 2011 at 15:28

Perl?

http://stackoverflow.com/questions/...act-lines-between-two-line-delimiters-in-perl

blastman · 6 Oct 2011 at 15:42

i saw that but i have NO perl experience, so was unable to get any of those solutions to work. mainly because i don't know how to use the suggested code on the back of standard in on my file.

any other suggestions?

JHeaton · 6 Oct 2011 at 17:27

blastman said:
i saw that but i have NO perl experience, so was unable to get any of those solutions to work. mainly because i don't know how to use the suggested code on the back of standard in on my file.

any other suggestions?

Do you have any experience with any languages? Some of them have some decent XML parsing libraries/modules. You probably could use awk for it, but I wouldn't have the first clue as to how you'd manage it.

JonJ678 · 7 Oct 2011 at 00:14

Which strings do you want?

edit: Paused to think for a bit, and then rewrote the following without sed.

This prints everything between <Message Date and </Message>, exclusive.

Code:

awk ' BEGIN {INMESSAGE=0}  ;  
/<\/Message>/ {INMESSAGE=0} ; 
(INMESSAGE==1)  {print} ; 
/<Message Date/ {INMESSAGE=1}'  ocuk_struct.txt

Taking a bit of a guess at what you mean by string, this will print the email addresses. The -F bit is -F ' " ' without spaces.

Code:

awk -F'"' 'BEGIN {INMESSAGE=0}  ;  
/<\/Message>/ {INMESSAGE=0} ; 
(INMESSAGE==1)  && /FriendlyName/ {print $2} ; 
/<Message Date/ {INMESSAGE=1}'  ocuk_struct.txt

From your example text, the latter prints
[email protected]
[email protected] (E-mail Address Not Verified)
[email protected]
[email protected] (E-mail Address Not Verified)

Either of the above will ignore anything between the two strings you want to match on, and print the output to the screen. Finally, if you match on <Text and </Text>, you'll get the message text. The following assumes anything between those tags is a valid message, if they occur outside of the <Message </Message> pair the code gets a little longer but no more elaborate.

Code:

awk ' BEGIN {INMESSAGE=0}  ;  
/<\/Text>/ {INMESSAGE=0} ; 
(INMESSAGE==1)  {print} ; 
/<Text/ {INMESSAGE=1}'  ocuk_struct.txt

Outputs
message text
message text.....

Cheers

edit: awk does cryptic pretty well too, the following prints the text you say you're after.

Code:

awk '/<Message/,/<\/Message>/'  ocuk_struct.txt

Or you could just butcher everything you don't like the look of with grep I suppose. It's a bit hard to confidently provide a solution when I'm not really clear what you're trying to achieve.

blastman · 7 Oct 2011 at 18:40

WOW, what a response!

I should have explained a little better, so let me try again...

We (at work) are thinking of getting into bed with a new lync server. However untill we weigh up all the options and decide if it's worth the cost, I've been asked to try and keep tabs on msn conversions in the office as all staff use msn to talk to each other and customers.

Since the newer version of msn, I'm unable to sniff the non-standard port that msn uses and grab the data that way. (New versions uses port 80 so thats not going to work)

I've written a logon script that copies the users xml history data to dumps it on a ftp server. I've then got a linux webserver that picks this data up via a cron job, now I want to go through each xml file for each user and get the data out of ever node in the xml, then import that data into mysql.

I'll have a bash with the awk code you've supplied, but in the mean time, if you can think of a better solution, I'd be very grateful.

Thanks again, and I'll update once I'v had a chance to try and implement your suggestions.

Blastman

JonJ678 · 9 Oct 2011 at 00:17

Mysql allows you to import delimited text files, so you can probably make your life easier by getting awk to output a file in a format mysql will happily read.

By default awk reads the first line of a text file, then tries every one of the rules you've typed in, in sequence. It then moves to the next line and repeats. This can be altered however, in that it doesn't have to split records at the new line character. I'm fairly sure you can split on any string you like, so the following will probably work.

The first command (should) get rid of any tabs present in the input, exchanging them for spaces. The first awk instance is the same as before, except it prints SPLIT_IT_HERE on a seperate line at the end of each block of text. Finally the second awk instance splits the input into records based on SPLIT_IT_HERE, and outputs the text seperated by tabs. These should be the only tabs present in the file, so mysql should have no trouble importing the result. I'm gambling on no one writing SPLIT_IT_HERE anywhere in the log, using a nonprinting character instead probably makes more sense. "034." perhaps.

Code:

sed 's_\t_    _g' YOURXMLFILE.xml | \
awk ' BEGIN {INMESSAGE=0}  ;  
/<\/Message>/ {INMESSAGE=0} ; 
(INMESSAGE==1)  {print} ; 
/<Message Date/ {INMESSAGE=1}
/<Message Date/ {print "SPLIT_IT_HERE"}'  |\
awk 'BEGIN {RS="SPLIT_IT_HERE" ; ORS="\t"} ; {print}'

Best of luck

edit: It's entirely possible that the following achieves exactly the same result.

Code:

sed 's_\t_    _g' YOURXMLFILE.xml | \
awk ' BEGIN {INMESSAGE=0}  ;  
/<\/Message>/ {INMESSAGE=0} ; 
(INMESSAGE==1)  {print} ; 
/<Message Date/ {INMESSAGE=1}
/<Message Date/ {print \t}'

Illusion · 19 Oct 2011 at 08:41

By keep tabs on MSN conversations do you mean just see how much it is being used to weigh up if it's worth getting a dedicated solution, to monitor conversations, or something else?

If you want to log conversations you could just set up your own Jabber server (ejabberd) and have it do most of the work for you.