perl regular expressions

daven1986 · 15 Jan 2008 at 22:47

hi guys,
just starting to use perl a bit and i need a bit of help with regular expressions.

i have these lines i need to parse:

Code:

<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Distributed Systems<br>LAB (3-7) / slk (3-7),vlt (3-7) / 219<br><br>Distributed Systems<br>TUT (2-10) / slk (2-10),vlt (2-10) / 311</font>
</td>
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Multimedia Systems<br>TUT (2-10) / ih (2-10) / 311</font>
</td>

and the regular expression i am using is:

Code:

$lines[$index] =~ />(.+)<br>\S\S\S\s\((.+)\) \/ .* \((.+)\) \/ (.+)</

this works fine on the second bit "Multimedia Systems", but fails on the first bit as there are more than one within the text.

the output i require is:
distributed systems
(3-7)
219
distributed systems
(2-10)
311
multimedia systems
(2-10)
311

can someone please give me a few pointers.

thanks

daven

daven1986 · 16 Jan 2008 at 13:15

hi, thanks for the help. I actually just rewrote the reg exp in 2 parts and it now works....to a point.

Code:

if($lines[$index] =~ />(.+)<\//){
			$new = $1;
			
			if($new =~ /((.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+))<br><br>\1*/){
				
			}
			elsif($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \((.+)\) \/ (.+)/){
				
			}
		}

my problem now is that in the first if statement it uses recursion on the reg exp. usually you can access groups of text by using $1, $2, $3 etc. however when I use the recursion I cannot access the $1 of the second recursion. Is there any way to do this?

At the moment I get the information for distributed systems (the first bit), for example, but then cannot access the second distributed systems text.

Instead of using recursion should i just use a while loop - i.e. while it matches process the data. obviously it will stop matching when there is no more data, this will work like the recursion to any number of courses after a .

the output i require would be:

Code:

distributed systems
(3-7)
219

distributed systems
(2-10)
311

multimedia systems
(2-10)
311

sorry for the poor explanation

thanks

daven

daven1986 · 17 Jan 2008 at 10:13

thanks for the advice. I have done pretty much what you suggested and now have:

Code:

while($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ (.*) \/ (.+)<br><br>/ || $new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+)/){

				if($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+)<br><br>/){
					$start = index($lines[$index], "<br><br>");
					$new = substr($lines[$index], ($start + 8));
				}
				elsif($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ (.*) \/ (.+)/){
					$new = "";
				}

which will break down the line, parsing the complex bit then cutting it out and parsing the next bit until the line has been completely parsed. the only problem is that it breaks slightly on this line:

Code:

>MEng Tests<br>Wks (11-11) /  / 344<br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / <br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (3-3) /  / </font>

it prints this:

Code:

MEng Tests
11-11
344<br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / 
TIME OFFSET 4
Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / <br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room
3-3

i am not sure why it is failing to match part of this line. but i am going to do a bit more debugging later to see if i can see where it fails. if you have any ideas it'd be much appreciated. the only thing i can see different is that there is nothing after the second "/" but this shouldn't make any difference.

thanks

daven

daven1986 · 17 Jan 2008 at 11:42

AndrewP said:
Dan mentioned it, but assuming a double separator for different items splitting the line and then matching on those fragments is probably easier, e.g. split(' ',$line).

Quick, untested regex:

Code:

([A-Za-z ]{1,}) [A-Za-z ]{3} $([0-9\-]{3,5})$.*\/ ([0-9]{3})?/)

I have sort of done that with substr. so i found the index of then truncated the string up to it.

daven1986 · 17 Jan 2008 at 20:19

hi,

thanks for the help. i have now managed to do it. i did as suggested but in a slightly more round about way!!

i have started to like perl!

daven1986 · 17 Jan 2008 at 20:43

sure, it is really messy though!

Code:

while($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+?)<br><br>/ || $new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+)/){
							
				if($new =~ /(.+)<br>\S\S\S\s(.+) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+?)<br><br>/){
					$start = index($lines[$index], "<br><br>");
					$new = substr($lines[$index], ($start + 8));
					}
				elsif($new =~ /(.+?)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+)/){
					@splitExp = split(/<br><br>/, $new);
					$splitExp = @splitExp;
					for($j = 0; $j < $splitExp; $j++){
						$splitExp[$j] =~ /(.+?)<br>\S\S\S\s(.+) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.*?)/;
					}
					$new = "";
					
				}
			}

i then passed the data i needed to a function that computed the date and time of my courses and put it into icalendar format.

daven

daven1986 · 20 Jan 2008 at 18:57

cool thanks. only thing i must ask is if you get more than one match before the end of the line, how do you recover this text?

because surely $1, $2, $3 will refer to the first match, if there are more matches do you use $4, $5, $6?

thanks

daven

daven1986 · 20 Jan 2008 at 21:35

oh i see, excellent.

thanks, this was my first experience of perl and reg exps! but i quite like them now! i see how i could have done it a bit better but it worked for me, (although it does break on other time tables!! but that isn't my problem any more!!)

thanks

daven