perl regular expressions

Soldato
Joined
29 Oct 2005
Posts
3,298
hi guys,
just starting to use perl a bit and i need a bit of help with regular expressions.

i have these lines i need to parse:

Code:
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Distributed Systems<br>LAB (3-7) / slk (3-7),vlt (3-7) / 219<br><br>Distributed Systems<br>TUT (2-10) / slk (2-10),vlt (2-10) / 311</font>
</td>
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Multimedia Systems<br>TUT (2-10) / ih (2-10) / 311</font>
</td>

and the regular expression i am using is:
Code:
$lines[$index] =~ />(.+)<br>\S\S\S\s\((.+)\) \/ .* \((.+)\) \/ (.+)</

this works fine on the second bit "Multimedia Systems", but fails on the first bit as there are more than one <br> within the text.

the output i require is:
distributed systems
(3-7)
219
distributed systems
(2-10)
311
multimedia systems
(2-10)
311

can someone please give me a few pointers.

thanks

daven
 
I'm no expert on regular expressions but I have used them a few times in PHP so I'll have a go. The expression looks like it is not matching on the first line when it gets to the comma (",Vlt"). This can be seen below, the green is what seems to be matching and the red is not.

Code:
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1"[COLOR="Lime"]>Distributed Systems<br>LAB (3-7) / slk (3-7)[/COLOR][COLOR="Red"],vlt (3-7) / 219<br><br>Distributed Systems<br>TUT (2-10) / slk (2-10),vlt (2-10) / 311<[/COLOR]/font>
</td>
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1"[COLOR="Lime"]>Multimedia Systems<br>TUT (2-10) / ih (2-10) / 311<[/COLOR]/font>
</td>

You haven't mentioned the formatting of all/any other data to parse so I'll try not to give you any useless tips that just make things worse :p
 
Last edited:
hi, thanks for the help. I actually just rewrote the reg exp in 2 parts and it now works....to a point.

Code:
if($lines[$index] =~ />(.+)<\//){
			$new = $1;
			
			if($new =~ /((.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+))<br><br>\1*/){
				
			}
			elsif($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \((.+)\) \/ (.+)/){
				
			}
		}

my problem now is that in the first if statement it uses recursion on the reg exp. usually you can access groups of text by using $1, $2, $3 etc. however when I use the recursion I cannot access the $1 of the second recursion. Is there any way to do this?

At the moment I get the information for distributed systems (the first bit), for example, but then cannot access the second distributed systems text.

Instead of using recursion should i just use a while loop - i.e. while it matches process the data. obviously it will stop matching when there is no more data, this will work like the recursion to any number of courses after a <br><br>.

the output i require would be:
Code:
distributed systems
(3-7)
219

distributed systems
(2-10)
311

multimedia systems
(2-10)
311

sorry for the poor explanation

thanks

daven
 
I think what you have done to use a different expression depending on the complexity of the line is the best way to go. You say the parsing of the 'simple' line works fine so I would think the only things needed are to get the more complex expression working by breaking that down also.

This could be done after the check to see if the line is complex by splitting it at the point "<br><br>" and then splitting that down again using an expression very much like the the original simple one but taking in to account the comma bits. The final piece of the puzzle would be to sort the looping of every line out.

If my mumbling are useless then there is always this link that may be a tad bit better: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm#StringSelections

Good luck! :)
 
thanks for the advice. I have done pretty much what you suggested and now have:
Code:
while($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ (.*) \/ (.+)<br><br>/ || $new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+)/){

				if($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+)<br><br>/){
					$start = index($lines[$index], "<br><br>");
					$new = substr($lines[$index], ($start + 8));
				}
				elsif($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ (.*) \/ (.+)/){
					$new = "";
				}

which will break down the line, parsing the complex bit then cutting it out and parsing the next bit until the line has been completely parsed. the only problem is that it breaks slightly on this line:
Code:
>MEng Tests<br>Wks (11-11) /  / 344<br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / <br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (3-3) /  / </font>

it prints this:
Code:
MEng Tests
11-11
344<br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / 
TIME OFFSET 4
Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / <br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room
3-3

i am not sure why it is failing to match part of this line. but i am going to do a bit more debugging later to see if i can see where it fails. if you have any ideas it'd be much appreciated. the only thing i can see different is that there is nothing after the second "/" but this shouldn't make any difference.

thanks

daven
 
Dan mentioned it, but assuming a double <br> separator for different items splitting the line and then matching on those fragments is probably easier, e.g. split('<br><br>',$line).

Quick, untested regex:
Code:
([A-Za-z ]{1,})<br>[A-Za-z ]{3} \(([0-9\-]{3,5})\).*\/ ([0-9]{3})?/)
 
Dan mentioned it, but assuming a double <br> separator for different items splitting the line and then matching on those fragments is probably easier, e.g. split('<br><br>',$line).

Quick, untested regex:
Code:
([A-Za-z ]{1,})<br>[A-Za-z ]{3} \(([0-9\-]{3,5})\).*\/ ([0-9]{3})?/)

I have sort of done that with substr. so i found the index of <br><br> then truncated the string up to it.
 
Looks like it's getting there, here's an example of what I was thinking could be done. It's in pseudocode since I haven't used perl...yet.

Code:
while there is data to parse {
	if this is a complex line i.e. contains <br><br> 
		split the line at the point '<br><br>'
		the split complex line should now be in the same format so a similar if not the same expression can be used
	else this is a simple line i.e. no '<br><br>'
		use the normal working expression here
	end if
end while

Here's my take on the regular expression after Andrew providing some inspiration with his and it should work in the complex and simple parts in the pseudo code, however it is also untested.

Code:
/^(\D+)<br>[A-Za-z ]{3} \((\d+-\d+)\) \/ .*? \/ (\d{3})?/i

Once converted in to perl I hope this works wonders.
 
Last edited:
sure, it is really messy though!

Code:
while($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+?)<br><br>/ || $new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+)/){
							
				if($new =~ /(.+)<br>\S\S\S\s(.+) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+?)<br><br>/){
					$start = index($lines[$index], "<br><br>");
					$new = substr($lines[$index], ($start + 8));
					}
				elsif($new =~ /(.+?)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+)/){
					@splitExp = split(/<br><br>/, $new);
					$splitExp = @splitExp;
					for($j = 0; $j < $splitExp; $j++){
						$splitExp[$j] =~ /(.+?)<br>\S\S\S\s(.+) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.*?)/;
					}
					$new = "";
					
				}
			}

i then passed the data i needed to a function that computed the date and time of my courses and put it into icalendar format.

daven
 
Seeing as you can do it as a 'one-liner', I thought I'd pipe up.

Assuming $text contains the text you want to parse:

Code:
while ($text =~ />([^>]+)<br>[^\(]+(\([\d-]+\))[^<]+\/\s*(\d+)</mg)
{
    print "$1\n$2\n$3\n";
}

The important thing here is the g regex modifier: apply the regex as many times as possible until you hit the end of the line.

The regex itself could be much more complex, but this one will do the matching you need. As it stands

Code:
>            : anchor the start of the regex at a > char
([^>]+)      : match and remember one or more characters that aren't >
<br>[^\(]+   : match a <br>, followed by one or more characters that aren't (
(\([\d-]+\)) : match and remember one or more digits or - chars in brackets
[^<]+\/\s*   : match one or more chars that aren't <, followed by a / and 0 or more spaces
(\d+)        : match and remember one or more digits
<            : anchor the end of the regex on a > char

HTH
B
 
cool thanks. only thing i must ask is if you get more than one match before the end of the line, how do you recover this text?

because surely $1, $2, $3 will refer to the first match, if there are more matches do you use $4, $5, $6?

thanks

daven
 
When used in the context of a loop, the /g modifier will resume matching from the end of the last match (at least, that's my understanding of it).

So, the simple loop I posted should do all you need (assuming you already have all the text in a variable). For instance:

Code:
#!/usr/bin/perl -w
use strict;

my $text = <<END;
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Distributed Systems<br>LAB (3-7) / slk (3-7),vlt (3-7) / 219<br><br>Distributed Systems<br>TUT (2-10) / slk (2-10),vlt (2-10) / 311</font>
</td>
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Multimedia Systems<br>TUT (2-10) / ih (2-10) / 311</font>
</td>
END

while ($text =~ />([^>]+)<br>[^\(]+(\([\d-]+\))[^<]+\/\s*(\d+)</mg)
{
    print "$1\n$2\n$3\n";
}

will output

Code:
Distributed Systems
(3-7)
219
Distributed Systems
(2-10)
311
Multimedia Systems
(2-10)
311

The perlre docs really are very good, even if the writing style is a little ... dry: http://www.perl.com/doc/manual/html/pod/perlre.html

B
 
oh i see, excellent.

thanks, this was my first experience of perl and reg exps! but i quite like them now! i see how i could have done it a bit better but it worked for me, (although it does break on other time tables!! but that isn't my problem any more!!)

thanks

daven
 
Back
Top Bottom