perl regular expressions

daven1986 · 15 Jan 2008 at 22:47

hi guys,
just starting to use perl a bit and i need a bit of help with regular expressions.

i have these lines i need to parse:

Code:

<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Distributed Systems<br>LAB (3-7) / slk (3-7),vlt (3-7) / 219<br><br>Distributed Systems<br>TUT (2-10) / slk (2-10),vlt (2-10) / 311</font>
</td>
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Multimedia Systems<br>TUT (2-10) / ih (2-10) / 311</font>
</td>

and the regular expression i am using is:

Code:

$lines[$index] =~ />(.+)<br>\S\S\S\s\((.+)\) \/ .* \((.+)\) \/ (.+)</

this works fine on the second bit "Multimedia Systems", but fails on the first bit as there are more than one within the text.

the output i require is:
distributed systems
(3-7)
219
distributed systems
(2-10)
311
multimedia systems
(2-10)
311

can someone please give me a few pointers.

thanks

daven

dan7827 · 16 Jan 2008 at 00:04

I'm no expert on regular expressions but I have used them a few times in PHP so I'll have a go. The expression looks like it is not matching on the first line when it gets to the comma (",Vlt"). This can be seen below, the green is what seems to be matching and the red is not.

Code:

<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1"[COLOR="Lime"]>Distributed Systems<br>LAB (3-7) / slk (3-7)[/COLOR][COLOR="Red"],vlt (3-7) / 219<br><br>Distributed Systems<br>TUT (2-10) / slk (2-10),vlt (2-10) / 311<[/COLOR]/font>
</td>
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1"[COLOR="Lime"]>Multimedia Systems<br>TUT (2-10) / ih (2-10) / 311<[/COLOR]/font>
</td>

You haven't mentioned the formatting of all/any other data to parse so I'll try not to give you any useless tips that just make things worse

daven1986 · 16 Jan 2008 at 13:15

hi, thanks for the help. I actually just rewrote the reg exp in 2 parts and it now works....to a point.

Code:

if($lines[$index] =~ />(.+)<\//){
			$new = $1;
			
			if($new =~ /((.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+))<br><br>\1*/){
				
			}
			elsif($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \((.+)\) \/ (.+)/){
				
			}
		}

my problem now is that in the first if statement it uses recursion on the reg exp. usually you can access groups of text by using $1, $2, $3 etc. however when I use the recursion I cannot access the $1 of the second recursion. Is there any way to do this?

At the moment I get the information for distributed systems (the first bit), for example, but then cannot access the second distributed systems text.

Instead of using recursion should i just use a while loop - i.e. while it matches process the data. obviously it will stop matching when there is no more data, this will work like the recursion to any number of courses after a .

the output i require would be:

Code:

distributed systems
(3-7)
219

distributed systems
(2-10)
311

multimedia systems
(2-10)
311

sorry for the poor explanation

thanks

daven

dan7827 · 16 Jan 2008 at 23:58

I think what you have done to use a different expression depending on the complexity of the line is the best way to go. You say the parsing of the 'simple' line works fine so I would think the only things needed are to get the more complex expression working by breaking that down also.

This could be done after the check to see if the line is complex by splitting it at the point " " and then splitting that down again using an expression very much like the the original simple one but taking in to account the comma bits. The final piece of the puzzle would be to sort the looping of every line out.

If my mumbling are useless then there is always this link that may be a tad bit better: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm#StringSelections

Good luck!

daven1986 · 17 Jan 2008 at 10:13

thanks for the advice. I have done pretty much what you suggested and now have:

Code:

while($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ (.*) \/ (.+)<br><br>/ || $new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+)/){

				if($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+)<br><br>/){
					$start = index($lines[$index], "<br><br>");
					$new = substr($lines[$index], ($start + 8));
				}
				elsif($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ (.*) \/ (.+)/){
					$new = "";
				}

which will break down the line, parsing the complex bit then cutting it out and parsing the next bit until the line has been completely parsed. the only problem is that it breaks slightly on this line:

Code:

>MEng Tests<br>Wks (11-11) /  / 344<br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / <br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (3-3) /  / </font>

it prints this:

Code:

MEng Tests
11-11
344<br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / 
TIME OFFSET 4
Alternative Personal Tutor slots with DOC tutor in tutor's room<br>TUT (10-10) /  / <br><br>Alternative Personal Tutor slots with DOC tutor in tutor's room
3-3

i am not sure why it is failing to match part of this line. but i am going to do a bit more debugging later to see if i can see where it fails. if you have any ideas it'd be much appreciated. the only thing i can see different is that there is nothing after the second "/" but this shouldn't make any difference.

thanks

daven

AndrewP · 17 Jan 2008 at 11:22

Dan mentioned it, but assuming a double separator for different items splitting the line and then matching on those fragments is probably easier, e.g. split(' ',$line).

Quick, untested regex:

Code:

([A-Za-z ]{1,})<br>[A-Za-z ]{3} \(([0-9\-]{3,5})\).*\/ ([0-9]{3})?/)

daven1986 · 17 Jan 2008 at 11:42

AndrewP said:
Dan mentioned it, but assuming a double separator for different items splitting the line and then matching on those fragments is probably easier, e.g. split(' ',$line).

Quick, untested regex:

Code:

([A-Za-z ]{1,}) [A-Za-z ]{3} $([0-9\-]{3,5})$.*\/ ([0-9]{3})?/)

I have sort of done that with substr. so i found the index of then truncated the string up to it.

dan7827 · 17 Jan 2008 at 20:02

Looks like it's getting there, here's an example of what I was thinking could be done. It's in pseudocode since I haven't used perl...yet.

Code:

while there is data to parse {
	if this is a complex line i.e. contains <br><br> 
		split the line at the point '<br><br>'
		the split complex line should now be in the same format so a similar if not the same expression can be used
	else this is a simple line i.e. no '<br><br>'
		use the normal working expression here
	end if
end while

Here's my take on the regular expression after Andrew providing some inspiration with his and it should work in the complex and simple parts in the pseudo code, however it is also untested.

Code:

/^(\D+)<br>[A-Za-z ]{3} \((\d+-\d+)\) \/ .*? \/ (\d{3})?/i

Once converted in to perl I hope this works wonders.

daven1986 · 17 Jan 2008 at 20:19

hi,

thanks for the help. i have now managed to do it. i did as suggested but in a slightly more round about way!!

i have started to like perl!

dan7827 · 17 Jan 2008 at 20:34

Would you mind sharing the code that got the outcome you were after? Just so I can see if I was anywhere close

daven1986 · 17 Jan 2008 at 20:43

sure, it is really messy though!

Code:

while($new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+?)<br><br>/ || $new =~ /(.+)<br>\S\S\S\s\((.+)\) \/ .* \/ (.+)/){
							
				if($new =~ /(.+)<br>\S\S\S\s(.+) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+?)<br><br>/){
					$start = index($lines[$index], "<br><br>");
					$new = substr($lines[$index], ($start + 8));
					}
				elsif($new =~ /(.+?)<br>\S\S\S\s\((.+)\) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.+)/){
					@splitExp = split(/<br><br>/, $new);
					$splitExp = @splitExp;
					for($j = 0; $j < $splitExp; $j++){
						$splitExp[$j] =~ /(.+?)<br>\S\S\S\s(.+) \/ ([a-zA-Z0-9\s\(\)-\.,]*) \/ (.*?)/;
					}
					$new = "";
					
				}
			}

i then passed the data i needed to a function that computed the date and time of my courses and put it into icalendar format.

daven

dan7827 · 18 Jan 2008 at 18:02

Cheers for that, it's nice to see the working result.

daven1986 said:
sure, it is really messy though!

Only one thing that has to be said to that...If it ain't broken don't fix it

buzman · 20 Jan 2008 at 15:40

Seeing as you can do it as a 'one-liner', I thought I'd pipe up.

Assuming $text contains the text you want to parse:

Code:

while ($text =~ />([^>]+)<br>[^\(]+(\([\d-]+\))[^<]+\/\s*(\d+)</mg)
{
    print "$1\n$2\n$3\n";
}

The important thing here is the g regex modifier: apply the regex as many times as possible until you hit the end of the line.

The regex itself could be much more complex, but this one will do the matching you need. As it stands

Code:

>            : anchor the start of the regex at a > char
([^>]+)      : match and remember one or more characters that aren't >
<br>[^\(]+   : match a <br>, followed by one or more characters that aren't (
(\([\d-]+\)) : match and remember one or more digits or - chars in brackets
[^<]+\/\s*   : match one or more chars that aren't <, followed by a / and 0 or more spaces
(\d+)        : match and remember one or more digits
<            : anchor the end of the regex on a > char

HTH
B

daven1986 · 20 Jan 2008 at 18:57

cool thanks. only thing i must ask is if you get more than one match before the end of the line, how do you recover this text?

because surely $1, $2, $3 will refer to the first match, if there are more matches do you use $4, $5, $6?

thanks

daven

buzman · 20 Jan 2008 at 21:06

When used in the context of a loop, the /g modifier will resume matching from the end of the last match (at least, that's my understanding of it).

So, the simple loop I posted should do all you need (assuming you already have all the text in a variable). For instance:

Code:

#!/usr/bin/perl -w
use strict;

my $text = <<END;
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Distributed Systems<br>LAB (3-7) / slk (3-7),vlt (3-7) / 219<br><br>Distributed Systems<br>TUT (2-10) / slk (2-10),vlt (2-10) / 311</font>
</td>
<td bordercolor="#000000">
<font color="#000000" face="Arial" size="1">Multimedia Systems<br>TUT (2-10) / ih (2-10) / 311</font>
</td>
END

while ($text =~ />([^>]+)<br>[^\(]+(\([\d-]+\))[^<]+\/\s*(\d+)</mg)
{
    print "$1\n$2\n$3\n";
}

will output

Code:

Distributed Systems
(3-7)
219
Distributed Systems
(2-10)
311
Multimedia Systems
(2-10)
311

The perlre docs really are very good, even if the writing style is a little ... dry: http://www.perl.com/doc/manual/html/pod/perlre.html

B

daven1986 · 20 Jan 2008 at 21:35

oh i see, excellent.

thanks, this was my first experience of perl and reg exps! but i quite like them now! i see how i could have done it a bit better but it worked for me, (although it does break on other time tables!! but that isn't my problem any more!!)

thanks

daven