I'm trying to learn regex

Associate
Joined
6 Jul 2003
Posts
2,075
Take the following string:

(?<=<title.*>)([\s\S]*)(?=</title>)

This is designed to find the title tags and anything in between on a webpage, and it does. However, the title tags themselves aren't returned and I can't figure out why. I'd guess it's got to do with the (?<= ) and (?= ) but can't find out anywhere what these mean and why their difference (the greater-than symbol in the first) is significant.

Also, what's the difference between [\s\S]* and .* ?

Muchos gracias mon amigos!
 
I think your regex is returning the title tags, however the brackets indicate groupings, so I'm guessing your code is only using the second group returned - which would be the HTML within the title.

Secondly, AFAIK there is no difference between \s\S* and .*, but I may be wrong on that.
 
It looks like your code is capturing the whole thing in 3 places via your lookahead/lookbehind layout (depending on the engine) and as Spunkey says you're over complicating your logic.

(<title[^>]*>[^<]*</title>) should do the trick (assuming you have no html within your title tags), otherwise something like (<title[^>]*>.*</title>) would be fine.
 
(?<= ) and (?= ) are called lookbehind and lookahead assertions respectively. They're zero-width assertions, which means that they're simply telling the parser that those patterns must be present before/after the pattern that they're qualifying, but that they shouldn't be captured as part of the match itself.

If you want to capture the tags as well, just leave out the lookbehind/lookahead syntax; this should do the trick:

Code:
<title.*?>.*?</title>

In this case, the ? modifies the * quantifier, making it "lazy", which means that the parser will only match the shortest possible part of the input text that still satisfies the rest of the pattern. This means that instead of matching something like <title>something</title></title>, it'll terminate the match earlier, giving <title>something</title>.

Edit: To be nit-picky, this pattern is better (since it requires a space after the opening "title" tag name):

Code:
<title( .*?)?>.*?</title>

Here's a useful resource for trying out regexes:

http://regexpal.com/
 
Last edited:
(?<= ) and (?= ) are called lookbehind and lookahead assertions respectively. They're zero-width assertions, which means that they're simply telling the parser that those patterns must be present before/after the pattern that they're qualifying, but that they shouldn't be captured as part of the match itself.

If you want to capture the tags as well, just leave out the lookbehind/lookahead syntax; this should do the trick:

Code:
<title.*?>.*?</title>

In this case, the ? modifies the * quantifier, making it "lazy", which means that the parser will only match the shortest possible part of the input text that still satisfies the rest of the pattern. This means that instead of matching something like <title>something</title></title>, it'll terminate the match earlier, giving <title>something</title>.

Edit: To be nit-picky, this pattern is better (since it requires a space after the opening "title" tag name):

Code:
<title( .*?)?>.*?</title>

Here's a useful resource for trying out regexes:

http://regexpal.com/

I personally prefer your second pattern and just love laziness ;-)

There are engine differences obviously so make sure that you are using a perlre implementation. (had some fun with Oracle's REGEXP a few times)

Another good resource is http://www.regular-expressions.info

If you intend to read your code after writing some regular expressions make sure you try to keep them fairly simple, it can be amusing to push your limits but it is never pretty to read it again a couple of years in...
 
Last edited:
Back
Top Bottom