I'm trying to learn regex

Paramount · 14 Nov 2010 at 14:14

Take the following string:

(?<=<title.*>)([\s\S]*)(?=</title>)

This is designed to find the title tags and anything in between on a webpage, and it does. However, the title tags themselves aren't returned and I can't figure out why. I'd guess it's got to do with the (?<= ) and (?= ) but can't find out anywhere what these mean and why their difference (the greater-than symbol in the first) is significant.

Also, what's the difference between [\s\S]* and .* ?

Muchos gracias mon amigos!

GravyMonster · 14 Nov 2010 at 20:05

I think your regex is returning the title tags, however the brackets indicate groupings, so I'm guessing your code is only using the second group returned - which would be the HTML within the title.

Secondly, AFAIK there is no difference between \s\S* and .*, but I may be wrong on that.

Teq · 14 Nov 2010 at 21:05

It looks like your code is capturing the whole thing in 3 places via your lookahead/lookbehind layout (depending on the engine) and as Spunkey says you're over complicating your logic.

(<title[^>]*>[^<]*</title>) should do the trick (assuming you have no html within your title tags), otherwise something like (<title[^>]*>.*</title>) would be fine.

Inquisitor · 14 Nov 2010 at 21:34

(?<= ) and (?= ) are called lookbehind and lookahead assertions respectively. They're zero-width assertions, which means that they're simply telling the parser that those patterns must be present before/after the pattern that they're qualifying, but that they shouldn't be captured as part of the match itself.

If you want to capture the tags as well, just leave out the lookbehind/lookahead syntax; this should do the trick:

Code:

<title.*?>.*?</title>

In this case, the ? modifies the * quantifier, making it "lazy", which means that the parser will only match the shortest possible part of the input text that still satisfies the rest of the pattern. This means that instead of matching something like <title>something</title></title>, it'll terminate the match earlier, giving <title>something</title>.

Edit: To be nit-picky, this pattern is better (since it requires a space after the opening "title" tag name):

Code:

<title( .*?)?>.*?</title>

Here's a useful resource for trying out regexes:

http://regexpal.com/

Teq · 15 Nov 2010 at 22:27

Inquisitor said:
(?<= ) and (?= ) are called lookbehind and lookahead assertions respectively. They're zero-width assertions, which means that they're simply telling the parser that those patterns must be present before/after the pattern that they're qualifying, but that they shouldn't be captured as part of the match itself.

If you want to capture the tags as well, just leave out the lookbehind/lookahead syntax; this should do the trick:

Code:

<title.*?>.*?</title>

In this case, the ? modifies the * quantifier, making it "lazy", which means that the parser will only match the shortest possible part of the input text that still satisfies the rest of the pattern. This means that instead of matching something like <title>something</title></title>, it'll terminate the match earlier, giving <title>something</title>.

Edit: To be nit-picky, this pattern is better (since it requires a space after the opening "title" tag name):

Code:

<title( .*?)?>.*?</title>

Here's a useful resource for trying out regexes:

http://regexpal.com/

I personally prefer your second pattern and just love laziness ;-)

There are engine differences obviously so make sure that you are using a perlre implementation. (had some fun with Oracle's REGEXP a few times)

Another good resource is http://www.regular-expressions.info

If you intend to read your code after writing some regular expressions make sure you try to keep them fairly simple, it can be amusing to push your limits but it is never pretty to read it again a couple of years in...

Izi · 15 Nov 2010 at 22:34

http://gskinner.com/RegExr/

use this to test out and build your regex.

regex is so user friendly

Xelene · 16 Nov 2010 at 03:24

http://www.weitz.de/regex-coach/

Cenuij · 19 Nov 2010 at 01:01

http://stackoverflow.com/questions/...ept-xhtml-self-contained-tags/1732454#1732454

Inquisitor · 19 Nov 2010 at 01:53

Cenuij said:
http://stackoverflow.com/questions/...ept-xhtml-self-contained-tags/1732454#1732454

If the OP's happy with a heuristic approach that isn't 100% guaranteed to produce accurate results for all possible inputs, then there's nothing wrong with using a regex for a situation like this