Parsing HTML to get out text

Chrisss · 1 Jan 2007 at 13:05

Hello, as part of a project I'm writing a program(in Java) that can fetch RSS news feeds, once the appropriate feed has been chosen and I've connected to it, it reads in the HTML, and then I need to rip out the story text from the HTML, without pulling text such as the side menu bar etc.

So say the user selected this story...

http://news.bbc.co.uk/sport1/hi/football/teams/c/chelsea/6220849.stm

I would like to be able to pull out just the story text, which begins with
'Boss Jose Mourinho has made a scathing ' and ends with with Reading's Stephen Hunt in mid-October.

Problem is, I have never used HTML before, so don't really know how to rip out just the story.

Looking through the HTML, there seems to be a <p> or a <b> before a new sentence, would looking for these tags and getting whatever comes directly after that tag until another tag is found a good way of getting it, or is this method likely to fail on other web pages?

Are there any other ways of identifying where the main text is?

Cheers for any help/pointers.

Chrisss · 1 Jan 2007 at 14:03

Ok thanks for that advice.

How would I get the XML version of a news feed though?

Would the site hosting the article, such as BBC, host it in XML format, or is it a case of me reading it differently to how I'm currently doing it?

Cheers

Parsing HTML to get out text

Chrisss

Chrisss

Chrisss

Chrisss