Hello, as part of a project I'm writing a program(in Java) that can fetch RSS news feeds, once the appropriate feed has been chosen and I've connected to it, it reads in the HTML, and then I need to rip out the story text from the HTML, without pulling text such as the side menu bar etc.
So say the user selected this story...
http://news.bbc.co.uk/sport1/hi/football/teams/c/chelsea/6220849.stm
I would like to be able to pull out just the story text, which begins with
'Boss Jose Mourinho has made a scathing ' and ends with with Reading's Stephen Hunt in mid-October.
Problem is, I have never used HTML before, so don't really know how to rip out just the story.
Looking through the HTML, there seems to be a <p> or a <b> before a new sentence, would looking for these tags and getting whatever comes directly after that tag until another tag is found a good way of getting it, or is this method likely to fail on other web pages?
Are there any other ways of identifying where the main text is?
Cheers for any help/pointers.
So say the user selected this story...
http://news.bbc.co.uk/sport1/hi/football/teams/c/chelsea/6220849.stm
I would like to be able to pull out just the story text, which begins with
'Boss Jose Mourinho has made a scathing ' and ends with with Reading's Stephen Hunt in mid-October.
Problem is, I have never used HTML before, so don't really know how to rip out just the story.
Looking through the HTML, there seems to be a <p> or a <b> before a new sentence, would looking for these tags and getting whatever comes directly after that tag until another tag is found a good way of getting it, or is this method likely to fail on other web pages?
Are there any other ways of identifying where the main text is?
Cheers for any help/pointers.