Parsing HTML to get out text

Chrisss · 1 Jan 2007 at 13:05

Hello, as part of a project I'm writing a program(in Java) that can fetch RSS news feeds, once the appropriate feed has been chosen and I've connected to it, it reads in the HTML, and then I need to rip out the story text from the HTML, without pulling text such as the side menu bar etc.

So say the user selected this story...

http://news.bbc.co.uk/sport1/hi/football/teams/c/chelsea/6220849.stm

I would like to be able to pull out just the story text, which begins with
'Boss Jose Mourinho has made a scathing ' and ends with with Reading's Stephen Hunt in mid-October.

Problem is, I have never used HTML before, so don't really know how to rip out just the story.

Looking through the HTML, there seems to be a <p> or a <b> before a new sentence, would looking for these tags and getting whatever comes directly after that tag until another tag is found a good way of getting it, or is this method likely to fail on other web pages?

Are there any other ways of identifying where the main text is?

Cheers for any help/pointers.

JIMA · 1 Jan 2007 at 13:55

Hi,

Is there any way you can get the newsfeed in XML format rather than HTML? The HTML probably doesn't tell you enough about the document in terms of it's structure etc. It also mixes up presentation elements with the text making it harder to parse.

The XML document will present you with just the data with no display formatting etc. Given an XML document you can get specific parts of the document using XPath or transform the XML into other things e.g HTML, using XSLT . For example, the title and body of a story may well be contained within separate elements:

<story>
<title>My Story</title>
<body>Here is the body of the story</body>
</story>

Using XPath it would be easy to pick out the individual parts of the story.

There's loads of information on the w3c site, including tutorials which are pretty good. Using all these things in Java isn't too difficult, using classes supplied in the javax.xml and org.w3c packages.

Hope that helps a bit.

Jim

Chrisss · 1 Jan 2007 at 14:03

Ok thanks for that advice.

How would I get the XML version of a news feed though?

Would the site hosting the article, such as BBC, host it in XML format, or is it a case of me reading it differently to how I'm currently doing it?

Cheers

JIMA · 1 Jan 2007 at 16:41

Have a look at freenewsfeed.com . This seems to provide a free RSS/XML feed. If you follow the link on the webpage for www.freenewsfeed.com/rss and then look at the source for the page shown you'll see the XML layout of the page you can parse or search for bits in.

Another resource you might find interesting is this:

http://www.javaworld.com/javaworld/jw-05-2000/jw-0526-jiniology.html

It shows something similar to what you want to do, along with explanations of how to do it.

Hope that's of use.

Jim