xpath - getting php to display the tex content of a remote table cell

craptakular · 2 Feb 2011 at 21:15

Hi,

I'm trying to create a script that will pull in data from a remote html page, more accurately a cell within a table.

I have used the bbc Top prem goal scorers table for this example I am trying to echo the text "Nolan".

However, I get a this error with the script below:

Catchable fatal error: Object of class DOMNodeList could not be converted to string in C:\wamp\www\scrape.php on line 26

Now from googling, it seems that a 'DOMNodeList' cannot be displayed as a string.

I'm not sure how to proceed. Any ideas?

If I can fix this script, I'm wondering in it's current form does it execute every time the page loads? If it does how would I get it to only execute once every hour?
So, if I had 10 people on the page it would execute 10 times?

PHP:

    <?php

      $my_url = 'http://news.bbc.co.uk/sport1/hi/football/eng_prem/top_scorers/default.stm';

      $html = file_get_contents($my_url);

      $dom = new DOMDocument();

      @$dom->loadHTML($html);

      $xpath = new DOMXPath($dom);

      $my_xpath_query = "/html/body[@id='body']/div[1]/div[@id='blq-container']/div[@id='blq-container-inner']/div[@id='blq-main']/div/table/tbody/tr/td[2]/table[1]/tbody/tr/td[1]/table[1]/tbody/tr[6]/td[1]";

$text = $xpath->query($my_xpath_query);

 
echo "$text";


      ?>

Pho · 2 Feb 2011 at 22:12

Firstly your xpath query isn't returning any results and secondly what it does return is an array and not pure text which is why you get that error.

Here's a much shorter xpath query which should get to your required cell (I've only used xpath once so I'm sure there's better ways to do this):

PHP:

$my_xpath_query = "//div[@id='blq-main']//table/tr[6]/td[1]";

// is kind of a wildcard, so it skips everything and jumps straight to the first div with an id of blq-main, it then looks for the next table it can find in the HTML (again, because of the // it doesn't have to be the next element exactly) and then does what you did to get to the right cell. If the table you wanted had an id set you could jump straight to it with /table[@id='tableresultsid'], but annoyingly the BBC doesn't set that for you.

Once you have that you can loop over all your results with something like this, handy if you wanted to for example return all the rows in the table and list all the players:

PHP:

 foreach ($text as $node)
 {
	echo $node->nodeValue."<br/>";
 }

However you only want one, so you could do this which takes the first item from the array (which in your example will be the only one as we've only returned one cell from the xpath) and displays it:

PHP:

 if (is_array($text))
	echo $text[0]->nodeValue;

To cache your results you could dump it into the database the first time a user requests the page and then every time a user accesses the page check if its time to update or not. Alternatively you could set-up a cron job or similar on your server to call a PHP script which polls the BBC site automatically every x minutes with the above and dumps the result into a database or file on your website which you can include in the pages your visitors see.

You certainly don't want to poll the BBC for each request so it's good you thought about it

.

craptakular · 3 Feb 2011 at 07:17

Thank you so much!

PHP:

 foreach ($text as $node) 
 { 
    echo $node->nodeValue."<br/>"; 
 }

Your above example worked, but I couldn't get the bottom one to work.

Essentially in the future I may need to collect anything up to 30 table cells, all on different pages but if I created something like this:

PHP:

<?php

      $my_url = 'http://news.bbc.co.uk/sport1/hi/football/eng_prem/top_scorers/default.stm';

      $html = file_get_contents($my_url);

      $dom = new DOMDocument();

      @$dom->loadHTML($html);

      $xpath = new DOMXPath($dom);

      $my_xpath_query = "//div[@id='blq-main']//table/tr[6]/td[1]"; 

$text = $xpath->query($my_xpath_query);

 
 foreach ($text as $node)
 {
    echo $node->nodeValue."<br/>";
 } 


//scrape number 2

$my_url1 = 'http://news.bbc.co.uk/sport1/hi/football/eng_prem/top_scorers/default.stm';

      $html1 = file_get_contents($my_url);

      $dom1 = new DOMDocument();

      @$dom1->loadHTML($html);

      $xpath1 = new DOMXPath($dom);

      $my_xpath_query1 = "//div[@id='blq-main']//table/tr[6]/td[1]"; 

$text1 = $xpath1->query($my_xpath_query1);

 
 foreach ($text1 as $node1)
 {
    echo $node1->nodeValue."<br/>";
 } 



      ?>

Is this the most effiecient way?

I'll research about how to dump this into a database, thats my next objective hehe.

Pho · 3 Feb 2011 at 21:12

You'd probably be better off with something like this, here's a very quick class I mocked up with some examples at the bottom:

PHP:

<?php
	Class Scrape
	{
		var $url;
		var $xpathQuery;
		var $xpathResults;

		function Scrape($url, $query)
		{
			$this->setURL($url);
			$this->setXpathQuery($query);
		}

		function getURL()
		{
			return $this->url;
		}

		function setURL($url)
		{
			$this->url = $url;
		}

		function getXpathQuery()
		{
			return $this->xpathQuery;
		}

		function setXpathQuery($query)
		{
			$this->xpathQuery= $query;
		}
		
		function getXpathResults()
		{
			return $this->xpathResults;
		}
		
		function setXpathResults($result)
		{
			$this->xpathResults = $result;
		}
		
		function execute()
		{				
			$html = file_get_contents($this->getURL());

			$dom = new DOMDocument();
			@$dom->loadHTML($html);
			$xpath = new DOMXPath($dom);				
			$results = $xpath->query($this->getXpathQuery());
			$this->setXpathResults($results);
		}
	}

	
	// List of top score players
	$scrape1 = new Scrape('http://news.bbc.co.uk/sport1/hi/football/eng_prem/top_scorers/default.stm' ,'//table[@class="fulltable"]/tr[@class="r1" or @class="r2"]/td[1]');
	$scrape1->execute();
	
	echo "<h1>List of top score players</h1>";
	foreach ( $scrape1->getXpathResults() as $row )
	{
		echo '<p>'.htmlentities($row->nodeValue).'</p>';
	}

	// List of cities proper by population
	$scrape2 = new Scrape('http://en.wikipedia.org/wiki/List_of_cities_proper_by_population' ,'//table[@class="sortable wikitable"]//td[2]');
	$scrape2->execute();
	
	echo "<h1>List of cities proper by population</h1>";
	foreach ( $scrape2->getXpathResults() as $row )
	{
		echo '<p>'.htmlentities($row->nodeValue).'</p>';
	}

As you can hopefully see, the class means you don't have to repeat all your xpath code. You could put in your caching system within there, so that when you call the execute function it gets the data from a database or the live site based on some decision and the code which uses it never need know if its a cached copy or not.. it's pretty much up to you.

For a basic caching system you really just want to check the database to see if a) the value you're looking for already exists and b) that the value's last update time (which you would store as another column, e.g. a timestamp field) is less than say 10 minutes old. If it is, just use that value, if it isn't, get your script to query the live data and save the result to the database (and remember to reset the last-update timestamp).

Enjoy

.

slylittlefox · 3 Feb 2011 at 21:54

Nice little API

craptakular · 4 Feb 2011 at 09:55

Thanks very much, I'm really new to PHP, it's also my first computer language.

So I assume all I have to learn is how to create a database and get the script you posted to connect to it and put the data in the right field based off an id that I would setup when creating the database?

Essentially I'm trying to grab the same cell, which is on 30 different pages, so it's the same kind of info just different figures for a different type of product. I would then need to display 6 of these values on my homepage and also display each one on a separate page on my wordpress site.

What I think I need, and i'm more than likely wrong or not understanding, I need a script that purely writes to the database (checks every hours based on the timestamp column and only updates if the value for X has changed.), then mini scripts that will display one table cell from the mysql database.

Pho · 4 Feb 2011 at 15:16

Oxy said:
Thanks very much, I'm really new to PHP, it's also my first computer language.

Oh ok, hopefully I didn't scare you off then

.

So I assume all I have to learn is how to create a database and get the script you posted to connect to it and put the data in the right field based off an id that I would setup when creating the database?

Yep pretty much. As you're new you might be better off following a separate tutorial and once you have that working modify it to suit your needs/site. I've not read it but this tutorial looks to be pretty nice.

Essentially I'm trying to grab the same cell, which is on 30 different pages, so it's the same kind of info just different figures for a different type of product. I would then need to display 6 of these values on my homepage and also display each one on a separate page on my wordpress site.

What I think I need, and i'm more than likely wrong or not understanding, I need a script that purely writes to the database (checks every hours based on the timestamp column and only updates if the value for X has changed.), then mini scripts that will display one table cell from the mysql database.

That sounds logical to me. Depending on your hosting provider you might be able to set-up a cron-job in your control panel, which basically states that you want to execute a PHP script at certain intervals, so this would cover off your database writing part and would allow you to separate the database writing part from the reading part - it should be a lot less messy this way as your database reading bit (inside your Wordpress site) only has to deal with reading the latest version of the data and wouldn't even need to bother doing the timestamp checks (unless of course you wanted to display the last updated time or something).

craptakular · 4 Feb 2011 at 20:37

cron I know from my linux use, so at least thats one thing I can do lol.

Will follow that guide and I have bought some textbooks to help me get the basics.

I'll see if I can work out how to do some "my first" mysql databases and data entry tommorrow.

craptakular · 6 Feb 2011 at 12:49

I have made good progress in learning how to create a database, add a table and call a selected table, the tigaz guide is great!

However, I have a major headache with xpath, I can call data in from sources with no namespaces without any issue. However, my main source of info I will need to use when I build my proper script uses namespaces, I think this is why it doesn't work atm, the php script doesn't release there is namespaces because I haven't defined it!

If I have a namespace of:

Code:

http://www.w3.org/1999/xhtml      x

This is what Xpath Checker firefox plugin outputs:

Code:

http://www.w3.org/1999/xhtml      x


id('tdtestbox')/x:table/x:tbody/x:tr[2]/x:td[2]

As I am not defining the name space in php script I am convinced this is why I return no data. Any ideas?

Pho · 6 Feb 2011 at 14:54

Nice

.

Hmm I'm not too sure on the namespace stuff.. can you post a site and part of it you're trying to pull the data from and I'll have a look?

craptakular · 6 Feb 2011 at 14:59

This page is exactly like the one I am having issue with:

http://www.skysports.com/

id('ss-content')/x:div[1]/x:div[1]/x:div[3]/x:ul[1]/x:li[2]/x:h4/x:a

It has all the funny x: stuff in...

If the xpath is "normal" (.i.e. no X: stuff) like the bbc example then it works perfectly, been at this since 9 am and my head is going to explode.

Thank you for still helping me!

Pho · 6 Feb 2011 at 15:22

I think that's just your extension causing you problems, there might be an option it it to remove the namespace stuff.

If you use Firebug you can do it as well:

xpath: /html/body/div/div[7]/div/div/div[3]/ul/li[2]/p

Which I didn't realise until a few minutes ago either

.

craptakular · 6 Feb 2011 at 15:39

I still can't get it to work, did you get it working?

craptakular · 6 Feb 2011 at 16:49

http://php.net/manual/en/domxpath.registernamespace.php

The 3rd comment down on that link the guy is saying that namespaces need to be prefixed, xhtml loaded, which we have done. then we cannot use the function loadhtml. It simply will not ever work, maybe this is the reason?

Pho · 6 Feb 2011 at 16:50

Hmm no that didn't seem to work for me either. Taking your xpath:
id('ss-content')/x:div[1]/x:div[1]/x:div[3]/x:ul[1]/x:li[2]/x:h4/x:a

remove all the x: parts:
id('ss-content')/div[1]/div[1]/div[3]/ul[1]/li[2]/h4/a

and convert the id('ss-content') to the PHP div syntax:
//div[@id="ss-content"]/div[1]/div[1]/div[3]/ul[1]/li[2]/h4/a

and it seems to work.

craptakular · 6 Feb 2011 at 16:58

this is actually driving me insane now lol... wil;l try it, if i can't can I email your trust?

craptakular · 6 Feb 2011 at 17:07

How would you convert this?

id('tdboxDetails')/x:table/x:tbody/x:tr[2]/x:td[2]

If you get that working I'll send you over some beer money/ enough for a game

Pho · 6 Feb 2011 at 17:33

Oxy said:
this is actually driving me insane now lol... wil;l try it, if i can't can I email your trust?

Sure, email me if you want.

Oxy said:
How would you convert this?

id('tdboxDetails')/x:table/x:tbody/x:tr[2]/x:td[2]

If you get that working I'll send you over some beer money/ enough for a game

Hmm, untested by try this:
//div[@id="tdboxDetails"]/table/tbody/tr[2]/td[2]

craptakular · 6 Feb 2011 at 17:37

I can't get it to work, going to fire off an email to you now!

EDIT: Sent one to your hotmail account.

Pho · 6 Feb 2011 at 17:59

Hmm I haven't received anything. Can you send it to my gmail instead? (akiller@)