Scraping info from websites (Python?)

Associate
Joined
21 Jan 2013
Posts
1,851
Location
Banbury, Oxfordshire
Hi all,

Hoping you might be able to help me out. I'm working on a project where I could really do with counting the number of instances of an image across website e.g. on a web store how many product pages have a 'Online Only' image appearing on them or a 'clearance' image etc.

I've worked with people in the past who have similar stuff to this for me using Python (believe it looks at class and style tags in the web source code?) but now really want to try and learn how to do it myself.

Is Python the best way to try and do the above or are there better ways to do it? Keen to learn but also don't have much free time so keen to minimize the time I need to spend figuring out how to do what I need to :)

Many thanks

Edit: For context I work with SQL and SAS day in day out so am quite comfortable learning raw code etc :)
 
I do this in vba/net quite often, assuming you don't want to start writing your own apis from scratch then you need a hook into the webpage html which converts it into objects/methods which will be exposed in your language of preference

I don't know Python so you will have to figure out that bit yourself.

It all depends on how well the webpage was made if you have html tags/objects properly named then you can generally refer directly to these. If the html is crap like it is where I work then nothing is tagged then you will have to do something like this which loops through every html element in the page looking for a certain key words:

Code:
Private Function ReturnVal(ByRef parDoc As MSHTML.HTMLDocument, ByVal parArg As String) As String

Dim result As String
Dim element As MSHTML.HTMLHtmlElement

Application.EnableEvents = False

result = "{<>}"

For Each element In parDoc.DocumentElement.all
    Debug.Print Trim(element.innerText)
    If Trim(element.innerText) = parArg Then
        If Not element.NextSibling Is Nothing Then
            If Trim(element.NextSibling.innerText) = "Hist" Then
                If Not element.NextSibling.NextSibling Is Nothing Then
                result = element.NextSibling.NextSibling.innerText
                End If
       
            Else
                If Left(element.NextSibling.innerText, 9) = "Show/Hide" Then
                    result = vbNullString
                Else
                    result = element.NextSibling.innerText
                End If
            End If
        Else
            result = vbNullString
        End If
                    
        Exit For
    End If

Next element

ReturnVal = result

End Function
 
I was going to say google "webcrawler <xxx>" where xxx is your preferred language to start from. A webcrawler is a classic learning exercise for most languages these days.

When I tried it, I came across this site that looks like it might save you some grief.
 
This is just what I've been working on for the past week at work coincidentally. It's scraping details off a hotel website (rooms available, prices, etc) for displaying in another website. The person then books on our website and the "bot" books for them, auto inserting the affiliate code into the booking process, scraping off the form, navigating their booking system etc.

I use Selenium (Python bindings) with the chrome driver when testing so I can watch it navigate chrome visually and pull info for debugging purposes. When it's out in production, it's using a headless browser (PhantomJS) with Selenium again.

You can navigate links, click on buttons, enter things into forms and submit them, wait for ajax calls to complete to get content that comes after the browser loads with Selenium.

A small snippet which is selecting some rooms from a hotel and getting their details.

Code:
                id = room.get_attribute('id')
                image = room.find_element_by_tag_name('img').get_attribute('src')
                rate = room.find_element_by_css_selector('div.price > span').get_attribute('innerText')
                type = room.find_element_by_css_selector('div.description > a > span.name').get_attribute('innerText')
                suite_details = self.driver.find_element_by_class_name('display_room_' + id)
                description = suite_details.find_element_by_id('room_des').get_attribute('innerText')

In the above example I have found a div for the room I am interested in (called room), I then find the img tag within that room div and get the src value. I'm also getting some details, prices etc for said room.

It's very easy with Selenium, a tool typically used in testing, you can do anything a person can, but "scriptomatically".

Selenium can be used in other languages too, e.g. Java, C#. I would rather at work be doing .NET and C# instead of Django and Python, but that's a topic for another discussion :P.

There is also CasperJS, which uses PhantomJS. This allows you to scrape and navigate pages with javascript.
 
Last edited:
I've done this with python before for my dissertation, scraped a bunch of car prices off of parkers.

Used mechanize for the browsing/getting html and logging in etc and then beautifulsoup for navigating the html and finding what I needed.
 
I've hand coded web crawlers in Python before, can't remember much now but the main package you'll see being mentioned is something called BeautifulSoup.

As mentioned above, Selenium can also be used.
 
I went for Selenium as I'm scraping a page that's filled up after the load has finished through ajax calls. I did look at BeautifulSoup originally but had the most success with Selenium in that area.
 
Back
Top Bottom