Scraping info from websites (Python?)

GhostWKD · 9 Mar 2016 at 12:35

Hi all,

Hoping you might be able to help me out. I'm working on a project where I could really do with counting the number of instances of an image across website e.g. on a web store how many product pages have a 'Online Only' image appearing on them or a 'clearance' image etc.

I've worked with people in the past who have similar stuff to this for me using Python (believe it looks at class and style tags in the web source code?) but now really want to try and learn how to do it myself.

Is Python the best way to try and do the above or are there better ways to do it? Keen to learn but also don't have much free time so keen to minimize the time I need to spend figuring out how to do what I need to

Many thanks

Edit: For context I work with SQL and SAS day in day out so am quite comfortable learning raw code etc

dazzerd · 9 Mar 2016 at 13:00

I do this in vba/net quite often, assuming you don't want to start writing your own apis from scratch then you need a hook into the webpage html which converts it into objects/methods which will be exposed in your language of preference

I don't know Python so you will have to figure out that bit yourself.

It all depends on how well the webpage was made if you have html tags/objects properly named then you can generally refer directly to these. If the html is crap like it is where I work then nothing is tagged then you will have to do something like this which loops through every html element in the page looking for a certain key words:

Code:

Private Function ReturnVal(ByRef parDoc As MSHTML.HTMLDocument, ByVal parArg As String) As String

Dim result As String
Dim element As MSHTML.HTMLHtmlElement

Application.EnableEvents = False

result = "{<>}"

For Each element In parDoc.DocumentElement.all
    Debug.Print Trim(element.innerText)
    If Trim(element.innerText) = parArg Then
        If Not element.NextSibling Is Nothing Then
            If Trim(element.NextSibling.innerText) = "Hist" Then
                If Not element.NextSibling.NextSibling Is Nothing Then
                result = element.NextSibling.NextSibling.innerText
                End If
       
            Else
                If Left(element.NextSibling.innerText, 9) = "Show/Hide" Then
                    result = vbNullString
                Else
                    result = element.NextSibling.innerText
                End If
            End If
        Else
            result = vbNullString
        End If
                    
        Exit For
    End If

Next element

ReturnVal = result

End Function

Russinating · 9 Mar 2016 at 14:05

I've done it before in PHP, too.

peterwalkley · 9 Mar 2016 at 15:18

I was going to say google "webcrawler <xxx>" where xxx is your preferred language to start from. A webcrawler is a classic learning exercise for most languages these days.

When I tried it, I came across this site that looks like it might save you some grief.

Scotty123 · 9 Mar 2016 at 15:19

This is just what I've been working on for the past week at work coincidentally. It's scraping details off a hotel website (rooms available, prices, etc) for displaying in another website. The person then books on our website and the "bot" books for them, auto inserting the affiliate code into the booking process, scraping off the form, navigating their booking system etc.

I use Selenium (Python bindings) with the chrome driver when testing so I can watch it navigate chrome visually and pull info for debugging purposes. When it's out in production, it's using a headless browser (PhantomJS) with Selenium again.

You can navigate links, click on buttons, enter things into forms and submit them, wait for ajax calls to complete to get content that comes after the browser loads with Selenium.

A small snippet which is selecting some rooms from a hotel and getting their details.

Code:

                id = room.get_attribute('id')
                image = room.find_element_by_tag_name('img').get_attribute('src')
                rate = room.find_element_by_css_selector('div.price > span').get_attribute('innerText')
                type = room.find_element_by_css_selector('div.description > a > span.name').get_attribute('innerText')
                suite_details = self.driver.find_element_by_class_name('display_room_' + id)
                description = suite_details.find_element_by_id('room_des').get_attribute('innerText')

In the above example I have found a div for the room I am interested in (called room), I then find the img tag within that room div and get the src value. I'm also getting some details, prices etc for said room.

It's very easy with Selenium, a tool typically used in testing, you can do anything a person can, but "scriptomatically".

Selenium can be used in other languages too, e.g. Java, C#. I would rather at work be doing .NET and C# instead of Django and Python, but that's a topic for another discussion

.

There is also CasperJS, which uses PhantomJS. This allows you to scrape and navigate pages with javascript.

Phantom Shadow · 11 Mar 2016 at 10:36

I've done this with python before for my dissertation, scraped a bunch of car prices off of parkers.

Used mechanize for the browsing/getting html and logging in etc and then beautifulsoup for navigating the html and finding what I needed.

AHarvey · 11 Mar 2016 at 17:58

I've hand coded web crawlers in Python before, can't remember much now but the main package you'll see being mentioned is something called BeautifulSoup.

As mentioned above, Selenium can also be used.

Scotty123 · 11 Mar 2016 at 18:57

I went for Selenium as I'm scraping a page that's filled up after the load has finished through ajax calls. I did look at BeautifulSoup originally but had the most success with Selenium in that area.