Help with diagnosing 403 forbidden error from wget command

Theo Godfrey · 2 Sep 2021 at 14:38

Hi there,

When I try the following code, I get a 403 forbidden error, and I can't work out why.

wget --random-wait --wait 1 --no-directories --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" --no-parent --span-hosts --accept jpeg,jpg,bmp,gif,png --secure-protocol=auto referer=https://pixabay.com/images/search/ --recursive --level=2 -e robots=off --load-cookies cookies.txt --input-file=pixabay_background_urls.txt

It returns:

--2021-09-01 18:12:06-- https://pixabay.com/photos/search/wallpaper/?cat=backgrounds&pagi=2
Connecting to pixabay.com (pixabay.com)|104.18.20.183|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-09-01 18:12:06 ERROR 403: Forbidden.

Notes:

-The input file has the the url 'https://pixabay.com/photos/search/wallpaper/?cat=backgrounds&pagi=2 ' page3, page 4 etc separated by new lines

-I used the long form for the flags just so I could remember what they were.

-I used a cookie file generated from the website called 'cookies.txt' and made sure it was up to date.

-I used the referer 'https://pixabay.com/images/search/' that I found by looking at the headers in Google DevTools.

-I'm able to visit these URLs normally without any visible captcha requirements

-I noticed one of the cookies _cf_bm had a Secure = TRUE- so needed to be sent using https. I'm not sure whether I'm doing that or not

It might not actually be possible to do, perhaps cloudflare is a deciding factor. But I'd like to know if it was something that could be circumvented and whether or not it's doable to download a large number of files from this website

Any solutions, insights or any other way of downloaded large numbers of image files would be very appreciated.I know pixabay has an API which I might use as a last resort, but I think it's very rate limited.

visibleman · 2 Sep 2021 at 16:27

Have you tried cURL to see if you're able to replicate the issue (rules out WGET)?

And is pixabay behind Cloudflare (or similar)?
As i've ran into that a few times where WGET/cURL fail when a website (even my own) is behind Cloudflare, so it may be that.

Pho · 2 Sep 2021 at 21:38

Yeah it looks like they're using Cloudflare. We had a similar problem the other day trying to talk to an API which was using Cloudflare from a Python script (using the requests library). In our instance cloudflare-scrape seemed to beat it though.

As it happened, cURL was working fine where requests failed.

If you open Chrome > dev tools > network then navigate to the page you can right click the request -> copy > cURL command to get your started.

Theo Godfrey · 3 Sep 2021 at 14:10

Thanks for the input - I'll try cURL and if that doesn't work the cloudflare-scrape module

touch · 3 Sep 2021 at 14:29

You'd probably be better using the API. Even if you do get it working your way, you'll probably find that the website has the same rate-limiting as the API anyway.

Beansprout · 27 Sep 2021 at 16:35

They'll be blocking you for the exact thing you're trying to do.....