Site scraping considerations/questions

smotchkkiss · October 10, 2015, 4:06pm

This is probably a common problem, I just haven’t looked for resources from people that have done this yet.

I’m going to write a ruby web service that will scrape and keep up to date a massive database of information, scraped from a number of sites.

Why? Both sites in question do not actually have APIs, and their search mechanisms are lackluster.

There are two concerns I have, which are semi-related.

Keeping the data up to date, or synced:
a) Handling DOM changes - this I think I can achieve with capybara/unit testing in general.
I want to be a “good web citizen”, and not hammer the site(s) with requests if I don’t have to. This problem also brings into consideration possible IP bans if they have something like fail2ban or cloudflare in place. Feels like I might be re-creating the google indexing wheel. This also ties into if I want to keep the data up to date. It honestly probably isn’t updated often enough for it to matter, but I still need to gather a strong baseline of data.

Thanks, and happy hacking.

AstonJ · October 10, 2015, 4:36pm

I use Nokogiri on a couple of projects - though mainly to check things like website ownership or links back to us.

I think the first thing you’d want to do is check the site’s policy on web scraping (robots.txt) and also look at how big the sites are (and whether they are on meaty servers or not) so you can determine how many concurrent requests to make to their server.

With regards to keeping data up to date, Nokogiri will load the entire document anyway, so I would probably just put the latest version in the DB and move the older version to a backup. You could then compare the two and determine an average age of update (and then change your frequency of scraping accordingly).

Re handling Dom changes, this is one of the pains of not using an API, and you’ll probably just have to get your app to let you know when it encounters a change/break.

The other thing to factor, is copyright - most sites frown upon their content being reproduced on another publicly available service, and of course you are probably already aware of Google penalising you for duplicate content. The other thing I really should mention is the risk of getting sued (newspapers in particular are known to take action about content being reproduced).

I’m sure @danielpclark and a few others have used Nokogiri so hopefully they might be able to chime in too

smotchkkiss · October 10, 2015, 5:37pm

I was actually planning on using nokogiri, I have a fair amount of experience with it, but your input in general is great - sincere thanks.

danielpclark · October 10, 2015, 9:39pm

Use Mechanize. It has Nokogiri deeply coupled in and gives you more powerful and efficient methods for scraping. Every return object has a mechanize instance object on it. Mechanize will let you follow redirects with an option parameter which you can’t do with Nokogiri alone.

Mechanize.new.tap{|i| i.follow_meta_refresh = true }.get(“http://www.example.com”)

If you want to see a library I’ve written with Mechanize I’ve written a Youtube Utility which lets you simply do a ruby method call to query video data. This library does have tests to verify the results.

When picking what data to scrape the Firefox Inspection tool (CTRL-SHIFT-I) will let you highlight what you want and get the exact CSS identifier for your choice of selection. You can be very specific like so:

hd: !!result.css("div:nth-child(1)").css("div:nth-child(2)").css("div:nth-child(4)").css("ul:nth-child(1)").text["HD"]

Or you can go with a general area covered and use Regex to match the exact pattern of the inner text you want:

user_name: query_result.css('li').css('a').select{|i| i.text =~ / by [a-z0-9]{1,}/i}.map {|i| i.text.match(/by ([a-z0-9]{1,})/i)[1]}.first

danielpclark · October 10, 2015, 10:41pm

As far as rate limiting your requests you may need to create a service to queue your requests with a max per time based on the sites restrictions. I would think that this stack would normally be empty so your request would be handled immediately. But in case it isn’t it will be queued and help protect against your app/service from being banned.

smotchkkiss · October 12, 2015, 2:18pm

Thanks again @danielpclark, for the reminder about mechanize. I thankfully also have experience using that gem, and it is awesome!

Your rate limiting advice is what I was really looking for and is very apt. I’m giving that a shot soon!

JGrantTuttle · October 15, 2015, 8:17pm

Please post your findings!