ruby on rails - How do I pull the correct image URL from this Wikipedia table? -
i built scraper pull information out of wikipedia table , upload database. until realized pulling wrong url on images, , wanted actual image url "http://upload.wikimedia.org/wikipedia/commons/thumb/3/38/baconbutty.jpg" , not "/wiki/file:baconbutty.jpg" apt give me. here code far:
def initialize @url = "http://en.wikipedia.org/wiki/list_of_sandwiches" @nodes = nokogiri::html(open(@url)) end def summary sammich_data = @nodes sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr') sammich_data.search('sup').remove sammich_hashes = sammiches.map {|x| if content = x.css('td')[0] name = content.text end if content = x.css('td a.image').map {|link| link ['href']} image =content[0] end if content = x.css('td')[2] origin = content.text end if content = x.css('td')[3] description =content.text end my issue line:
if content = x.css('td a.image').map {|link| link ['href']} image =content[0] if go td a.image img, gives me null entry.
any suggestions?
here's how i'd (if scrape wikipedia, wouldn't because have api stuff):
require 'nokogiri' require 'open-uri' require 'pp' doc = nokogiri::html(open("http://en.wikipedia.org/wiki/list_of_sandwiches")) sammich_hashes = doc.css('table.wikitable tr').map { |tr| name, image, origin, description = tr.css('td,th') name, origin, description = [name, origin, description].map{ |n| n && n.text ? n.text : nil } image = image.at('img')['src'] rescue nil { name: name, origin: origin, description: description, image: image } } pp sammich_hashes which outputs:
[ {:name=>"name", :origin=>"origin", :description=>"description", :image=>nil}, { :name=>"bacon", :origin=>"united kingdom", :description=>"often served ketchup or brown sauce", :image=>"//upload.wikimedia.org/wikipedia/commons/thumb/3/38/baconbutty.jpg/120px-baconbutty.jpg" }, ... [lots removed] ... { :name=>"zapiekanka", :origin=>"poland", :description=>"a halved baguette or other bread topped mushrooms , cheese, ham or other meats, , vegetables", :image=>"//upload.wikimedia.org/wikipedia/commons/thumb/1/12/zapiekanka_3..jpg/120px-zapiekanka_3..jpg" } ] if image isn't available, field set nil in returned hashes.
Comments
Post a Comment