ruby on rails - How do I pull the correct image URL from this Wikipedia table? -


i built scraper pull information out of wikipedia table , upload database. until realized pulling wrong url on images, , wanted actual image url "http://upload.wikimedia.org/wikipedia/commons/thumb/3/38/baconbutty.jpg" , not "/wiki/file:baconbutty.jpg" apt give me. here code far:

def initialize   @url = "http://en.wikipedia.org/wiki/list_of_sandwiches"   @nodes = nokogiri::html(open(@url))   end  def summary    sammich_data = @nodes    sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr')      sammich_data.search('sup').remove      sammich_hashes = sammiches.map {|x|         if content = x.css('td')[0]         name = content.text       end       if content = x.css('td a.image').map {|link| link ['href']}         image =content[0]       end       if content = x.css('td')[2]         origin = content.text       end       if content = x.css('td')[3]         description =content.text       end 

my issue line:

if content = x.css('td a.image').map {|link| link ['href']}             image =content[0] 

if go td a.image img, gives me null entry.

any suggestions?

here's how i'd (if scrape wikipedia, wouldn't because have api stuff):

require 'nokogiri' require 'open-uri' require 'pp'  doc = nokogiri::html(open("http://en.wikipedia.org/wiki/list_of_sandwiches"))    sammich_hashes = doc.css('table.wikitable tr').map { |tr|    name, image, origin, description = tr.css('td,th')   name, origin, description = [name, origin, description].map{ |n| n && n.text ? n.text : nil }   image = image.at('img')['src'] rescue nil    {     name: name,     origin: origin,     description: description,     image: image   } }  pp sammich_hashes 

which outputs:

[   {:name=>"name", :origin=>"origin", :description=>"description", :image=>nil},   {     :name=>"bacon",     :origin=>"united kingdom",     :description=>"often served ketchup or brown sauce",     :image=>"//upload.wikimedia.org/wikipedia/commons/thumb/3/38/baconbutty.jpg/120px-baconbutty.jpg"   },   ... [lots removed] ... {     :name=>"zapiekanka",     :origin=>"poland",     :description=>"a halved baguette or other bread topped mushrooms , cheese, ham or other meats, , vegetables",     :image=>"//upload.wikimedia.org/wikipedia/commons/thumb/1/12/zapiekanka_3..jpg/120px-zapiekanka_3..jpg"   } ] 

if image isn't available, field set nil in returned hashes.


Comments

Popular posts from this blog

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

qt - Errors in generated MOC files for QT5 from cmake -