linux - how to parse the documents using Crawlers -

March 15, 2015

i new topic, requirement parse documents of different types(html, pdf,txt) using crawlers. please suggest me crawler use requirement , provide me tutorial s or example how parse document using crawlers.

thankyou.

this broad question, answer broad , touches surface.
comes down 2 steps, (1) extracting data source, , (2) matching , parsing relevant data.

1a. extracting data web

there many ways scrape data web. different strategies can used depending if source static or dynamic.

if data on static pages, can download html source pages (automated, not manually) , extract data out of html source. downloading html source can done many different tools (in many different languages), simple wget or curl do.

if data on dynamic page (for example, if data behind forms need database query view it) strategy use automated web scraping or testing tool. there many of these. see list of automated data collection resources [1]. if use such tool, can extract data right away, don't have intermediate step of explicitly saving html source disk , parsing afterwards.

1b. extracting data pdf

try tabula first. it's open source web application lets visually extract tabular data pdfs.

if pdf doesn't have data neatly structured in simple tables or have way data tabula feasible, recommend using *nix command-line tool pdftotext converting portable document format (pdf) files plain text.

use command man pdftotext see manual page tool. 1 useful option -layout option tries preserve original layout in text output. default option "undo" physical layout of document, , instead output text in reading order.

1c. extracting data spreadsheet

try xls2text converting text.

2. parsing (html/text) data

for parsing data, there many options. example, can use combination of grep , sed, or beautifulsoup python library` if you're dealing html source, don't limit these options, can use language or tool you're familiar with.

when you're parsing , extracting data, you're doing pattern matching. unique patterns make easy isolate data you're after.

one method of course regular expressions. want extract email addresses text file named file.

egrep -io "\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}\b" file

the above command print email addresses [2]. if instead want save them file, append > filename end of command.

[1] note list not exhaustive list. it's missing many options.
[2] regular expression isn't bulletproof, there extreme cases not cover. alternatively, can use script i've created better extracting email addresses text files. it's more accurate @ finding email addresses, easier use, , can pass multiple files @ once. can access here: https://gist.github.com/dideler/5219706

Search This Blog

Parth Code