bash - Filter out HTML code with grep -

March 15, 2015

i working on project using bash shell script. idea grep wget retrieved page, in order pick paragraph on web page. area copy, starts a

<p><b>

but paragraph contains other bits of html code, such anchor tags, don't want in output of grep.
have tried

cat page.html| grep "<p><b>" >grep.txt

and grep output file, contains paragraph want

cat grep.txt|grep -v '<p>|<b>|<a>' >grep.txt

but clear file , not read anything. how can exclude html code?

i trying follow links in paragraph grep, in order same thing pages. 2 levels deep, main page , ever sub page(s) stem first paragraph of main page. know difficult idea, explained enough help. if have ideas, appreciated.

do have in bash? seems me python lend problem, in particular library called beautiful soup.

i've used parsing html in past , it's easiest tool find. has documentation dealing html.

perhaps make standalone python code extracts html , echos string you're after. python code called inside bash script if have bash functions want perform on string.

Search This Blog

Parth Code

bash - Filter out HTML code with grep -

Comments

Post a Comment

Popular posts from this blog

c# - WPF Converters DLL - Failed to Add Reference -

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

c++ - qgraphicsview horizontal scrolling always has a vertical delta -