bash - Filter out HTML code with grep -


i working on project using bash shell script. idea grep wget retrieved page, in order pick paragraph on web page. area copy, starts a

<p><b> 

but paragraph contains other bits of html code, such anchor tags, don't want in output of grep.
have tried

cat page.html| grep "<p><b>" >grep.txt 

and grep output file, contains paragraph want

cat grep.txt|grep -v '<p>|<b>|<a>' >grep.txt 

but clear file , not read anything. how can exclude html code?

i trying follow links in paragraph grep, in order same thing pages. 2 levels deep, main page , ever sub page(s) stem first paragraph of main page. know difficult idea, explained enough help. if have ideas, appreciated.

do have in bash? seems me python lend problem, in particular library called beautiful soup.

i've used parsing html in past , it's easiest tool find. has documentation dealing html.

perhaps make standalone python code extracts html , echos string you're after. python code called inside bash script if have bash functions want perform on string.


Comments

Popular posts from this blog

linux - xterm copying to CLIPBOARD using copy-selection causes automatic updating of CLIPBOARD upon mouse selection -

c++ - qgraphicsview horizontal scrolling always has a vertical delta -