bash - Filter out HTML code with grep -
i working on project using bash shell script. idea grep wget retrieved page, in order pick paragraph on web page. area copy, starts a
<p><b>
but paragraph contains other bits of html code, such anchor tags, don't want in output of grep.
have tried
cat page.html| grep "<p><b>" >grep.txt
and grep output file, contains paragraph want
cat grep.txt|grep -v '<p>|<b>|<a>' >grep.txt
but clear file , not read anything. how can exclude html code?
i trying follow links in paragraph grep, in order same thing pages. 2 levels deep, main page , ever sub page(s) stem first paragraph of main page. know difficult idea, explained enough help. if have ideas, appreciated.
do have in bash? seems me python lend problem, in particular library called beautiful soup.
i've used parsing html in past , it's easiest tool find. has documentation dealing html.
perhaps make standalone python code extracts html , echos string you're after. python code called inside bash script if have bash functions want perform on string.
Comments
Post a Comment