CLI에서 기본 웹 스크래핑

Question 1

그리고

curl "http://clojurescript.net/" | scrape -be '//body/script' | xml2json | jq '.html.body.script[].src

당신은

"http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-web.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-main.js"

이러한 도구는 다음과 같습니다.

대단한 JQhttps://stedolan.github.io/jq/;
깎다https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/scrape;
xml2jsonhttps://github.com/Inist-CNRS/node-xml2json-command.

또는 다음을 사용하여:

curl "http://clojurescript.net/" | hxnormalize -x | hxselect -i 'body > script' |  grep -oP '(http:.*?)(")' | sed 's/"//g'

당신은:

http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
http://kanaka.github.io/cljs-bootstrap/web/repl-web.js
http://kanaka.github.io/cljs-bootstrap/web/repl-main.js

Answer

그리고

curl "http://clojurescript.net/" | scrape -be '//body/script' | xml2json | jq '.html.body.script[].src

당신은

"http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js"
"http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-web.js"
"http://kanaka.github.io/cljs-bootstrap/web/repl-main.js"

이러한 도구는 다음과 같습니다.

대단한 JQhttps://stedolan.github.io/jq/;
깎다https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/scrape;
xml2jsonhttps://github.com/Inist-CNRS/node-xml2json-command.

또는 다음을 사용하여:

curl "http://clojurescript.net/" | hxnormalize -x | hxselect -i 'body > script' |  grep -oP '(http:.*?)(")' | sed 's/"//g'

당신은:

http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js
http://kanaka.github.io/cljs-bootstrap/web/jqconsole.min.js
http://kanaka.github.io/cljs-bootstrap/web/jq_readline.js
http://kanaka.github.io/cljs-bootstrap/web/repl-web.js
http://kanaka.github.io/cljs-bootstrap/web/repl-main.js

Question 2

나는 HTML을 구문 분석할 수 있는 독립 실행형 유틸리티를 모릅니다. XML을 위한 몇 가지 유틸리티가 있지만 사용하기 쉽지는 않을 것 같습니다.

많은 프로그래밍 언어에는 HTML 구문 분석을 위한 라이브러리가 있습니다. 대부분의 Unix 시스템에는 Perl 또는 Python이 있습니다. 파이썬을 사용하는 것이 좋습니다아름다운 수프아니면 Perl의HTML::트리빌더. 물론 원한다면 다른 언어를 사용할 수도 있습니다(노코기리루비 등)

다음은 다운로드와 구문 분석을 결합한 Python 한 줄입니다.

python2 -c 'import codecs, sys, urllib, BeautifulSoup; html = BeautifulSoup.BeautifulSoup(urllib.urlopen(sys.argv[1])); sys.stdout.writelines([e["src"] + "\n" for e in html.findAll("script")])' http://clojurescript.net/

또는 몇 가지 더 읽기 쉬운 코드 줄로:

python2 -c '
import codecs, sys, urllib, BeautifulSoup;
html = BeautifulSoup.BeautifulSoup(urllib.urlopen(sys.argv[1]));
scripts = html.findAll("script");
for e in scripts: print(e["src"])
' http://clojurescript.net/

Answer