HTML에서 데이터를 추출하는 쉬운 방법

Question 1

당신은 그것을 사용할 수 있습니다sed

$ cat test

<td><a href="http://help.domain.com " target="_blank">help.domain.com</a></td>
<td><a href="http://hello.domain.com " target="_blank">hello.domain.com</a></td>
<td><a href="http://test.domain.com " target="_blank">test.domain.com</a></td>

$ sed 's/^.*">//;s/<.*//' test

help.domain.com
hello.domain.com
test.domain.com

Answer

당신은 그것을 사용할 수 있습니다sed

$ cat test

<td><a href="http://help.domain.com " target="_blank">help.domain.com</a></td>
<td><a href="http://hello.domain.com " target="_blank">hello.domain.com</a></td>
<td><a href="http://test.domain.com " target="_blank">test.domain.com</a></td>

$ sed 's/^.*">//;s/<.*//' test

help.domain.com
hello.domain.com
test.domain.com

Question 2

당신은 그것을 사용할 수 있습니다 awk:

awk -F'">|</' '{ print $2 }' file

산출:

help.domain.com
hello.domain.com
test.domain.com

Answer

당신은 그것을 사용할 수 있습니다 awk:

awk -F'">|</' '{ print $2 }' file

산출:

help.domain.com
hello.domain.com
test.domain.com

Question 3

어쩌면 시도해봐lynx

lynx -dump -listonly -nonumbers  http://example.com/data/123 | awk -F'[/:]+' '{print $2}'

고양이 파일.html

<td><a href="http://help.example.com " target="_blank">help.example.com</a></td>
<td><a href="http://hello.example.com " target="_blank">hello.example.com</a></td>
<td><a href="http://test.example.com " target="_blank">test.example.com</a></td>

lynx -dump -listonly -nonumbers  file.html | awk -F'[/:]+' '{print $2}'

산출

help.example.com
hello.example.com
test.example.com

Answer

어쩌면 시도해봐lynx

lynx -dump -listonly -nonumbers  http://example.com/data/123 | awk -F'[/:]+' '{print $2}'

고양이 파일.html

<td><a href="http://help.example.com " target="_blank">help.example.com</a></td>
<td><a href="http://hello.example.com " target="_blank">hello.example.com</a></td>
<td><a href="http://test.example.com " target="_blank">test.example.com</a></td>

lynx -dump -listonly -nonumbers  file.html | awk -F'[/:]+' '{print $2}'

산출

help.example.com
hello.example.com
test.example.com

Question 4

이것이 일회성 작업이라면 다른 답변도 괜찮을 수 있습니다.

그 밖의 모든 경우에는 적절한 xml 또는 html 파서를 사용하세요!

예를 들어: BeautifulSoup:

curl -X POST http://example.com/data/123 | python -c '
from bs4 import BeautifulSoup
import sys
soup=BeautifulSoup(sys.stdin,"lxml")
for a in soup.find_all("a"):
  print(a.string)
'

산출:

help.example.com
hello.example.com
test.example.com

bs4설치 과정을 거쳐야 할 수도 있습니다 pip.

물론 그럴 필요는 없습니다 curl.요청 페이지에서 직접 python.

Answer