웹 콘텐츠를 크롤링할 때 숫자를 고정 숫자와 일치시킵니다.

2024-5-25 • tag-icon

소스 웹페이지를 구문 분석하여 다음과 유사한 모든 href를 찾으려고 합니다.

href='http://example.org/index.php?showtopic=509480

여기서 다음 숫자 showtopic=는 무작위입니다(고정 숫자 6개(예: 123456 - 654321)).

while read -r line
do
    source=$(curl -L line) #is this the right way to parse the source?
    grep "href='http://example.org/index.php?showtopic=" >> output.txt 
done <file.txt #file contains a list of web pages

어떤 번호를 모르는 경우 어떻게 모든 줄을 잡을 수 있습니까? 정규식으로 두 번째 grep을 수행할 수 있을까요? awk에서 다음과 유사한 범위를 사용하고 싶습니다.

awk "'/href='http://example.org/index.php?showtopic=/,/^\s/'" >> file.txt

또는 다음과 같이 grep을 두 배로 늘리십시오.

grep "href='http://example.org/index.php?showtopic=" | grep -e ^[0-9]{1,6}$ >> output.txt

답변1

cat input.txt |grep "href='http://example.org/index.php?showtopic=" > output.txt

cat은 grep으로 파이프된 파일의 내용을 출력합니다. grep은 이를 한 줄씩 비교하고 전체 줄을 출력 텍스트에 씁니다.

아니면 sed를 사용할 수도 있습니다:

 sed -n "\#href='http://example.org/index.php?showtopic=#p"  input.txt >  output-sed.txt

답변1

관련 정보