URL txt 파일을 컬링하지만 단일 파일에서 각 URL을 개별적으로 grep합니다.

Question 1

이 작업은 두 줄로 수행되어야 합니다.

sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} curl -O "{}"
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} basename "{}"|xargs -I {} sed '/mortgage/q' "{}"

각 줄의 첫 번째 sed 명령은 url 파일(예제에서는 /tmp/curl.conf)에서 URL을 추출합니다. 첫 번째 줄에서는 컬의 -O 옵션을 사용하여 각 페이지의 출력을 페이지 이름이 있는 파일에 저장합니다. 두 번째 줄에서는 각 파일을 다시 확인하고 관심 있는 텍스트만 표시합니다. 물론 파일에 "mortgage"라는 단어가 나오지 않으면 파일 전체가 출력될 것이다.

이렇게 하면 현재 디렉터리의 각 URL에 대한 임시 파일이 남게 됩니다.

편집하다:

다음은 남은 파일을 방지하고 결과를 stdout으로 출력하며 필요한 경우 거기에서 리디렉션할 수 있는 짧은 스크립트입니다.

#!/bin/bash
TMPF=$(mktemp)
# sed command extracts URLs line by line
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf >$TMPF
while read URL; do
    # retrieve each web page and delete any text after 'mortgage' (substitute whatever test you like)
    curl "$URL" 2>/dev/null | sed '/mortgage/q'
done <"$TMPF"
rm "$TMPF"

Answer

이 작업은 두 줄로 수행되어야 합니다.

sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} curl -O "{}"
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf|xargs -I {} basename "{}"|xargs -I {} sed '/mortgage/q' "{}"

각 줄의 첫 번째 sed 명령은 url 파일(예제에서는 /tmp/curl.conf)에서 URL을 추출합니다. 첫 번째 줄에서는 컬의 -O 옵션을 사용하여 각 페이지의 출력을 페이지 이름이 있는 파일에 저장합니다. 두 번째 줄에서는 각 파일을 다시 확인하고 관심 있는 텍스트만 표시합니다. 물론 파일에 "mortgage"라는 단어가 나오지 않으면 파일 전체가 출력될 것이다.

이렇게 하면 현재 디렉터리의 각 URL에 대한 임시 파일이 남게 됩니다.

편집하다:

다음은 남은 파일을 방지하고 결과를 stdout으로 출력하며 필요한 경우 거기에서 리디렉션할 수 있는 짧은 스크립트입니다.

#!/bin/bash
TMPF=$(mktemp)
# sed command extracts URLs line by line
sed -n 's/\s*URL\s*=\s*\(.*\)/\1/p' /tmp/curl.conf >$TMPF
while read URL; do
    # retrieve each web page and delete any text after 'mortgage' (substitute whatever test you like)
    curl "$URL" 2>/dev/null | sed '/mortgage/q'
done <"$TMPF"
rm "$TMPF"

Question 2

이 일반적인 트릭은 컬 구성 파일에 다른 옵션(예: 사용자 에이전트, 리퍼러 등)이 포함되어 있어도 여전히 작동합니다.

첫 번째 단계로 구성 파일의 이름이 지정되었다고 가정합니다.컬 구성, 이는 awk '/^[Uu][Rr][Ll]/{print;print "output = dummy/"++k;next}1' curl_config > curl_config2 각 URL/URL 아래에 다양한 출력 파일 이름을 점진적으로 추가하는 새로운 컬 구성 파일을 만드는 데 사용됩니다.

예:

[xiaobai@xiaobai curl]$ cat curl_config
URL = "www.google.com"
user-agent = "holeagent/5.0"

url = "m12345.google.com"
user-agent = "holeagent/5.0"

URL = "googlevideo.com"
user-agent = "holeagent/5.0"
[xiaobai@xiaobai curl]$ awk '/^[Uu][Rr][Ll]/{print;print "output = dummy/"++k;next}1' curl_config  > curl_config2 
[xiaobai@xiaobai curl]$ cat curl_config2
URL = "www.google.com"
output = dummy/1
user-agent = "holeagent/5.0"

url = "m12345.google.com"
output = dummy/2
user-agent = "holeagent/5.0"

URL = "googlevideo.com"
output = dummy/3
user-agent = "holeagent/5.0"
[xiaobai@xiaobai curl]$

그런 다음 mkdir dummy이 임시 파일을 저장할 디렉터리를 만듭니다. 세션을 만듭니다 inotifywait(sed '/google/q'를 sed '/mortgage/q'로 교체).

[xiaobai@xiaobai curl]$ rm -r dummy; mkdir dummy;
[xiaobai@xiaobai curl]$ rm final 
[xiaobai@xiaobai curl]$ inotifywait -m dummy -e close_write | while read path action file; do echo "[$file]">> final ; sed '/google/q' "$path$file" >> final; echo "$path$file"; rm "$path$file"; done;
Setting up watches.
Watches established.

다른 bash/터미널 세션을 엽니다. rm결정적인파일이 있는 경우 위의 첫 번째 단계에서 생성된 컬_config2 파일을 사용하여 컬을 실행합니다.

[xiaobai@xiaobai curl]$ curl -vLK curl_config2
...processing

이제 inotifywait 세션을 살펴보면 파일에 대한 최근 종료 쓰기를 인쇄하고 sed하고 완료되자마자 삭제합니다.

[xiaobai@xiaobai curl]$ inotifywait -m dummy -e close_write | while read path action file; do echo "[$file]">> final ; sed '/google/q' "$path$file" >> final; echo "$path$file"; rm "$path$file"; done;
Setting up watches.
Watches established.
dummy/1
dummy/3

마지막으로 출력이 호출되는 것을 볼 수 있습니다.결정적인, 이것[1과 3]구분 기호는 echo "[$file]">> final위에서 생성됩니다.

파일을 즉시 삭제하는 이유는 출력 파일이 크고 많은 URL을 계속 처리해야 하기 때문에 즉시 삭제하면 디스크 공간을 절약할 수 있다고 가정하기 때문입니다.

Answer