wget - URL 패턴을 기반으로 웹페이지를 다운로드하는 방법은 무엇입니까?

2024-5-26 • tag-icon

wget - URL 패턴을 기반으로 웹페이지를 다운로드하는 방법은 무엇입니까?

다음 디렉토리 구조를 가진 웹사이트 www.music.com을 생각해 보세요.

/piano
   /covers
     /Chopin 
        apple.html
        bannan.js
        balloon.html
        index.html
     /Franz Liszt
        index.html
        roses.js
        Love Dream.html

     /Frodo
        index.html
        linkenso.html

/violin
   /covers
      /David
         Viva.html
      /Ross
         index.html

하위 디렉토리 이름이 "Fr"로 시작하는 music.com/piano/covers의 중첩 디렉토리에서 index.html 파일을 가져오고 싶습니다. 위의 예에서는 2개의 파일만 다운로드하려고 합니다.

www.music.com/piano/covers/Franz Listz/index.html
www.music.com/piano/covers/Frodo/index.html

wget을 사용하여 다음을 사용하려고 생각했습니다.

$ wget 
   --mirror 
   --header="Accept: text/html"  
   --page-requisites 
   --html-extension 
   --convert-links 
   --restrict-file-names=windows 
   --domains=www.music.com/piano/covers 
   --accept-regex=/piano/covers/Fr.*/index.html  
        http://www.music.com

내 사이트에서도 동일한 작업을 수행했지만 잘못된 파일만 받았습니다.

www.music.com/index.html

위 옵션을 사용한 이유는 무엇입니까?

--recursive오류가 계속 지속되므로 or를 사용하는 것은 -r문제가 아닙니다. 또한 --page-requisites모든 정보를 제공하기 위해 서버가 필요하지 않기 때문에 이 옵션이 더 좋습니다.
--domains: 지정된 URL 이외의 것을 다운로드하지 않도록 주의하세요. piano/covers폴더 외부의 리소스가 필요하지 않기 때문에 이 경우가 되어야 합니다.
--header: Accept를 재정의하고 싶습니다. : * / * 내 요청이 모든 것을 요청하는 것을 방지합니다.
--html-extensions: html 파일만 다운로드

--accept-regex아무래도 그 부분은 고려조차 되지 않은 것 같습니다 . 하지만 -A필요한 파일이 여러 디렉터리에 분산되어 있으므로 이 옵션을 사용하는 것이 좋습니다 . 매개변수로 지정된 두 파일을 얻는 방법에 대한 아이디어가 있습니까 --accept-regex?

편집 1:

위 예에 사용된 URL에 액세스하면 404 오류가 발생합니다. 그래서 저는 제 웹사이트에 컨텍스트를 제공하고 실제로 이 작업을 하려고 합니다.

www.ajayhalthor.com의 디렉토리 구조:

/piano
   /nightwish-sahara
   /nightwish-amaranth
   /skillet-hero
   /skillet-the-last-night
   /breaking-benjamin-diary-of-jane
   /skillet-comatose
   /one-republic-counting-stars
   /skillet-falling-inside-the-black
   /63/index.html
   /a/few/more/links/index.html

/about
   /other/links/index.html
/Home
   /main/links/index.html

이 구조에서 www.ajayhalthor.com/piano에서 "Sk"로 시작하는 파일을 검색하고 싶습니다. 다음 파일을 검색하고 싶습니다.

www.ajayhalthor.com/piano/skillet-hero
www.ajayhalthor.com/piano/skillet-the-last-night
www.ajayhalthor.com/piano/skillet-comatose
www.ajayhalthor.com/piano/skillet-falling-inside-the-black

다음 명령을 실행하십시오.

$ wget 
   --mirror 
   --header="Accept: text/html"  
   --page-requisites 
   --html-extension 
   --convert-links 
   --restrict-file-names=windows 
   --domains=www.ajayhalthor.com/piano
   --accept-regex="piano/sk.*"
        http://www.ajayhalthor.com

다음과 같은 결과가 나타납니다.

Resolving www.ajayhalthor.com (www.ajayhalthor.com)... 23.229.213.7
Connecting to www.ajayhalthor.com (www.ajayhalthor.com)|23.229.213.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.ajayhalthor.com/index.html’

www.ajayhalthor.com/index.html              [ <=>                                                                          ]  24.28K  --.-KB/s    in 0.1s    

Last-modified header missing -- time-stamps turned off.
2017-01-19 01:56:11 (245 KB/s) - ‘www.ajayhalthor.com/index.html’ saved [24862]

FINISHED --2017-01-19 01:56:11--
Total wall clock time: 1.4s
Downloaded: 1 files, 24K in 0.1s (245 KB/s)
Converting links in www.ajayhalthor.com/index.html... 11-1
Converted links in 1 files in 0.003 seconds.

www.ajayhalthor.com/index.html 파일 1개만 다운로드되었습니다. 나는 그것을 --accept-regex올바르게 사용하고 있는가?

관련 정보