WGET을 사용하여 인덱스에서 모든 파일 이름 추출

Question

apache2다음 솔루션은 형식이 지정되지 않은 표준 생성 디렉터리 인덱스 에서만 작동합니다 . 다음을 사용하여 wget파일을 색인화하고 구문 분석 grep할 수 있습니다 cut.

#this will download the directory listing index.html file for /folder/
wget the.server.ip.address/folder/   

#this will grep for the table of the files, remove the top line (parent folder) and cut out
#the necessary fields
grep '</a></td>' index.html | tail -n +2 | cut -d'>' -f7 | cut -d'<' -f1

위에서 언급한 것처럼 이는 apache2다음과 같이 구성된 기본 옵션을 사용하여 서버에서 디렉터리 목록을 생성하는 경우에만 작동합니다.

<Directory /var/www/html/folder>
 Options +Indexes 
 AllowOverride None
 Allow from all
</Directory>

이 구성에서는 디렉터리 목록이 특정 형식 없이 wget반환되지만 index.html물론 다음 옵션을 사용하여 디렉터리 목록을 사용자 정의할 수도 있습니다.

IndexOptions +option1 -option2 ...

보다 정확한 답변을 제공하려면(귀하의 상황에 따라) 샘플 index.html파일이 필요합니다.

여기에 Python 버전도 있습니다.

from bs4 import BeautifulSoup
import requests

def get_listing() :
  dir='http://cdimage.debian.org/debian-cd/8.4.0-live/amd64/iso-hybrid/'
  for file in listFD(dir):
    print file.split("//")[2]

def listFD(url, ext=''):    
  page = requests.get(url).text
  print page
  soup = BeautifulSoup(page, 'html.parser')
  return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]

def main() :
  get_listing()


if __name__=='__main__' : 
  main()

가이드로 사용이 페이지.

Answer 1