GCS 버킷에서 데이터를 다운로드하는 가장 빠른 방법

Question

글쎄, 나는 이것을 답변으로 게시하고 있지만 새로운 답변은 언제나 환영합니다.

GSUTIL수백만 또는 수천 개의 파일을 검색해야 하는 경우 일부 패턴 일치 작업은 매우 느릴 수 있습니다. 먼저 목록을 나열하고 절대 파일 경로를 사용하여 파일을 다운로드하는 것이 좋습니다.

vikrant_singh_rana@cloudshell:~/download$ cat download_gcs_file.sh
#!/bin/bash

#below code will delete the file if it already exists in the current working directory
file="ls_output.csv"

if [ -f "$file" ] ; then
    rm "$file"
fi
#below code will list the files to output file ls_output.csv based on search pattern
gsutil ls -l "gs://test-bucket-data-prod-ingest/cm_data/AN/AM/*/a01_*_20210128*.csv.bz2" | awk '!hdr{ print "filename"; hdr=1; }; $1 <= 100{ print $3; }' >ls_output.csv

input_file_path='/home/vikrant_singh_rana/download/ls_output.csv'

#below code will read the input file name and download it from gcs location to local
count=0

{
    read
    while IFS=, read -r inputfilename
    do

        echo "input filename is:"$inputfilename

        if [ ! -z "$inputfilename" ] || [ "$inputfilename" != "filename" ]
        then
        echo "downloading file:" $inputfilename
        gsutil -m cp -R "$inputfilename" /home/vikrant_singh_rana/download/output/

        else echo "No Empty Files found"
        fi

        count=$[count + 1]
        echo "count is:" $count
    done
} < $input_file_path

#below will unzip the files to csv format
bzip2 -d /home/vikrant_singh_rana/download/output/*

입력 파일입니다

vikrant_singh_rana@cloudshell:~/download$ cat ls_output.csv
filename
gs://test-bucket-data-prod-ingest/cm_data/AN/AM/172.24.105.197-CORE-2/a01_1h_255_XYZ_202101282300_0009.csv.bz2

Answer 1

글쎄, 나는 이것을 답변으로 게시하고 있지만 새로운 답변은 언제나 환영합니다.

GSUTIL수백만 또는 수천 개의 파일을 검색해야 하는 경우 일부 패턴 일치 작업은 매우 느릴 수 있습니다. 먼저 목록을 나열하고 절대 파일 경로를 사용하여 파일을 다운로드하는 것이 좋습니다.

vikrant_singh_rana@cloudshell:~/download$ cat download_gcs_file.sh
#!/bin/bash

#below code will delete the file if it already exists in the current working directory
file="ls_output.csv"

if [ -f "$file" ] ; then
    rm "$file"
fi
#below code will list the files to output file ls_output.csv based on search pattern
gsutil ls -l "gs://test-bucket-data-prod-ingest/cm_data/AN/AM/*/a01_*_20210128*.csv.bz2" | awk '!hdr{ print "filename"; hdr=1; }; $1 <= 100{ print $3; }' >ls_output.csv

input_file_path='/home/vikrant_singh_rana/download/ls_output.csv'

#below code will read the input file name and download it from gcs location to local
count=0

{
    read
    while IFS=, read -r inputfilename
    do

        echo "input filename is:"$inputfilename

        if [ ! -z "$inputfilename" ] || [ "$inputfilename" != "filename" ]
        then
        echo "downloading file:" $inputfilename
        gsutil -m cp -R "$inputfilename" /home/vikrant_singh_rana/download/output/

        else echo "No Empty Files found"
        fi

        count=$[count + 1]
        echo "count is:" $count
    done
} < $input_file_path

#below will unzip the files to csv format
bzip2 -d /home/vikrant_singh_rana/download/output/*

입력 파일입니다

vikrant_singh_rana@cloudshell:~/download$ cat ls_output.csv
filename
gs://test-bucket-data-prod-ingest/cm_data/AN/AM/172.24.105.197-CORE-2/a01_1h_255_XYZ_202101282300_0009.csv.bz2

GCS 버킷에서 데이터를 다운로드하는 가장 빠른 방법

답변1

관련 정보