최신 타임스탬프 데이터를 기반으로 새 CSV 파일을 만듭니다.

Question

가정:

이름이 지정된 출력 파일의 경우 01_file1_202311301100.csv문자열은 'file1'입력 파일 이름의 두 번째 "_" 구분 필드에서 나옵니다(예: Real_file1_table2.dat).
OP의 코드는 "어제"에 대한 새 하위 디렉터리를 생성하는 것처럼 보이지만 이는 이 답변을 위해 날짜가 "오늘"인 입력 파일 항목을 처리해서는 안 된다는 것을 의미합니다. /output 파일은 현재 디렉토리에 있습니다. Medium; OP가 하위 디렉토리를 처리하고 "어제"와 "오늘"을 처리하는 방법을 코드로 확장할 수 있습니까?
큰따옴표로 묶인 유일한 입력 필드는 첫 번째(쉼표로 구분된) 필드입니다.
첫 번째 필드는 항상 format 입니다 "YYYY-MM-DD HH:MM:SS". 그렇지 않으면 해당 행을 무시합니다.
줄 바꿈이 포함된 입력 필드가 없습니다.

전반적인 디자인:

bash출력 파일의 접두사( )를 결정합니다 pfx(입력 파일 이름을 기준으로).
bash"마지막" 출력 파일의 이름을 결정합니다.
pfx"마지막" 출력 파일 이름을 다음으로 전달합니다.awk
awk입력 *.dat파일 처리 용
첫 번째 필드의 내용을 기반으로 출력 파일 이름을 구성합니다(예: 2023-11-30 11:00:00goes 202311301100)
출력 파일 이름이 다음과 같은 경우미만"마지막" 출력 파일 이름은 출력 파일이 이미 존재함을 알려주므로 입력 줄을 무시합니다.
출력 파일 이름이 다음과 같은 경우동일한"마지막" 출력 파일 이름을 사용하면 새 출력 파일 생성을 진행합니다(이렇게 하면 2023-11-30 11:00스크립트 실행 전후에 파일에 날짜/시간 값을 추가하는 경우가 해결됩니다. 예: -*.datawk
출력 파일 이름이 다음과 같은 경우보다 낫다"마지막" 출력 파일 이름은 새 출력 파일을 생성해야 함을 나타냅니다.

떨어져 bash / awk있는:

for datfile in *_table2.dat
do
    [[ ! -f "${datfile}" ]] && break

    ############
    #### the following bash code needs to be run before each run of the awk script

    IFS='_' read -r _ pfx _ <<< "${datfile}"

    case "${pfx}" in
        file1)  pfx="01_${pfx}" ;;
        file2)  pfx="02_${pfx}" ;;    
        file3)  pfx="03_${pfx}" ;;
            *)  pfx="00_${pfx}" ;;
    esac

    last_file="${pfx}_000000000000.csv"

    for outfile in "${pfx}"_*.csv
    do
        [[ -f "${outfile}" ]] && last_file="${outfile}"
    done

    ############
    #### at this point we have:
    ####   1) the '##_file#' prefix for our new output files(s)
    ####   2) the name of the 'last' output file

    awk -v pfx="${pfx}" -v last_file="${last_file}" '
    BEGIN      { FS=OFS=","
                 regex = "^\"[0-9]{4}.*\"$"                               # 1st field regex: "YYYY..."
               }

    FNR==2     { hdr = $0 }

    $1 ~ regex { dt = $1                                                    # copy 1st field
                 gsub(/[^[:digit:]]/,"",dt)                               # strip out everything other than digits
                 dt = substr(dt,1,12)                                     # grab YYYY-MM-DD HH:MM which now looks like YYYYMMDDHHMM

                 if ( dt != dt_prev ) {                                   # if this is a new dt value
                    dt_prev = dt
                    printme = 1                                           # default to printing input lines to new output file

                    close(outfile)                                        # close previous output file
                    outfile = pfx "_" dt ".csv"                           # build new output file name

                    if ( outfile < last_file ) {                          # if "less than" last file then we will skip
                       printf "WARNING: file exists: %s (skipping)\n", outfile
                       printme = 0
                    }
                    else
                    if ( outfile == last_file ) {                         # if "equal to" last file then overwrite
                       printf "WARNING: file exists: %s (overwriting)\n", outfile
                       print hdr > outfile                                # print default header to our overwrite file
                    }
                    else                                                  # else new output file is "greater than" last file
                       print hdr > outfile                                # print default header to our new output file
                 }

                 if ( printme ) {                                         # if printme==1 then print current line to outfile
                    print $1,$2,sprintf("%0.3f%s%0.3f%s%0.3f",$3,OFS,$4,OFS,$5) > outfile
                 }
               }
    ' "${datfile}"
done

OP의 첫 번째 버전에 대해 실행 Real_file1_table2.dat:

$ awk ....

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969

Real_file1_table2.dat"재정의" 논리를 테스트하기 위해 OP의 두 번째 버전을 다음과 같이 변경합니다 .

$ cat Real_file1_table2.2.dat
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999    # another 2023-11-30 11:01 entry
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

이 새 버전에 대해 실행하십시오 Real_file1_table2.dat.

$ awk ...
WARNING: file exists: 01_file1_202311301100.csv (skipping)
WARNING: file exists: 01_file1_202311301101.csv (overwriting)

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999

==> 01_file1_202311301102.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404

==> 01_file1_202311301103.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

Answer 1

가정:

이름이 지정된 출력 파일의 경우 01_file1_202311301100.csv문자열은 'file1'입력 파일 이름의 두 번째 "_" 구분 필드에서 나옵니다(예: Real_file1_table2.dat).
OP의 코드는 "어제"에 대한 새 하위 디렉터리를 생성하는 것처럼 보이지만 이는 이 답변을 위해 날짜가 "오늘"인 입력 파일 항목을 처리해서는 안 된다는 것을 의미합니다. /output 파일은 현재 디렉토리에 있습니다. Medium; OP가 하위 디렉토리를 처리하고 "어제"와 "오늘"을 처리하는 방법을 코드로 확장할 수 있습니까?
큰따옴표로 묶인 유일한 입력 필드는 첫 번째(쉼표로 구분된) 필드입니다.
첫 번째 필드는 항상 format 입니다 "YYYY-MM-DD HH:MM:SS". 그렇지 않으면 해당 행을 무시합니다.
줄 바꿈이 포함된 입력 필드가 없습니다.

전반적인 디자인:

bash출력 파일의 접두사( )를 결정합니다 pfx(입력 파일 이름을 기준으로).
bash"마지막" 출력 파일의 이름을 결정합니다.
pfx"마지막" 출력 파일 이름을 다음으로 전달합니다.awk
awk입력 *.dat파일 처리 용
첫 번째 필드의 내용을 기반으로 출력 파일 이름을 구성합니다(예: 2023-11-30 11:00:00goes 202311301100)
출력 파일 이름이 다음과 같은 경우미만"마지막" 출력 파일 이름은 출력 파일이 이미 존재함을 알려주므로 입력 줄을 무시합니다.
출력 파일 이름이 다음과 같은 경우동일한"마지막" 출력 파일 이름을 사용하면 새 출력 파일 생성을 진행합니다(이렇게 하면 2023-11-30 11:00스크립트 실행 전후에 파일에 날짜/시간 값을 추가하는 경우가 해결됩니다. 예: -*.datawk
출력 파일 이름이 다음과 같은 경우보다 낫다"마지막" 출력 파일 이름은 새 출력 파일을 생성해야 함을 나타냅니다.

떨어져 bash / awk있는:

for datfile in *_table2.dat
do
    [[ ! -f "${datfile}" ]] && break

    ############
    #### the following bash code needs to be run before each run of the awk script

    IFS='_' read -r _ pfx _ <<< "${datfile}"

    case "${pfx}" in
        file1)  pfx="01_${pfx}" ;;
        file2)  pfx="02_${pfx}" ;;    
        file3)  pfx="03_${pfx}" ;;
            *)  pfx="00_${pfx}" ;;
    esac

    last_file="${pfx}_000000000000.csv"

    for outfile in "${pfx}"_*.csv
    do
        [[ -f "${outfile}" ]] && last_file="${outfile}"
    done

    ############
    #### at this point we have:
    ####   1) the '##_file#' prefix for our new output files(s)
    ####   2) the name of the 'last' output file

    awk -v pfx="${pfx}" -v last_file="${last_file}" '
    BEGIN      { FS=OFS=","
                 regex = "^\"[0-9]{4}.*\"$"                               # 1st field regex: "YYYY..."
               }

    FNR==2     { hdr = $0 }

    $1 ~ regex { dt = $1                                                    # copy 1st field
                 gsub(/[^[:digit:]]/,"",dt)                               # strip out everything other than digits
                 dt = substr(dt,1,12)                                     # grab YYYY-MM-DD HH:MM which now looks like YYYYMMDDHHMM

                 if ( dt != dt_prev ) {                                   # if this is a new dt value
                    dt_prev = dt
                    printme = 1                                           # default to printing input lines to new output file

                    close(outfile)                                        # close previous output file
                    outfile = pfx "_" dt ".csv"                           # build new output file name

                    if ( outfile < last_file ) {                          # if "less than" last file then we will skip
                       printf "WARNING: file exists: %s (skipping)\n", outfile
                       printme = 0
                    }
                    else
                    if ( outfile == last_file ) {                         # if "equal to" last file then overwrite
                       printf "WARNING: file exists: %s (overwriting)\n", outfile
                       print hdr > outfile                                # print default header to our overwrite file
                    }
                    else                                                  # else new output file is "greater than" last file
                       print hdr > outfile                                # print default header to our new output file
                 }

                 if ( printme ) {                                         # if printme==1 then print current line to outfile
                    print $1,$2,sprintf("%0.3f%s%0.3f%s%0.3f",$3,OFS,$4,OFS,$5) > outfile
                 }
               }
    ' "${datfile}"
done

OP의 첫 번째 버전에 대해 실행 Real_file1_table2.dat:

$ awk ....

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969

Real_file1_table2.dat"재정의" 논리를 테스트하기 위해 OP의 두 번째 버전을 다음과 같이 변경합니다 .

$ cat Real_file1_table2.2.dat
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999    # another 2023-11-30 11:01 entry
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

이 새 버전에 대해 실행하십시오 Real_file1_table2.dat.

$ awk ...
WARNING: file exists: 01_file1_202311301100.csv (skipping)
WARNING: file exists: 01_file1_202311301101.csv (overwriting)

$ head 01*csv
==> 01_file1_202311301100.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:00:00",289233,0.3495333,0.2412115,333.2676

==> 01_file1_202311301101.csv <==
"2023-11-30 11:01:00",289234,1.035533,1.019842,344.1969
"2023-11-30 11:01:00",666666,0.7777777,0.8888888,17.99999

==> 01_file1_202311301102.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:02:00",289235,0.7758334,0.7252186,17.75404

==> 01_file1_202311301103.csv <==
Timestamp,col1,col2,col3,col4
"2023-11-30 11:03:00",289236,0.7693,0.7103683,359.0702

최신 타임스탬프 데이터를 기반으로 새 CSV 파일을 만듭니다.

답변1

관련 정보