손상된 CSV 파일에서 중첩된 큰따옴표를 이스케이프하세요.

Question

한 줄에 10개의 필드가 있는 샘플 입력 파일을 만들었습니다. 여기서 필드 4와 9가 참조될 수 있습니다.

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

그런 다음 이 스크립트를 작성하여(GNU awk를 세 번째 인수로 사용 match()) 각 입력 행의 유형을 식별한 다음 그에 따라 참조 필드를 수정합니다.

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

.

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

출력은 항상 입력에서 참조될 수 있는 2개의 필드를 참조합니다. 이것이 마음에 들지 않으면 연습으로 간단히 조정할 수 있습니다. 또한 CSV에서 큰따옴표를 "이스케이프"하는 보다 전통적인 방법을 사용했는데, 이는 큰따옴표를 두 배로 늘리는 것입니다. 원한다면 \"이 역시 사소한 변화일 뿐입니다 "". 바라보다https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awkCSV 및 CSV "표준"과 함께 awk를 사용하는 방법에 대한 추가 정보.

Answer 1

한 줄에 10개의 필드가 있는 샘플 입력 파일을 만들었습니다. 여기서 필드 4와 9가 참조될 수 있습니다.

$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25

그런 다음 이 스크립트를 작성하여(GNU awk를 세 번째 인수로 사용 match()) 각 입력 행의 유형을 식별한 다음 그에 따라 참조 필드를 수정합니다.

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    # The 4th and 9th fields may or may not be quoted so we are looking
    # for one of these patterns of fields:
    #
    #    1,2,3,4,5,6,7,8,9,10           - type A
    #    1,2,3,"4",5,6,7,8,9,10         - type B
    #    1,2,3,4,5,6,7,8,"9",10         - type C
    #    1,2,3,"4",5,6,7,8,"9",10       - type D
    #
    # If we can determine which type of record we have then we can
    # identify the fields.

    delete f
    if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
        type = "A"
        split(a[0],f)
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
        type = "B"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
    }
    else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
        type = "C"
        split(a[1],f)
        f[9] = a[3]
        f[10] = a[4]
    }
    else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
        type = "D"
        split(a[1],f)
        f[4] = a[3]
        split(a[4],tmp)
        for (i in tmp) {
            f[4+i] = tmp[i]
        }
        f[9] = a[6]
        f[10] = a[7]
    }
    else {
        type = "Unknown"
        split($0,f)
        printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
    }

    # Uncomment the following lines to see what the above is doing:
    #print ORS "################" ORS "Type " type ":\t" $0
    #for (i=1; i in f; i++) {
        #print i, "<" f[i] ">"
    #}

    gsub(/^"|"$/,"",f[4])
    gsub(/"/,"\"\"",f[4])
    f[4] = "\"" f[4] "\""

    gsub(/^"|"$/,"",f[9])
    gsub(/"/,"\"\"",f[9])
    f[9] = "\"" f[9] "\""

    $0 = ""
    for (i in f) {
        $i = f[i]
    }
    print
}

.

$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25

출력은 항상 입력에서 참조될 수 있는 2개의 필드를 참조합니다. 이것이 마음에 들지 않으면 연습으로 간단히 조정할 수 있습니다. 또한 CSV에서 큰따옴표를 "이스케이프"하는 보다 전통적인 방법을 사용했는데, 이는 큰따옴표를 두 배로 늘리는 것입니다. 원한다면 \"이 역시 사소한 변화일 뿐입니다 "". 바라보다https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awkCSV 및 CSV "표준"과 함께 awk를 사용하는 방법에 대한 추가 정보.

손상된 CSV 파일에서 중첩된 큰따옴표를 이스케이프하세요.

답변1

관련 정보