![손상된 CSV 파일에서 중첩된 큰따옴표를 이스케이프하세요.](https://linux55.com/image/180022/%EC%86%90%EC%83%81%EB%90%9C%20CSV%20%ED%8C%8C%EC%9D%BC%EC%97%90%EC%84%9C%20%EC%A4%91%EC%B2%A9%EB%90%9C%20%ED%81%B0%EB%94%B0%EC%98%B4%ED%91%9C%EB%A5%BC%20%EC%9D%B4%EC%8A%A4%EC%BC%80%EC%9D%B4%ED%94%84%ED%95%98%EC%84%B8%EC%9A%94..png)
중첩된 큰따옴표가 많은 손상된 "CSV" 파일이 있습니다. 예를 들어:
123,"I wonder how to escape "these" quotes with backslashes.",123,456
456,"I wonder how to escape "these" quotes with backslashes.",456,789
이 문제를 해결하는 방법을 아시나요?
고쳐 쓰다실제 예를 들어보세요:
n9sih438,4994fa72322,PMC,Rapid Identification of Malaria Vaccine Candidates Based on alpha-Helical Coiled Coil Protein Motif,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"To identify malaria antigens for vaccine development, we selected alpha-helical coiled coil domains of proteins predicted to be present in the parasite erythrocytic stage. The corresponding synthetic peptides are expected to mimic structurally "native" epitopes. Indeed the 95 chemically synthesized peptides were all specifically recognized by human immune sera, though at various prevalence. Peptide specific antibodies were obtained both by affinity-purification from malaria immune sera and by immunization of mice. These antibodies did not show significant cross reactions, i.e., they were specific for the original peptide, reacted with native parasite proteins in infected erythrocytes and several were active in inhibiting in vitro parasite growth. Circular dichroism studies indicated that the selected peptides assumed partial or high alpha-helical content. Thus, we demonstrate that the bioinformatics/chemical synthesis approach described here can lead to the rapid identification of molecules which target biologically active antibodies, thus identifying suitable vaccine candidates. This strategy can be, in principle, extended to vaccine discovery in a wide range of other pathogens.",2007-07-25
제목 필드(4번째 필드)와 요약 필드(9번째 필드)에는 중첩된 큰따옴표가 나타날 수 있습니다.
답변1
한 줄에 10개의 필드가 있는 샘플 입력 파일을 만들었습니다. 여기서 필드 4와 9가 참조될 수 있습니다.
$ cat file
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,here is an unquoted string,2007-07-25
n9sih438,4994fa72322,PMC,here is an unquoted string,10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,",string,","within,", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,",string,","within,", quotes.",2007-07-25
그런 다음 이 스크립트를 작성하여(GNU awk를 세 번째 인수로 사용 match()
) 각 입력 행의 유형을 식별한 다음 그에 따라 참조 필드를 수정합니다.
$ cat tst.awk
BEGIN { FS=OFS="," }
{
# The 4th and 9th fields may or may not be quoted so we are looking
# for one of these patterns of fields:
#
# 1,2,3,4,5,6,7,8,9,10 - type A
# 1,2,3,"4",5,6,7,8,9,10 - type B
# 1,2,3,4,5,6,7,8,"9",10 - type C
# 1,2,3,"4",5,6,7,8,"9",10 - type D
#
# If we can determine which type of record we have then we can
# identify the fields.
delete f
if ( match($0,/^(([^",]+,){9}[^",]+)$/,a) ) {
type = "A"
split(a[0],f)
}
else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){5}[^",]+)$/,a) ) {
type = "B"
split(a[1],f)
f[4] = a[3]
split(a[4],tmp)
for (i in tmp) {
f[4+i] = tmp[i]
}
}
else if ( match($0,/^(([^",]+,){8})(".*"),([^",]+)$/,a) ) {
type = "C"
split(a[1],f)
f[9] = a[3]
f[10] = a[4]
}
else if ( match($0,/^(([^",]+,){3})(".*"),(([^",]+,){4})(".*"),([^",]+)$/,a) ) {
type = "D"
split(a[1],f)
f[4] = a[3]
split(a[4],tmp)
for (i in tmp) {
f[4+i] = tmp[i]
}
f[9] = a[6]
f[10] = a[7]
}
else {
type = "Unknown"
split($0,f)
printf "Warning, could not classify file \"%s\", line %d: %s\n", FILENAME, FNR, $0 | "cat>&2"
}
# Uncomment the following lines to see what the above is doing:
#print ORS "################" ORS "Type " type ":\t" $0
#for (i=1; i in f; i++) {
#print i, "<" f[i] ">"
#}
gsub(/^"|"$/,"",f[4])
gsub(/"/,"\"\"",f[4])
f[4] = "\"" f[4] "\""
gsub(/^"|"$/,"",f[9])
gsub(/"/,"\"\"",f[9])
f[9] = "\"" f[9] "\""
$0 = ""
for (i in f) {
$i = f[i]
}
print
}
.
$ awk -f tst.awk file
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is an unquoted string",2007-07-25
n9sih438,4994fa72322,PMC,"here is an unquoted string",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
n9sih438,4994fa72322,PMC,"here is a,"",string,"",""within,"", quotes.",10.1371/journal.pone.0000645,PMC1920550,17653272,cc-by,"here is a,"",string,"",""within,"", quotes.",2007-07-25
출력은 항상 입력에서 참조될 수 있는 2개의 필드를 참조합니다. 이것이 마음에 들지 않으면 연습으로 간단히 조정할 수 있습니다. 또한 CSV에서 큰따옴표를 "이스케이프"하는 보다 전통적인 방법을 사용했는데, 이는 큰따옴표를 두 배로 늘리는 것입니다. 원한다면 \"
이 역시 사소한 변화일 뿐입니다 ""
. 바라보다https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awkCSV 및 CSV "표준"과 함께 awk를 사용하는 방법에 대한 추가 정보.