문자열 필드에 여러 개의 쉼표가 있는 .CSV 파일의 날짜 필드 형식 지정

Question 1

쉼표로 구분하지만 쉼표가 있는 문자열이 있습니다. 9열을 날짜로 언급하고 있다고 생각하지 마세요. print m표시할 줄 뒤에 a를 삽입하세요 .

m=substr($9,4,3)
print m

예

MY M: lum
column1,column2,column3,column4,column5,column6, column7, Column8,00/00/2009, Column10
MY M: me"
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
MY M: tho
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
MY M: me"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

접근 방식을 다시 생각하거나 문자열에 포함된 쉼표를 피해야 한다고 생각합니다.

수리하다

awk캐릭터 그룹을 분할하는 이상하지만 유용한 기능이 있습니다. 한 가지 방법은 ","쉼표를 사용하는 대신 분할하는 것입니다 .

예시(개선사항 #1)

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 {
  if(NR==1){print}
  else{ m=substr($9,4,3); print "MY M: " m;
   $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
  print
 } }' OFS="," file.csv

산출

MY M: 
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10,,,,,,,,00/00/2000
MY M: OCT
"12,B000QRIGJ4,4432,string with quotes, and with a comma, and colon: in between,4432,author1, name,890,88,10/11/2011,12"
MY M: OCT
"4432,B000QRIGJ4,890,another, string with quotes, and with more than, two commas: in between,455,author2, name,12,455,10/12/2011,55"
MY M: OCT
"11,B000QRIGJ4,77,string with, commas and (paranthesis) and : colans, in between,12,author3, name,333,22,10/13/2011,232"

심지어 이것도 완전히 맞는 말은 아니다. 따옴표를 복원하려면 추가 정리를 수행한 다음 문자열의 시작과 끝에서 중복된 따옴표를 제거해야 합니다.

예시(개선사항 #2)

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3); print "MY M: " m;
 $9 = sprintf("\"%02d/%02d/20%02d\"",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 /\"\"/ }' OFS="," file.csv

산출

MY M: 
"column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10","","","","","","","",""00/00/2000"","",
MY M: OCT
""12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88",""10/11/2011"","12"",
MY M: OCT
""4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455",""10/12/2011"","55"",
MY M: OCT
""11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22",""10/13/2011"","232"",

나는 이 접근 방식을 계속 사용하지 않을 것이며 이것이 문제를 해결하는 좋은 방법이 아니며 유지 관리 문제가 있고 시간이 지남에 따라 입력이 변경되면 매우 취약하다는 점을 알기를 바랍니다.

예시(개선사항 #3)

좋습니다. 그냥 이대로 둘 수는 없습니다. 이것이 실제 사례입니다.

awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }

 { if (NR==1){print; next} }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 }' OFS="," file.csv | sed -e 's/""/"/g' -e 's/,$//'

산출

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","10/11/2011","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","10/12/2011","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","10/13/2011","232"

Answer

쉼표로 구분하지만 쉼표가 있는 문자열이 있습니다. 9열을 날짜로 언급하고 있다고 생각하지 마세요. print m표시할 줄 뒤에 a를 삽입하세요 .

m=substr($9,4,3)
print m

예

MY M: lum
column1,column2,column3,column4,column5,column6, column7, Column8,00/00/2009, Column10
MY M: me"
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
MY M: tho
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
MY M: me"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

접근 방식을 다시 생각하거나 문자열에 포함된 쉼표를 피해야 한다고 생각합니다.

수리하다

awk캐릭터 그룹을 분할하는 이상하지만 유용한 기능이 있습니다. 한 가지 방법은 ","쉼표를 사용하는 대신 분할하는 것입니다 .

예시(개선사항 #1)

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 {
  if(NR==1){print}
  else{ m=substr($9,4,3); print "MY M: " m;
   $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
  print
 } }' OFS="," file.csv

산출

MY M: 
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10,,,,,,,,00/00/2000
MY M: OCT
"12,B000QRIGJ4,4432,string with quotes, and with a comma, and colon: in between,4432,author1, name,890,88,10/11/2011,12"
MY M: OCT
"4432,B000QRIGJ4,890,another, string with quotes, and with more than, two commas: in between,455,author2, name,12,455,10/12/2011,55"
MY M: OCT
"11,B000QRIGJ4,77,string with, commas and (paranthesis) and : colans, in between,12,author3, name,333,22,10/13/2011,232"

심지어 이것도 완전히 맞는 말은 아니다. 따옴표를 복원하려면 추가 정리를 수행한 다음 문자열의 시작과 끝에서 중복된 따옴표를 제거해야 합니다.

예시(개선사항 #2)

$ awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }
 { m=substr($9,4,3); print "MY M: " m;
 $9 = sprintf("\"%02d/%02d/20%02d\"",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 /\"\"/ }' OFS="," file.csv

산출

MY M: 
"column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10","","","","","","","",""00/00/2000"","",
MY M: OCT
""12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88",""10/11/2011"","12"",
MY M: OCT
""4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455",""10/12/2011"","55"",
MY M: OCT
""11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22",""10/13/2011"","232"",

나는 이 접근 방식을 계속 사용하지 않을 것이며 이것이 문제를 해결하는 좋은 방법이 아니며 유지 관리 문제가 있고 시간이 지남에 따라 입력이 변경되면 매우 취약하다는 점을 알기를 바랍니다.

예시(개선사항 #3)

좋습니다. 그냥 이대로 둘 수는 없습니다. 이것이 실제 사례입니다.

awk -F'","' '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month[i]]=i
 }

 { if (NR==1){print; next} }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/20%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 for (i=1; i<=10; i++) printf("\"%s\",",$i); printf("%s\n","")
 }' OFS="," file.csv | sed -e 's/""/"/g' -e 's/,$//'

산출

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","10/11/2011","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","10/12/2011","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","10/13/2011","232"

Question 2

적절한 CSV 파서가 포함된 도구를 사용하세요. 예를 들어, 루비의 경우:

ruby -rcsv -pe '
  if $. > 1
    row = CSV.parse_line($_)
    row[8] = Date.parse(row[8]).strftime("%Y/%m/%d")
    $_ = row.to_csv(:force_quotes=>true)
  end
' file.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"

Answer

적절한 CSV 파서가 포함된 도구를 사용하세요. 예를 들어, 루비의 경우:

ruby -rcsv -pe '
  if $. > 1
    row = CSV.parse_line($_)
    row[8] = Date.parse(row[8]).strftime("%Y/%m/%d")
    $_ = row.to_csv(:force_quotes=>true)
  end
' file.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/10/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/10/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/10/13","232"

Question 3

간단한 방법

to의 모든 항목을 발견된 곳으로 변경 DD-MMM-YYYY합니다 YYYY/MM/DD.

$ perl -pe 'BEGIN{ @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
                for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
               } 
          s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1#;' foo.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/9/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/9/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/9/13","232"

정확한 방법

필드 9의 형식만 변경하세요. Perl의 플래그를 사용하여 각 행을 필드(예: 필드는 ) -a로 분할 하고 필드 구분 기호를 로 설정 하면 다음을 수행할 수 있습니다.awk$F[0],$F[1]...$F[N-1]-F","

perl -F'\",\"' -lane 'BEGIN{
               @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
               for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
              } 
              $F[8]=~s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1# if $.>1; 
              print join("\",\"",@F)' foo.csv

그러면 YYYY/MM/DD가 인쇄되고 (귀하의 질문에서 했던 것처럼) 모든 연도가 20.

Answer

간단한 방법

to의 모든 항목을 발견된 곳으로 변경 DD-MMM-YYYY합니다 YYYY/MM/DD.

$ perl -pe 'BEGIN{ @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
                for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
               } 
          s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1#;' foo.csv

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","2011/9/11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","2011/9/12","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","2011/9/13","232"

정확한 방법

필드 9의 형식만 변경하세요. Perl의 플래그를 사용하여 각 행을 필드(예: 필드는 ) -a로 분할 하고 필드 구분 기호를 로 설정 하면 다음을 수행할 수 있습니다.awk$F[0],$F[1]...$F[N-1]-F","

perl -F'\",\"' -lane 'BEGIN{
               @month=qw(JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC); 
               for ($i=1; $i<=12; $i++) {$mdigit{$month[$i]}=$i;}
              } 
              $F[8]=~s#(\d{1,2})-(\w{3})-(\d{2,4})#20$3/$mdigit{$2}/$1# if $.>1; 
              print join("\",\"",@F)' foo.csv

그러면 YYYY/MM/DD가 인쇄되고 (귀하의 질문에서 했던 것처럼) 모든 연도가 20.

Question 4

Miller( mlr)를 사용하여 공백을 정리한 다음(일부 헤더에 공백이 있는 것 같음) 필드의 날짜를 Column9올바른 형식으로 변환합니다. 날짜 변환은 먼저 를 사용하여 주어진 날짜를 Unix 시간으로 변환한 strptime()다음 를 사용하여 즉시 원하는 형식으로 다시 포맷하는 방식으로 수행됩니다 strftime().

mlr --csv \
    clean-whitespace then \
    put '$Column9 = strftime(strptime($Column9, "%d-%b-%y"), "%Y/%m/%d")' file

질문의 데이터에 대한 결과를 제공합니다.

column1,column2,column3,column4,column5,column6,column7,Column8,Column9,Column10
12,B000QRIGJ4,4432,"string with quotes, and with a comma, and colon: in between",4432,"author1, name",890,88,2011/10/11,12
4432,B000QRIGJ4,890,"another, string with quotes, and with more than, two commas: in between",455,"author2, name",12,455,2011/10/12,55
11,B000QRIGJ4,77,"string with, commas and (paranthesis) and : colans, in between",12,"author3, name",333,22,2011/10/13,232

모든 필드를 참조하려면 --quote-all지금 --csv명령줄에 해당 옵션을 추가하세요 . 기본적으로 Miller는 실제로 참조해야 하는 필드만 참조합니다.

형식이 더 아름답습니다.

+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| column1 | column2    | column3 | column4                                                                 | column5 | column6       | column7 | Column8 | Column9    | Column10 |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| 12      | B000QRIGJ4 | 4432    | string with quotes, and with a comma, and colon: in between             | 4432    | author1, name | 890     | 88      | 2011/10/11 | 12       |
| 4432    | B000QRIGJ4 | 890     | another, string with quotes, and with more than, two commas: in between | 455     | author2, name | 12      | 455     | 2011/10/12 | 55       |
| 11      | B000QRIGJ4 | 77      | string with, commas and (paranthesis) and : colans, in between          | 12      | author3, name | 333     | 22      | 2011/10/13 | 232      |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+

Answer

Miller( mlr)를 사용하여 공백을 정리한 다음(일부 헤더에 공백이 있는 것 같음) 필드의 날짜를 Column9올바른 형식으로 변환합니다. 날짜 변환은 먼저 를 사용하여 주어진 날짜를 Unix 시간으로 변환한 strptime()다음 를 사용하여 즉시 원하는 형식으로 다시 포맷하는 방식으로 수행됩니다 strftime().

mlr --csv \
    clean-whitespace then \
    put '$Column9 = strftime(strptime($Column9, "%d-%b-%y"), "%Y/%m/%d")' file

질문의 데이터에 대한 결과를 제공합니다.

column1,column2,column3,column4,column5,column6,column7,Column8,Column9,Column10
12,B000QRIGJ4,4432,"string with quotes, and with a comma, and colon: in between",4432,"author1, name",890,88,2011/10/11,12
4432,B000QRIGJ4,890,"another, string with quotes, and with more than, two commas: in between",455,"author2, name",12,455,2011/10/12,55
11,B000QRIGJ4,77,"string with, commas and (paranthesis) and : colans, in between",12,"author3, name",333,22,2011/10/13,232

모든 필드를 참조하려면 --quote-all지금 --csv명령줄에 해당 옵션을 추가하세요 . 기본적으로 Miller는 실제로 참조해야 하는 필드만 참조합니다.

형식이 더 아름답습니다.

+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| column1 | column2    | column3 | column4                                                                 | column5 | column6       | column7 | Column8 | Column9    | Column10 |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+
| 12      | B000QRIGJ4 | 4432    | string with quotes, and with a comma, and colon: in between             | 4432    | author1, name | 890     | 88      | 2011/10/11 | 12       |
| 4432    | B000QRIGJ4 | 890     | another, string with quotes, and with more than, two commas: in between | 455     | author2, name | 12      | 455     | 2011/10/12 | 55       |
| 11      | B000QRIGJ4 | 77      | string with, commas and (paranthesis) and : colans, in between          | 12      | author3, name | 333     | 22      | 2011/10/13 | 232      |
+---------+------------+---------+-------------------------------------------------------------------------+---------+---------------+---------+---------+------------+----------+

문자열 필드에 여러 개의 쉼표가 있는 .CSV 파일의 날짜 필드 형식 지정

답변1

예

수리하다

예시(개선사항 #1)

산출

예시(개선사항 #2)

산출

예시(개선사항 #3)

산출

답변2

답변3

간단한 방법

정확한 방법

답변4

관련 정보